analytics SAL

151-200 of 5140 results (22ms)

2023-03-07 §
14:59	<nfraison>	force startup of nodemanager on analytics_cluster	[analytics]
14:58	<btullis>	pooled druid1004	[analytics]
14:57	<btullis>	pooling aqs1010 and aqs1016	[analytics]
14:56	<btullis>	pooling datahubsearch1001	[analytics]
14:53	<btullis>	leaving safe mode on hdfs	[analytics]
13:59	<btullis>	disabled puppet temporarily on an-master100[1-2] to avoid an automatic restart of yarn	[analytics]
13:57	<btullis>	stopped `hadoop-yarn-resourcemanager.service` on both an-master100[1-2]	[analytics]
13:54	<btullis>	entering safe mode with `sudo -u hdfs kerberos-run-command hdfs hadoop dfsadmin -safemode enter` on an-master1002	[analytics]
12:57	<btullis>	depooled druid1004 for T329073	[analytics]
12:56	<btullis>	depooled datahubsearch1001 for T329073	[analytics]
12:51	<btullis>	disabled gobblin timers on an-launcher1002	[analytics]
12:46	<btullis>	depooling aqs1016for T329073	[analytics]
12:45	<btullis>	depooling aqs1010 for T329073	[analytics]
08:00	<nfraison>	Reimage an-conf1003 to upgrade to bullseye T329362	[analytics]
2023-03-06 §
23:12	<mforns>	deployed airflow analytics to unbreak druid-load-edit-hourly	[analytics]
15:26	<mforns>	deployed airflow analytics to unbreak druid-load-edit-hourly	[analytics]
13:53	<btullis>	failing over the production hadoop cluster namenode service to an-master1002	[analytics]
13:17	<btullis>	failing over analytics test cluster namenode service to an-test-master1002 T329073	[analytics]
12:26	<nfraison>	Reimage an-conf1002 to upgrade to bullseye T329362	[analytics]
10:15	<ottomata>	deploy mediawiki_history_reduced_2023_02 snapshot to AQS	[analytics]
09:23	<nfraison>	Reimage an-conf1001 to upgrade to bullseye T329362	[analytics]
2023-03-03 §
16:48	<xcollazo>	Deleted snapshot=2023-02-20 for tables image_suggestions_search_index_full, image_suggestions_search_index_delta, image_suggestions_lead_image_data and image_suggestions_wikidata_data from the analytics_platform_eng schema. This data will be regenerated. See https://phabricator.wikimedia.org/T330688.	[analytics]
15:53	<mforns>	deployed airflow analytics to unbreak edit_hourly_dag	[analytics]
15:44	<xcollazo>	Deploying latest image_suggestions DAG on platform_eng Airflow instance	[analytics]
07:29	<elukey>	truncate /var/log/auth.log.1 on krb1001 to free space (root partition almost filled up)	[analytics]
2023-03-02 §
13:27	<nfraison>	airflow on an-test-client1001 is migrated to version 2.5.1	[analytics]
12:32	<joal>	Rerun mediawiki-history-denormalize-wf-2023-02	[analytics]
10:00	<btullis>	commencing second attempt to upgrade airflow on an-test-client1001 to version 2.5.1	[analytics]
2023-03-01 §
22:45	<mforns>	re-deployed airflow analytics with some forgotten changes	[analytics]
22:42	<mforns>	deployed Airflow analytics	[analytics]
22:30	<mforns>	finished refinery deployment, although didn't manage to run refinery-deploy-to-hdfs without warnings...	[analytics]
21:48	<mforns>	kill edit-hourly-coord in Hue to migrate it to Airflow	[analytics]
21:26	<mforns>	starting refinery deploy	[analytics]
19:38	<SandraEbele>	rerunning webrequest load text for 2023-03-01-08 hour.	[analytics]
18:54	<joal>	Create empty partitions in event.mediawiki_page_move table for codfw datacenter from beginning of week (2023-02-27T00 -> 2023-02-28T13)	[analytics]
10:25	<nfraison>	rebooting an-worker1132 being slower than other node (potential issue with raid card/disks)	[analytics]
07:59	<nfraison>	restarted hiveserver2 in analytics-test to take in account -XX:MaxMetaspaceSize=512m JVM parameter	[analytics]
2023-02-28 §
21:33	<xcollazo>	Deploying section_image_recommendations DAG to platform_eng Airflow instance	[analytics]
11:38	<btullis>	cancelled merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128	[analytics]
11:32	<btullis>	merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128	[analytics]
09:42	<nfraison>	restart presto prod coordinator to take in account heap size change	[analytics]
09:38	<nfraison>	Failover hive servers to active server: an-coord1001	[analytics]
09:32	<nfraison>	restarted hive-metastore and hiveserver2 on an-coord1001 (non-active hive server)	[analytics]
08:22	<nfraison>	Failover hive servers to standby server: https://gerrit.wikimedia.org/r/c/operations/dns/+/892460	[analytics]
2023-02-27 §
14:52	<nfraison>	restarted hive-metastore and hiveserver2 on an-coord1002 (standby hive server)	[analytics]
2023-02-22 §
19:39	<mforns>	restarted the following an-launcher1002 timers, which seemed stuck (next run = n/a): gobblin-webrequest.timer, reportupdater-browser.timer, reportupdater-reference-previews.timer, refine_event.timer, refine_eventlogging_legacy.timer	[analytics]
11:07	<nfraison>	roll restart presto clusters to take in account fix on node.environment typo	[analytics]
2023-02-21 §
19:01	<mforns>	re airflow silent failure: the job was pageview_actor_hourly	[analytics]
19:00	<mforns>	we had another silent failure in airflow, a sensor that failed without sending an email. the logs are missing.	[analytics]
09:33	<nfraison>	adding last batch of 5 nodes to the presto prod cluster	[analytics]