analytics SAL

401-450 of 5862 results (35ms)

2023-09-14 §
14:13	<stevemunene>	powercycle an-worker1138, investigating failures related to reimage T332570	[analytics]
11:42	<btullis>	deploying conda-analytics version 0.0.20 to the test cluster for T337258	[analytics]
2023-09-12 §
14:59	<btullis>	successfully failed back the HDFS namenode services to an-master1001	[analytics]
11:21	<btullis>	demonstrated the use of SAL for T343762	[analytics]
09:54	<btullis>	btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet	[analytics]
2023-09-07 §
16:55	<btullis>	restarting the aqs service on all aqs* servers in batches to pick up new MW_history snapshot.	[analytics]
13:43	<mforns>	(actual timestamp: 2023-09-06, 19:10:29 UTC) cleared airflow task mediawiki_history_reduced.check_mediawiki_history_reduced_error_folder (and subsequent tasks) for snapshot=2023-08. This was due to false positive errors having been generated by the checker.	[analytics]
2023-09-05 §
14:26	<btullis>	completed eventstreams and eventstreams-internal deployments.	[analytics]
14:23	<btullis>	deploying eventstreams for T344688	[analytics]
14:15	<btullis>	deploying eventstreams-internal for T344688	[analytics]
12:35	<stevemunene>	power cycle an-worker1132. Host is stuck on debian install after a failed reimage.	[analytics]
10:35	<joal>	Rerun cassandra_load_pageview_top_articles_monthly	[analytics]
10:35	<joal>	Clear airflow false-failed tasks for pageview_hourly (log-aggregation issue)	[analytics]
2023-09-01 §
07:43	<stevemunene>	powercycle an-worker1145.eqiad.wmnet host cpus soft lockup T345413	[analytics]
2023-08-31 §
13:02	<aqu>	Deployed refinery using scap, then deployed onto hdfs	[analytics]
12:01	<aqu>	About to deploy analytics refinery (weekly train)	[analytics]
2023-08-30 §
15:43	<stevemunene>	restart hadoop-yarn-nodemanager.service on an-worker11[29-48].eqiad.wmnet in batches of 2 with 3 minutes in between	[analytics]
14:46	<stevemunene>	restart hadoop-yarn-nodemanager.service on an-worker11[00-28].eqiad.wmnet in batches of 2 with 3 minutes in between	[analytics]
14:08	<stevemunene>	restart hadoop-yarn-nodemanager.service on an-worker10[78-99].eqiad.wmnet in batches of 2 with 3 minutes in between	[analytics]
12:41	<stevemunene>	disable puppet on an-worket1147 test hadoop-yarn log aggregation compression algorithm The compression was set to gzip but should have been set to gz	[analytics]
12:26	<stevemunene>	restart hadoop-yarn-nodemanager.service on an-worker1147	[analytics]
2023-08-29 §
11:01	<joal>	Update mediawiki_history_check_denormalize airflow job variables to send job-reports to both data-engineering-alerts and product-analytics	[analytics]
10:52	<joal>	Deploy airflow-dags/analytics	[analytics]
2023-08-24 §
18:20	<btullis>	attempting another failback of the hadoop namenode services	[analytics]
16:47	<btullis>	start hadoop namenode on an-master1001 after crash.	[analytics]
16:46	<btullis>	failback unsuccessful. namenode services still running on an-master1002.	[analytics]
16:43	<btullis>	going for failback of HDFS namenode service from an-master1002 to an-master1001	[analytics]
16:10	<btullis>	about to reboot an-master1001	[analytics]
16:09	<btullis>	failing over yarn resourcemanager to an-master1002	[analytics]
16:07	<btullis>	failing over hdfs namenode from an-master1001 to an-master1002	[analytics]
12:40	<btullis>	rebooting an-coord1001	[analytics]
12:08	<btullis>	failing over hive to an-coord1002 in advance of reboot of an-coord1001	[analytics]
11:24	<btullis>	btullis@cp3074:~$ sudo systemctl start varnishkafka-webrequest.service	[analytics]
2023-08-23 §
14:50	<btullis>	rebooting an-launcher1002	[analytics]
08:22	<btullis>	beginning a rolling reboot of kafka-jumbo	[analytics]
2023-08-22 §
17:24	<joal>	Redeploying refinery onto Hadoop-test to try to fix jar issue	[analytics]
14:29	<gmodena>	deploying refinery with hdfs	[analytics]
14:08	<gmodena>	deploying refinery using scap	[analytics]
13:03	<btullis>	deploying the change to the yarn log retention and compression for T342923	[analytics]
2023-08-17 §
15:12	<btullis>	failing hive back to an-coord1001 following maintenance	[analytics]
14:59	<btullis>	restarting hive-server2 and hive-metastore services on an-coord1001 after failover.	[analytics]
14:49	<btullis>	failing over hive to an-coord1002 to permit restart of hive on an-coord1001	[analytics]
09:29	<btullis>	deploying airflow-analytics	[analytics]
2023-08-16 §
17:06	<btullis>	aqs deploy completed successfully.	[analytics]
17:05	<btullis>	re-ran efine_eventlogging_analytics failed job and sent follow-up email.	[analytics]
16:52	<btullis>	deploying aqs again	[analytics]
16:43	<btullis>	deploying aqs	[analytics]
2023-08-14 §
09:27	<btullis>	rebooted an-worker1124 due to CPU lockups	[analytics]
2023-08-12 §
14:16	<btullis>	re-ran refine_event job for 'mediawiki_revision_create\|mediawiki_page_create'	[analytics]
2023-08-10 §
16:59	<btullis>	re-enabled airflow jobs on analytics_test instance	[analytics]