401-450 of 5862 results (33ms)
2023-09-14 §
14:13 <stevemunene> powercycle an-worker1138, investigating failures related to reimage T332570 [analytics]
11:42 <btullis> deploying conda-analytics version 0.0.20 to the test cluster for T337258 [analytics]
2023-09-12 §
14:59 <btullis> successfully failed back the HDFS namenode services to an-master1001 [analytics]
11:21 <btullis> demonstrated the use of SAL for T343762 [analytics]
09:54 <btullis> btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [analytics]
2023-09-07 §
16:55 <btullis> restarting the aqs service on all aqs* servers in batches to pick up new MW_history snapshot. [analytics]
13:43 <mforns> (actual timestamp: 2023-09-06, 19:10:29 UTC) cleared airflow task mediawiki_history_reduced.check_mediawiki_history_reduced_error_folder (and subsequent tasks) for snapshot=2023-08. This was due to false positive errors having been generated by the checker. [analytics]
2023-09-05 §
14:26 <btullis> completed eventstreams and eventstreams-internal deployments. [analytics]
14:23 <btullis> deploying eventstreams for T344688 [analytics]
14:15 <btullis> deploying eventstreams-internal for T344688 [analytics]
12:35 <stevemunene> power cycle an-worker1132. Host is stuck on debian install after a failed reimage. [analytics]
10:35 <joal> Rerun cassandra_load_pageview_top_articles_monthly [analytics]
10:35 <joal> Clear airflow false-failed tasks for pageview_hourly (log-aggregation issue) [analytics]
2023-09-01 §
07:43 <stevemunene> powercycle an-worker1145.eqiad.wmnet host cpus soft lockup T345413 [analytics]
2023-08-31 §
13:02 <aqu> Deployed refinery using scap, then deployed onto hdfs [analytics]
12:01 <aqu> About to deploy analytics refinery (weekly train) [analytics]
2023-08-30 §
15:43 <stevemunene> restart hadoop-yarn-nodemanager.service on an-worker11[29-48].eqiad.wmnet in batches of 2 with 3 minutes in between [analytics]
14:46 <stevemunene> restart hadoop-yarn-nodemanager.service on an-worker11[00-28].eqiad.wmnet in batches of 2 with 3 minutes in between [analytics]
14:08 <stevemunene> restart hadoop-yarn-nodemanager.service on an-worker10[78-99].eqiad.wmnet in batches of 2 with 3 minutes in between [analytics]
12:41 <stevemunene> disable puppet on an-worket1147 test hadoop-yarn log aggregation compression algorithm The compression was set to gzip but should have been set to gz [analytics]
12:26 <stevemunene> restart hadoop-yarn-nodemanager.service on an-worker1147 [analytics]
2023-08-29 §
11:01 <joal> Update mediawiki_history_check_denormalize airflow job variables to send job-reports to both data-engineering-alerts and product-analytics [analytics]
10:52 <joal> Deploy airflow-dags/analytics [analytics]
2023-08-24 §
18:20 <btullis> attempting another failback of the hadoop namenode services [analytics]
16:47 <btullis> start hadoop namenode on an-master1001 after crash. [analytics]
16:46 <btullis> failback unsuccessful. namenode services still running on an-master1002. [analytics]
16:43 <btullis> going for failback of HDFS namenode service from an-master1002 to an-master1001 [analytics]
16:10 <btullis> about to reboot an-master1001 [analytics]
16:09 <btullis> failing over yarn resourcemanager to an-master1002 [analytics]
16:07 <btullis> failing over hdfs namenode from an-master1001 to an-master1002 [analytics]
12:40 <btullis> rebooting an-coord1001 [analytics]
12:08 <btullis> failing over hive to an-coord1002 in advance of reboot of an-coord1001 [analytics]
11:24 <btullis> btullis@cp3074:~$ sudo systemctl start varnishkafka-webrequest.service [analytics]
2023-08-23 §
14:50 <btullis> rebooting an-launcher1002 [analytics]
08:22 <btullis> beginning a rolling reboot of kafka-jumbo [analytics]
2023-08-22 §
17:24 <joal> Redeploying refinery onto Hadoop-test to try to fix jar issue [analytics]
14:29 <gmodena> deploying refinery with hdfs [analytics]
14:08 <gmodena> deploying refinery using scap [analytics]
13:03 <btullis> deploying the change to the yarn log retention and compression for T342923 [analytics]
2023-08-17 §
15:12 <btullis> failing hive back to an-coord1001 following maintenance [analytics]
14:59 <btullis> restarting hive-server2 and hive-metastore services on an-coord1001 after failover. [analytics]
14:49 <btullis> failing over hive to an-coord1002 to permit restart of hive on an-coord1001 [analytics]
09:29 <btullis> deploying airflow-analytics [analytics]
2023-08-16 §
17:06 <btullis> aqs deploy completed successfully. [analytics]
17:05 <btullis> re-ran efine_eventlogging_analytics failed job and sent follow-up email. [analytics]
16:52 <btullis> deploying aqs again [analytics]
16:43 <btullis> deploying aqs [analytics]
2023-08-14 §
09:27 <btullis> rebooted an-worker1124 due to CPU lockups [analytics]
2023-08-12 §
14:16 <btullis> re-ran refine_event job for 'mediawiki_revision_create|mediawiki_page_create' [analytics]
2023-08-10 §
16:59 <btullis> re-enabled airflow jobs on analytics_test instance [analytics]