1-50 of 5449 results (26ms)
2023-09-01 §
07:43 <stevemunene> powercycle an-worker1145.eqiad.wmnet host cpus soft lockup T345413 [analytics]
2023-08-31 §
13:02 <aqu> Deployed refinery using scap, then deployed onto hdfs [analytics]
12:01 <aqu> About to deploy analytics refinery (weekly train) [analytics]
2023-08-30 §
15:43 <stevemunene> restart hadoop-yarn-nodemanager.service on an-worker11[29-48].eqiad.wmnet in batches of 2 with 3 minutes in between [analytics]
14:46 <stevemunene> restart hadoop-yarn-nodemanager.service on an-worker11[00-28].eqiad.wmnet in batches of 2 with 3 minutes in between [analytics]
14:08 <stevemunene> restart hadoop-yarn-nodemanager.service on an-worker10[78-99].eqiad.wmnet in batches of 2 with 3 minutes in between [analytics]
12:41 <stevemunene> disable puppet on an-worket1147 test hadoop-yarn log aggregation compression algorithm The compression was set to gzip but should have been set to gz [analytics]
12:26 <stevemunene> restart hadoop-yarn-nodemanager.service on an-worker1147 [analytics]
2023-08-29 §
11:01 <joal> Update mediawiki_history_check_denormalize airflow job variables to send job-reports to both data-engineering-alerts and product-analytics [analytics]
10:52 <joal> Deploy airflow-dags/analytics [analytics]
2023-08-24 §
18:20 <btullis> attempting another failback of the hadoop namenode services [analytics]
16:47 <btullis> start hadoop namenode on an-master1001 after crash. [analytics]
16:46 <btullis> failback unsuccessful. namenode services still running on an-master1002. [analytics]
16:43 <btullis> going for failback of HDFS namenode service from an-master1002 to an-master1001 [analytics]
16:10 <btullis> about to reboot an-master1001 [analytics]
16:09 <btullis> failing over yarn resourcemanager to an-master1002 [analytics]
16:07 <btullis> failing over hdfs namenode from an-master1001 to an-master1002 [analytics]
12:40 <btullis> rebooting an-coord1001 [analytics]
12:08 <btullis> failing over hive to an-coord1002 in advance of reboot of an-coord1001 [analytics]
11:24 <btullis> btullis@cp3074:~$ sudo systemctl start varnishkafka-webrequest.service [analytics]
2023-08-23 §
14:50 <btullis> rebooting an-launcher1002 [analytics]
08:22 <btullis> beginning a rolling reboot of kafka-jumbo [analytics]
2023-08-22 §
17:24 <joal> Redeploying refinery onto Hadoop-test to try to fix jar issue [analytics]
14:29 <gmodena> deploying refinery with hdfs [analytics]
14:08 <gmodena> deploying refinery using scap [analytics]
13:03 <btullis> deploying the change to the yarn log retention and compression for T342923 [analytics]
2023-08-17 §
15:12 <btullis> failing hive back to an-coord1001 following maintenance [analytics]
14:59 <btullis> restarting hive-server2 and hive-metastore services on an-coord1001 after failover. [analytics]
14:49 <btullis> failing over hive to an-coord1002 to permit restart of hive on an-coord1001 [analytics]
09:29 <btullis> deploying airflow-analytics [analytics]
2023-08-16 §
17:06 <btullis> aqs deploy completed successfully. [analytics]
17:05 <btullis> re-ran efine_eventlogging_analytics failed job and sent follow-up email. [analytics]
16:52 <btullis> deploying aqs again [analytics]
16:43 <btullis> deploying aqs [analytics]
2023-08-14 §
09:27 <btullis> rebooted an-worker1124 due to CPU lockups [analytics]
2023-08-12 §
14:16 <btullis> re-ran refine_event job for 'mediawiki_revision_create|mediawiki_page_create' [analytics]
2023-08-10 §
16:59 <btullis> re-enabled airflow jobs on analytics_test instance [analytics]
08:58 <btullis> rebooting an-db1001 [analytics]
08:57 <btullis> stopped all airflow-scheduler services [analytics]
08:57 <btullis> paused all dags on all airflow instances [analytics]
2023-08-09 §
14:22 <btullis> failing over namenode on test cluster from an-test-master1001 to an-test-master1002 after upgrade of an-test-master1002 to bullseye [analytics]
11:31 <btullis> I did systemctl reset-failed logrotate.service on datahubsearch1002 [analytics]
11:08 <btullis> starting hadoop-hdfs-namenode.service on an-master1002 [analytics]
11:02 <btullis> failing over namenode services to an-master1002 so that I can reboot an-master1001 [analytics]
09:49 <btullis> restarted systemd-timedate service on an-worker1086 [analytics]
2023-08-07 §
17:09 <btullis> deploying new mediawiki_history snapshot to AQS [analytics]
2023-08-02 §
20:42 <xcollazo> deployed latest for Airflow analytics instance. [analytics]
19:30 <xcollazo> deploying refinery to try and fix https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/thread/QKXYMYKMWXGRNYZ77CENA5F2EGA66QQ2/ [analytics]
12:42 <xcollazo> Redeploy of analytics_product Airflow instance to see it it clears a Spark issue [analytics]
2023-08-01 §
11:37 <btullis> ran apt clean on an-tool1009 to free up disk space [analytics]