451-500 of 6013 results (31ms)
2023-10-23 §
08:28 <brouberol> sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1005.eqiad.wmnet - T336044 [analytics]
2023-10-19 §
19:58 <xcollazo> ran "sudo -u hdfs hdfs dfs -cp /user/xcollazo/artifacts/spark-3.3.2-assembly.zip /user/spark/share/lib/" and "sudo -u hdfs hdfs dfs -chmod o+r /user/spark/share/lib/spark-3.3.2-assembly.zip" to bring make Spark 3.3.2 assembly available for other folks. [analytics]
19:54 <xcollazo> ran "sudo -u hdfs hdfs dfs -rm /user/spark/share/lib/spark-3.1.2-assembly.jar.backup" to remove old spark assembly backup from May 25 2023. [analytics]
19:52 <xcollazo> ran "$ sudo -u hdfs hdfs dfs -rm /user/spark/share/lib/spark-3.1.2-assembly.jar.bak" to remove old spark assembly backup from Jun 13 2023. [analytics]
15:22 <brouberol> The kafka service has been stopped on kafka-jumbo100[1-6] - T336044 [analytics]
15:04 <brouberol> sudo cumin --batch-size 1 --batch-sleep 60 'kafka-jumbo100[1-6].eqiad.wmnet' 'sudo systemctl stop kafka.service' - T336044 [analytics]
15:02 <brouberol> disabling puppet on kafka-jumbo100[1-6] to make sure kafka isn't resarted - T336044 [analytics]
12:13 <brouberol> disabling puppet on kafka-jumbo nodes so we can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/966497 [analytics]
09:42 <btullis> re-running airflow jobs for missing webrequest data on hadoop-test [analytics]
2023-10-18 §
18:03 <stevemunene> revert Add analytics-wmde service user to the Yarn production queue T340648 [analytics]
17:43 <tchin> deploying mw-page-content-change-enrich [analytics]
16:53 <stevemunene> Add analytics-wmde service user to the Yarn production queue T340648 [analytics]
09:14 <btullis> rebooting stat100[6-7] [analytics]
09:07 <btullis> rebooting stat1004 [analytics]
07:01 <aqu> Started deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train [analytics]
2023-10-17 §
16:17 <btullis> restarting hadoop-yarn-nodemanager on an-test-worker1001 [analytics]
14:01 <tchin> deploying airflow analytics [analytics]
13:39 <tchin> deploying refinery [analytics]
12:56 <btullis> deploying multiple spark shufflers to the test cluster [analytics]
09:51 <btullis> re-enabling all previously paused dags [analytics]
09:50 <btullis> restarting all airflow schedulers after rebooting an-db1001 [analytics]
09:10 <btullis> pausing both active dags on the analytics_product airflow instance [analytics]
09:09 <btullis> pausing all 7 active dags on airflow-platform_eng airflow instance [analytics]
09:07 <btullis> pausing all 3 active dags on airflow-research instance [analytics]
09:07 <btullis> pausing all 28 active airflow dags on airflow-search instance [analytics]
09:03 <btullis> pausing all airflow dags on analytics instance [analytics]
2023-10-16 §
13:05 <brouberol> deploying mw-page-content-change-enrich with the new kafka broker list T336044 [analytics]
10:06 <btullis> deploying presto version 0.283 to production for T342343 with `sudo debdeploy deploy -u 2023-10-12-presto.yaml -Q 'P{O:analytics_cluster::presto::server} or P{O:analytics_cluster::coordinator} or A:stat'` [analytics]
08:49 <brouberol> redeploying datahub with the new kafka broker list T336044 [analytics]
08:42 <brouberol> redeploying eventgate-analytics-external with the new kafka broker list T336044 [analytics]
08:38 <brouberol> redeploying eventgate-analytics with the new kafka broker list T336044 [analytics]
08:34 <brouberol> redeploying eventstreams-internal with the new kafka broker list T336044 [analytics]
2023-10-12 §
13:22 <btullis> rebooting archiva1002.wikimedia.org for T344671 [analytics]
12:00 <btullis> pushing out presto version 0.283 to the test cluster. [analytics]
09:31 <btullis> rebooting an-coord1002 for T344671 [analytics]
09:18 <btullis> power cycling an-master1002 to address unresponsiveness [analytics]
2023-10-11 §
09:27 <btullis> trigger rolling-restart of aqs services with `sudo cumin -b 2 -s 20 A:aqs 'systemctl restart aqs'` [analytics]
2023-10-09 §
18:35 <mforns> deployed airflow analytics [analytics]
10:46 <btullis> started rolling restart of an-worker1[078-156] for T344587 [analytics]
08:55 <btullis> started rolling restart of analytics10[70-77] for T344587 [analytics]
2023-10-05 §
15:30 <btullis> failed over test cluster hadoop namenode services to an-test-master1002 [analytics]
2023-10-04 §
06:19 <Surbhi_> Deployed refinery using scap, then deployed onto hdfs [analytics]
2023-10-02 §
16:45 <joal> Silent the "High Kafka consumer lag for mw_page_content_change_enrich in codfw" alert for 3 days [analytics]
13:40 <stevemunene> roll-restart druid public workers to pick up a new worker node. T336042 [analytics]
13:28 <joal> Manually mark wikidata_item_page_link_weekly.wait_for_mediawiki_page_move task successfull (with note) to overcome datacenter switchover sensor issue [analytics]
13:27 <joal> Manually mark wikidata_item_page_link_weeklywait_for_mediawiki_page_move [analytics]
07:36 <joal> deploying mw-page-content-change-enrich on codfw after kafka has finished synchronizing its replicas [analytics]
2023-09-29 §
13:10 <btullis> systemctl reset-failed on kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service on kafka-jumbo1001 [analytics]
12:07 <joal> mw_page_content_change_enrich alert silenced for the weekend, the app is down, more investigation next week [analytics]
12:06 <joal> Various restarts of mw_page_content_change_enrich k8s app since yesterday - the app is failing to send data to kafka [analytics]