651-700 of 5926 results (31ms)
2023-06-16 §
12:11 <btullis> restarting refine_event_sanitized_main_delayed.service on an-launcher1002 [analytics]
12:03 <btullis> restarting refine_event_sanitized_analytics_delayed.service on an-launcher1002 [analytics]
11:14 <btullis> rebooting an-test-worker1002 for T335358 and stuck gobblin [analytics]
10:13 <joal> rerun druid_load_edit_hourly to reload full snapshot [analytics]
2023-06-15 §
19:27 <btullis> restarting aqs service on A:aqs in batches of 2, 10 seconds apart [analytics]
17:02 <joal> Deploying airflow (again) to fix memory issues [analytics]
15:58 <joal> Rerun druid indexation for mediawiki_history_reduced [analytics]
15:56 <joal> Deploy airflow to fix druid loading jobs using snapshot [analytics]
15:53 <milimetric> refinery-source 0.2.17 deployed, refinery updated and synced to hdfs [analytics]
12:47 <stevemunene> roll running sre.hadoop.roll-restart-masters to completely remove any reference of analytics1058-1060 for T317861 [analytics]
12:34 <joal> Deploy analytics-airlfow to patch mediawiki_history_reduced druid loading [analytics]
09:05 <elukey> move varnishkafka instances in ulsfo to PKI [analytics]
2023-06-14 §
20:18 <milimetric> reran mediawiki_history_reduced druid load task after deploying Joseph's fix [analytics]
13:15 <stevemunene> running the puppet on an-master100[1-2] Remove analytics58_60 from the HDFS topology T317861 [analytics]
2023-06-13 §
19:27 <btullis> restarting the hive-server2 and hive-metastore services on an-coord1001 [analytics]
19:03 <btullis> freeing up space in /srv on an-launcher1002 with `btullis@an-launcher1002:/srv/airflow-analytics/logs/scheduler$ find -maxdepth 1 -type d -mtime +15 -print0 | xargs -0 sudo rm -rf` for T339002 [analytics]
16:41 <ottomata> deploying refinery for weekly train [analytics]
15:45 <SandraEbele> Deployed refinery-source using jenkins [analytics]
15:19 <ottomata> drop event.mediawiki_page_outlink_topic_prediction_change table and data - T337395 [analytics]
15:13 <SandraEbele> deploying refinery source [analytics]
15:05 <ottomata> dropping hive table event.mediawiki_page_change_v1 to pick up backwards incompatible schema change - T337395 [analytics]
15:03 <btullis> failing over the analytics-hive cname to an-coord1002 [analytics]
13:45 <elukey> fixed broken graphs in the varnishkafka's dashboard [analytics]
13:37 <btullis> restarting hive-server2 and hive-metastore on an-coord1002 prior to failover. [analytics]
13:00 <btullis> rolled out conda-analytics 0.0.18 to analytics-airflow and hadoop-coordinator [analytics]
12:25 <btullis> beginning rollout of conda-analytics 0.0.18 to hadoop-workers [analytics]
07:10 <elukey> move varnishkafka instances on cp4037 to PKI TLS certs [analytics]
2023-06-12 §
12:39 <btullis> ran apt clean on an-testui1001 to get some free disk space. [analytics]
11:30 <btullis> resuming deployment of eventgate-main [analytics]
09:58 <btullis> deploying eventgate-main [analytics]
08:52 <btullis> restart monitor_refine_netflow service on an-launcher1002 after successful job re-run. [analytics]
08:36 <btullis> re-running the refine_netflow task [analytics]
2023-06-09 §
20:40 <btullis> restarting the aqs service more quickly with: `sudo cumin -b 2 -s 10 A:aqs 'systemctl restart aqs'` [analytics]
20:23 <btullis> btullis@cumin1001:~$ sudo cookbook sre.aqs.roll-restart-reboot --alias aqs restart_daemons --reason aqs_rollback_btullis [analytics]
20:22 <btullis> merged and deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/928927 to revert aqs mediawiki snapshot change [analytics]
2023-06-08 §
17:12 <btullis> running the sre.hadoop.roll-restart-masters cookbook for the analytics cluster, to pick up the new journalnode for T338336 [analytics]
17:01 <btullis> running puppet on an-worker1142 to start the new journalnode [analytics]
06:42 <stevemunene> stop hadoop-hdfs-journalnode on analytics1069 in order to swap the journal node with an-worker1142 T338336 [analytics]
06:10 <elukey> kill remaining processes for `andyrussg` on stat100x nodes to unblock puppet [analytics]
2023-06-07 §
15:38 <btullis> installing presto 0.281 to the test cluster [analytics]
15:23 <elukey> all varnishkafka instances on caching nodes are getting restarted due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/928087 - T337825 [analytics]
14:13 <btullis> running `sudo cumin A:wikireplicas-web 'maintain-views --all-databases --table abuse_filter_history --replace-all` on A:wikireplicas-web [analytics]
14:04 <btullis> running `maintain-views --all-databases --table abuse_filter_history --replace-all` on A:wikireplicas-analytics [analytics]
11:52 <btullis> running `sudo maintain-views --all-databases --table abuse_filter_history --replace-all` on clouddbd1021 for T315426 [analytics]
08:02 <elukey> set "loadByPeriod(P15D+future), dropForever" for webrequest_sampled_live in druid-analytics - T337460 [analytics]
2023-06-06 §
15:52 <elukey> restart yarn resourcemanager on an-master1002 to restore the Yarn UI (that works only when the active yarn RM is on 1001) [analytics]
15:07 <mforns> deployed airflow analytics to try and fix the edit_hourly DAG again [analytics]
13:09 <ottomata> EventStreamConfig - temporarily Disable canary events and hadoop ingestion for development.network.probe stream - T332024 [analytics]
11:29 <stevemunene> service hadoop-yarn-resourcemanager restart for T317861 [analytics]
11:13 <btullis> restart airflow-scheduler service on an-test-client1001 for analytics_test instance [analytics]