451-500 of 6069 results (34ms)
2023-11-14 §
14:50 <btullis> roll-restarting the presto cluster to pick up new puppet 7 CA settings [analytics]
14:28 <btullis> performing a rolling restart of the mariadb services on dbstore100[3,5,7] post this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/968668 [analytics]
11:03 <stevemunene> depool druid100[4-6] set pooled=inactive [analytics]
2023-11-13 §
23:34 <btullis> rebooting clouddb1021 to pick up new kernel and puppet 7 CA. [analytics]
21:28 <btullis> deploying updated datahub containers for T348647 [analytics]
21:27 <btullis> reloading haproxy on dbproxy1018 post maintenance [analytics]
17:07 <ottomata> deploying refinery with refinery source 0.2.25 jars and using 0.2.25 for refine job - T321854 [analytics]
13:57 <btullis> reloaded haproxy on dbproxy1018 to depool the analytics wikireplicas cluster [analytics]
12:31 <btullis> repooled clouddb10[13-16] post maintenance. [analytics]
11:08 <btullis> rebooting clouddb1013 to pick up new kernel and SSL CA settings [analytics]
10:49 <btullis> systemctl reload haproxy on dbproxy1019 to depool the web wikireplica cluster [analytics]
2023-11-09 §
14:43 <btullis> pooled druid10[09-11] in the druid-public cluster. [analytics]
12:29 <btullis> Proceeding to roll-restart yarn nodemanagers with `sudo cumin A:hadoop-worker -b 1 -s 30 'systemctl restart hadoop-yarn-nodemanager.service'` for T344910 [analytics]
11:47 <btullis> restarting yarn-nodemanager service on an-worker1100.eqiad.wmnet as a canary for T344910 [analytics]
11:14 <btullis> deploying multiple spark shufflers to production for T344910 [analytics]
09:53 <btullis> executed `helmfile -e eqiad --state-values-set roll_restart=1 sync` to roll-restart datahub in eqiad [analytics]
09:43 <btullis> executed `helmfile -e codfw --state-values-set roll_restart=1 sync` to roll-restart datahub in codfw [analytics]
2023-11-08 §
15:52 <stevemunene> Add analytics-wmde service user to the Yarn production queue T340648 [analytics]
13:55 <btullis> beginning rolling restart of all hadoop workers in production, to pick up new puppet 7 CA settings. [analytics]
10:33 <btullis> restarting hadoop-hdfs-datanode.service and hadoop-yarn-nodemanager.service on an-worker1111 to pick up puppet7 changes. [analytics]
10:27 <brouberol> running scap deploy for airflow-dags/analytics [analytics]
2023-11-07 §
20:48 <xcollazo> Ran 'kerberos-run-command hdfs hdfs dfs -chmod -R g+w /wmf/data/wmf_dumps/wikitext_raw_rc2' to ease experimentation on this release candidate table. [analytics]
15:52 <btullis> restart airflow-sheduler and airflow-webserver services on an-test-client1002 [analytics]
15:50 <btullis> restart mariadb service on an-test-coord1001 [analytics]
15:50 <btullis> restart mariadb service on an-test-coord100 [analytics]
15:49 <btullis> restart presto-server service on an-test-coord1001 and an-test-presto1001 to pick up new puppet 7 CA settings [analytics]
15:48 <btullis> restart hive-server2 and hive-metastore services on an-test-coord1001 to pick up new puppet 7 CA settings. [analytics]
15:35 <btullis> roll-restarting hadoop workers in test, to test new puppet 7 CA settings. [analytics]
14:52 <btullis> roll-restarting hadoop masters on the test cluster, after upgrading to puppet 7 [analytics]
12:05 <btullis> deploying datahub to prod for the pki certificates. [analytics]
11:36 <btullis> deploying datahub to staging to start using pki certificates - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/969345/ [analytics]
10:40 <btullis> re-running the kafka_jumbo_ingestion in analytics airflow [analytics]
2023-11-06 §
18:38 <milimetric> deployed refinery-source, starting to deploy analytics airflow dags [analytics]
13:57 <stevemunene> roll-restart druid public workers to pick up a new zookeeper node druid1009. T336042 [analytics]
13:32 <stevemunene> restart zookeper leader to pick up new host druid1009 T336042 [analytics]
13:25 <stevemunene> stop and disable zookeper on druid1004 T336042 [analytics]
13:19 <stevemunene> disable puppet on druid1004 and druid10[09-11] to Onboard new druid1009 to the ZooKeeper cluster for `druid-public-eqiad` cluster [analytics]
2023-11-01 §
15:58 <stevemunene> powercyle stat1008, host is frozen/stuck in an unresponsive state [analytics]
2023-10-31 §
09:26 <brouberol> I replaced the self-signed skein certificate by one issued by our cfssl PKI on an-test1002 - T329398 [analytics]
2023-10-26 §
16:18 <stevemunene> roll-restart druid public workers to pick up new zookeeper hosts. T336042 [analytics]
15:29 <stevemunene> stop zookeper on druid1005 current leader for the `druid-public-eqiad` this will trigger the election of a new leader T336042 [analytics]
10:18 <stevemunene> restart zookeper leader to pick up new host druid1011 T336042 [analytics]
09:18 <stevemunene> stop zookeper on druid1006 T336042 [analytics]
08:48 <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1009 [analytics]
08:06 <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1008 [analytics]
2023-10-24 §
16:46 <xcollazo> Deploying latest DAGs to analytics Airflow instance [analytics]
12:41 <joal> Drop wmf.referrer_daily hive table and data [analytics]
10:07 <btullis> transferring snapshot s2.2023-10-23--01-34-18 from dbprov1004 to dbstore1007:/srv/sqldata.s2 [analytics]
10:02 <btullis> stopping and deleting s2 on dbstore1007. [analytics]
2023-10-23 §
10:14 <brouberol> sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1001.eqiad.wmnet [analytics]