1-50 of 10000 results (87ms)
2026-03-09 ยง
22:03 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2004-dev.codfw.wmnet with OS trixie [production]
22:02 <alexsanford> Redeployed security fix for T419186 [production]
21:44 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [production]
21:40 <andrew@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [production]
21:37 <cdobbins@puppetserver1001> conftool action : set/pooled=yes; selector: name=cp7002.magru.wmnet [production]
21:34 <cdobbins@cumin2002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7002.magru.wmnet with OS trixie [production]
21:29 <alexsanford> Deployed security fix for T419186 [production]
21:22 <andrew@cumin2002> START - Cookbook sre.hosts.reimage for host cloudgw2004-dev.codfw.wmnet with OS trixie [production]
21:21 <andrew@cumin2002> END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudgw2004-dev.codfw.wmnet with OS trixie [production]
21:17 <dani@deploy2002> Finished scap sync-world: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] (duration: 08m 15s) [production]
21:13 <dani@deploy2002> dani: Continuing with sync [production]
21:11 <dani@deploy2002> dani: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [production]
21:09 <dani@deploy2002> Started scap sync-world: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] [production]
21:08 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [production]
21:05 <cdobbins@cumin2002> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [production]
21:02 <cdobbins@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [production]
21:01 <andrew@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [production]
21:01 <tgr_> removed private code for T397244 [production]
21:01 <ryankemper> [WDQS] Alright, these are re-entering a failed state soon enough that we will need to identify the offender if we want to restore proper service. We could put some temporary hack to restart every few minutes so we at least maintain some uptime, but root cause is the usual 'we need a requestctl rule to block whoever's killing us' scenario [production]
21:00 <cdobbins@puppetserver1001> conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: Trixie reimaging] [production]
20:57 <ryankemper> [WDQS] Auto-remediation would have eventually restarted these, but some of them were staying below our current threshold of `threads > 1200`. May want to lower threshold, or examine an additional metric-type to look at in the future [production]
20:56 <ryankemper> [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'A:wdqs-main AND P{wdqs1*}' 'systemctl restart wdqs-blazegraph'` [production]
20:54 <ryankemper> [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'A:wdqs-main AND P{wdqs2*}' 'systemctl restart wdqs-blazegraph'` [production]
20:44 <andrew@cumin2002> START - Cookbook sre.hosts.reimage for host cloudgw2004-dev.codfw.wmnet with OS trixie [production]
20:43 <tgr@deploy2002> Unlocked for deployment [MediaWiki]: working on private change (duration: 10m 10s) [production]
20:36 <cdobbins@cumin2002> START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS trixie [production]
20:33 <tgr@deploy2002> Locking from deployment [MediaWiki]: working on private change [production]
20:31 <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] (duration: 13m 36s) [production]
20:27 <tgr@deploy2002> cscott, tgr, anzx: Continuing with sync [production]
20:19 <tgr@deploy2002> cscott, tgr, anzx: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [production]
20:17 <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] [production]
20:13 <aaron@deploy2002> Finished scap sync-world: Backport for [[gerrit:1249363|Remove redundant math spec file from wwwportal (T418188)]] (duration: 06m 56s) [production]
20:09 <aaron@deploy2002> aaron: Continuing with sync [production]
20:08 <aaron@deploy2002> aaron: Backport for [[gerrit:1249363|Remove redundant math spec file from wwwportal (T418188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [production]
20:06 <aaron@deploy2002> Started scap sync-world: Backport for [[gerrit:1249363|Remove redundant math spec file from wwwportal (T418188)]] [production]
20:01 <brett@puppetserver1001> conftool action : set/pooled=yes; selector: name=cp7016.* [production]
19:54 <cdobbins@cumin2002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7001.magru.wmnet with OS trixie [production]
19:51 <brett@cumin2002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7016.magru.wmnet with OS trixie [production]
19:49 <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1248911|Stop writing to il_to on commonswiki (T415787)]] (duration: 06m 04s) [production]
19:45 <zabe@deploy2002> zabe: Continuing with sync [production]
19:44 <zabe@deploy2002> zabe: Backport for [[gerrit:1248911|Stop writing to il_to on commonswiki (T415787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [production]
19:43 <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1248911|Stop writing to il_to on commonswiki (T415787)]] [production]
19:29 <btullis@deploy2002> helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [production]
19:28 <btullis@deploy2002> helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [production]
19:28 <cdobbins@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [production]
19:24 <brett@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7016.magru.wmnet with reason: host reimage [production]
19:23 <cdobbins@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [production]
19:19 <brett@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on cp7016.magru.wmnet with reason: host reimage [production]
19:15 <cwhite@deploy2002> Finished deploy [performance/arc-lamp@aa8da8b]: Ie7e0355f89294a2927f9dbc28afec3a62d1752de (duration: 00m 08s) [production]
19:15 <cwhite@deploy2002> Started deploy [performance/arc-lamp@aa8da8b]: Ie7e0355f89294a2927f9dbc28afec3a62d1752de [production]