production SAL

1351-1400 of 10000 results (103ms)

2025-05-01 §
19:30	<dduvall@deploy1003>	Finished scap sync-world: retrying sync-world following spurious helmfile apply error (mw-jobrunner codfw) (duration: 11m 24s)	[production]
19:20	<sukhe>	sukhe@netbox1003:~$ sudo systemctl start uwsgi-netbox.service: service was OOM'ed, restarting	[production]
19:18	<dduvall@deploy1003>	Started scap sync-world: retrying sync-world following spurious helmfile apply error (mw-jobrunner codfw)	[production]
19:16	<jhathaway@dns1004>	END - running authdns-update	[production]
19:14	<jhathaway@dns1004>	START - running authdns-update	[production]
19:09	<ryankemper>	T376151 [wdqs-internal lvs teardown] running puppet across `A:wdqs-internal` now that pybal has been restarted	[production]
19:09	<dduvall>	deployment of mw-jobrunner-main for codfw failed during scap train (group2) (T386222)	[production]
19:09	<ryankemper>	T376151 [wdqs-internal lvs teardown -> pybal rolling restart] all IPVS diff check alerts have recovered, rolling restart complete	[production]
19:06	<dduvall>	helm error during group2 deployment "Get "https://kubemaster.svc.codfw.wmnet:6443/api/v1/namespaces/mw-jobrunner/services/mediawiki-main-tls-service": dial tcp 10.2.1.8:6443: connect: no route to host - error from a previous attempt: read tcp 10.64.16.93:41894->10.2.1.8:6443: read: connection reset by peer"	[production]
19:04	<ryankemper>	T376151 [wdqs-internal lvs teardown -> pybal rolling restart] `ipvsadm --delete-service --tcp-service 10.2.2.41:80` on `lvs1019` and `lvs1020`	[production]
19:03	<ryankemper>	T376151 [wdqs-internal lvs teardown -> pybal rolling restart] `ipvsadm --delete-service --tcp-service 10.2.1.41:80` on `A:lvs-secondary-codfw OR A:lvs-low-traffic-codfw`(lvs2013, lvs2014)	[production]
18:59	<ryankemper>	T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on `A:lvs-low-traffic-codfw` (lvs2013)	[production]
18:58	<ryankemper>	T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on `A:lvs-secondary-codfw` (lvs2014), waiting 2 mins before proceeding	[production]
18:55	<ryankemper>	T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on `A:lvs-low-traffic-eqiad` (lvs1019), waiting few mins before proceeding	[production]
18:48	<ryankemper>	T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on `A:lvs-secondary-eqiad`, it only restarted on ` lvs1020` but for some reason ` lvs1013` doesn't have a pybal service running	[production]
18:44	<ryankemper>	T376151 [wdqs-internal lvs teardown -> pybal rolling restart] ran puppet on `O:Lvs::balancer` after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136747	[production]
18:32	<eevans@deploy1003>	helmfile [eqiad] DONE helmfile.d/services/echostore: apply	[production]
18:31	<eevans@deploy1003>	helmfile [eqiad] START helmfile.d/services/echostore: apply	[production]
18:30	<eevans@deploy1003>	helmfile [codfw] DONE helmfile.d/services/echostore: apply	[production]
18:29	<eevans@deploy1003>	helmfile [codfw] START helmfile.d/services/echostore: apply	[production]
18:28	<eevans@deploy1003>	helmfile [staging] DONE helmfile.d/services/echostore: apply	[production]
18:27	<eevans@deploy1003>	helmfile [staging] START helmfile.d/services/echostore: apply	[production]
18:26	<ryankemper>	T376151 (wdqs-internal lvs teardown) Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136744 to flip `wdqs-internal` service state to `lvs_setup` and running puppet across `A:dnsbox`	[production]
18:24	<dduvall@deploy1003>	rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.27 refs T386222	[production]
18:23	<ryankemper@dns1004>	END - running authdns-update	[production]
18:21	<ryankemper@dns1004>	START - running authdns-update	[production]
17:31	<jhathaway>	testing sasl email relaying on mx-in{1001,2001}	[production]
16:40	<btullis@deploy1003>	helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply	[production]
16:40	<btullis@deploy1003>	helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply	[production]
16:39	<btullis@deploy1003>	helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply	[production]
16:38	<btullis@deploy1003>	helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply	[production]
16:04	<jhancock@cumin2002>	END (PASS) - Cookbook sre.dns.netbox (exit_code=0)	[production]
16:02	<jhancock@cumin2002>	START - Cookbook sre.dns.netbox	[production]
16:01	<jhancock@cumin2002>	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2045.codfw.wmnet with OS bookworm	[production]
16:01	<jhancock@cumin2002>	END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"	[production]
15:58	<jhancock@cumin2002>	START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"	[production]
15:42	<jhancock@cumin2002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2045.codfw.wmnet with reason: host reimage	[production]
15:40	<jhancock@cumin2002>	START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2045.codfw.wmnet with reason: host reimage	[production]
15:34	<hnowlan@deploy1003>	helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply	[production]
15:34	<hnowlan@deploy1003>	helmfile [eqiad] START helmfile.d/services/mw-cron: apply	[production]
15:29	<jhancock@cumin2002>	START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm	[production]
15:29	<jhancock@cumin2002>	START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm	[production]
15:28	<jhancock@cumin2002>	START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm	[production]
15:02	<jhancock@cumin2002>	END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART	[production]
15:00	<jhancock@cumin2002>	END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART	[production]
15:00	<jhancock@cumin2002>	END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART	[production]
14:55	<jhancock@cumin2002>	START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART	[production]
14:54	<jhancock@cumin2002>	START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART	[production]
14:54	<jhancock@cumin2002>	START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART	[production]
14:52	<jhancock@cumin2002>	END (PASS) - Cookbook sre.dns.netbox (exit_code=0)	[production]