651-700 of 773 results (12ms)
2015-10-09 §
14:37 <Coren> Beginning rotation of execution nodes to apply fix for T106170 [tools]
2015-10-06 §
04:35 <yuvipanda> created tools-puppetmaster-02 as hot spare [tools]
2015-10-01 §
23:38 <yuvipanda> actually rebooting tools-worker-02, had actually rebooted-01 earlier #facepalm [tools]
23:20 <yuvipanda> rebooting tools-worker-02 to pickup new kernel [tools]
23:10 <yuvipanda> failed over tools-proxy-01 to -02, restarting -01 to pick up new kernel [tools]
22:58 <yuvipanda> rebooted tools-proxy-02 to pick up new kernel [tools]
2015-09-30 §
07:12 <yuvipanda> deleted tools-webproxy-01 and -02, running on proxy-01 and -02 now [tools]
06:40 <yuvipanda> migrated webproxy to tools-proxy-01 [tools]
2015-09-28 §
15:24 <Coren> rebooting tools-shadow after mount option changes. [tools]
2015-09-23 §
18:22 <valhallasw`cloud> here = https://etherpad.wikimedia.org/p/74j8K2zIob [tools]
18:22 <valhallasw`cloud> experimenting with https://github.com/jordansissel/fpm on tools-packages, and manually installing packages for that. Noting them here. [tools]
2015-09-16 §
01:17 <YuviPanda> attempting to move grrrit-wm to kubernetes [tools]
01:17 <YuviPanda> attempting to move to kubernetes [tools]
2015-09-08 §
08:05 <valhallasw`cloud> Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated.<br>Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated. [tools]
08:04 <valhallasw`cloud> added all packages in data/project/.system/deb-precise to aptly repo precise-tools [tools]
08:03 <valhallasw`cloud> added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools [tools]
2015-09-07 §
18:49 <valhallasw`cloud> ran sudo mount -o remount /data/project on tools-static-01, which also solved the issue, so skipping the reboot [tools]
18:47 <valhallasw`cloud> switched static webserver to tools-static-02 [tools]
18:45 <valhallasw`cloud> weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting. [tools]
17:57 <YuviPanda> created tools-k8s-master-01 with jessie, will be etcd and kubernetes master [tools]
2015-09-03 §
07:09 <valhallasw`cloud> and just re-running puppet solves the issue. Sigh. [tools]
07:09 <valhallasw`cloud> last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file [tools]
07:07 <valhallasw`cloud> err, is empty. [tools]
07:07 <valhallasw`cloud> uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?! [tools]
2015-09-02 §
13:58 <valhallasw`cloud> rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load [tools]
13:55 <valhallasw`cloud> restarted gridengine_exec on tools-exec-1403 [tools]
13:53 <valhallasw`cloud> tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job. [tools]
13:16 <YuviPanda> deleted all jobs of ralgisbot [tools]
13:12 <YuviPanda> suspended all jobs in ralgisbot temporarily [tools]
12:57 <YuviPanda> rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles [tools]
2015-09-01 §
21:01 <valhallasw`cloud> killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately. [tools]
15:47 <valhallasw`cloud> git reset --hard cdnjs on tools-web-static-01 [tools]
06:23 <valhallasw`cloud> seems to have worked. SGE :( [tools]
06:17 <valhallasw`cloud> going to restart sge_qmaster, hoping this solves the issue :/ [tools]
06:07 <valhallasw`cloud> e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?! [tools]
06:06 <valhallasw`cloud> test job does not get submitted because all queues are overloaded?! [tools]
06:06 <valhallasw`cloud> investigating SGE issues reported on irc/email [tools]
2015-08-31 §
21:21 <valhallasw`cloud> webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest) [tools]
21:20 <valhallasw`cloud> restarted webservicemonitor [tools]
21:19 <valhallasw`cloud> seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2 [tools]
21:18 <valhallasw`cloud> running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running [tools]
21:15 <valhallasw`cloud> several webservices seem to actually have not gotten back online?! what on earth is going on. [tools]
21:10 <valhallasw`cloud> some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again [tools]
20:29 <valhallasw`cloud> |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time. [tools]
20:25 <valhallasw`cloud> ca 500 jobs @ 5s/job = approx 40 minutes [tools]
20:23 <valhallasw`cloud> doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh* [tools]
20:21 <valhallasw`cloud> now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues [tools]
19:36 <valhallasw`cloud> last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs [tools]
19:35 <valhallasw`cloud> one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi? [tools]
19:31 <valhallasw`cloud> https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues [tools]