| 2015-08-31
      
      § | 
    
  | 21:21 | <valhallasw`cloud> | webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest) | [tools] | 
            
  | 21:20 | <valhallasw`cloud> | restarted webservicemonitor | [tools] | 
            
  | 21:19 | <valhallasw`cloud> | seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2 | [tools] | 
            
  | 21:18 | <valhallasw`cloud> | running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running | [tools] | 
            
  | 21:15 | <valhallasw`cloud> | several webservices seem to actually have not gotten back online?! what on earth is going on. | [tools] | 
            
  | 21:10 | <valhallasw`cloud> | some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again | [tools] | 
            
  | 20:29 | <valhallasw`cloud> | |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time. | [tools] | 
            
  | 20:25 | <valhallasw`cloud> | ca 500 jobs @ 5s/job = approx 40 minutes | [tools] | 
            
  | 20:23 | <valhallasw`cloud> | doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh* | [tools] | 
            
  | 20:21 | <valhallasw`cloud> | now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues | [tools] | 
            
  | 19:36 | <valhallasw`cloud> | last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs | [tools] | 
            
  | 19:35 | <valhallasw`cloud> | one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi? | [tools] | 
            
  | 19:31 | <valhallasw`cloud> | https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues | [tools] | 
            
  | 07:31 | <valhallasw`cloud> | removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs) | [tools] | 
            
  
    | 2015-08-18
      
      § | 
    
  | 13:57 | <valhallasw`cloud> | same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs. | [tools] | 
            
  | 13:55 | <valhallasw`cloud> | no, wait, that's ''tools-webgrid-lighttpd-1411.eqiad.wmflabs'', not the actual host ''tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs''. We should fix that dns mess as well. | [tools] | 
            
  | 13:54 | <valhallasw`cloud> | tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state. | [tools] | 
            
  | 13:47 | <valhallasw`cloud> | that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state | [tools] | 
            
  | 13:46 | <valhallasw`cloud> | starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using <code>for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done</code> | [tools] | 
            
  | 08:37 | <valhallasw`cloud> | sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" | [tools] | 
            
  | 08:33 | <valhallasw`cloud> | tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available) | [tools] | 
            
  | 08:30 | <valhallasw`cloud> | hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config | [tools] | 
            
  | 08:21 | <valhallasw`cloud> | still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" | [tools] | 
            
  | 08:20 | <valhallasw`cloud> | sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs | [tools] |