| 
      
        2016-01-27
      
      §
     | 
  
    
  | 18:26 | 
  <valhallasw`cloud> | 
  messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate | 
  [tools] | 
            
  | 18:24 | 
  <valhallasw`cloud> | 
  'sleep' test job also seems to work without issues | 
  [tools] | 
            
  | 18:23 | 
  <valhallasw`cloud> | 
  no errors in log file, qstat works | 
  [tools] | 
            
  | 18:23 | 
  <chasemp> | 
  master sge restarted post dump and restart for jobs db | 
  [tools] | 
            
  | 18:22 | 
  <valhallasw`cloud> | 
  messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016' | 
  [tools] | 
            
  | 18:20 | 
  <chasemp> | 
  master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job | 
  [tools] | 
            
  | 18:19 | 
  <valhallasw`cloud> | 
  dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M | 
  [tools] | 
            
  | 18:17 | 
  <valhallasw`cloud> | 
  SGE Configuration successfully saved to /root/sge_maint_01272016 directory. | 
  [tools] | 
            
  | 18:14 | 
  <chasemp> | 
  grid master stopped | 
  [tools] | 
            
  
    | 
      
        2016-01-21
      
      §
     | 
  
    
  | 22:24 | 
  <YuviPanda> | 
  deleted tools-redis-01 and -02 (are on 1001 and 1002 now) | 
  [tools] | 
            
  | 21:13 | 
  <YuviPanda> | 
  repooled exec nodes on labvirt1010 | 
  [tools] | 
            
  | 21:08 | 
  <YuviPanda> | 
  gridengine-master started, verified shadow hasn't started | 
  [tools] | 
            
  | 21:00 | 
  <YuviPanda> | 
  stop gridengine master | 
  [tools] | 
            
  | 20:51 | 
  <YuviPanda> | 
  repooled exec nodes on labvirt1007 was last message | 
  [tools] | 
            
  | 20:51 | 
  <YuviPanda> | 
  repooled exec nodes on labvirt1006 | 
  [tools] | 
            
  | 20:39 | 
  <YuviPanda> | 
  failover tools-static too tools-web-static-01 | 
  [tools] | 
            
  | 20:38 | 
  <YuviPanda> | 
  failover tools-checker to tools-checker-01 | 
  [tools] | 
            
  | 20:32 | 
  <YuviPanda> | 
  depooled exec nodes on 1007 | 
  [tools] | 
            
  | 20:32 | 
  <YuviPanda> | 
  repooled exec nodes on 1006 | 
  [tools] | 
            
  | 20:14 | 
  <YuviPanda> | 
  depooled all exec nodes in labvirt1006 | 
  [tools] | 
            
  | 20:11 | 
  <YuviPanda> | 
  repooled exec node son 1005 | 
  [tools] | 
            
  | 19:53 | 
  <YuviPanda> | 
  depooled exec nodes on labvirt1005 | 
  [tools] | 
            
  | 19:49 | 
  <YuviPanda> | 
  repooled exec nodes from labvirt1004 | 
  [tools] | 
            
  | 19:48 | 
  <YuviPanda> | 
  failed over proxy to tools-proxy-01 again | 
  [tools] | 
            
  | 19:31 | 
  <YuviPanda> | 
  depooled exec nodes from labvirt1004 | 
  [tools] | 
            
  | 19:29 | 
  <YuviPanda> | 
  repooled exec nodes from labvirt1003 | 
  [tools] | 
            
  | 19:13 | 
  <YuviPanda> | 
  depooled instances on labvirt1003 | 
  [tools] | 
            
  | 19:06 | 
  <YuviPanda> | 
  re-enabled queues on exec nodes that were on labvirt1002 | 
  [tools] | 
            
  | 19:02 | 
  <YuviPanda> | 
  failed over tools proxy to tools-proxy-02 | 
  [tools] | 
            
  | 18:46 | 
  <YuviPanda> | 
  drained and disabled queues on all nodes on labvirt1002 | 
  [tools] | 
            
  | 18:38 | 
  <YuviPanda> | 
  restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead | 
  [tools] | 
            
  
    | 
      
        2016-01-11
      
      §
     | 
  
    
  | 22:19 | 
  <valhallasw`cloud> | 
  reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30 | 
  [tools] | 
            
  | 22:12 | 
  <YuviPanda> | 
  restarted gridengine master again | 
  [tools] | 
            
  | 22:07 | 
  <valhallasw`cloud> | 
  set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0 | 
  [tools] | 
            
  | 22:05 | 
  <valhallasw`cloud> | 
  set maxujobs back to 0, but doesn't help | 
  [tools] | 
            
  | 21:57 | 
  <valhallasw`cloud> | 
  reset to 7:30 | 
  [tools] | 
            
  | 21:57 | 
  <valhallasw`cloud> | 
  that cleared the measure, but jobs still not starting. Ugh! | 
  [tools] | 
            
  | 21:55 | 
  <valhallasw`cloud> | 
  set job_load_adjustments_decay_time = 0:0:0 | 
  [tools] | 
            
  | 21:45 | 
  <YuviPanda> | 
  restarted gridengine master | 
  [tools] | 
            
  | 21:43 | 
  <valhallasw`cloud> | 
  qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting | 
  [tools] | 
            
  | 21:42 | 
  <valhallasw`cloud> | 
  resetting to 0:7:30, as it's not having the intended effect | 
  [tools] | 
            
  | 21:41 | 
  <valhallasw`cloud> | 
  currently 353 jobs in qw state | 
  [tools] |