analytics SAL

2701-2750 of 3830 results (25ms)

2018-02-13 §
11:42	<elukey>	force kill of yarn nodemanager + other containers on analytics1057 (node failed, unit masked, processes still around)	[analytics]
2018-02-12 §
23:16	<elukey>	re-run webrequest-load-wf-upload-2018-2-12-21 via Hue (node managers failure)	[analytics]
23:13	<elukey>	manual restart of Yarn Node Managers on analytics1058/31	[analytics]
23:09	<elukey>	cleaned up tmp files on all analytics hadoop worker nodes, job filling up tmp	[analytics]
17:18	<elukey>	home dirs on stat1004 moved to /srv/home (/home symlinks to it)	[analytics]
17:15	<ottomata>	restarting eventlogging-processors to blacklist Print schema in eventlogging-valid-mixed (MySQL)	[analytics]
14:46	<ottomata>	deploying eventlogging for T186833 with EventCapsule in code and IP NO_DB_PROPERTIES	[analytics]
2018-02-09 §
12:19	<joal>	Rerun wikidata-articleplaceholder_metrics-wf-2018-2-8	[analytics]
2018-02-08 §
16:23	<elukey>	stop archiva on meitnerium to swap /var/lib/archiva from the root partition to a new separate one	[analytics]
2018-02-07 §
13:55	<joal>	Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01	[analytics]
13:49	<elukey>	restart overlord/middlemanager on druid1005	[analytics]
2018-02-06 §
19:40	<joal>	Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01	[analytics]
15:36	<elukey>	drain + shutdown of analytics1038 to replace faulty BBU	[analytics]
09:58	<elukey>	applied https://gerrit.wikimedia.org/r/c/405687/ manually on deployment-eventlog02 for testing	[analytics]
2018-02-05 §
15:51	<elukey>	live hacked deployment-eventlog02's /srv/deployment/eventlogging/analytics/eventlogging/handlers.py to add poll(0) to the confluent kafka producer - T185291	[analytics]
11:03	<elukey>	restart eventlogging/forwarder legacy-zmq on eventlog1001 due to slow memory leak over time (cached memory down to zero)	[analytics]
2018-02-02 §
17:09	<joal>	Webrequest upload 2018-02-02 hours 9 and 11 dataloss warning have been checked - They are false positive	[analytics]
09:56	<joal>	unique_devices-per_project_family-monthly-wf-2018-1 after failure	[analytics]
2018-02-01 §
17:00	<ottomata>	killing stuck JsonRefine eventlogging analytics job application_1515441536446_52892, not sure why this is stuck.	[analytics]
14:06	<joal>	Dataloss alerts for upload 2018-02-01 hours 1, 2, 3 and 5 were false positives	[analytics]
12:17	<joal>	Restart cassandra monthly bundle after January deploy	[analytics]
2018-01-23 §
20:10	<ottomata>	hdfs dfs -chmod 775 /wmf/data/archive/mediacounts/daily/2018 for T185419	[analytics]
09:26	<joal>	Dataloss warning for upload and text 2018-01-23:06 is confirmed to be false positive	[analytics]
2018-01-22 §
17:36	<joal>	Kill-Restart clickstream oozie job after deploy	[analytics]
17:12	<joal>	deploying refinery onto HDFS	[analytics]
17:12	<joal>	Refinery deployed from scap	[analytics]
2018-01-18 §
19:11	<joal>	Kill-Restart coord_pageviews_top_bycountry_monthly ooie job from 2015-05	[analytics]
19:10	<joal>	Add fake data to cassandra to silent alarms (Thanks again ema)	[analytics]
18:56	<joal>	Truncating table "local_group_default_T_top_bycountry"."data" in cassandra before reload	[analytics]
15:21	<mforns>	refinery deployment using scap and then deploying onto hdfs finished	[analytics]
15:07	<mforns>	starting refinery deployment	[analytics]
12:43	<elukey>	piwik on bohrium re-enabled	[analytics]
12:40	<elukey>	set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot)	[analytics]
09:38	<elukey>	reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites	[analytics]
09:37	<elukey>	resumed druid hourly index jobs via hue and restored pivot's configuration	[analytics]
09:21	<elukey>	reboot druid1001 for kernel upgrades	[analytics]
09:00	<elukey>	suspended hourly druid batch index jobs via Hue	[analytics]
08:58	<elukey>	temporarily set druid1002 in superset's druid cluster config (via UI)	[analytics]
08:53	<elukey>	temporarily point pivot's configuration to druid1002 (druid1001 needs to be rebooted)	[analytics]
08:52	<elukey>	disable druid1001's middlemanager as prep step for reboot	[analytics]
07:11	<elukey>	re-run webrequest-load-wf-misc-2018-1-18-3 via Hue	[analytics]
2018-01-17 §
17:33	<elukey>	killed the banner impression spark job (application_1515441536446_27293) again to force it to respawn (real time indexers not present)	[analytics]
17:29	<elukey>	restarted all druid overlords on druid100[123] (weird race condition messages about who was the leader for some task)	[analytics]
16:24	<elukey>	re-run all the pageview-druid-hourly failed jobs via Hue	[analytics]
14:42	<elukey>	restart druid middlemanager on druid1003 as attempt to unblock realtime streaming	[analytics]
14:21	<elukey>	forced kill of banner impression data streaming job to get it restarted	[analytics]
11:44	<elukey>	re-run pageview-druid-hourly-wf-2018-1-17-9 and pageview-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's middlemanager being in a weird state after reboot)	[analytics]
11:44	<elukey>	restart druid middlemanager on druid1002	[analytics]
10:38	<elukey>	stopped all crons on hadoop-coordinator-1	[analytics]
10:37	<elukey>	re-run webrequest-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's reboot)	[analytics]