analytics SAL

951-1000 of 5249 results (32ms)

2022-01-24 §
18:23	<razzi>	downtime an-coord1001 while attempting to fix /srv partition	[analytics]
11:48	<elukey>	roll restart of kafka test brokers to pick up the new keystore/tls-certs (1y of validity)	[analytics]
2022-01-22 §
08:36	<elukey>	`apt-get clean` on an-test-coord1001 to free some space	[analytics]
2022-01-21 §
01:03	<milimetric>	rerunning the eventlogging_to_druid_network_flows_internal-sanitization_daily timer that failed to get logs	[analytics]
2022-01-20 §
11:58	<btullis>	re-enabled puppet on all hive nodes, deploying the updated log4j configuration for parquet	[analytics]
11:36	<btullis>	temporarily disabling puppet on servers with hive installed T297734	[analytics]
07:49	<joal>	Rerun failed webrequest jobs (text and upload, 2022-01-19T19:00	[analytics]
2022-01-19 §
15:44	<ottomata>	installing anaconda-wmf_2020.02~wmf6_amd64.deb on all analytics cluster nodes. - T292699	[analytics]
14:00	<ottomata>	installing anaconda-wmf_2020.02~wmf6_amd64.deb on stat1004 - T292699	[analytics]
2022-01-17 §
07:19	<elukey>	launch webrequest bundle from 2022-01-16T01:00 (first hour missing for text) - 0003712-220113112502223-oozie-oozi-B	[analytics]
07:17	<elukey>	kill webrequest bundle, text coordinator failed (logs/info/etc.. https://hue.wikimedia.org/hue/jobbrowser/#!id=0024621-210701181527401-oozie-oozi-B)	[analytics]
07:13	<elukey>	umount/mount /mnt/hdfs on an-coord1001 to pick up java upgrades	[analytics]
2022-01-16 §
16:43	<elukey>	`elukey@an-launcher1002:~$ sudo systemctl reset-failed eventlogging_to_druid_network_internal_flows-sanitization_daily.service eventlogging_to_druid_network_internal_flows_daily.service eventlogging_to_druid_network_internal_flows_hourly.service	[analytics]
2022-01-13 §
12:41	<joal>	rerun failed instances of webrequest-load-coord	[analytics]
11:59	<btullis>	stopped eventlogging service on eventlog1003 with 1 hour's downtime.	[analytics]
11:52	<btullis>	Upgrading hive packages on stat1005	[analytics]
11:26	<btullis>	restarted hive-metastore and hive-server2 on an-coord1001 after running puppet.	[analytics]
11:23	<btullis>	btullis@an-coord1001:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie oozie-client	[analytics]
11:18	<btullis>	btullis@an-coord1002:~$ sudo systemctl restart hive-metastore hive-server2	[analytics]
09:53	<btullis>	DNS change deployed, failing over hive to an-coord1002.	[analytics]
09:42	<btullis>	btullis@an-coord1002:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie-client	[analytics]
08:45	<joal>	Kill-restart wikidata-json_entity-weekly-coord after deploy	[analytics]
2022-01-12 §
21:13	<joal>	Deploying refinery to HDFS	[analytics]
20:46	<joal>	Deploying refinery with scap	[analytics]
20:35	<joal>	refinery-source v0.1.24 released on archiva	[analytics]
11:25	<elukey>	move kafka-jumbo nodes to fixed kafka uid/gid	[analytics]
07:46	<elukey>	`systemctl reset-failed product-analytics-movement-metrics.service` on stat1007	[analytics]
2022-01-10 §
13:56	<btullis>	Upgrading oozie packages on an-test-coord1001 to test new log4j versions	[analytics]
2022-01-08 §
10:51	<elukey>	start hive-server2 on an-coord1002 - failed to connect to the metastore due to restart	[analytics]
10:41	<elukey>	restart hive daemons on an-coord1002 (after my last upgrade/rollback of packages the prometheus agent settings were not picked up, so no metrics)	[analytics]
2022-01-07 §
20:16	<ottomata>	altering hive table MobileWikiAppiOSUserHistory field event.device_level_enabled to string - T298721	[analytics]
17:29	<btullis>	deployed updated hive packages to an-test-worker100[1-3] and an-test-ui1001	[analytics]
14:52	<btullis>	root@aqs1014:~# jmap -dump:live,format=b,file=/srv/cassandra-b/tmp/aqs1014-b-dump202201071450.hprof 4468	[analytics]
2022-01-06 §
18:02	<btullis>	btullis@aqs1010:~$ sudo systemctl restart cassandra-a.service	[analytics]
12:22	<btullis>	restarting cassandra-a service on aqs1004.eqiad.wmnet in order to troubleshoot logging.	[analytics]
11:24	<btullis>	restarting cassandra-a service on aqs1010.eqiad.wmnet in order to troubleshoot logging.	[analytics]
08:12	<joal>	Rerun failed webrequest-load-wf-text-2022-1-6-7	[analytics]
07:58	<joal>	Rerun refine_event_sanitized_analytics_immediate missing hours after errors from the past days	[analytics]
07:39	<joal>	Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-05T2[123]:00:00 and 2022-01-06T00:00:00, dropping malformed rows as discussed with schema owner	[analytics]
2022-01-05 §
19:16	<joal>	Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-04T1[5789]:00:00, dropping malformed rows as discussed with schema owner	[analytics]
11:37	<btullis>	Upgrading hive on an-test-client1001 in order to test log4j upgrade	[analytics]
11:35	<btullis>	Upgrading hive packages on an-test-coord1001 to test log4j changes.	[analytics]
2022-01-04 §
10:39	<elukey>	restart cassandra-a on aqs1010 (heap size used in full, high GC)	[analytics]
10:20	<elukey>	restart cassandra-a on aqs1015 (heap size used in full, high GC)	[analytics]
2022-01-03 §
18:26	<joal>	rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2022-1-1	[analytics]
16:08	<joal>	Kill cassandra3-local_group_default_T_mediarequest_per_file-daily-2022-1-1	[analytics]
11:26	<elukey>	restart cassandra-b on aqs1015 (instance not responding, probably trashing)	[analytics]
11:16	<elukey>	restart cassandra-b on aqs1010 (stuck trashing)	[analytics]
10:34	<elukey>	depool aqs1010 (`sudo -i depool` on the node) to allow investigation of the cassandra -b instance	[analytics]
10:22	<elukey>	powercycle an-worker1114 (CPU soft lockup errors in mgmt console)	[analytics]