151-200 of 5140 results (29ms)
2023-03-07 §
14:59 <nfraison> force startup of nodemanager on analytics_cluster [analytics]
14:58 <btullis> pooled druid1004 [analytics]
14:57 <btullis> pooling aqs1010 and aqs1016 [analytics]
14:56 <btullis> pooling datahubsearch1001 [analytics]
14:53 <btullis> leaving safe mode on hdfs [analytics]
13:59 <btullis> disabled puppet temporarily on an-master100[1-2] to avoid an automatic restart of yarn [analytics]
13:57 <btullis> stopped `hadoop-yarn-resourcemanager.service` on both an-master100[1-2] [analytics]
13:54 <btullis> entering safe mode with `sudo -u hdfs kerberos-run-command hdfs hadoop dfsadmin -safemode enter` on an-master1002 [analytics]
12:57 <btullis> depooled druid1004 for T329073 [analytics]
12:56 <btullis> depooled datahubsearch1001 for T329073 [analytics]
12:51 <btullis> disabled gobblin timers on an-launcher1002 [analytics]
12:46 <btullis> depooling aqs1016for T329073 [analytics]
12:45 <btullis> depooling aqs1010 for T329073 [analytics]
08:00 <nfraison> Reimage an-conf1003 to upgrade to bullseye T329362 [analytics]
2023-03-06 §
23:12 <mforns> deployed airflow analytics to unbreak druid-load-edit-hourly [analytics]
15:26 <mforns> deployed airflow analytics to unbreak druid-load-edit-hourly [analytics]
13:53 <btullis> failing over the production hadoop cluster namenode service to an-master1002 [analytics]
13:17 <btullis> failing over analytics test cluster namenode service to an-test-master1002 T329073 [analytics]
12:26 <nfraison> Reimage an-conf1002 to upgrade to bullseye T329362 [analytics]
10:15 <ottomata> deploy mediawiki_history_reduced_2023_02 snapshot to AQS [analytics]
09:23 <nfraison> Reimage an-conf1001 to upgrade to bullseye T329362 [analytics]
2023-03-03 §
16:48 <xcollazo> Deleted snapshot=2023-02-20 for tables image_suggestions_search_index_full, image_suggestions_search_index_delta, image_suggestions_lead_image_data and image_suggestions_wikidata_data from the analytics_platform_eng schema. This data will be regenerated. See https://phabricator.wikimedia.org/T330688. [analytics]
15:53 <mforns> deployed airflow analytics to unbreak edit_hourly_dag [analytics]
15:44 <xcollazo> Deploying latest image_suggestions DAG on platform_eng Airflow instance [analytics]
07:29 <elukey> truncate /var/log/auth.log.1 on krb1001 to free space (root partition almost filled up) [analytics]
2023-03-02 §
13:27 <nfraison> airflow on an-test-client1001 is migrated to version 2.5.1 [analytics]
12:32 <joal> Rerun mediawiki-history-denormalize-wf-2023-02 [analytics]
10:00 <btullis> commencing second attempt to upgrade airflow on an-test-client1001 to version 2.5.1 [analytics]
2023-03-01 §
22:45 <mforns> re-deployed airflow analytics with some forgotten changes [analytics]
22:42 <mforns> deployed Airflow analytics [analytics]
22:30 <mforns> finished refinery deployment, although didn't manage to run refinery-deploy-to-hdfs without warnings... [analytics]
21:48 <mforns> kill edit-hourly-coord in Hue to migrate it to Airflow [analytics]
21:26 <mforns> starting refinery deploy [analytics]
19:38 <SandraEbele> rerunning webrequest load text for 2023-03-01-08 hour. [analytics]
18:54 <joal> Create empty partitions in event.mediawiki_page_move table for codfw datacenter from beginning of week (2023-02-27T00 -> 2023-02-28T13) [analytics]
10:25 <nfraison> rebooting an-worker1132 being slower than other node (potential issue with raid card/disks) [analytics]
07:59 <nfraison> restarted hiveserver2 in analytics-test to take in account -XX:MaxMetaspaceSize=512m JVM parameter [analytics]
2023-02-28 §
21:33 <xcollazo> Deploying section_image_recommendations DAG to platform_eng Airflow instance [analytics]
11:38 <btullis> cancelled merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128 [analytics]
11:32 <btullis> merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128 [analytics]
09:42 <nfraison> restart presto prod coordinator to take in account heap size change [analytics]
09:38 <nfraison> Failover hive servers to active server: an-coord1001 [analytics]
09:32 <nfraison> restarted hive-metastore and hiveserver2 on an-coord1001 (non-active hive server) [analytics]
08:22 <nfraison> Failover hive servers to standby server: https://gerrit.wikimedia.org/r/c/operations/dns/+/892460 [analytics]
2023-02-27 §
14:52 <nfraison> restarted hive-metastore and hiveserver2 on an-coord1002 (standby hive server) [analytics]
2023-02-22 §
19:39 <mforns> restarted the following an-launcher1002 timers, which seemed stuck (next run = n/a): gobblin-webrequest.timer, reportupdater-browser.timer, reportupdater-reference-previews.timer, refine_event.timer, refine_eventlogging_legacy.timer [analytics]
11:07 <nfraison> roll restart presto clusters to take in account fix on node.environment typo [analytics]
2023-02-21 §
19:01 <mforns> re airflow silent failure: the job was pageview_actor_hourly [analytics]
19:00 <mforns> we had another silent failure in airflow, a sensor that failed without sending an email. the logs are missing. [analytics]
09:33 <nfraison> adding last batch of 5 nodes to the presto prod cluster [analytics]