Page MenuHomePhabricator

Switchover s3 master (db1157 -> db1123)
Closed, ResolvedPublic

Description

When: Tuesday 29th - 06:00 AM UTC
Affected wikis: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist

NEW primary: db1123
OLD primary: db1157
  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1157.eqiad.wmnet h=db1123.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s3 T301850" 'A:db-section-s3'
  • Set NEW primary with weight 0
sudo dbctl instance db1123 set-weight 0
sudo dbctl config commit -m "Set db1123 with weight 0 T301850"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=15 --only-slave-move db1157 db1123
  • Disable puppet on both nodes
sudo cumin 'db1157* or db1123*' 'disable-puppet "primary switchover T301850"'

Failover:

  • Log the failover:
!log Starting s3 eqiad failover from db1157 to db1123 - T301850
  • Set section read-only:
sudo dbctl --scope eqiad section s3 ro "Maintenance until 05:15 UTC - T301850"
sudo dbctl config commit -m "Set s3 eqiad as read-only for maintenance - T301850"
  • Check s3 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1157 db1123
echo "===== db1157 (OLD)"; sudo db-mysql db1157 -e 'show slave status\G'
echo "===== db1123 (NEW)"; sudo db-mysql db1123 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s3 set-master db1123
sudo dbctl --scope eqiad section s3 rw
sudo dbctl config commit -m "Promote db1123 to s3 primary and set section read-write T301850"
  • Restart puppet on both hosts:
sudo cumin 'db1157* or db1123*' 'run-puppet-agent -e "primary switchover T301850"'

Clean up tasks:

  • Clean up heartbeat table(s) (delete from heartbeat.heartbeat where server_id=171966508)
  • change events for query killer:
events_coredb_master.sql on the new primary db1123
events_coredb_slave.sql on the new slave db1157
sudo dbctl instance db1157 set-candidate-master --section s3 true
sudo dbctl instance db1123 set-candidate-master --section s3 false
(dborch1001): sudo orchestrator-client -c untag -i db1123 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1157 --tag name=candidate
sudo dbctl instance db1157 depool
sudo dbctl config commit -m "Depool db1157 T301850"

Related Objects

Event Timeline

Marostegui triaged this task as Medium priority.Feb 16 2022, 8:29 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Blocked on the DBA board.
Marostegui updated the task description. (Show Details)

Going to reboot db1123 (future master) so it can pick up the new kernel (T303174)

Future master rebooted with the new kernel.

Trizek-WMF subscribed.

Now that we changed the process in T303605: Stop announcing and scheduling primary database switchovers, you have to use this tag in your switchover tasks. :)

Thanks - I will change the other task we've scheduled for the 31st then.

Change 774373 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1123 to s3 master

https://gerrit.wikimedia.org/r/774373

Change 774374 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update s3 master CNAME

https://gerrit.wikimedia.org/r/774374

Mentioned in SAL (#wikimedia-operations) [2022-03-29T05:02:02Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 20 hosts with reason: Primary switchover s3 T301850

Mentioned in SAL (#wikimedia-operations) [2022-03-29T05:02:16Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 20 hosts with reason: Primary switchover s3 T301850

Mentioned in SAL (#wikimedia-operations) [2022-03-29T05:02:34Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1123 with weight 0 T301850', diff saved to https://phabricator.wikimedia.org/P23438 and previous config saved to /var/cache/conftool/dbconfig/20220329-050234-root.json

Change 774373 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1123 to s3 master

https://gerrit.wikimedia.org/r/774373

Mentioned in SAL (#wikimedia-operations) [2022-03-29T06:00:08Z] <marostegui> Starting s3 eqiad failover from db1157 to db1123 - T301850

Mentioned in SAL (#wikimedia-operations) [2022-03-29T06:00:24Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T301850', diff saved to https://phabricator.wikimedia.org/P23448 and previous config saved to /var/cache/conftool/dbconfig/20220329-060024-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-03-29T06:00:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1123 to s3 primary and set section read-write T301850', diff saved to https://phabricator.wikimedia.org/P23449 and previous config saved to /var/cache/conftool/dbconfig/20220329-060059-marostegui.json

Change 774374 merged by Marostegui:

[operations/dns@master] wmnet: Update s3 master CNAME

https://gerrit.wikimedia.org/r/774374

Mentioned in SAL (#wikimedia-operations) [2022-03-29T06:05:33Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1157 T301850', diff saved to https://phabricator.wikimedia.org/P23450 and previous config saved to /var/cache/conftool/dbconfig/20220329-060532-root.json

This switchover was done. Read only time was 35 seconds.

Closing this ticket as all the pending schema changes have their own task, so they can be tracked there.