@recaptime-dev's working patches + fork for Phorge, a community fork of Phabricator. (Upstream dev and stable branches are at upstream/main and upstream/stable respectively.) hq.recaptime.dev/wiki/Phorge
phorge phabricator
at upstream/main 406 lines 19 kB view raw
1@title Cluster: Databases 2@group cluster 3 4Configuring Phorge to use multiple database hosts. 5 6Overview 7======== 8 9You can deploy Phorge with multiple database hosts, configured as a master 10and a set of replicas. The advantages of doing this are: 11 12 - faster recovery from disasters by promoting a replica; 13 - graceful degradation if the master fails; and 14 - some tools to help monitor and manage replica health. 15 16This configuration is complex, and many installs do not need to pursue it. 17 18If you lose the master, Phorge can degrade automatically into read-only 19mode and remain available, but can not fully recover without operational 20intervention unless the master recovers on its own. 21 22Phorge will not currently send read traffic to replicas unless the master 23has failed, so configuring a replica will not currently spread any load away 24from the master. Future versions of Phorge are expected to be able to 25distribute some read traffic to replicas. 26 27Phorge can not currently be configured into a multi-master mode, nor can 28it be configured to automatically promote a replica to become the new master. 29There are no current plans to support multi-master mode or autonomous failover, 30although this may change in the future. 31 32Phorge applications //can// be partitioned across multiple database 33masters. This does not provide redundancy and generally does not increase 34resilience or resistance to data loss, but can help you scale and operate 35Phorge. For details, see 36@{article:Cluster: Partitioning and Advanced Configuration}. 37 38 39Setting up MySQL Replication 40============================ 41 42To begin, set up a replica database server and configure MySQL replication. 43 44If you aren't sure how to do this, refer to the MySQL manual for instructions. 45The MySQL documentation is comprehensive and walks through the steps and 46options in good detail. You should understand MySQL replication before 47deploying it in production: Phorge layers on top of it, and does not 48attempt to abstract it away. 49 50Some useful notes for configuring replication for Phorge: 51 52**Binlog Format**: Phorge issues some queries which MySQL will detect as 53unsafe if you use the `STATEMENT` binlog format (the default). Instead, use 54`MIXED` (recommended) or `ROW` as the `binlog_format`. 55 56**Grant `REPLICATION CLIENT` Privilege**: If you give the user that Phorge 57will use to connect to the replica database server the `REPLICATION CLIENT` 58privilege, Phorge's status console can give you more information about 59replica health and state. 60 61**Copying Data to Replicas**: Phorge uses InnoDB tables if available. 62Phorge still allows MyISAM for older database server software. 63 64An approach you may want to consider to limit downtime and produce a dump is 65to leave Phorge running but configured in read-only mode while dumping: 66 67 - Stop all the daemons. 68 - Set `cluster.read-only` to `true` and deploy the new configuration. The 69 web UI should now show that Phorge is in "Read Only" mode. 70 - Dump the database. You can do this with `bin/storage dump --for-replica` 71 to add the `--master-data` flag to the underlying command and include a 72 `CHANGE MASTER ...` statement in the dump. 73 - Once the dump finishes, turn `cluster.read-only` off again to restore 74 service. Continue loading the dump into the replica normally. 75 76**Log Expiration**: You can configure MySQL to automatically clean up old 77binary logs on startup with the `expire_logs_days` option. If you do not 78configure this and do not explicitly purge old logs with `PURGE BINARY LOGS`, 79the binary logs on disk will grow unboundedly and relatively quickly. 80 81Once you have a working replica, continue below to tell Phorge about it. 82 83 84Configuring Replicas 85==================== 86 87Once your replicas are in working order, tell Phorge about them by 88configuring the `cluster.databases` option. This option must be configured from 89the command line or in configuration files because Phorge needs to read 90it //before// it can connect to databases. 91 92This option value will list all of the database hosts that you want Phorge 93to interact with: your master and all your replicas. Each entry in the list 94should have these keys: 95 96 - `host`: //Required string.// The database host name. 97 - `role`: //Required string.// The cluster role of this host, one of 98 `master` or `replica`. 99 - `port`: //Optional int.// The port to connect to. If omitted, the default 100 port from `mysql.port` will be used. 101 - `user`: //Optional string.// The MySQL username to use to connect to this 102 host. If omitted, the default from `mysql.user` will be used. 103 - `pass`: //Optional string.// The password to use to connect to this host. 104 If omitted, the default from `mysql.pass` will be used. 105 - `disabled`: //Optional bool.// If set to `true`, Phorge will not 106 connect to this host. You can use this to temporarily take a host out 107 of service. 108 109When `cluster.databases` is configured the `mysql.host` option is not used. 110The other MySQL connection configuration options (`mysql.port`, `mysql.user`, 111`mysql.pass`) are used only to provide defaults. 112 113Once you've configured this option, restart Phorge for the changes to take 114effect, then continue to "Monitoring Replicas" to verify the configuration. 115 116 117Monitoring Replicas 118=================== 119 120You can monitor replicas in {nav Config > Services > Database Servers}. This 121interface shows you a quick overview of replicas and their health, and can 122detect some common issues with replication. 123 124The table on this page shows each database and current status. 125 126NOTE: This page runs its diagnostics //from the web server that is serving the 127request//. If you are recovering from a disaster, the view this page shows 128may be partial or misleading, and two requests served by different servers may 129see different views of the cluster. 130 131**Connection**: Phorge tries to connect to each configured database, then 132shows the result in this column. If it fails, a brief diagnostic message with 133details about the error is shown. If it succeeds, the column shows a rough 134measurement of latency from the current webserver to the database. 135 136**Replication**: This is a summary of replication status on the database. If 137things are properly configured and stable, the replicas should be actively 138replicating and no more than a few seconds behind master, and the master 139should //not// be replicating from another database. 140 141To report this status, the user Phorge is connecting as must have the 142`REPLICATION CLIENT` privilege (or the `SUPER` privilege) so it can run the 143`SHOW REPLICA STATUS` command. The `REPLICATION CLIENT` privilege only enables 144the user to run diagnostic commands so it should be reasonable to grant it in 145most cases, but it is not required. If you choose not to grant it, this page 146can not show any useful diagnostic information about replication status but 147everything else will still work. 148 149If a replica is more than a second behind master, this page will show the 150current replication delay. If the replication delay is more than 30 seconds, 151it will report "Slow Replication" with a warning icon. 152 153If replication is delayed, data is at risk: if you lose the master and can not 154later recover it (for example, because a meteor has obliterated the datacenter 155housing the physical host), data which did not make it to the replica will be 156lost forever. 157 158Beyond the risk of data loss, any read-only traffic sent to the replica will 159see an older view of the world which could be confusing for users: it may 160appear that their data has been lost, even if it is safe and just hasn't 161replicated yet. 162 163Phorge will attempt to prevent clients from seeing out-of-date views, but 164sometimes sending traffic to a delayed replica is the best available option 165(for example, if the master can not be reached). 166 167**Health**: This column shows the result of recent health checks against the 168server. After several checks in a row fail, Phorge will mark the server 169as unhealthy and stop sending traffic to it until several checks in a row 170later succeed. 171 172Note that each web server tracks database health independently, so if you have 173several servers they may have different views of database health. This is 174normal and not problematic. 175 176For more information on health checks, see "Unreachable Masters" below. 177 178**Messages**: This column has additional details about any errors shown in the 179other columns. These messages can help you understand or resolve problems. 180 181 182Testing Replicas 183================ 184 185To test that your configuration can survive a disaster, turn off the master 186database. Do this with great ceremony, making a cool explosion sound as you 187run the `mysqld stop` command. 188 189If things have been set up properly, Phorge should degrade to a temporary 190read-only mode immediately. After a brief period of unresponsiveness, it will 191degrade further into a longer-term read-only mode. For details on how this 192works internally, see "Unreachable Masters" below. 193 194Once satisfied, turn the master back on. After a brief delay, Phorge 195should recognize that the master is healthy again and recover fully. 196 197Throughout this process, the {nav Database Servers} console will show a 198current view of the world from the perspective of the web server handling the 199request. You can use it to monitor state. 200 201You can perform a more narrow test by enabling `cluster.read-only` in 202configuration. This will put Phorge into read-only mode immediately 203without turning off any databases. 204 205You can use this mode to understand which capabilities will and will not be 206available in read-only mode, and make sure any information you want to remain 207accessible in a disaster (like wiki pages or contact information) is really 208accessible. 209 210See the next section, "Degradation to Read Only Mode", for more details about 211when, why, and how Phorge degrades. 212 213If you run custom code or extensions, they may not accommodate read-only mode 214properly. You should specifically test that they function correctly in 215read-only mode and do not prevent you from accessing important information. 216 217 218Degradation to Read-Only Mode 219============================= 220 221Phorge will degrade to read-only mode when any of these conditions occur: 222 223 - you turn it on explicitly; 224 - you configure cluster mode, but don't set up any masters; 225 - the master can not be reached while handling a request; or 226 - recent attempts to connect to the master have consistently failed. 227 228When Phorge is running in read-only mode, users can still read data and 229browse and clone repositories, but they can not edit, update, or push new 230changes. For example, users can still read disaster recovery information on 231the wiki or emergency contact information on user profiles. 232 233You can enable this mode explicitly by configuring `cluster.read-only`. Some 234reasons you might want to do this include: 235 236 - to test that the mode works like you expect it to; 237 - to make sure that information you need will be available; 238 - to prevent new writes while performing database maintenance; or 239 - to permanently archive a Phorge install. 240 241You can also enable this mode implicitly by configuring `cluster.databases` 242but disabling the master, or by not specifying any host as a master. This may 243be more convenient than turning it on explicitly during the course of 244operations work. 245 246If Phorge is unable to reach the master database, it will degrade into 247read-only mode automatically. See "Unreachable Masters" below for details on 248how this process works. 249 250If you end up in a situation where you have lost the master and can not get it 251back online (or can not restore it quickly) you can promote a replica to become 252the new master. See the next section, "Promoting a Replica", for details. 253 254 255Promoting a Replica 256=================== 257 258If you lose access to the master database, Phorge will degrade into 259read-only mode. This is described in greater detail below. 260 261The easiest way to get out of read-only mode is to restore the master database. 262If the database recovers on its own or operations staff can revive it, 263Phorge will return to full working order after a few moments. 264 265If you can't restore the master or are unsure you will be able to restore the 266master quickly, you can promote a replica to become the new master instead. 267 268Before doing this, you should first assess how far behind the master the 269replica was when the link died. Any data which was not replicated will either 270be lost or become very difficult to recover after you promote a replica. 271 272For example, if some `T1234` had been created on the master but had not yet 273replicated and you promote the replica, a new `T1234` may be created on the 274replica after promotion. Even if you can recover the master later, merging 275the data will be difficult because each database may have conflicting changes 276which can not be merged easily. 277 278If there was a significant replication delay at the time of the failure, you 279may wait to try harder or spend more time attempting to recover the master 280before choosing to promote. 281 282If you have made a choice to promote, disable replication on the replica and 283mark it as the `master` in `cluster.databases`. Remove the original master and 284deploy the configuration change to all surviving hosts. 285 286Once write service is restored, you should provision, deploy, and configure a 287new replica by following the steps you took the first time around. You are 288critically vulnerable to a second disruption until you have restored the 289redundancy. 290 291 292Unreachable Masters 293=================== 294 295This section describes how Phorge determines that a master has been lost, 296marks it unreachable, and degrades into read-only mode. 297 298Phorge degrades into read-only mode automatically in two ways: very 299briefly in response to a single connection failure, or more permanently in 300response to a series of connection failures. 301 302In the first case, if a request needs to connect to the master but is not able 303to, Phorge will temporarily degrade into read-only mode for the remainder 304of that request. The alternative is to fail abruptly, but Phorge can 305sometimes degrade successfully and still respond to the user's request, so it 306makes an effort to finish serving the request from replicas. 307 308If the request was a write (like posting a comment) it will fail anyway, but 309if it was a read that did not actually need to use the master it may succeed. 310 311This temporary mode is intended to recover as gracefully as possible from brief 312interruptions in service (a few seconds), like a server being restarted, a 313network link becoming temporarily unavailable, or brief periods of load-related 314disruption. If the anomaly is temporary, Phorge should recover immediately 315(on the next request once service is restored). 316 317This mode can be slow for users (they need to wait on connection attempts to 318the master which fail) and does not reduce load on the master (requests still 319attempt to connect to it). 320 321The second way Phorge degrades is by running periodic health checks 322against databases, and marking them unhealthy if they fail over a longer period 323of time. This mechanism is very similar to the health checks that most HTTP 324load balancers perform against web servers. 325 326If a database fails several health checks in a row, Phorge will mark it as 327unhealthy and stop sending all traffic (except for more health checks) to it. 328This improves performance during a service interruption and reduces load on the 329master, which may help it recover from load problems. 330 331You can monitor the status of health checks in the {nav Database Servers} 332console. The "Health" column shows how many checks have run recently and 333how many have succeeded. 334 335Health checks run every 3 seconds, and 5 checks in a row must fail or succeed 336before Phorge marks the database as healthy or unhealthy, so it will 337generally take about 15 seconds for a database to change state after it goes 338down or comes up. 339 340If all of the recent checks fail, Phorge will mark the database as 341unhealthy and stop sending traffic to it. If the master was the database that 342was marked as unhealthy, Phorge will actively degrade into read-only mode 343until it recovers. 344 345This mode only attempts to connect to the unhealthy database once every few 346seconds to see if it is recovering, so performance will be better on average 347(users rarely need to wait for bad connections to fail or time out) and the 348database will receive less load. 349 350Once all of the recent checks succeed, Phorge will mark the database as 351healthy again and continue sending traffic to it. 352 353Health checks are tracked individually for each web server, so some web servers 354may see a host as healthy while others see it as unhealthy. This is normal, and 355can accurately reflect the state of the world: for example, the link between 356datacenters may have been lost, so hosts in one datacenter can no longer see 357the master, while hosts in the other datacenter still have a healthy link to 358it. 359 360 361Backups 362====== 363 364Even if you configure replication, you should still retain separate backup 365snapshots. Replicas protect you from data loss if you lose a host, but they do 366not let you recover from data mutation mistakes. 367 368If something issues `DELETE` or `UPDATE` statements and destroys data on the 369master, the mutation will propagate to the replicas almost immediately and the 370data will be gone forever. Normally, the only way to recover this data is from 371backup snapshots. 372 373Although you should still have a backup process, your backup process can 374safely pull dumps from a replica instead of the master. This operation can 375be slow, so offloading it to a replica can make the performance of the master 376more consistent. 377 378To dump from a replica, you can use `bin/storage dump --host <host>` to 379control which host the command connects to. (You may still want to execute 380this command //from// that host, to avoid sending the whole dump over the 381network). 382 383With the `--for-replica` flag, the `bin/storage dump` command creates dumps 384with `--master-data`, which includes a `CHANGE MASTER` statement in the output. 385This may be helpful when initially setting up new replicas, as it can make it 386easier to change the binlog coordinates to the correct position for the dump. 387 388With recent versions of MySQL, it is also possible to configure a //delayed// 389replica which intentionally lags behind the master (say, by 12 hours). In the 390event of a bad mutation, this could give you a larger window of time to 391recognize the issue and recover the lost data from the delayed replica (which 392might be quick) without needing to restore backups (which might be very slow). 393 394Delayed replication is outside the scope of this document, but may be worth 395considering as an additional data security step on top of backup snapshots 396depending on your resources and needs. If you configure a delayed replica, do 397not add it to the `cluster.databases` configuration: Phorge should never 398send traffic to it, and does not need to know about it. 399 400 401Next Steps 402========== 403 404Continue by: 405 406 - returning to @{article:Clustering Introduction}.