Logical replication is a powerful feature in PostgreSQL that allows for real-time data replication between databases. It can be used for performing major version upgrades using a blue/green setup where you have two databases, allowing you to test and switch over to a new version with minimal downtime. Logical replication can also be use to facilitate database migrations between different environments, using the same technique and tooling.
In this post, I will describe the process of setting up simple replication and switchover between two databases using pg_easy_replicate. For the purposes of this post, we are using Aurora RDS PostgreSQL. However, the principles and steps apply to any PostgreSQL database.
What is pg_easy_replicate
?
At its core, pg_easy_replicate is a CLI orchestrator tool that makes the process of setting up and managing logical replication between PostgreSQL databases a breeze. It’s designed to ensure zero data loss and minimal downtime, making operations like major version upgrades, load testing with blue/green database setups, and other similar tasks more manageable.
Working with pg_easy_replicate
The tool itself isn’t responsible for syncing data, but it takes care of setting up all the logistics required to ensure that you, as a developer, don’t need to do those things yourself manually.
I usually prefer to execute the commands as one-off tasks in the environment and network that have access to the database. Usually, it’s a container or pod in a production environment running the shayonj/pg_easy_replicate:latest
Docker image.
Lets gets started.
Modifying Aurora PostgreSQL Parameters
First, you will need to ensure that your Aurora Database is setup for logical replication. Start by updating the following parameters.
rds.logical_replication
- Set this parameter to
1
to enable logical replication capabilities on your instance.
- Set this parameter to
max_replication_slots
- Set this to at least
2
for each group you intend to replicate sincepg_easy_replicate
sets up one publisher and one subscriber per group. If planning for multiple groups, increase this number accordingly.
- Set this to at least
max_wal_senders
- This should match or exceed the number set for
max_replication_slots
. It controls the number of concurrent Write-Ahead Logging (WAL) sender processes.
- This should match or exceed the number set for
max_logical_replication_workers
- Generally, set this equal to the number of
max_replication_slots
to ensure each slot can be actively processed by a dedicated worker.
- Generally, set this equal to the number of
max_worker_processes
- Increase this parameter to accommodate the total of
max_logical_replication_workers
,autovacuum_max_workers
, andmax_parallel_workers
.
- Increase this parameter to accommodate the total of
These parameters are adjusted in the AWS Aurora parameter group associated with your database. Applying changes will require a reboot of your instance to take effect.
These should apply to other Postgres installations too, except rds.logical_replication
. In other installations, it’s the standard wal_level
config.
Requirements
- PostgreSQL 10+,
- If not using docker image
- Ruby 3.0+
pg_dump
Step 1: Install pg_easy_replicate
pg_easy_replicate
can be installed via RubyGems or Docker, based on your environment setup. I usually run these from within a containerized environment. At Tines, we use Tines Stories to kick these off without needing to exec into any production environment.
Ruby Environment:
gem install pg_easy_replicate
Docker:
docker pull shayonj/pg_easy_replicate:latest
Step 2: Set Up Environment Variables
Define the source and target database URLs in your environment. The source is the primary database where all the data resides. The target is the database to which you are replicating the data.
export SOURCE_DB_URL="postgres://USERNAME:PASSWORD@source_host:5432/SOURCE_DB"
export TARGET_DB_URL="postgres://USERNAME:PASSWORD@target_host:5433/TARGET_DB"
Step 3: User Permissions
To successfully set up and manage logical replication with pg_easy_replicate
, the database user must have the appropriate permissions. In Aurora PostgreSQL, you need to ensure the user has permissions to perform replication tasks as well as the ability to create databases and tables.
Here’s how you can set up the necessary permissions using SQL commands:
Grant the necessary role for replication
In Aurora PostgreSQL, you can grant the rds_replication
role, which is specifically designed to provide the permissions needed for logical replication.
-- Grant rds_replication role
GRANT rds_replication TO your_user_name;
Allow user to create Databases and Tables You must also ensure the user has the rights to create databases and manage tables, which are essential for setting up new schemas and managing data structure during replication.
-- Grant CREATE DATABASE permission
ALTER ROLE your_user_name CREATEDB;
-- Grant CREATE permission on the schema
GRANT CREATE ON SCHEMA public TO your_user_name;
Assuming the schema your data is in is called public
Step 4: Configuration Check
Run a configuration check to ensure both databases are properly configured for replication. pg_easy_replicate
does some pre-flight checks to ensure that base configurations are met.
pg_easy_replicate config_check --special-user-role=rds_superuser
--special-user-role
is only required for AWS Aurora or GCP (cloudsqlsuperuser
) where they don’t allow superuser
permissions on the user.
Step 5: Bootstrap
Bootstrap is a required step and allows pg_easy_replicate
to manage some internal state to keep track of the instructions provided. You can use pg_easy_replicate
to perform multiple parallel replications on your database. Each replication that you set up is identified as a group
in pg_easy_replicate
and can be identified through the --group-name
flag.
The following will prepare the metadata tables and roles necessary for replication, and also ensure that the schema on the target database matches that of the source, indicated through the --copy-schema
flag. It uses pg_dump
behind the scenes for the same.
pg_easy_replicate bootstrap --group-name your_group_name --copy-schema --special-user-role=rds_superuser
Step 6: Start the Replication
By default all tables are part of the replication process. If you wish to only include specific tables, you pass a comma separate list through the --tables
flag.
pg_easy_replicate start_sync --group-name your_group_name --schema-name public
For faster initial COPY
of the data when replication starts, pg_easy_replicate
attempts to drop all the indexes before the COPY
and recreates them for you post COPY
as part of the switchover
task (see below for more details). Since this part is still experimental, you can skip this operation using the --skip-recreate-indexes-post-copy
flag passed to start_sync
.
Step 7: Monitor the Replication
Once the replication has started you can monitor the stats (returned in JSON
) using stats
command. Depending on how big the database is and the number of replication workers, it can take a while for tables to report the replicating
status.
pg_easy_replicate stats --group-name your_group_name
{
"lag_stats": [
{
"pid": 66,
"client_addr": "192.168.128.2",
"user_name": "jamesbond",
"application_name": "pger_subscription_database_cluster_1",
"state": "streaming",
"sync_state": "async",
"write_lag": "0.0",
"flush_lag": "0.0",
"replay_lag": "0.0"
}
],
"message_lsn_receipts": [
{
"received_lsn": "0/1674688",
"last_msg_send_time": "2023-06-19 19:56:35 UTC",
"last_msg_receipt_time": "2023-06-19 19:56:35 UTC",
"latest_end_lsn": "0/1674688",
"latest_end_time": "2023-06-19 19:56:35 UTC"
}
],
"sync_started_at": "2023-06-19 19:54:54 UTC",
"sync_failed_at": null,
"switchover_completed_at": null
....
The write_lag
, flush_lag
, and replay_lag
tell you how far behind the workers are, in terms of kilobytes
, for performing those specific operations.
pg_easy_replicate
exposes additional attributes like sync_started_at
, sync_failed_at
, and switchover_completed_at
for the respective group. You can use this information to keep track of the overall process.]
Step 8: Perform the Switchover
Depending on what you are using pg_easy_replicate
for, you may not need to using switchover
.
pg_easy_replicate
doesn’t kick off the switchover on its own. When you start the sync via start_sync
, it starts the replication between the two databases. Once you have had the time to monitor stats and any other key metrics, you can kick off the switchover
.
switchover
will wait until all tables in the group are replicating and the delta for lag is <200kb
(by calculating the pg_wal_lsn_diff
between sent_lsn
and write_lsn
) and then perform the switch.
Additionally, switchover
will take care of re-adding the indices (it had removed in start_sync
) in the target database before hand. Depending on the size of the tables, the recreation of indexes (which happens CONCURRENTLY
) may take a while.
The switch is made by putting the user on the source database in READ ONLY
mode, so that it is not accepting any more writes and waits for the flush lag to be 0
. It’s up to the user to kick off a rolling restart of their application containers or failover DNS after the switchover is complete, so that your application isn’t sending any read + write requests to the old/source database.
pg_easy_replicate switchover --group-name your_group_name
You can control the delta size for your operations using --lag-delta-size
flag. Default is 200
kb.
Optional Step: Cleanup
Use cleanup
if you want to remove all bootstrapped data for the specified group. Additionally you can pass -e
or --everything
in order to clean up all schema changes for bootstrapped tables, users and any publication/subscription data.
pg_easy_replicate cleanup --group-name your_group_name --everything
What to Watch Out For
- Performance Impact: Keep an eye on the database performance, especially if you’ve increased the
max_worker_processes
. Depending on the size of your data you may see an increase I/O due to the initial COPY that happens when the sync starts. - Connection Limits: Make sure your Aurora instance can handle the additional connections needed for replication.
Switchover strategies with minimal downtime
For minimal downtime, it’d be best to watch/tail the stats and wait until switchover_completed_at
is updated with a timestamp. Once that happens you can perform any of the following strategies.
Note: These are just suggestions and pg_easy_replicate
doesn’t provide any functionalities for this. See conclusion on whats coming soon 👀.
Rolling restart strategy
In this strategy, you ideally have a change ready to go which instructs your application to start connecting to the new database. Either using an environment variable or similar. Depending on the application type, it also may or may not require a rolling restart.
Next, you can set up a program (thinking a bash script on loop) that watches the stats
and waits until switchover_completed_at
is reporting as true
. Once that happens it kicks off a rolling restart of your application containers so they can start making connections to the DNS of the new database.
DNS Failover strategy
In this strategy, you have a weighted based DNS system (example AWS Route53 weighted records) where 100% of traffic goes to a primary origin and 0% to a secondary origin. The primary origin here is the DNS host for your source database and secondary origin is the DNS host for your target database. You can set up your application ahead of time to interact with the database using DNS from the weighted group.
Next, you can set up a program that watches the stats
and waits until switchover_completed_at
is reporting as true
. Once that happens it updates the weight in the DNS weighted group where 100% of the requests now go to the new/target database. Note: Keeping a low ttl
is recommended.
Conclusion
Using pg_easy_replicate
to handle logical replication on Aurora PostgreSQL is a simple and effective way to manage database replication and switchover. I hope you found this post useful. I am sure at some point in future this post would get outdate, so I recommend keeping an eye on the Github Readme page for the most up to data information https://github.com/shayonj/pg_easy_replicate
What’s next
I have also been working on a side project to introduce a more managed offering to automate these kinds of database tasks, make “DB Ops” much simpler in PostgreSQL, and provide a more user-friendly UI/UX to manage them. If you are interested in learning more, working together, or have feature requests or just want to jam on PostgreSQL, please don’t hesitate to reach out to me on Twitter/X or via email at [email protected] 🙂.