Writing
ReplicaDB: A First Look at Cross-Database Replication
Notes on where ReplicaDB fits, what it gets right, and where a heavier replication tool is still the better call.
What Problem Does ReplicaDB Solve?
Data replication between different systems is core to any distributed system. You can do this in a ton of different ways. You could write up REST endpoints and add logic to have the systems push data to each other. You could set up data queues with ActiveMQ/RabbitMQ/Kafka to make topics allowing many systems to subscribe to data. You can do flat out database copies that you ship out to different instances.
You typically do this kind of copying and synchronization with non-colocated systems - essentially when they can’t share the data in any reasonable way. Data replication have suites of tools out there we can use. Typically these seem to shine when bulk data movement is required. Small updates might be overkill.
So let’s explore ReplicaDB
How It Works
ReplicaDB runs by default as a command line utility, but since it’s a jar it’s feasible to incorporate into your system. Running it on the command line would probably be the best approach. You could hook it up with a cron configuration or any other scheduling system.
As an example as a command (lifted from their website):
replicadb --mode=complete -j=1 \
--sink-connect=jdbc:oracle:thin:@$ORAHOST:$ORAPORT:$ORASID \
--sink-user=$ORAUSER \
--sink-password=$ORAPASS \
--sink-table=dept \
--source-connect=jdbc:postgresql://$PGHOST/osalvador \
--source-table=dept \
--source-columns=dept.*
You could also shove these settings into property files you can point to. For example:
mode=complete
jobs=1
source.connect=jdbc:postgresql://{ip}:{port}/{SOURCE_DB}
source.user={DB_USER}
source.password={DB_PASSWORD}
source.table=customers
sink.connect=jdbc:postgresql://{ip}:{port}/{SINK_DB}
sink.user={DB_USER}
sink.password={DB_PASSWORD}
sink.table=customers
sink.staging-schema=public
Features
Complete Mode
Complete mode is probably the most expensive, but more robust strategy. The sink database is completely truncated and the source database is fully copied over into the sink.
Drawbacks are that the synchronization scales with the size of the data, not necessarily with the number of changes. This will naturally increase bandwith and can make the synchronization take longer.
So, below in the code from the github site, the data copy from the source to the sink is an overwrite. Data in that table is not safe.
# Sink has data that source does not — complete should wipe it
insert_customers(source_conn, [RETAIL_ROW])
insert_customers(sink_conn, [ENTERPRISE_ROW])
replicadb(mode="complete")
assert count_rows(sink_conn) == 1
with sink_conn.cursor() as cur:
cur.execute("SELECT customer_type FROM customers")
assert cur.fetchone()[0] == "retail"
Incremental Mode
Incremental allows you to specify a filter to apply to the source’s data before it’s copied to the sink. A truncate is no longer required new data is simply appended. This is useful for a couple situations:
- Sending a list of events, like user interactions, logs, etc. These are events that always append-only and don’t alter the previous rows.
- Sending data that needs to be processed, like row updates, so you can write code to handle the data updates (ehhh… this is kinda like events?)
Because you have the ability to specify a filter, you can use a data watermark technique to get new records. This could either be an auto incrementing ID, or a “updated_at” timestamp column. Storing that watermark is of course on you, which honestly might be preferable.
So in the example from the github code. We can see the incremental update mode.
# Complete sync first
insert_customers(source_conn, [RETAIL_ROW, ENTERPRISE_ROW])
replicadb(mode="complete")
assert count_rows(sink_conn) == 2
with source_conn.cursor() as cur:
cur.execute("SELECT MAX(id) FROM customers")
last_id = cur.fetchone()[0]
# Add a third row and sync incrementally
new_row = (
"retail", "Bob Jones", "bob@example.com",
"555-0003", "Denver", "USA", None, None,
)
insert_customers(source_conn, [new_row])
replicadb(mode="incremental", extra={"source.where": f"id > {last_id}"})
# All three rows should be in the sink — existing ones untouched
assert count_rows(sink_conn) == 3
Why Not Just Write Your Own SQL?
For the situations I forsee, I don’t necessarily
- For small, single-vendor tables: you probably don’t need it
- Where ReplicaDB pulls ahead:
- Parallelism -
jobs=Nsplits the table into chunks across multiple connections; correct range partitioning is non-trivial to implement - Staging - writes to a staging table first, then promotes; a failed run doesn’t leave the sink half-written
- Cross-vendor type mapping - handles Oracle to Postgres and similar combinations without a hand-rolled translation layer
- Memory management - streams in fetch-size chunks; easy to OOM yourself loading a large result set naively
- Parallelism -
- Rule of thumb: roll your own if same-vendor plus modest table size plus no parallel requirement; reach for ReplicaDB when any of those stops being true
What It Gets Right
- Zero infrastructure overhead
- Multi-vendor via JDBC
- Simple mental model
Where It Falls Short
- No native upsert, so duplicate rows still fail on constraints
- No delete awareness without soft-delete workarounds
- Watermark state management is entirely on you
- No schema migration
Verdict
- Good fit for simple, scheduled bulk copies
- Reach for a heavier tool (Debezium, Airbyte, or pg_logical) when you need upserts, hard deletes, or automatic checkpointing