Writing

ReplicaDB: A First Look at Cross-Database Replication

Apr 11, 2026

Notes on where ReplicaDB fits, what it gets right, and where a heavier replication tool is still the better call.

What Problem Does ReplicaDB Solve?

Data replication between different systems is core to any distributed system. You can do this in a ton of different ways. You could write up REST endpoints and add logic to have the systems push data to each other. You could set up data queues with ActiveMQ/RabbitMQ/Kafka to make topics allowing many systems to subscribe to data. You can do flat out database copies that you ship out to different instances.

You typically do this kind of copying and synchronization with non-colocated systems - essentially when they can’t share the data in any reasonable way. Data replication have suites of tools out there we can use. Typically these seem to shine when bulk data movement is required. Small updates might be overkill.

So let’s explore ReplicaDB

How It Works

ReplicaDB runs by default as a command line utility, but since it’s a jar it’s feasible to incorporate into your system. Running it on the command line would probably be the best approach. You could hook it up with a cron configuration or any other scheduling system.

As an example as a command (lifted from their website):

replicadb --mode=complete -j=1 \
--sink-connect=jdbc:oracle:thin:@$ORAHOST:$ORAPORT:$ORASID \
--sink-user=$ORAUSER \
--sink-password=$ORAPASS \
--sink-table=dept \
--source-connect=jdbc:postgresql://$PGHOST/osalvador \
--source-table=dept \
--source-columns=dept.*

You could also shove these settings into property files you can point to. For example:

mode=complete
jobs=1
source.connect=jdbc:postgresql://{ip}:{port}/{SOURCE_DB}
source.user={DB_USER}
source.password={DB_PASSWORD}
source.table=customers
sink.connect=jdbc:postgresql://{ip}:{port}/{SINK_DB}
sink.user={DB_USER}
sink.password={DB_PASSWORD}
sink.table=customers
sink.staging-schema=public

Features

Complete Mode

Complete mode is probably the most expensive, but more robust strategy. The sink database is completely truncated and the source database is fully copied over into the sink.

Drawbacks are that the synchronization scales with the size of the data, not necessarily with the number of changes. This will naturally increase bandwith and can make the synchronization take longer.

So, below in the code from the github site, the data copy from the source to the sink is an overwrite. Data in that table is not safe.

  # Sink has data that source does not — complete should wipe it
  insert_customers(source_conn, [RETAIL_ROW])
  insert_customers(sink_conn, [ENTERPRISE_ROW])

  replicadb(mode="complete")

  assert count_rows(sink_conn) == 1
  with sink_conn.cursor() as cur:
      cur.execute("SELECT customer_type FROM customers")
      assert cur.fetchone()[0] == "retail"

Incremental Mode

Incremental allows you to specify a filter to apply to the source’s data before it’s copied to the sink. A truncate is no longer required new data is simply appended. This is useful for a couple situations:

Sending a list of events, like user interactions, logs, etc. These are events that always append-only and don’t alter the previous rows.
Sending data that needs to be processed, like row updates, so you can write code to handle the data updates (ehhh… this is kinda like events?)

Because you have the ability to specify a filter, you can use a data watermark technique to get new records. This could either be an auto incrementing ID, or a “updated_at” timestamp column. Storing that watermark is of course on you, which honestly might be preferable.

So in the example from the github code. We can see the incremental update mode.

  # Complete sync first
  insert_customers(source_conn, [RETAIL_ROW, ENTERPRISE_ROW])
  replicadb(mode="complete")
  assert count_rows(sink_conn) == 2

  with source_conn.cursor() as cur:
      cur.execute("SELECT MAX(id) FROM customers")
      last_id = cur.fetchone()[0]

  # Add a third row and sync incrementally
  new_row = (
      "retail", "Bob Jones", "bob@example.com",
      "555-0003", "Denver", "USA", None, None,
  )
  insert_customers(source_conn, [new_row])

  replicadb(mode="incremental", extra={"source.where": f"id > {last_id}"})

  # All three rows should be in the sink — existing ones untouched
  assert count_rows(sink_conn) == 3

Why Not Just Write Your Own SQL?

For the situations I forsee, I don’t necessarily

For small, single-vendor tables: you probably don’t need it
Where ReplicaDB pulls ahead:
- Parallelism - jobs=N splits the table into chunks across multiple connections; correct range partitioning is non-trivial to implement
- Staging - writes to a staging table first, then promotes; a failed run doesn’t leave the sink half-written
- Cross-vendor type mapping - handles Oracle to Postgres and similar combinations without a hand-rolled translation layer
- Memory management - streams in fetch-size chunks; easy to OOM yourself loading a large result set naively
Rule of thumb: roll your own if same-vendor plus modest table size plus no parallel requirement; reach for ReplicaDB when any of those stops being true

What It Gets Right

Zero infrastructure overhead
Multi-vendor via JDBC
Simple mental model

Where It Falls Short

No native upsert, so duplicate rows still fail on constraints
No delete awareness without soft-delete workarounds
Watermark state management is entirely on you
No schema migration

Verdict

Good fit for simple, scheduled bulk copies
Reach for a heavier tool (Debezium, Airbyte, or pg_logical) when you need upserts, hard deletes, or automatic checkpointing