Building manga-cdc: a change-data-capture pipeline for manga releases

Manga chapters do not arrive in one place. Official simulpubs land on Manga Plus. Community metadata lives on MangaDex. Aggregators like MangaFire, MangaPill, and MangaTown each have different HTML layouts. Scanlation groups post to sites like Asura Scans on their own schedule.

I was tired of opening six tabs every evening. So I built manga-cdc: a change-data-capture-style pipeline that scrapes sources on a schedule, diffs against PostgreSQL, publishes chapter events through Kafka or QStash, and routes notifications to Discord, Slack, or Telegram.

This is Part 1 of the manga-cdc series — the architecture story and the decisions that shaped it.

The problem is not “write a scraper”

A Discord webhook script breaks the moment you care about reliability:

Failure	What happens
Duplicate notifications	Retries and republication look like new chapters
Lost notifications	Process crashes between DB write and webhook
Untrusted writes	Public webhook endpoints get probed
Ops blindness	A source silently returns zero rows for a week
Deploy fragility	Secrets and URLs hard-coded in one file

manga-cdc treats this as a small distributed system: ingest, durable state, event delivery, policy layer, operator UI.

That framing matters. The goal was never “minimum lines of Go.” It was provable end-to-end behavior on real sites, with a credible path toward true CDC later.

The four-box architecture

At the center is a pipeline with clear boundaries:

flowchart LR
  scraper["Scraper (Go)"]
  db[(PostgreSQL)]
  bus["Kafka / QStash"]
  notifier["Notifier\n(Spring Boot)"]
  channels["Discord / Slack / Telegram"]

  scraper --> db
  db --> bus
  bus --> notifier
  notifier --> channels

Two edge surfaces sit on top:

flowchart TB
  scraper["Scraper"]
  db[(PostgreSQL)]
  notifier["Notifier"]
  dashboard["Dashboard (Svelte)"]
  status["Status page (Vercel)"]

  scraper --> db
  notifier --> db
  dashboard -. read API .-> db
  status -. health polls .-> notifier

Dashboard (Svelte) — operator UI with a BFF proxy to the read API
Status page (Vercel) — public pipeline health for you and contributors

Why Go for scraping and Java for notifications?

The scrape path is stateless batch work: adapters, diff engine, optional publish. It runs on a schedule, exits, and should fail in isolation — a broken MangaTown adapter must not take down notification delivery.

The notifier is long-lived policy + HTTP: webhook consumers, channel routing, per-series preferences, read APIs for the dashboard. Spring Boot fits that operational profile, and the split gives each side a clear scaling unit.

PostgreSQL stays authoritative. The event bus decouples scrape latency from notify latency. That is the same shape as application-level CDC:

Phase 3 may add WAL CDC with Debezium. Phase 1 deliberately uses application publish so we can run serverless and keep costs near zero.

flowchart LR
  insert["INSERT chapter"]
  event["Debezium-shaped JSON"]
  policy["Notifier policy"]
  deliver["Channel delivery"]

  insert --> event
  event --> policy
  policy --> deliver

Six sources, one adapter contract

Each source adapter implements the same contract: return normalized series and chapter rows. Behind that interface, everything is different:

Source	Access model	Stability risk
MangaDex	Official API	Rate limits, API versioning
Manga Plus	Official API	Regional availability
MangaFire	HTML scraping	Layout changes
MangaPill	HTML scraping	Anti-bot behavior
MangaTown	HTML scraping	Pagination quirks
Asura Scans	HTML scraping	Irregular release timing

The diff engine compares incoming rows against PostgreSQL. New chapters get inserted, flagged, and published. The design doc calls this Real > Impressive: fixture tests are necessary, but tagged releases still run against live behavior.

Consistency: what is actually guaranteed

This pipeline is not linearizable end-to-end. That is fine — but you have to know the gaps:

Step	Guarantee
DB insert of new chapter	ACID per transaction
Event publish after insert	Best effort in-process
Notifier delivery	At-least-once with idempotent handlers
Dashboard freshness	Cached read path

A crash after commit but before publish leaves is_new true for the next run to reconcile. Notification handlers dedupe on chapter identity so retries do not spam channels.

stateDiagram-v2
  [*] --> Scraped: adapter run
  Scraped --> Committed: DB insert
  Committed --> Published: event publish
  Published --> Delivered: notifier
  Committed --> Reconcile: crash before publish
  Reconcile --> Published: next scrape run
  Delivered --> [*]

Operators see health through the status page and pipeline health endpoints — because silent scrape failure is the most common production incident class.

Notification policy, not just delivery

The notifier is more than a webhook forwarder. It implements:

Per-series preferences — which channels, which series, batching rules
Channel adapters — Discord, Slack, Telegram with separate formatting
Mutation guards — write APIs protected when mutations are disabled in prod
Notification log — audit trail of what was sent and why

The dashboard proxies read APIs so browser clients never hold service credentials. The status page polls pipeline health into a Vercel-friendly KV model for public visibility.

Deployment and the cost profile

Phase 1 optimizes for BYO data plane and scale-to-zero where possible:

Scraper as a scheduled job (Railway, Cloud Run job, cron)
PostgreSQL on Neon, Supabase, or self-hosted
Kafka on Upstash or QStash as a lighter alternative
Notifier on Cloud Run
Dashboard and status on Vercel

Terraform modules exist for multiple clouds — not because multi-cloud is fun, but because portability is part of the credibility goal. The env parsing patterns are shared; the topology is the same.

CI is a release train: PR validation, image builds, Helm charts, and e2e gates on tagged releases. The project even dogfoods cross-repo orchestration patterns (more on that in the pipeline-compose series).

What I would do differently

If I started today with only personal use in mind, I might defer Kafka and use QStash-only eventing longer. The abstraction still pays off — switching buses should not require rewriting the diff engine — but operating Kafka for a single-tenant setup is real overhead.

I would also ship the watchlist contribution flow earlier. Community series discovery is the main growth loop for a project like this.

What is next in this series

Future posts will go deeper into:

Source adapter design and scrape SLOs
The dual-write path and idempotency contracts
Notification filter graph and auth model
CI/CD gates and the release train
The Phase 2/3 roadmap (SaaS, true WAL CDC)

If you are building something that looks like “scrape → diff → notify,” the manga-cdc architecture docs are the design record. Start with the architecture reading guide on the repo.

Links: GitHub · Dashboard · Status