Part 1

Building manga-cdc: a change-data-capture pipeline for manga releases

Part of the manga-cdc series · Source repo

Manga chapters do not arrive in one place. Official simulpubs land on Manga Plus. Community metadata lives on MangaDex. Aggregators like MangaFire, MangaPill, and MangaTown each have different HTML layouts. Scanlation groups post to sites like Asura Scans on their own schedule.

I was tired of opening six tabs every evening. So I built manga-cdc: a change-data-capture-style pipeline that scrapes sources on a schedule, diffs against PostgreSQL, publishes chapter events through Kafka or QStash, and routes notifications to Discord, Slack, or Telegram.

This is Part 1 of the manga-cdc series — the architecture story and the decisions that shaped it.

The problem is not “write a scraper”

A Discord webhook script breaks the moment you care about reliability:

FailureWhat happens
Duplicate notificationsRetries and republication look like new chapters
Lost notificationsProcess crashes between DB write and webhook
Untrusted writesPublic webhook endpoints get probed
Ops blindnessA source silently returns zero rows for a week
Deploy fragilitySecrets and URLs hard-coded in one file

manga-cdc treats this as a small distributed system: ingest, durable state, event delivery, policy layer, operator UI.

That framing matters. The goal was never “minimum lines of Go.” It was provable end-to-end behavior on real sites, with a credible path toward true CDC later.

The four-box architecture

At the center is a pipeline with clear boundaries:

flowchart LR
  scraper["Scraper (Go)"]
  db[(PostgreSQL)]
  bus["Kafka / QStash"]
  notifier["Notifier\n(Spring Boot)"]
  channels["Discord / Slack / Telegram"]

  scraper --> db
  db --> bus
  bus --> notifier
  notifier --> channels

Two edge surfaces sit on top:

flowchart TB
  scraper["Scraper"]
  db[(PostgreSQL)]
  notifier["Notifier"]
  dashboard["Dashboard (Svelte)"]
  status["Status page (Vercel)"]

  scraper --> db
  notifier --> db
  dashboard -. read API .-> db
  status -. health polls .-> notifier
  • Dashboard (Svelte) — operator UI with a BFF proxy to the read API
  • Status page (Vercel) — public pipeline health for you and contributors

Why Go for scraping and Java for notifications?

The scrape path is stateless batch work: adapters, diff engine, optional publish. It runs on a schedule, exits, and should fail in isolation — a broken MangaTown adapter must not take down notification delivery.

The notifier is long-lived policy + HTTP: webhook consumers, channel routing, per-series preferences, read APIs for the dashboard. Spring Boot fits that operational profile, and the split gives each side a clear scaling unit.

PostgreSQL stays authoritative. The event bus decouples scrape latency from notify latency. That is the same shape as application-level CDC:

Phase 3 may add WAL CDC with Debezium. Phase 1 deliberately uses application publish so we can run serverless and keep costs near zero.

flowchart LR
  insert["INSERT chapter"]
  event["Debezium-shaped JSON"]
  policy["Notifier policy"]
  deliver["Channel delivery"]

  insert --> event
  event --> policy
  policy --> deliver

Six sources, one adapter contract

Each source adapter implements the same contract: return normalized series and chapter rows. Behind that interface, everything is different:

SourceAccess modelStability risk
MangaDexOfficial APIRate limits, API versioning
Manga PlusOfficial APIRegional availability
MangaFireHTML scrapingLayout changes
MangaPillHTML scrapingAnti-bot behavior
MangaTownHTML scrapingPagination quirks
Asura ScansHTML scrapingIrregular release timing

The diff engine compares incoming rows against PostgreSQL. New chapters get inserted, flagged, and published. The design doc calls this Real > Impressive: fixture tests are necessary, but tagged releases still run against live behavior.

Consistency: what is actually guaranteed

This pipeline is not linearizable end-to-end. That is fine — but you have to know the gaps:

StepGuarantee
DB insert of new chapterACID per transaction
Event publish after insertBest effort in-process
Notifier deliveryAt-least-once with idempotent handlers
Dashboard freshnessCached read path

A crash after commit but before publish leaves is_new true for the next run to reconcile. Notification handlers dedupe on chapter identity so retries do not spam channels.

stateDiagram-v2
  [*] --> Scraped: adapter run
  Scraped --> Committed: DB insert
  Committed --> Published: event publish
  Published --> Delivered: notifier
  Committed --> Reconcile: crash before publish
  Reconcile --> Published: next scrape run
  Delivered --> [*]

Operators see health through the status page and pipeline health endpoints — because silent scrape failure is the most common production incident class.

Notification policy, not just delivery

The notifier is more than a webhook forwarder. It implements:

  • Per-series preferences — which channels, which series, batching rules
  • Channel adapters — Discord, Slack, Telegram with separate formatting
  • Mutation guards — write APIs protected when mutations are disabled in prod
  • Notification log — audit trail of what was sent and why

The dashboard proxies read APIs so browser clients never hold service credentials. The status page polls pipeline health into a Vercel-friendly KV model for public visibility.

Deployment and the cost profile

Phase 1 optimizes for BYO data plane and scale-to-zero where possible:

  • Scraper as a scheduled job (Railway, Cloud Run job, cron)
  • PostgreSQL on Neon, Supabase, or self-hosted
  • Kafka on Upstash or QStash as a lighter alternative
  • Notifier on Cloud Run
  • Dashboard and status on Vercel

Terraform modules exist for multiple clouds — not because multi-cloud is fun, but because portability is part of the credibility goal. The env parsing patterns are shared; the topology is the same.

CI is a release train: PR validation, image builds, Helm charts, and e2e gates on tagged releases. The project even dogfoods cross-repo orchestration patterns (more on that in the pipeline-compose series).

What I would do differently

If I started today with only personal use in mind, I might defer Kafka and use QStash-only eventing longer. The abstraction still pays off — switching buses should not require rewriting the diff engine — but operating Kafka for a single-tenant setup is real overhead.

I would also ship the watchlist contribution flow earlier. Community series discovery is the main growth loop for a project like this.

What is next in this series

Future posts will go deeper into:

  • Source adapter design and scrape SLOs
  • The dual-write path and idempotency contracts
  • Notification filter graph and auth model
  • CI/CD gates and the release train
  • The Phase 2/3 roadmap (SaaS, true WAL CDC)

If you are building something that looks like “scrape → diff → notify,” the manga-cdc architecture docs are the design record. Start with the architecture reading guide on the repo.

Links: GitHub · Dashboard · Status