Skip to content

Kafka to NSQ Migration Playbook

Version: 1.0 Status: Draft Last Updated: 2026-04-03

Purpose

Migrate Simpaisa's messaging infrastructure from Kafka to NSQ. Kafka is currently undocumented — no topic naming conventions, partitioning strategies, or consumer group configurations exist. This playbook defines the discovery, migration, and cutover process.

Why NSQ

  • Simpler operations: no ZooKeeper/KRaft dependency, no broker rebalancing.
  • Decentralised topology: nsqd runs per service host, no single cluster bottleneck.
  • Operationally transparent: nsqadmin provides immediate visibility into message flow.
  • Sufficient for our scale: 270M+ transactions processed reliably with at-least-once delivery.

Phase 1 — Discovery & Documentation

Goal: document what exists before changing anything.

  • Audit all Kafka topics across PK, BD, NP, IQ, EG environments.
  • For each topic, record: name, producer service(s), consumer group(s), partition count, average throughput (msg/s), message format, retention policy.
  • Identify ordering dependencies (topics where partition key matters).
  • Output: a Kafka inventory spreadsheet/table — this is the migration source of truth.

Phase 2 — NSQ Topic Naming Convention

All NSQ topics follow: {domain}.{event}.{version}

Examples: - payin.transaction.completed.v1 - payout.transfer.initiated.v1 - remittance.quote.created.v1 - card.authorisation.approved.v1

Channels map to consumer groups: {consuming-service}-{purpose} e.g., ledger-service-reconciliation, notification-service-alerts

Phase 3 — Deploy NSQ Alongside Kafka

  • Deploy nsqd on each service host (co-located with producers/consumers).
  • Deploy 3x nsqlookupd instances for topic discovery.
  • Deploy nsqadmin for operational visibility.
  • Configure nsq_exporter to push metrics to Grafana.
  • Validate: publish test messages, confirm delivery through nsqlookupd.

Rollback: tear down NSQ components. Kafka unaffected.

Phase 4 — Dual-Write

  • Modify producers to write to both Kafka and NSQ simultaneously.
  • NSQ messages use the standard envelope (see Message Format below).
  • Monitor: compare message counts between Kafka and NSQ for each topic.
  • Duration: minimum 2 weeks per market, longer if discrepancies found.

Rollback: disable NSQ writes in producer config. Kafka continues unchanged.

Phase 5 — Consumer Migration

Migrate consumers one service at a time, lowest-risk first:

  1. Notification consumers (non-critical, idempotent).
  2. Audit/logging consumers.
  3. Reconciliation consumers.
  4. Core transaction processing consumers (highest risk, last).

For each consumer: - Switch from Kafka consumer to NSQ subscriber. - Verify: message processing rate, error rate, latency. - Keep Kafka consumer running in shadow mode for 1 week. - Disable Kafka consumer after verification.

Rollback: re-enable Kafka consumer, disable NSQ subscriber.

Phase 6 — Verification & Cutover

  • Confirm all consumers are on NSQ with zero Kafka dependencies.
  • Compare metrics: message throughput, p99 latency, error rates.
  • Disable Kafka producers (dual-write off).
  • Monitor for 1 week post-cutover.

Phase 7 — Decommission Kafka

  • Remove Kafka client libraries from services.
  • Tear down Kafka brokers and ZooKeeper nodes.
  • Archive Kafka configuration for reference.
  • Update infrastructure-as-code to remove Kafka resources.

Message Format

Standard JSON envelope for all NSQ messages:

{
  "eventType": "payin.transaction.completed.v1",
  "timestamp": "2026-04-03T14:30:00Z",
  "traceId": "abc123-def456-ghi789",
  "source": "payin-service",
  "payload": { ... }
}

All fields are mandatory. traceId links to OpenTelemetry distributed traces.

Delivery Guarantees

  • At-least-once delivery. NSQ guarantees a message is delivered at least once.
  • Consumers MUST be idempotent. Use traceId or a business-level idempotency key to deduplicate.
  • Max-in-flight: configure per consumer based on processing capacity. Start with 5, tune upward.
  • Failure handling: exponential backoff with jitter. NSQ requeues failed messages automatically.

NSQ Configuration Defaults

Parameter Value Notes
mem-queue-size 10,000 Messages held in memory before disk spill.
max-msg-size 1 MB Reject messages exceeding this.
msg-timeout 60s Consumer must ACK within this window.
max-req-timeout 1h Maximum requeue delay.

Monitoring

  • nsqadmin: real-time topic/channel depth, in-flight counts, consumer connections.
  • Grafana dashboards: nsq_exporter metrics — message rate, channel depth, requeue rate, timeout rate.
  • Alerting: channel depth > 10,000 (warning), > 100,000 (critical). Consumer disconnect alerts.

Risks & Mitigations

Risk Mitigation
Undocumented Kafka usage discovered mid-migration Phase 1 discovery is thorough; dual-write phase catches gaps
Message ordering requirements Identify in Phase 1; for strict ordering, use single nsqd + single channel
NSQ message loss on nsqd crash Enable --data-path disk persistence; replicate via nsqd-per-host topology
Consumer cannot handle at-least-once Enforce idempotency in code review before migration