Kafka to NSQ Migration Playbook¶
Version: 1.0 Status: Draft Last Updated: 2026-04-03
Purpose¶
Migrate Simpaisa's messaging infrastructure from Kafka to NSQ. Kafka is currently undocumented — no topic naming conventions, partitioning strategies, or consumer group configurations exist. This playbook defines the discovery, migration, and cutover process.
Why NSQ¶
- Simpler operations: no ZooKeeper/KRaft dependency, no broker rebalancing.
- Decentralised topology: nsqd runs per service host, no single cluster bottleneck.
- Operationally transparent: nsqadmin provides immediate visibility into message flow.
- Sufficient for our scale: 270M+ transactions processed reliably with at-least-once delivery.
Phase 1 — Discovery & Documentation¶
Goal: document what exists before changing anything.
- Audit all Kafka topics across PK, BD, NP, IQ, EG environments.
- For each topic, record: name, producer service(s), consumer group(s), partition count, average throughput (msg/s), message format, retention policy.
- Identify ordering dependencies (topics where partition key matters).
- Output: a Kafka inventory spreadsheet/table — this is the migration source of truth.
Phase 2 — NSQ Topic Naming Convention¶
All NSQ topics follow: {domain}.{event}.{version}
Examples:
- payin.transaction.completed.v1
- payout.transfer.initiated.v1
- remittance.quote.created.v1
- card.authorisation.approved.v1
Channels map to consumer groups: {consuming-service}-{purpose}
e.g., ledger-service-reconciliation, notification-service-alerts
Phase 3 — Deploy NSQ Alongside Kafka¶
- Deploy
nsqdon each service host (co-located with producers/consumers). - Deploy 3x
nsqlookupdinstances for topic discovery. - Deploy
nsqadminfor operational visibility. - Configure
nsq_exporterto push metrics to Grafana. - Validate: publish test messages, confirm delivery through nsqlookupd.
Rollback: tear down NSQ components. Kafka unaffected.
Phase 4 — Dual-Write¶
- Modify producers to write to both Kafka and NSQ simultaneously.
- NSQ messages use the standard envelope (see Message Format below).
- Monitor: compare message counts between Kafka and NSQ for each topic.
- Duration: minimum 2 weeks per market, longer if discrepancies found.
Rollback: disable NSQ writes in producer config. Kafka continues unchanged.
Phase 5 — Consumer Migration¶
Migrate consumers one service at a time, lowest-risk first:
- Notification consumers (non-critical, idempotent).
- Audit/logging consumers.
- Reconciliation consumers.
- Core transaction processing consumers (highest risk, last).
For each consumer: - Switch from Kafka consumer to NSQ subscriber. - Verify: message processing rate, error rate, latency. - Keep Kafka consumer running in shadow mode for 1 week. - Disable Kafka consumer after verification.
Rollback: re-enable Kafka consumer, disable NSQ subscriber.
Phase 6 — Verification & Cutover¶
- Confirm all consumers are on NSQ with zero Kafka dependencies.
- Compare metrics: message throughput, p99 latency, error rates.
- Disable Kafka producers (dual-write off).
- Monitor for 1 week post-cutover.
Phase 7 — Decommission Kafka¶
- Remove Kafka client libraries from services.
- Tear down Kafka brokers and ZooKeeper nodes.
- Archive Kafka configuration for reference.
- Update infrastructure-as-code to remove Kafka resources.
Message Format¶
Standard JSON envelope for all NSQ messages:
{
"eventType": "payin.transaction.completed.v1",
"timestamp": "2026-04-03T14:30:00Z",
"traceId": "abc123-def456-ghi789",
"source": "payin-service",
"payload": { ... }
}
All fields are mandatory. traceId links to OpenTelemetry distributed traces.
Delivery Guarantees¶
- At-least-once delivery. NSQ guarantees a message is delivered at least once.
- Consumers MUST be idempotent. Use
traceIdor a business-level idempotency key to deduplicate. - Max-in-flight: configure per consumer based on processing capacity. Start with 5, tune upward.
- Failure handling: exponential backoff with jitter. NSQ requeues failed messages automatically.
NSQ Configuration Defaults¶
| Parameter | Value | Notes |
|---|---|---|
mem-queue-size |
10,000 | Messages held in memory before disk spill. |
max-msg-size |
1 MB | Reject messages exceeding this. |
msg-timeout |
60s | Consumer must ACK within this window. |
max-req-timeout |
1h | Maximum requeue delay. |
Monitoring¶
- nsqadmin: real-time topic/channel depth, in-flight counts, consumer connections.
- Grafana dashboards: nsq_exporter metrics — message rate, channel depth, requeue rate, timeout rate.
- Alerting: channel depth > 10,000 (warning), > 100,000 (critical). Consumer disconnect alerts.
Risks & Mitigations¶
| Risk | Mitigation |
|---|---|
| Undocumented Kafka usage discovered mid-migration | Phase 1 discovery is thorough; dual-write phase catches gaps |
| Message ordering requirements | Identify in Phase 1; for strict ordering, use single nsqd + single channel |
| NSQ message loss on nsqd crash | Enable --data-path disk persistence; replicate via nsqd-per-host topology |
| Consumer cannot handle at-least-once | Enforce idempotency in code review before migration |