Building Event-Driven Systems with Kafka: Lessons from 11 Years in the Field

By Gaurav Pratap Singh · March 2026 · ~12 min read
Kafka Distributed Systems Java Event-Driven Spring Boot

Key Takeaways

Consumer group semantics are not optional knowledge — misunderstand them and you will lose messages in production
Partition count is a decision you can almost never reverse; overestimate deliberately
Schema Registry is not a nice-to-have — it is the contract layer of your event-driven system
Dead-letter topics are where you discover your real error handling strategy
Kafka on Kubernetes needs stateful-set awareness that most blog posts skip entirely
Claude Code can scaffold correct Kafka boilerplate in seconds — but you still need to understand what it generates

I have been building event-driven systems with Apache Kafka since 2018. Across healthcare claims processing (EDI 835/837), telecom network orchestration, financial trade data pipelines, and workforce management platforms, Kafka has been the backbone. The official docs are excellent. The Confluent tutorials are solid. But none of them will tell you the things that actually hurt in production at scale.

This post collects the hard-won lessons — the ones I learned by shipping broken things into production, by debugging 3 a.m. incidents, and by reviewing dozens of Kafka implementations across teams. It is written for engineers who already understand what Kafka is and want to understand how to use it well.

Lesson 1: Consumer Group Offset Management — When At-Least-Once Is a Lie You’re Telling Yourself

Every Kafka tutorial says: Kafka guarantees at-least-once delivery. What they quietly skip is the part where you are responsible for making that guarantee actually hold, and it is very easy to violate it without realizing.

The most common mistake: committing offsets before processing is complete. Here is the anti-pattern in Spring Kafka:

// DANGEROUS: offset committed before downstream write completes
@KafkaListener(topics = "claim-events", groupId = "rcm-processor")
public void onMessage(ClaimEvent event) {
    // offset auto-committed by Spring before this returns
    claimService.persist(event); // if this throws, message is LOST
}

The fix: use AckMode.MANUAL_IMMEDIATE and commit only after your downstream operation succeeds:

@KafkaListener(topics = "claim-events", groupId = "rcm-processor")
public void onMessage(ClaimEvent event, Acknowledgment ack) {
    try {
        claimService.persist(event);
        ack.acknowledge(); // only commit after success
    } catch (RetryableException e) {
        throw e; // do NOT ack: let Kafka redeliver
    } catch (NonRetryableException e) {
        ack.acknowledge(); // ack but route to DLT
        deadLetterRouter.send(event, e);
    }
}

The deeper lesson: decide your delivery guarantee before you write any code. At-least-once, at-most-once, and effectively-once require fundamentally different implementations. Effectively-once (idempotent consumers + transactional producers) is the hardest — and the one most teams think they don’t need until they do.

Lesson 2: Partition Count Is a Decision You Will Regret (Or Thank Yourself For) in 18 Months

Kafka does not let you easily decrease partitions. Increasing partitions mid-stream disrupts key-based ordering guarantees. This makes the initial partition count one of the most consequential — and most under-thought — choices in any Kafka deployment.

Rules I follow:

Start with max expected consumer parallelism, then add 20%. If you plan to run 12 consumer instances, start with 15 partitions, not 12.
Never set partitions below the planned consumer count. Idle consumers waste memory and thread pool slots without adding throughput.
Round to a factor-friendly number. 12, 24, 48, 60 — numbers divisible by 2, 3, 4, 6 give you flexibility later.
High-cardinality key topics need more partitions. Topics keyed by userId in a multi-million-user system need far more partitions than topics keyed by accountType with 5 possible values.

In a healthcare claims pipeline I built, we started with 6 partitions on an EDI 837 ingest topic. Twelve months later, claim volume tripled. We needed 18 consumer instances for throughput but were capped at 6 due to partition count. The migration to 24 partitions took three weeks of carefully managed dual-read/write infrastructure. Starting at 24 would have cost nothing.

Lesson 3: Schema Registry Is a First-Class Citizen, Not an Afterthought

I have seen teams run Kafka for months with raw JSON strings, no schema enforcement, and no contract between producers and consumers. Then one producer silently renames a field. Three downstream consumers break. The incident spans six hours. The post-mortem says we need better documentation.

What they actually need is Schema Registry with subject compatibility rules enforced at produce time. Confluent Schema Registry with Avro or Protobuf schemas gives you:

Schema evolution with backward/forward compatibility checks — a producer cannot publish a breaking change without an explicit compatibility mode update
Compact wire format — Avro binary is significantly smaller than JSON for high-throughput topics
Schema-as-code — your .avsc files live in Git; schema history is auditable
Generated POJOs — Avro Maven plugin generates your event DTOs automatically

# application.yml: Kafka producer with Schema Registry
spring:
  kafka:
    producer:
      key-serializer: org.apache.kafka.common.serialization.StringSerializer
      value-serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
      properties:
        schema.registry.url: http://schema-registry:8081
        auto.register.schemas: false   # NEVER true in production
        use.latest.version: true

Setting auto.register.schemas: false in production is not optional. If you let producers auto-register schemas, any developer can accidentally push an incompatible schema and break consumers. Treat schema registration as a deployment artifact, gated by CI/CD.

Lesson 4: Dead-Letter Topics Are Where You Discover Your Real Error-Handling Strategy

A dead-letter topic (DLT) is not a trash can. It is a signal queue. Every message that lands in your DLT is a bug, a data contract violation, or a capacity problem you haven’t solved yet.

Three DLT patterns I have shipped in production:

Pattern 1 — Single DLT per topic: Simple. claim-events.DLT alongside claim-events. Works for low-volume topics where you manually inspect failures.

Pattern 2 — Retry topics + DLT: claim-events.RETRY-1 to claim-events.RETRY-3 to claim-events.DLT. Spring Kafka’s DeadLetterPublishingRecoverer handles this out of the box with exponential back-off between retry attempts without blocking the main partition.

Pattern 3 — DLT with replay capability: Store DLT messages with full context headers (original topic, original offset, exception class, stack trace, timestamp). Build an ops tool that re-publishes selected DLT messages back to the original topic after a fix deploys. This is the pattern that saves you during production incidents.

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
    var recoverer = new DeadLetterPublishingRecoverer(template,
        (record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition()));

    // exponential back-off: 1s, 2s, 4s — max 3 retries
    var backOff = new ExponentialBackOffWithMaxRetries(3);
    backOff.setInitialInterval(1_000);
    backOff.setMultiplier(2.0);

    return new DefaultErrorHandler(recoverer, backOff);
}

Lesson 5: Kafka + Kubernetes — The Gotchas Nobody Warns You About

Running Kafka consumers on Kubernetes is mostly fine. Running Kafka brokers on Kubernetes is where things get interesting.

StatefulSet pod identity matters for broker configuration. Kafka brokers identify themselves by broker.id. When a Kubernetes pod restarts, it keeps its StatefulSet ordinal (kafka-0, kafka-1, kafka-2), which maps to broker ID. If you use Deployments instead of StatefulSets, broker IDs become non-deterministic and your cluster enters split-brain scenarios. Always StatefulSet for brokers.

PersistentVolume reclaim policy must be Retain. If a pod gets evicted and its PVC is deleted with the Delete reclaim policy, you have just lost a broker’s log directory. Set persistentVolumeReclaimPolicy: Retain and manage PV lifecycle manually.

Consumer group rebalancing amplifies pod disruption. Every rolling deployment triggers a consumer group rebalance. During rebalance, partitions are unassigned and no messages are consumed. For high-throughput topics, a 30-second rebalance window is a real processing gap. Solutions: cooperative sticky rebalancing (partition.assignment.strategy=CooperativeStickyAssignor), or static membership (group.instance.id) to skip rebalance on reconnects within session.timeout.ms.

Resource limits kill brokers silently. Kafka brokers are JVM processes with large heaps. If your Kubernetes resource limits are set too low, the OOM killer terminates the broker JVM without a clean shutdown, leaving dirty partition logs that require recovery on restart. Size your limits generously and monitor kafka_server_BrokerTopicMetrics_MessagesInPerSec in Grafana.

Lesson 6: Monitoring Is Not Optional — The Metrics That Actually Matter

Kafka exposes hundreds of JMX metrics. Here are the five that matter most:

records-lag-max — consumer lag by partition. Alert when this exceeds 10,000 sustained.
UnderReplicatedPartitions — if non-zero, a broker is struggling. Alert immediately.
Produce TotalTimeMs p99 — p99 produce latency. Spikes indicate broker I/O pressure.
MessagesInPerSec — ingestion rate trend. Sudden drops mean producer-side issues.
rebalance-latency-avg — how long rebalances are taking. Sustained rebalances indicate consumer instability.

Set these up in Grafana before you go to production. Not after the first 3 a.m. incident.

Bonus: Using Claude Code to Scaffold Kafka Boilerplate

Since I started using Claude Code in my daily workflow, Kafka setup code has become significantly faster to write. A prompt like:

Generate a Spring Boot 3 / Spring Kafka consumer for topic payment-events using manual offset acknowledgment, exponential back-off retry (3 retries), dead-letter topic routing, and Avro deserialization with Schema Registry at localhost:8081. Include the error handler bean and application.yml configuration.

…produces a complete, correct implementation in seconds — consumer listener, error handler, DLT configuration, and YAML — that would take 20–30 minutes to assemble from documentation and Stack Overflow.

The critical caveat: Claude Code generates correct boilerplate, but it does not understand your specific system’s throughput requirements, your Schema Registry subject naming convention, your partition count rationale, or your DLT replay strategy. The code it generates is a correct starting point — not a finished solution. Apply Lessons 1–6 above to make it production-grade.

For test generation in particular, I use it to produce consumer unit tests with EmbeddedKafkaBroker — a setup that used to take 45 minutes to get right now takes under 5.

Frequently Asked Questions

How many partitions should a Kafka topic have?

Start with the maximum number of consumer instances you plan to run, then add 20–30% headroom. For high-throughput topics (millions of messages/day), start at 24 or 48. For low-throughput internal event topics, 3–6 is fine. Increasing partitions later disrupts key-based ordering, so plan high initially.

What is the difference between a Kafka consumer group and a consumer instance?

A consumer group is a logical unit identified by a group ID. Kafka guarantees each partition is assigned to exactly one consumer instance within a group at a time. Multiple groups can all consume the same topic independently (fan-out pattern). Within a group, parallelism is limited by partition count — adding more consumer instances than partitions does nothing.

What is a Kafka dead-letter topic and when should I use one?

A dead-letter topic (DLT) receives messages your consumer could not process after all retry attempts. Use DLTs whenever you have non-retryable errors (malformed data, schema violations, business logic rejections). Never silently drop failed messages — route them to a DLT so you can audit, alert, and replay after a fix deploys.

Should I use self-managed Kafka on Kubernetes or a managed service like Confluent Cloud or AWS MSK?

For most teams: use a managed service. Self-managed Kafka on Kubernetes is operationally expensive — broker lifecycle management, PersistentVolume management, and rebalancing tuning all require sustained expertise. Confluent Cloud and AWS MSK eliminate 80% of that overhead. Only self-manage if you have strict data residency requirements or a dedicated platform team with deep Kafka expertise.

What is Schema Registry and do I need it?

Schema Registry stores and validates Avro/Protobuf/JSON Schema schemas for Kafka topics. It enforces compatibility rules between schema versions, preventing producers from publishing breaking changes. If you have more than one team producing or consuming events from the same topics, Schema Registry is not optional — it is the contract layer of your event-driven architecture.

The Bottom Line

Kafka is a mature, battle-tested technology. It handles extraordinary scale when configured correctly, and it fails in deeply confusing ways when not. The lessons above are not theoretical — each one is sourced from a real production incident or a real architectural decision I have made and later reconsidered.

If you are starting a new Kafka-based system: treat offset management, partition strategy, Schema Registry, and DLT as architecture decisions — not implementation details. The cost of fixing them after you are in production is orders of magnitude higher than getting them right upfront.

If you are using Claude Code or GitHub Copilot to scaffold your Kafka infrastructure: the generated code is a correct starting point. You need the mental model from Lessons 1–6 to evaluate, adapt, and safely operate what the AI generates. The AI knows syntax; you need to know the system.

More from this blog:

Building Event-Driven Systems with Kafka: Lessons from 11 Years in the Field

Lesson 1: Consumer Group Offset Management — When At-Least-Once Is a Lie You’re Telling Yourself

Lesson 2: Partition Count Is a Decision You Will Regret (Or Thank Yourself For) in 18 Months

Lesson 3: Schema Registry Is a First-Class Citizen, Not an Afterthought

Lesson 4: Dead-Letter Topics Are Where You Discover Your Real Error-Handling Strategy

Lesson 5: Kafka + Kubernetes — The Gotchas Nobody Warns You About

Lesson 6: Monitoring Is Not Optional — The Metrics That Actually Matter

Bonus: Using Claude Code to Scaffold Kafka Boilerplate

Frequently Asked Questions

How many partitions should a Kafka topic have?

What is the difference between a Kafka consumer group and a consumer instance?

What is a Kafka dead-letter topic and when should I use one?

Should I use self-managed Kafka on Kubernetes or a managed service like Confluent Cloud or AWS MSK?

What is Schema Registry and do I need it?

The Bottom Line

Comments

Leave a comment Cancel reply

Building Event-Driven Systems with Kafka: Lessons from 11 Years in the Field

Lesson 1: Consumer Group Offset Management — When At-Least-Once Is a Lie You’re Telling Yourself

Lesson 2: Partition Count Is a Decision You Will Regret (Or Thank Yourself For) in 18 Months

Lesson 3: Schema Registry Is a First-Class Citizen, Not an Afterthought

Lesson 4: Dead-Letter Topics Are Where You Discover Your Real Error-Handling Strategy

Lesson 5: Kafka + Kubernetes — The Gotchas Nobody Warns You About

Lesson 6: Monitoring Is Not Optional — The Metrics That Actually Matter

Bonus: Using Claude Code to Scaffold Kafka Boilerplate

Frequently Asked Questions

How many partitions should a Kafka topic have?

What is the difference between a Kafka consumer group and a consumer instance?

What is a Kafka dead-letter topic and when should I use one?

Should I use self-managed Kafka on Kubernetes or a managed service like Confluent Cloud or AWS MSK?

What is Schema Registry and do I need it?

The Bottom Line

Share this:

Comments

Leave a comment Cancel reply