Key Takeaways
- Consumer group semantics are not optional knowledge — misunderstand them and you will lose messages in production
- Partition count is a decision you can almost never reverse; overestimate deliberately
- Schema Registry is not a nice-to-have — it is the contract layer of your event-driven system
- Dead-letter topics are where you discover your real error handling strategy
- Kafka on Kubernetes needs stateful-set awareness that most blog posts skip entirely
- In the Agentic SDLC era: a complete, correct Kafka consumer with DLT, retry, and Schema Registry takes 5 minutes to scaffold. The hard part was never the boilerplate — it was always the reasoning about partitions, offsets, and exactly-once semantics. AI can’t do that for you yet.
I have been building event-driven systems with Apache Kafka since 2018. Across healthcare claims processing, telecom network orchestration, financial trade data pipelines, and workforce management platforms, Kafka has been the backbone. The official docs are excellent. The Confluent tutorials are solid. But none of them will tell you the things that actually hurt in production at scale.
This post collects those hard-won lessons — and compares how Kafka development looked before agentic coding tools versus how it works in 2026.
Lesson 1: Consumer Group Offset Management — When At-Least-Once Is a Lie You’re Telling Yourself
Every Kafka tutorial says: Kafka guarantees at-least-once delivery. What they quietly skip is the part where you are responsible for making that guarantee actually hold.
The most common mistake: committing offsets before processing is complete. Here is the anti-pattern in Spring Kafka:
// DANGEROUS: offset committed before downstream write completes
@KafkaListener(topics = "claim-events", groupId = "rcm-processor")
public void onMessage(ClaimEvent event) {
claimService.persist(event); // if this throws, message is LOST
}
The fix: use AckMode.MANUAL_IMMEDIATE and commit only after your downstream operation succeeds:
@KafkaListener(topics = "claim-events", groupId = "rcm-processor")
public void onMessage(ClaimEvent event, Acknowledgment ack) {
try {
claimService.persist(event);
ack.acknowledge();
} catch (RetryableException e) {
throw e; // do NOT ack: let Kafka redeliver
} catch (NonRetryableException e) {
ack.acknowledge(); // ack but route to DLT
deadLetterRouter.send(event, e);
}
}
The deeper lesson: decide your delivery guarantee before you write any code. At-least-once, at-most-once, and effectively-once require fundamentally different implementations. Effectively-once (idempotent consumers + transactional producers) is the hardest — and the one most teams think they don’t need until they do.
Lesson 2: Partition Count Is the Decision You’ll Regret (Or Thank Yourself For) in 18 Months
Kafka does not let you easily decrease partitions. Increasing partitions mid-stream disrupts key-based ordering guarantees. This makes partition count one of the most consequential and most under-thought choices in any Kafka deployment.
Rules I follow:
- Start with max expected consumer parallelism, then add 20%. If you plan to run 12 consumer instances, start with 15 partitions.
- Never set partitions below the planned consumer count. Idle consumers waste memory without adding throughput.
- Round to a factor-friendly number. 12, 24, 48 — numbers divisible by 2, 3, 4, 6 give you flexibility later.
- High-cardinality key topics need more partitions. Topics keyed by userId in a multi-million-user system need far more than topics keyed by accountType with 5 possible values.
In a healthcare claims pipeline I built, we started with 6 partitions on an EDI 837 ingest topic. Twelve months later, claim volume tripled. We needed 18 consumer instances but were capped at 6. The migration took three weeks. Starting at 24 would have cost nothing.
Lesson 3: Schema Registry Is a First-Class Citizen, Not an Afterthought
I have seen teams run Kafka for months with raw JSON strings, no schema enforcement, and no contract between producers and consumers. Then one producer silently renames a field. Three downstream consumers break. Six-hour incident. Post-mortem says “we need better documentation.”
What they actually need is Schema Registry with subject compatibility rules enforced at produce time:
spring:
kafka:
producer:
value-serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
properties:
schema.registry.url: http://schema-registry:8081
auto.register.schemas: false # NEVER true in production
use.latest.version: true
Setting auto.register.schemas: false in production is not optional. Treat schema registration as a deployment artifact, gated by CI/CD.
Lesson 4: Dead-Letter Topics Are Where You Discover Your Real Error-Handling Strategy
A DLT is not a trash can. It is a signal queue. Every message that lands there is a bug, a data contract violation, or a capacity problem you haven’t solved yet.
The pattern I use in production — retry topics + replay-capable DLT:
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
var recoverer = new DeadLetterPublishingRecoverer(template,
(record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition()));
var backOff = new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(1_000);
backOff.setMultiplier(2.0);
return new DefaultErrorHandler(recoverer, backOff);
}
Store DLT messages with full context headers (original topic, original offset, exception class, stack trace, timestamp). Build an ops tool that re-publishes selected DLT messages after a fix deploys. This is the pattern that saves you during production incidents.
Lesson 5: Kafka + Kubernetes — The Gotchas Nobody Warns You About
StatefulSet pod identity matters for broker configuration. Kafka brokers identify themselves by broker.id. If you use Deployments instead of StatefulSets, broker IDs become non-deterministic. Always StatefulSet for brokers.
PersistentVolume reclaim policy must be Retain. If a pod gets evicted and its PVC is deleted with the Delete reclaim policy, you have just lost a broker’s log directory.
Consumer group rebalancing amplifies pod disruption. Every rolling deployment triggers a rebalance. Solutions: cooperative sticky rebalancing (partition.assignment.strategy=CooperativeStickyAssignor), or static membership (group.instance.id) to skip rebalance on reconnects within session.timeout.ms.
Resource limits kill brokers silently. Size Kafka broker resource limits generously and monitor kafka_server_BrokerTopicMetrics_MessagesInPerSec in Grafana.
Lesson 6: The Metrics That Actually Matter
- records-lag-max — consumer lag by partition. Alert when sustained above 10,000.
- UnderReplicatedPartitions — if non-zero, alert immediately.
- Produce TotalTimeMs p99 — spikes indicate broker I/O pressure.
- MessagesInPerSec — sudden drops mean producer-side issues.
- rebalance-latency-avg — sustained rebalances indicate consumer instability.
Set these up in Grafana before you go to production. Not after the first 3 a.m. incident.
Kafka Development in the Agentic SDLC: Before and After
The lessons above haven’t changed. What has changed is the cost of the mechanical work that surrounds them.
Before (Pre-2024): What Kafka Setup Actually Took
Setting up a new Spring Kafka consumer service with production-grade configuration wasn’t hard. It was just slow and repetitive. Every engineer who built more than two Kafka consumers from scratch knows the ritual:
Pre-2024: Time to scaffold a production-grade Kafka consumer microservice
| KafkaListener + manual ACK + error handler bean | ~20 min |
| DLT routing + exponential back-off retry config | ~20 min |
| Avro schema + Schema Registry serializer config | ~25 min |
| Unit tests with EmbeddedKafkaBroker | ~45 min |
| application.yml for all environments + Dockerfile | ~15 min |
| Total ceremony before writing one line of business logic | ~2 hours |
And this assumes you already knew the right patterns. A junior engineer or someone new to a stack spending that 2 hours would often get the AckMode wrong, miss the auto.register.schemas: false flag, or write a DLT handler that doesn’t preserve the original offset. The ceremony was not just slow — it was a surface area for subtle mistakes.
After (Agentic SDLC): The Same Setup in 5 Minutes
Here is the prompt I use today:
Generate a production-grade Spring Boot 3 / Spring Kafka consumer for topic
"claim-events" with:
- Manual offset acknowledgment (AckMode.MANUAL_IMMEDIATE)
- ExponentialBackOffWithMaxRetries(3) with 1s initial interval
- DeadLetterPublishingRecoverer routing to claim-events.DLT
- Avro deserialization with Schema Registry at ${SCHEMA_REGISTRY_URL}
- auto.register.schemas=false
- EmbeddedKafkaBroker unit tests for: happy path, RetryableException
(offset NOT committed), NonRetryableException (routed to DLT)
- application.yml with dev/staging/prod profiles
- Dockerfile using eclipse-temurin:21-jre
Claude Code generates all of it: listener, error handler bean, serializer config, YAML, Dockerfile, and three unit tests — in under 5 minutes. The output is production-correct because the prompt specifies the constraints. No manual cross-referencing of Spring Kafka docs, no copy-paste from a previous project.
The Before/After Comparison Table
What AI Still Cannot Do With Kafka
The boilerplate is solved. The hard parts remain hard:
- Partition count decision: AI doesn’t know your projected message throughput, key cardinality, or consumer scaling plan. It will suggest a number. That number may be wrong for your system. You need to reason about this from your own domain knowledge.
- Exactly-once semantics design: Whether your system actually requires idempotent producers + transactional consumers depends on business requirements and acceptable failure modes. AI can explain the options; it cannot determine which one your use case needs.
- DLT replay strategy: How you replay DLT messages after a fix, in what order, at what rate, with what validation — this is a production operations design question that depends on your downstream systems’ behavior under load.
- Rebalancing strategy under your specific deployment pattern: Whether cooperative sticky or static membership is right depends on your pod restart frequency and session timeout tolerances. AI can describe both; you have to model the tradeoff for your system.
The agentic SDLC saved the 2 hours of ceremony. It did not replace the 11 years of knowing what can go wrong at 3 a.m.
Frequently Asked Questions
How many partitions should a Kafka topic have?
Start with the maximum number of consumer instances you plan to run, then add 20–30% headroom. For high-throughput topics (millions of messages/day), start at 24 or 48. For low-throughput internal event topics, 3–6 is fine. Increasing partitions later disrupts key-based ordering, so plan high initially.
What is the difference between a Kafka consumer group and a consumer instance?
A consumer group is a logical unit identified by a group ID. Kafka guarantees each partition is assigned to exactly one consumer instance within a group at a time. Multiple groups can all consume the same topic independently (fan-out). Within a group, parallelism is limited by partition count — adding more consumer instances than partitions does nothing.
What is a Kafka dead-letter topic and when should I use one?
A DLT receives messages your consumer could not process after all retry attempts. Use DLTs whenever you have non-retryable errors (malformed data, schema violations, business logic rejections). Never silently drop failed messages — route them to a DLT so you can audit, alert, and replay after a fix deploys.
Should I use self-managed Kafka on Kubernetes or a managed service?
For most teams: use a managed service (Confluent Cloud or AWS MSK). Self-managed Kafka on Kubernetes is operationally expensive — broker lifecycle, PersistentVolume management, and rebalancing tuning require sustained expertise. Only self-manage if you have strict data residency requirements or a dedicated platform team.
Can Claude Code generate correct Kafka configuration?
Yes, for the boilerplate — consumer listener, error handler, DLT routing, Schema Registry config, and unit tests — if your prompt specifies the constraints precisely. It does not know your partition count rationale, your DLT replay strategy, or your exactly-once requirements. The generated code is a correct starting point. The architectural decisions in Lessons 1–6 are yours to make.
The Bottom Line
Kafka is a mature, battle-tested technology. It handles extraordinary scale when configured correctly, and it fails in deeply confusing ways when not.
In 2026, the agentic SDLC eliminates the 2-hour setup ceremony. A well-prompted Claude Code session generates production-correct consumer infrastructure in minutes. The knowledge required to understand what it generated, to decide whether the partition strategy fits your workload, and to diagnose a 3 a.m. consumer lag incident — that knowledge is still yours to build. The AI saved the time. It did not acquire the experience on your behalf.
More from this blog:
- The Agentic SDLC: How Claude Code and Codex Are Rewriting the Developer Workflow
- TDD + BDD in the Agentic SDLC Era: Before, After, and What Actually Changed
- Distributed Systems Meets Machine Learning: Notes from Georgia Tech OMSCS
Gaurav Pratap Singh is a Principal / Staff Software Engineer at UKG with 11 years in distributed systems and enterprise Java. He is pursuing an MS in CS at Georgia Tech (OMSCS — Computing Systems + ML).
Leave a comment