Building Event-Driven Systems with Kafka: 11 Years of Lessons + the Agentic SDLC

By Gaurav Pratap Singh · March 2026 · 15 min read
Apache Kafka Distributed Systems Agentic SDLC Java Spring Boot

Key Takeaways

Consumer group semantics are not optional knowledge — misunderstand them and you will lose messages in production
Partition count is a decision you can almost never reverse; overestimate deliberately
Schema Registry is not a nice-to-have — it is the contract layer of your event-driven system
Dead-letter topics are where you discover your real error handling strategy
Kafka on Kubernetes needs stateful-set awareness that most blog posts skip entirely
In the Agentic SDLC era: a complete, correct Kafka consumer with DLT, retry, and Schema Registry takes 5 minutes to scaffold. The hard part was never the boilerplate — it was always the reasoning about partitions, offsets, and exactly-once semantics. AI can’t do that for you yet.

I have been building event-driven systems with Apache Kafka since 2018. Across healthcare claims processing, telecom network orchestration, financial trade data pipelines, and workforce management platforms, Kafka has been the backbone. The official docs are excellent. The Confluent tutorials are solid. But none of them will tell you the things that actually hurt in production at scale.

This post collects those hard-won lessons — and compares how Kafka development looked before agentic coding tools versus how it works in 2026.

Lesson 1: Consumer Group Offset Management — When At-Least-Once Is a Lie You’re Telling Yourself

Every Kafka tutorial says: Kafka guarantees at-least-once delivery. What they quietly skip is the part where you are responsible for making that guarantee actually hold.

The most common mistake: committing offsets before processing is complete. Here is the anti-pattern in Spring Kafka:

// DANGEROUS: offset committed before downstream write completes
@KafkaListener(topics = "claim-events", groupId = "rcm-processor")
public void onMessage(ClaimEvent event) {
    claimService.persist(event); // if this throws, message is LOST
}

The fix: use AckMode.MANUAL_IMMEDIATE and commit only after your downstream operation succeeds:

@KafkaListener(topics = "claim-events", groupId = "rcm-processor")
public void onMessage(ClaimEvent event, Acknowledgment ack) {
    try {
        claimService.persist(event);
        ack.acknowledge();
    } catch (RetryableException e) {
        throw e; // do NOT ack: let Kafka redeliver
    } catch (NonRetryableException e) {
        ack.acknowledge(); // ack but route to DLT
        deadLetterRouter.send(event, e);
    }
}

The deeper lesson: decide your delivery guarantee before you write any code. At-least-once, at-most-once, and effectively-once require fundamentally different implementations. Effectively-once (idempotent consumers + transactional producers) is the hardest — and the one most teams think they don’t need until they do.

Lesson 2: Partition Count Is the Decision You’ll Regret (Or Thank Yourself For) in 18 Months

Kafka does not let you easily decrease partitions. Increasing partitions mid-stream disrupts key-based ordering guarantees. This makes partition count one of the most consequential and most under-thought choices in any Kafka deployment.

Rules I follow:

Start with max expected consumer parallelism, then add 20%. If you plan to run 12 consumer instances, start with 15 partitions.
Never set partitions below the planned consumer count. Idle consumers waste memory without adding throughput.
Round to a factor-friendly number. 12, 24, 48 — numbers divisible by 2, 3, 4, 6 give you flexibility later.
High-cardinality key topics need more partitions. Topics keyed by userId in a multi-million-user system need far more than topics keyed by accountType with 5 possible values.

In a healthcare claims pipeline I built, we started with 6 partitions on an EDI 837 ingest topic. Twelve months later, claim volume tripled. We needed 18 consumer instances but were capped at 6. The migration took three weeks. Starting at 24 would have cost nothing.

Lesson 3: Schema Registry Is a First-Class Citizen, Not an Afterthought

I have seen teams run Kafka for months with raw JSON strings, no schema enforcement, and no contract between producers and consumers. Then one producer silently renames a field. Three downstream consumers break. Six-hour incident. Post-mortem says “we need better documentation.”

What they actually need is Schema Registry with subject compatibility rules enforced at produce time:

spring:
  kafka:
    producer:
      value-serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
      properties:
        schema.registry.url: http://schema-registry:8081
        auto.register.schemas: false   # NEVER true in production
        use.latest.version: true

Setting auto.register.schemas: false in production is not optional. Treat schema registration as a deployment artifact, gated by CI/CD.

Lesson 4: Dead-Letter Topics Are Where You Discover Your Real Error-Handling Strategy

A DLT is not a trash can. It is a signal queue. Every message that lands there is a bug, a data contract violation, or a capacity problem you haven’t solved yet.

The pattern I use in production — retry topics + replay-capable DLT:

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
    var recoverer = new DeadLetterPublishingRecoverer(template,
        (record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition()));

    var backOff = new ExponentialBackOffWithMaxRetries(3);
    backOff.setInitialInterval(1_000);
    backOff.setMultiplier(2.0);

    return new DefaultErrorHandler(recoverer, backOff);
}

Store DLT messages with full context headers (original topic, original offset, exception class, stack trace, timestamp). Build an ops tool that re-publishes selected DLT messages after a fix deploys. This is the pattern that saves you during production incidents.

Lesson 5: Kafka + Kubernetes — The Gotchas Nobody Warns You About

StatefulSet pod identity matters for broker configuration. Kafka brokers identify themselves by broker.id. If you use Deployments instead of StatefulSets, broker IDs become non-deterministic. Always StatefulSet for brokers.

PersistentVolume reclaim policy must be Retain. If a pod gets evicted and its PVC is deleted with the Delete reclaim policy, you have just lost a broker’s log directory.

Consumer group rebalancing amplifies pod disruption. Every rolling deployment triggers a rebalance. Solutions: cooperative sticky rebalancing (partition.assignment.strategy=CooperativeStickyAssignor), or static membership (group.instance.id) to skip rebalance on reconnects within session.timeout.ms.

Resource limits kill brokers silently. Size Kafka broker resource limits generously and monitor kafka_server_BrokerTopicMetrics_MessagesInPerSec in Grafana.

Lesson 6: The Metrics That Actually Matter

records-lag-max — consumer lag by partition. Alert when sustained above 10,000.
UnderReplicatedPartitions — if non-zero, alert immediately.
Produce TotalTimeMs p99 — spikes indicate broker I/O pressure.
MessagesInPerSec — sudden drops mean producer-side issues.
rebalance-latency-avg — sustained rebalances indicate consumer instability.

Set these up in Grafana before you go to production. Not after the first 3 a.m. incident.

Kafka Development in the Agentic SDLC: Before and After

The lessons above haven’t changed. What has changed is the cost of the mechanical work that surrounds them.

Before (Pre-2024): What Kafka Setup Actually Took

Setting up a new Spring Kafka consumer service with production-grade configuration wasn’t hard. It was just slow and repetitive. Every engineer who built more than two Kafka consumers from scratch knows the ritual:

Pre-2024: Time to scaffold a production-grade Kafka consumer microservice

KafkaListener + manual ACK + error handler bean	~20 min
DLT routing + exponential back-off retry config	~20 min
Avro schema + Schema Registry serializer config	~25 min
Unit tests with EmbeddedKafkaBroker	~45 min
application.yml for all environments + Dockerfile	~15 min
Total ceremony before writing one line of business logic	~2 hours

And this assumes you already knew the right patterns. A junior engineer or someone new to a stack spending that 2 hours would often get the AckMode wrong, miss the auto.register.schemas: false flag, or write a DLT handler that doesn’t preserve the original offset. The ceremony was not just slow — it was a surface area for subtle mistakes.

After (Agentic SDLC): The Same Setup in 5 Minutes

Here is the prompt I use today:

Generate a production-grade Spring Boot 3 / Spring Kafka consumer for topic
"claim-events" with:
- Manual offset acknowledgment (AckMode.MANUAL_IMMEDIATE)
- ExponentialBackOffWithMaxRetries(3) with 1s initial interval
- DeadLetterPublishingRecoverer routing to claim-events.DLT
- Avro deserialization with Schema Registry at ${SCHEMA_REGISTRY_URL}
- auto.register.schemas=false
- EmbeddedKafkaBroker unit tests for: happy path, RetryableException 
  (offset NOT committed), NonRetryableException (routed to DLT)
- application.yml with dev/staging/prod profiles
- Dockerfile using eclipse-temurin:21-jre

Claude Code generates all of it: listener, error handler bean, serializer config, YAML, Dockerfile, and three unit tests — in under 5 minutes. The output is production-correct because the prompt specifies the constraints. No manual cross-referencing of Spring Kafka docs, no copy-paste from a previous project.

The Before/After Comparison Table

Task	Before (Pre-2024)	After (Agentic SDLC)
Consumer scaffold	~2 hrs manual. Common mistakes: wrong AckMode, missing error handler, DLT not preserving partition.	<5 min with a spec prompt. Production-correct configuration if prompt specifies constraints.
EmbeddedKafka tests	~45 min setup. EmbeddedKafkaBroker config was the thing everyone copied from a previous project and hoped it still worked.	Generated alongside consumer code. Tests cover retry paths, DLT routing, and offset-commit semantics correctly when prompted.
Debug consumer lag	Engineer reads Grafana metrics, correlates with app logs, hypothesizes cause. Entirely manual loop.	Paste lag metrics + recent logs into Claude Code. AI identifies candidate causes (slow downstream, rebalancing, GC pause). Still requires your system knowledge to confirm.
Avro schema design	Engineer writes .avsc manually, checks compatibility by hand or after a failed CI run.	AI drafts Avro schema from domain description. Suggest forward/backward compatible evolution path. Still requires domain review for business field semantics.
Kubernetes manifests	Manual. StatefulSet vs Deployment decision often wrong for brokers. Resource limits guessed.	AI generates StatefulSet + PVC with Retain policy correctly when prompted. Resource sizing still requires your load knowledge.
Architecture reasoning	Partition strategy, offset semantics, DLT replay design — required experience. Juniors got it wrong, often silently.	AI can explain partition strategy tradeoffs. Cannot decide for you based on your specific throughput, key cardinality, and team size. This is still your job.

What AI Still Cannot Do With Kafka

The boilerplate is solved. The hard parts remain hard:

Partition count decision: AI doesn’t know your projected message throughput, key cardinality, or consumer scaling plan. It will suggest a number. That number may be wrong for your system. You need to reason about this from your own domain knowledge.
Exactly-once semantics design: Whether your system actually requires idempotent producers + transactional consumers depends on business requirements and acceptable failure modes. AI can explain the options; it cannot determine which one your use case needs.
DLT replay strategy: How you replay DLT messages after a fix, in what order, at what rate, with what validation — this is a production operations design question that depends on your downstream systems’ behavior under load.
Rebalancing strategy under your specific deployment pattern: Whether cooperative sticky or static membership is right depends on your pod restart frequency and session timeout tolerances. AI can describe both; you have to model the tradeoff for your system.

The agentic SDLC saved the 2 hours of ceremony. It did not replace the 11 years of knowing what can go wrong at 3 a.m.

Frequently Asked Questions

How many partitions should a Kafka topic have?

Start with the maximum number of consumer instances you plan to run, then add 20–30% headroom. For high-throughput topics (millions of messages/day), start at 24 or 48. For low-throughput internal event topics, 3–6 is fine. Increasing partitions later disrupts key-based ordering, so plan high initially.

What is the difference between a Kafka consumer group and a consumer instance?

A consumer group is a logical unit identified by a group ID. Kafka guarantees each partition is assigned to exactly one consumer instance within a group at a time. Multiple groups can all consume the same topic independently (fan-out). Within a group, parallelism is limited by partition count — adding more consumer instances than partitions does nothing.

What is a Kafka dead-letter topic and when should I use one?

A DLT receives messages your consumer could not process after all retry attempts. Use DLTs whenever you have non-retryable errors (malformed data, schema violations, business logic rejections). Never silently drop failed messages — route them to a DLT so you can audit, alert, and replay after a fix deploys.

Should I use self-managed Kafka on Kubernetes or a managed service?

For most teams: use a managed service (Confluent Cloud or AWS MSK). Self-managed Kafka on Kubernetes is operationally expensive — broker lifecycle, PersistentVolume management, and rebalancing tuning require sustained expertise. Only self-manage if you have strict data residency requirements or a dedicated platform team.

Can Claude Code generate correct Kafka configuration?

Yes, for the boilerplate — consumer listener, error handler, DLT routing, Schema Registry config, and unit tests — if your prompt specifies the constraints precisely. It does not know your partition count rationale, your DLT replay strategy, or your exactly-once requirements. The generated code is a correct starting point. The architectural decisions in Lessons 1–6 are yours to make.

The Bottom Line

Kafka is a mature, battle-tested technology. It handles extraordinary scale when configured correctly, and it fails in deeply confusing ways when not.

In 2026, the agentic SDLC eliminates the 2-hour setup ceremony. A well-prompted Claude Code session generates production-correct consumer infrastructure in minutes. The knowledge required to understand what it generated, to decide whether the partition strategy fits your workload, and to diagnose a 3 a.m. consumer lag incident — that knowledge is still yours to build. The AI saved the time. It did not acquire the experience on your behalf.

More from this blog:

Gaurav Pratap Singh is a Principal / Staff Software Engineer at UKG with 11 years in distributed systems and enterprise Java. He is pursuing an MS in CS at Georgia Tech (OMSCS — Computing Systems + ML).