Key Takeaways
- Writing tests first is now faster than writing implementation first — the agentic SDLC inverted the time equation that used to make TDD feel like overhead.
- Before 2024, TDD was widely understood but inconsistently practiced. The bottleneck was never comprehension — it was the time cost of writing thorough tests under sprint pressure.
- Claude Code generates test scaffolding (including edge cases you would have skipped) in under a minute. Your job becomes domain review, not boilerplate authoring.
- BDD with Cucumber now starts from AI-drafted Gherkin scenarios that the BA refines — reversing the old bottom-up flow where developers wrote scenarios the BA never read.
- TDD fundamentals don’t change. The design pressure of test-first, the Red-Green-Refactor discipline, and the value of mutation testing are as relevant as ever — now just with a much lower entry cost.
I have been doing TDD since 2014. I have seen it dismissed as academic overhead, cargo-culted by teams chasing coverage metrics, done correctly by rare teams that treated it as a design discipline — and now, since 2024, I have watched the agentic SDLC change its economics entirely.
This post is a before/after account. Not a tutorial. Not a framework pitch. A practitioner’s honest comparison of how TDD and BDD actually worked in enterprise Java before AI coding agents became part of the workflow, and how they work now.
Part 1: How TDD Actually Worked Before (2014–2023)
The honest version, not the conference-talk version.
The Time Tax That Made TDD Politically Difficult
Every engineer who has tried to introduce TDD to a team under delivery pressure has encountered the same objection: we don’t have time to write tests first. This objection was never really about laziness. It was a rational response to a genuine cost.
Here is what writing comprehensive tests for a Spring Boot service actually took, pre-agentic tooling:
Pre-2024: Time to write tests for a PaymentService (one class, ~8 methods)
| Unit tests (JUnit + Mockito) for happy paths | ~25 min |
| Unit tests for error paths + edge cases | ~40 min |
| Spring slice test setup (@WebMvcTest / @DataJpaTest) | ~30 min |
| Cucumber BDD scenarios + step definitions | ~60 min |
| Total for one service class | ~2.5 hours |
For a feature sprint with 4–6 new service classes, that’s a full day of test writing that teams consistently deprioritized when sprint velocity was the metric being tracked. Even engineers who believed in TDD often wrote tests after the fact — not because they were lazy, but because the ratio of test-writing time to implementation time was unsustainable under deadline pressure.
The Coverage Theater Problem
The response to TDD resistance was often: enforce a coverage threshold. 70%, 80%, 90% line coverage as a CI gate. This created a worse problem than no coverage: coverage theater.
Engineers under coverage pressure wrote the fastest tests possible to hit the threshold — tests that exercised code paths without asserting anything meaningful about behavior. A test that calls paymentService.process(validPayment()) and asserts assertNotNull(result) contributes 100% line coverage to the service method and tells you absolutely nothing about whether it works correctly.
The coverage number looked good. The tests provided no safety net. Regressions still shipped.
The BDD Adoption Failure Pattern
BDD with Cucumber had an even more consistent failure pattern. The theoretical workflow:
- Business analyst writes Gherkin scenarios in plain English
- Developer writes step definitions and implementation
- Everyone stays aligned on requirements
The actual workflow, at most companies:
- BA writes requirements in Confluence
- Developer writes implementation
- Developer writes Gherkin feature files that describe what they already built (not what was required)
- BA never reads the feature files
- Step definitions are written to call internal service methods directly, not observable API behavior
This produced Cucumber tests that were essentially JUnit tests in a more verbose syntax, with no collaboration benefit and significantly more maintenance overhead. Many teams abandoned Cucumber after a year because the cost didn’t justify the value — and they were right, given how they were using it.
Part 2: The Agentic SDLC Changes the Economics (2024–Present)
The 2024–2026 shift happened because of one thing: the time cost of writing comprehensive tests dropped by 80–90%. When the bottleneck disappears, the political objection disappears with it.
Test-First Is Now Faster Than Implementation-First
This is the inversion that matters. Before agentic tools, writing tests first added time to the development cycle. Now it often saves time, because the test spec gives the AI a precise target that produces cleaner implementation on the first pass.
Here is my current workflow for a new service class, using Claude Code:
Step 1 — Write the spec, not the implementation:
I need a PaymentService that:
- Processes PaymentEvent objects from a Kafka consumer
- Checks for duplicate processing using an idempotency key against PostgreSQL
- If not duplicate: updates AccountBalance, publishes BalanceUpdatedEvent to Kafka
- If duplicate: returns immediately (no-op, no exception)
- For malformed events (null amount, negative amount, missing accountId):
throws NonRetryableException
- For transient failures (DB timeout, Kafka publish failure):
throws RetryableException
Write ALL JUnit 5 + Mockito tests first.
Cover: happy path, duplicate detection, malformed events (3 variants),
DB timeout, Kafka failure, null input.
Use @ExtendWith(MockitoExtension.class). Do not write the implementation yet.
Step 2 — Review and add domain knowledge:
Claude Code generates 12–15 test methods covering every case I specified plus several I didn’t explicitly mention (empty string accountId, zero-value amount, concurrent duplicate check). I review each test for domain correctness. In a payment system, I know that amounts of exactly 0.00 are valid for refund reversal events — the AI doesn’t. I add or modify those 1–2 domain-specific tests. The rest are correct as generated.
Step 3 — Generate implementation against passing tests:
Now implement PaymentService to make all tests pass.
Use Spring Data JPA for the repository, KafkaTemplate for publishing.
Add @Transactional where appropriate.
All tests must pass before you finish.
Claude Code writes the implementation, runs the tests, fixes failures, and returns when green. Total time: under 15 minutes for a service that previously took 2.5 hours to write with tests.
2026: Time to write tests + implementation for a PaymentService
| Write spec prompt (replaces test writing) | ~3 min |
| AI generates tests; you review + add domain cases | ~5 min |
| AI generates implementation + runs tests to green | ~5 min |
| Spring slice test setup (generated) | ~2 min |
| Total for one service class | ~15 min |
The Before/After Comparison: TDD Workflow
BDD + Cucumber: How the Workflow Reversed
The biggest shift in BDD practice is that the workflow now runs top-down instead of bottom-up.
Before (bottom-up, developer-driven):
1. BA writes requirements in Confluence
2. Developer reads requirements, writes implementation
3. Developer writes Gherkin to describe what was built
4. Gherkin reflects implementation, not requirements
5. BA never reads it; step definitions call internal methods
6. BDD provides no collaboration value
After (top-down, spec-driven):
1. BA writes requirements (anywhere — Jira, Confluence, verbal)
2. Engineer prompts Claude Code:
"Given these requirements, write Cucumber Gherkin feature file
with scenarios for the happy path, edge cases, and error states"
3. AI generates draft Gherkin that both BA and engineer review together
4. BA corrects scenarios that don't match intent (this is the valuable alignment step)
5. Engineer prompts: "Generate step definitions that test via the REST API"
6. Implementation follows from the agreed scenarios
7. BDD now genuinely serves collaboration
The key insight: AI is good at generating syntactically correct Gherkin from requirements text. What it cannot do is know whether the scenarios match business intent — that’s the BA’s job. The AI draft makes the alignment conversation happen earlier and more concretely than a blank Confluence page ever did.
Here is an example of what Claude Code generates from a requirements prompt for a payment feature:
Feature: Payment Processing
As a financial platform
I want to process payment events reliably
So that account balances reflect all transactions accurately
Background:
Given the payment service is running
And account "ACC-123" has a balance of 0.00
Scenario: Successful payment is applied to account balance
Given a payment event for account "ACC-123" of amount 1500.00
When the payment processor handles the event
Then the account balance for "ACC-123" should be 1500.00
And a balance-updated event should be published
Scenario: Duplicate payment is rejected idempotently
Given a payment event with idempotency key "PAY-456" has already been processed
When a second payment event with idempotency key "PAY-456" arrives
Then the account balance should remain unchanged
And no duplicate event should be published
Scenario: Payment with negative amount is rejected
Given a payment event for account "ACC-123" of amount -50.00
When the payment processor handles the event
Then a NonRetryableException should be thrown
And the account balance should remain unchanged
Scenario: Payment fails due to database timeout and is retried
Given the database is intermittently unavailable
When a payment event for account "ACC-123" of amount 100.00 arrives
Then a RetryableException should be thrown
And the event should be eligible for redelivery
This draft takes 30 seconds to generate and replaces the 60–90 minutes engineers used to spend writing Gherkin from scratch. The BA can now review it, correct scenario 3 (“actually, negative amounts are valid for refund reversals up to the account balance”), and the alignment happens before a single line of implementation is written.
Part 3: What Didn’t Change (And Never Will)
Speed improvements don’t change the fundamentals. These principles matter as much in 2026 as they did in 2014 — maybe more, because AI makes it easier to violate them at scale.
TDD Is Still a Design Practice First
Writing tests before implementation is still the correct discipline, and not because it improves coverage. It improves design. A class that is hard to test is hard to test because it is poorly designed: too many dependencies, too many responsibilities, too much hidden state.
In the agentic era, this means: if you ask Claude Code to write tests for an existing class and the generated test setup is complicated — many mocks, complex fixtures, indirect access to state — that complexity is the AI signaling that your design needs work. The feedback is the same as before; the signal is now clearer because the AI has no tolerance for unnecessary complexity.
Spring Test Slices: Still the Correct Architecture
The fundamental principle — load only what you need to test — is unchanged. What changed is that AI generates the correct slice configuration without you having to look up the docs:
// Test only the web layer
@WebMvcTest(PaymentController.class)
class PaymentControllerTest {
@Autowired MockMvc mockMvc;
@MockBean PaymentService paymentService;
@Test
void shouldReturn201ForValidPayment() throws Exception {
given(paymentService.process(any()))
.willReturn(PaymentResult.success("PAY-001"));
mockMvc.perform(post("/api/payments")
.contentType(APPLICATION_JSON)
.content("""
{ "accountId": "ACC-123", "amount": 1500.00,
"idempotencyKey": "PAY-001" }
"""))
.andExpect(status().isCreated())
.andExpect(jsonPath("$.paymentId").value("PAY-001"));
}
@Test
void shouldReturn400ForNegativeAmount() throws Exception {
mockMvc.perform(post("/api/payments")
.contentType(APPLICATION_JSON)
.content("""
{ "accountId": "ACC-123", "amount": -50.00,
"idempotencyKey": "PAY-002" }
"""))
.andExpect(status().isBadRequest())
.andExpect(jsonPath("$.error").value("INVALID_AMOUNT"));
}
}
// Test only the JPA layer
@DataJpaTest
class PaymentRepositoryTest {
@Autowired PaymentRepository repository;
@Test
void shouldDetectDuplicateByIdempotencyKey() {
repository.save(Payment.builder()
.idempotencyKey("PAY-456")
.accountId("ACC-123")
.amount(BigDecimal.valueOf(1500.00))
.status(PaymentStatus.PROCESSED)
.build());
assertThat(repository.existsByIdempotencyKey("PAY-456")).isTrue();
assertThat(repository.existsByIdempotencyKey("PAY-999")).isFalse();
}
}
Mutation Testing: More Important, Not Less
In the pre-agentic era, mutation testing (PIT for Java) was a nice-to-have that few teams ran regularly because the feedback loop was slow. In the agentic era, it is more important than ever — because AI-generated tests can produce high line coverage while missing the semantically meaningful assertions.
PIT introduces deliberate modifications to your production code — changing > to >=, negating a boolean, replacing a return value with null — and checks whether your tests catch the change. A high mutation score means your tests detect incorrect behavior, not just execute the code.
<!-- pom.xml: Add PIT mutation testing -->
<plugin>
<groupId>org.pitest</groupId>
<artifactId>pitest-maven</artifactId>
<version>1.15.3</version>
<configuration>
<targetClasses>
<param>com.example.payments.service.*</param>
</targetClasses>
<targetTests>
<param>com.example.payments.service.*Test</param>
</targetTests>
<mutationThreshold>80</mutationThreshold>
<coverageThreshold>90</coverageThreshold>
</configuration>
</plugin>
In my current workflow: after Claude Code generates and passes tests, I run PIT to validate the test quality. If mutation score is below 75%, I prompt Claude Code: These mutants survived. For each one, explain why the test didn’t catch it and add a test that would. The AI response is consistently accurate and the fix closes the gap. The human review is whether the added assertions reflect actual business requirements.
Part 4: New Anti-Patterns Introduced by Agentic TDD
The agentic SDLC introduces failure modes that didn’t exist before. These are the new ways TDD goes wrong in 2026.
Anti-Pattern 1: Trusting AI Coverage Without Domain Review
AI-generated tests cover the spec you gave them, not the requirements you meant to give them. If your prompt says “validate that amount is positive,” the AI correctly tests that negative amounts throw an exception. It has no way to know that in your domain, amounts of exactly zero are valid for refund reversals, or that amounts greater than $10,000 require a separate compliance check.
The fix: treat AI-generated tests as comprehensive scaffolding, not complete specification. Your domain expertise is what converts scaffolding into a meaningful safety net. Read every generated test as critically as you would read generated implementation.
Anti-Pattern 2: Spec-by-Prompt Instead of Spec-by-Conversation
The worst agentic TDD workflow I’ve seen: engineer gets a ticket from Jira, pastes the ticket description into Claude Code, gets tests back, gets implementation back, opens a PR. The BA never saw the Gherkin. The business intent was never validated against the AI interpretation.
AI-generated Gherkin from a Jira ticket reflects the quality of the ticket, not the quality of the requirement. If the ticket is ambiguous, the Gherkin will be too — just confidently so. The collaboration step — BA reviewing AI-drafted scenarios before implementation — is not optional.
Anti-Pattern 3: Skipping the Red Phase
A seductive shortcut: ask Claude Code to write tests AND implementation simultaneously. This eliminates the design pressure of TDD entirely. You lose the feedback loop that tells you your design is too complex to test — because the AI will write tests that accommodate any design.
Always run the tests first and confirm they fail before writing implementation. The failing test is not just a process step — it’s the moment you confirm the test actually tests what you think it tests.
Anti-Pattern 4: Over-Mocking in AI-Generated Tests
AI tends to mock aggressively — if a class has six collaborators, the generated tests will mock all six for every test, even when a real instance would be simpler and more meaningful. This produces tests that are tightly coupled to implementation details rather than behavior.
Review generated mock usage critically. Tests that mock the class under test’s own methods are always wrong. Tests that mock 5+ collaborators for a single assertion are usually a sign the class has too many responsibilities.
My Prompt Library for TDD + BDD (March 2026)
These are the prompts I actually use, stored in our team’s shared Claude Code prompt library:
# Unit test generation (tests only, no implementation)
Write JUnit 5 + Mockito unit tests for [ClassName]. Behavior to test: [describe in plain English]. Cover: happy path, all explicit error cases, null inputs, boundary values, and any concurrency concerns. Use @ExtendWith(MockitoExtension.class). Assert on return values and side effects (verify() calls). Do NOT write the implementation.
# BDD feature file generation
Generate a Cucumber Gherkin feature file for: [feature description]. Include: one happy-path scenario, the top 3 error/edge scenarios, and any performance or concurrency scenarios relevant to this domain. Scenarios should describe observable behavior, not internal methods. Use Background: for shared setup. Keep step language business-readable.
# Mutation testing gap closure
PIT mutation testing shows these surviving mutants in [ClassName]: [paste PIT output] For each surviving mutant: 1. Explain in one sentence why the existing test didn't catch it 2. Write a new test that would kill this mutant 3. Flag any mutant where killing it would require testing an implementation detail (not a behavior) — skip those.
# Spring slice setup
Create a @WebMvcTest for [ControllerClass] covering: [list of endpoints and key scenarios] Use MockMvc + @MockBean for service dependencies. Test: 200/201 success cases, 400 validation failures, 404 not-found, and 500 upstream error handling. Match our existing test naming convention: shouldReturn[Status]When[Condition].
Frequently Asked Questions
Does AI-generated test code make TDD easier to skip?
Potentially, yes — if your team conflates “tests exist” with “TDD was practiced.” You can ask Claude Code to write implementation and tests simultaneously and get both in one step. This is not TDD. It eliminates the design feedback loop. The value of test-first comes from writing the failing test before the implementation and feeling the design pressure. Teams that use AI to generate tests alongside implementation are doing code coverage generation, not test-driven development.
What should I review in AI-generated tests?
Three things: (1) Domain correctness — does the test reflect actual business rules, not just syntactic interpretation of the prompt? (2) Assertion depth — does the test assert on meaningful behavior, not just “result is not null”? (3) Mock appropriateness — are the things being mocked actually external dependencies, or is the AI mocking the class under test’s own collaborators unnecessarily? Line coverage is not a useful review dimension.
How do I convince my team to adopt agentic TDD?
Run a live demo. Take a real upcoming service class from your backlog. Time yourself writing tests and implementation with Claude Code in front of the team. The combination of seeing the speed and seeing that the generated tests are actually comprehensive (not just happy-path) addresses both the “we don’t have time” and “AI tests aren’t good enough” objections simultaneously. One 15-minute demo is worth 10 architecture decision records about why TDD is important.
Do I still need BDD with AI tools available?
Yes — more than before. The value of BDD was never in Cucumber syntax or in the testing framework. It was in the conversation between business and engineering that Gherkin scenarios force. AI can generate Gherkin from requirements text, but it cannot validate whether the scenarios match business intent. The collaboration step — BA reviewing AI-drafted scenarios before implementation — is the highest-value part of BDD, and it’s faster now because you start from a draft rather than a blank page.
What is the right test pyramid ratio in 2026?
I use: 65% unit tests (JUnit + Mockito, fast, test one class in isolation), 25% integration/slice tests (@WebMvcTest, @DataJpaTest, TestContainers for real DB), 10% BDD acceptance tests (Cucumber, test observable behavior via API). The ratio matters less than the principle: each layer tests what the layer below can’t adequately cover. Unit tests can’t test HTTP routing; slice tests can. Slice tests can’t test business workflow across multiple services; BDD scenarios can.
The Bottom Line
The agentic SDLC did not change why TDD works. It changed whether it’s practical to do it consistently.
Before 2024, the failure mode of TDD was time pressure. Comprehensive tests took 2–3 hours per service class. Sprints were two weeks. The math didn’t always work out, and coverage theater filled the gap.
In 2026, the failure mode of TDD is review discipline. Tests take 15 minutes per service class. The time objection is gone. The new risk is that engineers treat AI-generated tests as complete rather than as scaffolding requiring domain review. The gap between “AI said the tests pass” and “these tests verify the system is correct for our domain” is where bugs will hide.
TDD is a design practice. Agentic tools accelerate the mechanical parts. The design thinking — what behavior should this system have, what are its invariants, what should it refuse to do — is still yours. The tools give you more time to focus on that thinking. Use it.
More from this blog:
- The Agentic SDLC: How Claude Code and Codex Are Rewriting the Developer Workflow
- Building Event-Driven Systems with Kafka: Lessons from 11 Years in the Field
- Distributed Systems Meets Machine Learning: Notes from Georgia Tech OMSCS
Gaurav Pratap Singh is a Principal / Staff Software Engineer at UKG with 11 years in distributed systems and enterprise Java. He is pursuing an MS in CS at Georgia Tech (OMSCS — Computing Systems + ML). He writes about engineering practice, agentic AI tooling, and the evolving software development lifecycle.
Leave a comment