TDD + BDD in the Agentic SDLC Era: Before, After, and What Actually Changed

By Gaurav Pratap Singh  ·  March 2026  ·  14 min read
TDD BDD Agentic SDLC Java Claude Code

Key Takeaways

  • Writing tests first is now faster than writing implementation first — the agentic SDLC inverted the time equation that used to make TDD feel like overhead.
  • Before 2024, TDD was widely understood but inconsistently practiced. The bottleneck was never comprehension — it was the time cost of writing thorough tests under sprint pressure.
  • Claude Code generates test scaffolding (including edge cases you would have skipped) in under a minute. Your job becomes domain review, not boilerplate authoring.
  • BDD with Cucumber now starts from AI-drafted Gherkin scenarios that the BA refines — reversing the old bottom-up flow where developers wrote scenarios the BA never read.
  • TDD fundamentals don’t change. The design pressure of test-first, the Red-Green-Refactor discipline, and the value of mutation testing are as relevant as ever — now just with a much lower entry cost.

I have been doing TDD since 2014. I have seen it dismissed as academic overhead, cargo-culted by teams chasing coverage metrics, done correctly by rare teams that treated it as a design discipline — and now, since 2024, I have watched the agentic SDLC change its economics entirely.

This post is a before/after account. Not a tutorial. Not a framework pitch. A practitioner’s honest comparison of how TDD and BDD actually worked in enterprise Java before AI coding agents became part of the workflow, and how they work now.


Part 1: How TDD Actually Worked Before (2014–2023)

The honest version, not the conference-talk version.

The Time Tax That Made TDD Politically Difficult

Every engineer who has tried to introduce TDD to a team under delivery pressure has encountered the same objection: we don’t have time to write tests first. This objection was never really about laziness. It was a rational response to a genuine cost.

Here is what writing comprehensive tests for a Spring Boot service actually took, pre-agentic tooling:

Pre-2024: Time to write tests for a PaymentService (one class, ~8 methods)

Unit tests (JUnit + Mockito) for happy paths ~25 min
Unit tests for error paths + edge cases ~40 min
Spring slice test setup (@WebMvcTest / @DataJpaTest) ~30 min
Cucumber BDD scenarios + step definitions ~60 min
Total for one service class ~2.5 hours

For a feature sprint with 4–6 new service classes, that’s a full day of test writing that teams consistently deprioritized when sprint velocity was the metric being tracked. Even engineers who believed in TDD often wrote tests after the fact — not because they were lazy, but because the ratio of test-writing time to implementation time was unsustainable under deadline pressure.

The Coverage Theater Problem

The response to TDD resistance was often: enforce a coverage threshold. 70%, 80%, 90% line coverage as a CI gate. This created a worse problem than no coverage: coverage theater.

Engineers under coverage pressure wrote the fastest tests possible to hit the threshold — tests that exercised code paths without asserting anything meaningful about behavior. A test that calls paymentService.process(validPayment()) and asserts assertNotNull(result) contributes 100% line coverage to the service method and tells you absolutely nothing about whether it works correctly.

The coverage number looked good. The tests provided no safety net. Regressions still shipped.

The BDD Adoption Failure Pattern

BDD with Cucumber had an even more consistent failure pattern. The theoretical workflow:

  1. Business analyst writes Gherkin scenarios in plain English
  2. Developer writes step definitions and implementation
  3. Everyone stays aligned on requirements

The actual workflow, at most companies:

  1. BA writes requirements in Confluence
  2. Developer writes implementation
  3. Developer writes Gherkin feature files that describe what they already built (not what was required)
  4. BA never reads the feature files
  5. Step definitions are written to call internal service methods directly, not observable API behavior

This produced Cucumber tests that were essentially JUnit tests in a more verbose syntax, with no collaboration benefit and significantly more maintenance overhead. Many teams abandoned Cucumber after a year because the cost didn’t justify the value — and they were right, given how they were using it.


Part 2: The Agentic SDLC Changes the Economics (2024–Present)

The 2024–2026 shift happened because of one thing: the time cost of writing comprehensive tests dropped by 80–90%. When the bottleneck disappears, the political objection disappears with it.

Test-First Is Now Faster Than Implementation-First

This is the inversion that matters. Before agentic tools, writing tests first added time to the development cycle. Now it often saves time, because the test spec gives the AI a precise target that produces cleaner implementation on the first pass.

Here is my current workflow for a new service class, using Claude Code:

Step 1 — Write the spec, not the implementation:

I need a PaymentService that:
- Processes PaymentEvent objects from a Kafka consumer
- Checks for duplicate processing using an idempotency key against PostgreSQL
- If not duplicate: updates AccountBalance, publishes BalanceUpdatedEvent to Kafka
- If duplicate: returns immediately (no-op, no exception)
- For malformed events (null amount, negative amount, missing accountId): 
  throws NonRetryableException
- For transient failures (DB timeout, Kafka publish failure): 
  throws RetryableException

Write ALL JUnit 5 + Mockito tests first. 
Cover: happy path, duplicate detection, malformed events (3 variants), 
DB timeout, Kafka failure, null input.
Use @ExtendWith(MockitoExtension.class). Do not write the implementation yet.

Step 2 — Review and add domain knowledge:

Claude Code generates 12–15 test methods covering every case I specified plus several I didn’t explicitly mention (empty string accountId, zero-value amount, concurrent duplicate check). I review each test for domain correctness. In a payment system, I know that amounts of exactly 0.00 are valid for refund reversal events — the AI doesn’t. I add or modify those 1–2 domain-specific tests. The rest are correct as generated.

Step 3 — Generate implementation against passing tests:

Now implement PaymentService to make all tests pass. 
Use Spring Data JPA for the repository, KafkaTemplate for publishing. 
Add @Transactional where appropriate.
All tests must pass before you finish.

Claude Code writes the implementation, runs the tests, fixes failures, and returns when green. Total time: under 15 minutes for a service that previously took 2.5 hours to write with tests.

2026: Time to write tests + implementation for a PaymentService

Write spec prompt (replaces test writing) ~3 min
AI generates tests; you review + add domain cases ~5 min
AI generates implementation + runs tests to green ~5 min
Spring slice test setup (generated) ~2 min
Total for one service class ~15 min

The Before/After Comparison: TDD Workflow

TDD Phase Before (Pre-2024) After (Agentic SDLC)
Red (failing test) Engineer writes test manually. 25–40 min for a class. Edge cases often missed under pressure. AI generates comprehensive test class from spec in <60 sec. Engineer adds domain-specific cases (2–3 min).
Green (implementation) Engineer writes implementation. Tests sometimes written to fit implementation rather than the other way. AI writes implementation targeting the tests. Tests spec the behavior; implementation follows. Design pressure preserved.
Refactor Often skipped under time pressure. Technical debt accumulates here. AI executes refactor tasks on command. Tests act as safety net. Skipping refactor is a choice, not a necessity.
Edge case coverage Proportional to engineer’s experience and available time. Frequently incomplete. AI suggests edge cases systematically (nulls, boundary values, exception paths). Domain expert adds what AI can’t know.
Spring slice setup Manual. Boilerplate setup for @WebMvcTest, @DataJpaTest, @MockBean configuration took 20–30 min per class. AI generates correct slice test setup. Engineer specifies what to test; AI handles the infrastructure configuration.
Test failure debugging Engineer reads stack trace, identifies root cause, fixes, re-runs. Loop repeated manually. AI reads stack trace, identifies root cause, proposes fix, applies, re-runs in one agentic loop. You review the fix.

BDD + Cucumber: How the Workflow Reversed

The biggest shift in BDD practice is that the workflow now runs top-down instead of bottom-up.

Before (bottom-up, developer-driven):

1. BA writes requirements in Confluence
2. Developer reads requirements, writes implementation
3. Developer writes Gherkin to describe what was built
4. Gherkin reflects implementation, not requirements
5. BA never reads it; step definitions call internal methods
6. BDD provides no collaboration value

After (top-down, spec-driven):

1. BA writes requirements (anywhere — Jira, Confluence, verbal)
2. Engineer prompts Claude Code:
   "Given these requirements, write Cucumber Gherkin feature file 
    with scenarios for the happy path, edge cases, and error states"
3. AI generates draft Gherkin that both BA and engineer review together
4. BA corrects scenarios that don't match intent (this is the valuable alignment step)
5. Engineer prompts: "Generate step definitions that test via the REST API"
6. Implementation follows from the agreed scenarios
7. BDD now genuinely serves collaboration

The key insight: AI is good at generating syntactically correct Gherkin from requirements text. What it cannot do is know whether the scenarios match business intent — that’s the BA’s job. The AI draft makes the alignment conversation happen earlier and more concretely than a blank Confluence page ever did.

Here is an example of what Claude Code generates from a requirements prompt for a payment feature:

Feature: Payment Processing
  As a financial platform
  I want to process payment events reliably
  So that account balances reflect all transactions accurately

  Background:
    Given the payment service is running
    And account "ACC-123" has a balance of 0.00

  Scenario: Successful payment is applied to account balance
    Given a payment event for account "ACC-123" of amount 1500.00
    When the payment processor handles the event
    Then the account balance for "ACC-123" should be 1500.00
    And a balance-updated event should be published

  Scenario: Duplicate payment is rejected idempotently
    Given a payment event with idempotency key "PAY-456" has already been processed
    When a second payment event with idempotency key "PAY-456" arrives
    Then the account balance should remain unchanged
    And no duplicate event should be published

  Scenario: Payment with negative amount is rejected
    Given a payment event for account "ACC-123" of amount -50.00
    When the payment processor handles the event
    Then a NonRetryableException should be thrown
    And the account balance should remain unchanged

  Scenario: Payment fails due to database timeout and is retried
    Given the database is intermittently unavailable
    When a payment event for account "ACC-123" of amount 100.00 arrives
    Then a RetryableException should be thrown
    And the event should be eligible for redelivery

This draft takes 30 seconds to generate and replaces the 60–90 minutes engineers used to spend writing Gherkin from scratch. The BA can now review it, correct scenario 3 (“actually, negative amounts are valid for refund reversals up to the account balance”), and the alignment happens before a single line of implementation is written.


Part 3: What Didn’t Change (And Never Will)

Speed improvements don’t change the fundamentals. These principles matter as much in 2026 as they did in 2014 — maybe more, because AI makes it easier to violate them at scale.

TDD Is Still a Design Practice First

Writing tests before implementation is still the correct discipline, and not because it improves coverage. It improves design. A class that is hard to test is hard to test because it is poorly designed: too many dependencies, too many responsibilities, too much hidden state.

In the agentic era, this means: if you ask Claude Code to write tests for an existing class and the generated test setup is complicated — many mocks, complex fixtures, indirect access to state — that complexity is the AI signaling that your design needs work. The feedback is the same as before; the signal is now clearer because the AI has no tolerance for unnecessary complexity.

Spring Test Slices: Still the Correct Architecture

The fundamental principle — load only what you need to test — is unchanged. What changed is that AI generates the correct slice configuration without you having to look up the docs:

// Test only the web layer
@WebMvcTest(PaymentController.class)
class PaymentControllerTest {
    @Autowired MockMvc mockMvc;
    @MockBean PaymentService paymentService;

    @Test
    void shouldReturn201ForValidPayment() throws Exception {
        given(paymentService.process(any()))
            .willReturn(PaymentResult.success("PAY-001"));

        mockMvc.perform(post("/api/payments")
                .contentType(APPLICATION_JSON)
                .content("""
                    { "accountId": "ACC-123", "amount": 1500.00,
                      "idempotencyKey": "PAY-001" }
                    """))
            .andExpect(status().isCreated())
            .andExpect(jsonPath("$.paymentId").value("PAY-001"));
    }

    @Test
    void shouldReturn400ForNegativeAmount() throws Exception {
        mockMvc.perform(post("/api/payments")
                .contentType(APPLICATION_JSON)
                .content("""
                    { "accountId": "ACC-123", "amount": -50.00,
                      "idempotencyKey": "PAY-002" }
                    """))
            .andExpect(status().isBadRequest())
            .andExpect(jsonPath("$.error").value("INVALID_AMOUNT"));
    }
}
// Test only the JPA layer
@DataJpaTest
class PaymentRepositoryTest {
    @Autowired PaymentRepository repository;

    @Test
    void shouldDetectDuplicateByIdempotencyKey() {
        repository.save(Payment.builder()
            .idempotencyKey("PAY-456")
            .accountId("ACC-123")
            .amount(BigDecimal.valueOf(1500.00))
            .status(PaymentStatus.PROCESSED)
            .build());

        assertThat(repository.existsByIdempotencyKey("PAY-456")).isTrue();
        assertThat(repository.existsByIdempotencyKey("PAY-999")).isFalse();
    }
}

Mutation Testing: More Important, Not Less

In the pre-agentic era, mutation testing (PIT for Java) was a nice-to-have that few teams ran regularly because the feedback loop was slow. In the agentic era, it is more important than ever — because AI-generated tests can produce high line coverage while missing the semantically meaningful assertions.

PIT introduces deliberate modifications to your production code — changing > to >=, negating a boolean, replacing a return value with null — and checks whether your tests catch the change. A high mutation score means your tests detect incorrect behavior, not just execute the code.

<!-- pom.xml: Add PIT mutation testing -->
<plugin>
  <groupId>org.pitest</groupId>
  <artifactId>pitest-maven</artifactId>
  <version>1.15.3</version>
  <configuration>
    <targetClasses>
      <param>com.example.payments.service.*</param>
    </targetClasses>
    <targetTests>
      <param>com.example.payments.service.*Test</param>
    </targetTests>
    <mutationThreshold>80</mutationThreshold>
    <coverageThreshold>90</coverageThreshold>
  </configuration>
</plugin>

In my current workflow: after Claude Code generates and passes tests, I run PIT to validate the test quality. If mutation score is below 75%, I prompt Claude Code: These mutants survived. For each one, explain why the test didn’t catch it and add a test that would. The AI response is consistently accurate and the fix closes the gap. The human review is whether the added assertions reflect actual business requirements.


Part 4: New Anti-Patterns Introduced by Agentic TDD

The agentic SDLC introduces failure modes that didn’t exist before. These are the new ways TDD goes wrong in 2026.

Anti-Pattern 1: Trusting AI Coverage Without Domain Review

AI-generated tests cover the spec you gave them, not the requirements you meant to give them. If your prompt says “validate that amount is positive,” the AI correctly tests that negative amounts throw an exception. It has no way to know that in your domain, amounts of exactly zero are valid for refund reversals, or that amounts greater than $10,000 require a separate compliance check.

The fix: treat AI-generated tests as comprehensive scaffolding, not complete specification. Your domain expertise is what converts scaffolding into a meaningful safety net. Read every generated test as critically as you would read generated implementation.

Anti-Pattern 2: Spec-by-Prompt Instead of Spec-by-Conversation

The worst agentic TDD workflow I’ve seen: engineer gets a ticket from Jira, pastes the ticket description into Claude Code, gets tests back, gets implementation back, opens a PR. The BA never saw the Gherkin. The business intent was never validated against the AI interpretation.

AI-generated Gherkin from a Jira ticket reflects the quality of the ticket, not the quality of the requirement. If the ticket is ambiguous, the Gherkin will be too — just confidently so. The collaboration step — BA reviewing AI-drafted scenarios before implementation — is not optional.

Anti-Pattern 3: Skipping the Red Phase

A seductive shortcut: ask Claude Code to write tests AND implementation simultaneously. This eliminates the design pressure of TDD entirely. You lose the feedback loop that tells you your design is too complex to test — because the AI will write tests that accommodate any design.

Always run the tests first and confirm they fail before writing implementation. The failing test is not just a process step — it’s the moment you confirm the test actually tests what you think it tests.

Anti-Pattern 4: Over-Mocking in AI-Generated Tests

AI tends to mock aggressively — if a class has six collaborators, the generated tests will mock all six for every test, even when a real instance would be simpler and more meaningful. This produces tests that are tightly coupled to implementation details rather than behavior.

Review generated mock usage critically. Tests that mock the class under test’s own methods are always wrong. Tests that mock 5+ collaborators for a single assertion are usually a sign the class has too many responsibilities.


My Prompt Library for TDD + BDD (March 2026)

These are the prompts I actually use, stored in our team’s shared Claude Code prompt library:

# Unit test generation (tests only, no implementation)

Write JUnit 5 + Mockito unit tests for [ClassName].
Behavior to test: [describe in plain English].
Cover: happy path, all explicit error cases, null inputs, 
boundary values, and any concurrency concerns.
Use @ExtendWith(MockitoExtension.class).
Assert on return values and side effects (verify() calls).
Do NOT write the implementation.

# BDD feature file generation

Generate a Cucumber Gherkin feature file for: [feature description].
Include: one happy-path scenario, the top 3 error/edge scenarios,
and any performance or concurrency scenarios relevant to this domain.
Scenarios should describe observable behavior, not internal methods.
Use Background: for shared setup. Keep step language business-readable.

# Mutation testing gap closure

PIT mutation testing shows these surviving mutants in [ClassName]: 
[paste PIT output]
For each surviving mutant:
1. Explain in one sentence why the existing test didn't catch it
2. Write a new test that would kill this mutant
3. Flag any mutant where killing it would require testing 
   an implementation detail (not a behavior) — skip those.

# Spring slice setup

Create a @WebMvcTest for [ControllerClass] covering:
[list of endpoints and key scenarios]
Use MockMvc + @MockBean for service dependencies.
Test: 200/201 success cases, 400 validation failures, 
404 not-found, and 500 upstream error handling.
Match our existing test naming convention: 
shouldReturn[Status]When[Condition].

Frequently Asked Questions

Does AI-generated test code make TDD easier to skip?

Potentially, yes — if your team conflates “tests exist” with “TDD was practiced.” You can ask Claude Code to write implementation and tests simultaneously and get both in one step. This is not TDD. It eliminates the design feedback loop. The value of test-first comes from writing the failing test before the implementation and feeling the design pressure. Teams that use AI to generate tests alongside implementation are doing code coverage generation, not test-driven development.

What should I review in AI-generated tests?

Three things: (1) Domain correctness — does the test reflect actual business rules, not just syntactic interpretation of the prompt? (2) Assertion depth — does the test assert on meaningful behavior, not just “result is not null”? (3) Mock appropriateness — are the things being mocked actually external dependencies, or is the AI mocking the class under test’s own collaborators unnecessarily? Line coverage is not a useful review dimension.

How do I convince my team to adopt agentic TDD?

Run a live demo. Take a real upcoming service class from your backlog. Time yourself writing tests and implementation with Claude Code in front of the team. The combination of seeing the speed and seeing that the generated tests are actually comprehensive (not just happy-path) addresses both the “we don’t have time” and “AI tests aren’t good enough” objections simultaneously. One 15-minute demo is worth 10 architecture decision records about why TDD is important.

Do I still need BDD with AI tools available?

Yes — more than before. The value of BDD was never in Cucumber syntax or in the testing framework. It was in the conversation between business and engineering that Gherkin scenarios force. AI can generate Gherkin from requirements text, but it cannot validate whether the scenarios match business intent. The collaboration step — BA reviewing AI-drafted scenarios before implementation — is the highest-value part of BDD, and it’s faster now because you start from a draft rather than a blank page.

What is the right test pyramid ratio in 2026?

I use: 65% unit tests (JUnit + Mockito, fast, test one class in isolation), 25% integration/slice tests (@WebMvcTest, @DataJpaTest, TestContainers for real DB), 10% BDD acceptance tests (Cucumber, test observable behavior via API). The ratio matters less than the principle: each layer tests what the layer below can’t adequately cover. Unit tests can’t test HTTP routing; slice tests can. Slice tests can’t test business workflow across multiple services; BDD scenarios can.


The Bottom Line

The agentic SDLC did not change why TDD works. It changed whether it’s practical to do it consistently.

Before 2024, the failure mode of TDD was time pressure. Comprehensive tests took 2–3 hours per service class. Sprints were two weeks. The math didn’t always work out, and coverage theater filled the gap.

In 2026, the failure mode of TDD is review discipline. Tests take 15 minutes per service class. The time objection is gone. The new risk is that engineers treat AI-generated tests as complete rather than as scaffolding requiring domain review. The gap between “AI said the tests pass” and “these tests verify the system is correct for our domain” is where bugs will hide.

TDD is a design practice. Agentic tools accelerate the mechanical parts. The design thinking — what behavior should this system have, what are its invariants, what should it refuse to do — is still yours. The tools give you more time to focus on that thinking. Use it.

More from this blog:


Gaurav Pratap Singh is a Principal / Staff Software Engineer at UKG with 11 years in distributed systems and enterprise Java. He is pursuing an MS in CS at Georgia Tech (OMSCS — Computing Systems + ML). He writes about engineering practice, agentic AI tooling, and the evolving software development lifecycle.