Technical Due Diligence When the Codebase Was Built with AI: The New CTO Checklist
- 10 min read
Technical Due Diligence When the Codebase Was Built with AI: The New CTO Checklist
The standard technical due diligence checklist was written for codebases built by developers making deliberate decisions. When AI wrote 40-60% of the code, the checklist still applies - but it misses a different class of failure entirely.
I do technical due diligence for acquirers and investors. In the last 18 months, most of the codebases I’ve reviewed have significant AI-generated content. Some of them are fine. Some have problems that wouldn’t surface in a standard review but become expensive after close. Here’s what the updated checklist looks like.
Start With the Question Most Reviewers Skip
Before you look at a single line of code, ask: “What percentage of this codebase was AI-generated? Which tools? Over what time period?”
Most founders can’t answer this precisely. That’s not disqualifying - the tooling to track AI code provenance hasn’t been standard practice. But the answer tells you a lot about how intentionally the team was working.
A team that says “roughly 40%, mostly Claude Code for boilerplate and Copilot for tests, starting about 14 months ago” is a team that was paying attention. A team that says “I don’t know, we just used whatever” is a team where the code is going to surprise you. Both types of codebases exist. Your review process should differ.
The follow-up question: “Did you review AI-generated code before merging, or accept suggestions directly?” There’s no wrong answer, but the answer calibrates what you’re about to find.
Test Coverage: Quality vs. Quantity
AI writes tests prolifically. You will almost certainly see high line coverage numbers. Those numbers are nearly meaningless on their own.
The failure mode is this: AI generates tests that verify the code does exactly what it does - not that the code does what it should. A function that returns the wrong result gets a test that asserts the wrong result. Everything passes. Nothing is actually tested.
The diagnostic: run mutation testing. Mutation testing modifies your code in small ways (flipping a > to >=, changing a true to false) and checks whether your tests catch the change. If they don’t, the tests aren’t doing real work.
- Python:
mutmut - JavaScript/TypeScript:
stryker
A mutation score below 40% means the test suite is not catching bugs. This is common in AI-generated test suites. It’s not a dealbreaker, but it means the coverage numbers in the due diligence deck are misleading and the actual bug risk is higher than they suggest.
Green signal: Mutation score above 60%, tests that exercise error paths and edge cases, test names that describe behavior not implementation.
Yellow signal: High line coverage, low mutation score. The tests exist but don’t protect much. Fixable, but requires real work.
Red signal: High coverage numbers, mutation score below 25%, no tests for error handling. The test suite is a confidence illusion.
Secrets and Credentials: Scan the Full History
Run trufflehog or gitleaks on the full git history - not just the current HEAD.
AI-generated code hardcodes credentials more often than human-written code. The Cloud Security Alliance’s 2026 data puts 40-62% of AI-generated code as containing security vulnerabilities. A more specific data point: AI-assisted commits expose secrets at 3.2% versus 1.5% for human-only commits - more than double the rate. And GitHub’s 2025 data found 28.65 million hardcoded secrets in public repos, up 34% year over year.
The current HEAD might be clean because someone caught it. But the credential may have been committed, used, and is now compromised - even if it’s been removed from current files. Git history is permanent.
trufflehog git file://./path/to/repo --since-commit HEAD~500
Look specifically at environment configuration files, test fixtures, and any file with “config” or “settings” in the name. These are where AI most commonly generates example credentials that then get committed.
Green signal: Clean history scan, .env in .gitignore from the first commit, secrets management through a proper vault or environment variable system.
Yellow signal: A few historical findings that were rotated after discovery. Check that the rotation actually happened - that the credential in question is no longer valid.
Red signal: Active credentials in git history that haven’t been rotated, or a .env file committed at any point with real values.
Dependency Audit: Beyond npm audit
Standard dependency audits check for known vulnerabilities in packages you’re using. That’s still worth running. But AI introduces a different problem: packages that are wrong, outdated, or don’t exist at all.
AI models hallucinate package names. Less often than they used to, but it still happens. More commonly, they suggest packages that were real but are now abandoned, deprecated, or superseded. The code looks correct. The package installs. But it’s doing something subtly different from what the code assumes, or it hasn’t received a security update in three years.
The manual check: take five non-trivial dependencies you haven’t heard of and verify that they actually do what the code uses them for. Check last release date, download trends, and whether the functionality has been absorbed into a framework natively.
Also check for version mismatches. AI frequently generates code that works against a package’s API as it existed 18 months ago. The package is on a newer major version. The code works because the old API still exists as a compatibility shim - until the next major version drops it.
npm audit
npx depcheck # finds unused dependencies
npm outdated # flags packages behind their published version
Green signal: Active dependencies with recent releases, no audit findings at critical or high severity, version ranges that make sense.
Yellow signal: Several outdated packages, minor audit findings. Common and manageable.
Red signal: Packages that don’t exist on npm/PyPI, critical vulnerability findings, or dependencies pinned to versions from 3+ years ago.
Architecture Coherence: The Three-Pattern Problem
AI generates locally consistent code. It sees the code around it and matches the style. The problem surfaces at the global level.
The most common pattern I find: three different approaches to the same architectural problem in different parts of the codebase, each of which is “correct” in isolation. A service layer implemented three different ways. API responses formatted with three different conventions. Database queries handled with three different abstraction patterns. Each section looks fine. The whole is incoherent.
This happens because AI was given different context at different points in the project’s development. The architecture evolved, the CLAUDE.md (or equivalent) wasn’t updated, and subsequent code generation followed the old pattern or invented a new one.
The diagnostic question: ask the team to explain the service layer architecture, or whatever the core domain logic pattern is. Can they give you a consistent answer? Or do different developers describe it differently?
The architecture itself might be fine. What you’re actually testing is whether the team understands what they built. A team that understands the inconsistencies and has a plan to address them is in a different position than a team discovering them for the first time during your review.
Green signal: Consistent patterns across the codebase, team can articulate the architecture and the reasoning behind it.
Yellow signal: Inconsistencies present but team is aware of them and has a documented remediation plan.
Red signal: Multiple conflicting patterns with no team awareness of the inconsistency, or team members giving materially different descriptions of the same system.
The “Can They Explain It” Test
This is the highest-signal check I run, and it takes 30 minutes.
Pick a non-trivial piece of AI-generated code - something with real business logic, not a CRUD endpoint. Find something that was committed 6-8 weeks ago. Ask the developer who wrote it to walk you through the design decisions.
Not “what does this code do” - that they can read. “Why is it structured this way? What alternatives did you consider? What would need to change if requirement X shifted?”
If a developer can’t explain the design decisions in code they shipped two months ago, they didn’t understand it when they shipped it. They accepted a suggestion that looked right. This is a team capability signal, not a blame assessment - it tells you how much hidden complexity you’re acquiring.
The inverse is also true. A developer who can walk you through AI-generated code, explain where they steered it differently than its first suggestion, and articulate what they’d change with hindsight is a developer who was using AI as a tool rather than a replacement for thinking. That team is in good shape.
Green signal: Developer can explain design decisions, trade-offs considered, and evolution of the approach.
Yellow signal: Developer understands what the code does but is vague on why it’s structured a particular way. Common, and recoverable.
Red signal: Developer cannot explain code they committed, or discovers during the walkthrough that the code doesn’t actually do what they thought it did.
Prompt Injection Surface Area
If the product passes user input to AI models - which is increasingly common, and increasingly often not the core feature but a secondary “AI-powered” addition - check the input sanitization.
Prompt injection is not the same problem as SQL injection, and developers who know how to prevent SQL injection often don’t know what to look for here. The attack surface is anywhere user-controlled text enters a prompt template without sanitization.
Look for:
- String interpolation of user input directly into prompt templates
- System prompts that can be overridden by sufficiently crafted user input
- Outputs from AI models being passed back into other AI calls without validation
- Tool call definitions that user input can influence
This is an emerging area and the tooling is immature. A manual review of how user input flows into any AI call is more reliable than automated scanning right now.
Green signal: User input is treated as data, not as instructions. Prompt templates have clear boundaries between system and user content.
Yellow signal: Some interpolation of user input but limited model capabilities (output-only, no tool calls). Lower risk but worth documenting.
Red signal: Raw user input in system prompts, AI models with tool-call capabilities where user input can influence the tool selection.
License Contamination
AI models were trained on code with a range of licenses, including GPL and other copyleft licenses. If the model reproduced GPL-licensed code verbatim and that code is now in your product, you have a license problem that survives acquisition.
Most of the time this isn’t an issue. But “most of the time” is not due diligence.
The question to ask: “Has any IP review or license audit been done on this codebase?” Most early-stage companies will say no. That’s not unusual. What matters is whether an acquirer’s legal team has flagged this as a concern and whether you need to add a license scan to the pre-close checklist.
Tools: FOSSA or Black Duck for automated license scanning. Neither is perfect for AI-generated code specifically, but they’ll catch obvious GPL reproduction and flag dependencies with license complications.
Green signal: License audit has been run, FOSSA or equivalent is integrated into CI, no copyleft findings in application code.
Yellow signal: No audit run but no obvious high-risk patterns (no AI-generated code that closely resembles known open-source implementations in regulated or IP-sensitive domains).
Red signal: AI-generated code that appears to reproduce non-trivially from identifiable open-source projects under restrictive licenses, with no audit and no legal review.
The updated due diligence checklist isn’t longer than the standard one - it’s differently focused. Standard DD asks “is the code well-written.” AI-augmented DD asks “does the team understand what they built, and are the specific failure modes of AI generation under control.”
Those are answerable questions. Teams that used AI well will show it in their answers. Teams that used AI as a shortcut will show that too.