--- title: "Building an AI code review agent: lessons from production" canonical: "https://agenticup.dev/posts/building-ai-code-review-agent/" pubDate: "2026-06-01T00:00:00.000Z" description: "I built an AI code review agent that posts comments on GitHub PRs. The architecture was the easy part. The failure modes — hallucinated bugs, missing real issues, arguing with human reviewers — nearly made me scrap the project." tags: [code review, ai agents, github, production, pr automation] --- The [Claude Code documentation](https://docs.anthropic.com/en/docs/claude-code/overview) describes how code review fits into the agentic coding workflow — models that read files, run linters, and post structured feedback on PRs. **TL;DR:** I built a GitHub PR review agent that posts comments on 70% of PRs usefully — the architecture (webhook → classify → diff → review → post) was the easy part. The failure modes (hallucinated bugs, missing issues in large files, duplicate comments, arguing with human reviewers) took 11 weeks to solve. Cost: ~$50/month for 50 PRs. I spent three months building an AI code review agent for GitHub PRs. The first prototype took a weekend. Making it reliable enough for production — where false positives erode trust and missed bugs defeat the purpose — took the remaining eleven weeks. Here's what I learned. > **Key takeaways:** > - The architecture is straightforward: webhook → classify → fetch diff → review with LLM → post comments > - The system prompt matters more than the model — "senior engineer reviewing a junior's PR" works best > - Chunking large diffs is essential but introduces its own problems (duplicate comments, lost cross-file context) > - False positives are the #1 reliability challenge — confidence thresholds and severity classification help > - The most surprising failure mode: the AI arguing with human reviewers in comment threads

Confession

I nearly scrapped this project three times. Each time, a breakthrough in prompt design or architecture kept me going. The current version runs on about 20 repos and posts reviews that are useful — not perfect, but useful — on about 70% of PRs.

## Architecture The agent runs as a simple web service with three components: ``` GitHub Webhook → Event Classifier → Diff Fetcher → LLM Reviewer → Comment Poster ↓ ↓ ↓ Skip if not PR Chunk if large Aggregate results ``` ### 1. Webhook receiver ```python from fastapi import FastAPI, Request from github_webhook import Webhook app = FastAPI() webhook = Webhook() @app.post("/webhook") async def handle_webhook(request: Request): payload = await request.json() event = request.headers.get("x-github-event") if event == "pull_request" and payload["action"] in ["opened", "synchronize"]: pr = PRContext( owner=payload["repository"]["owner"]["login"], repo=payload["repository"]["name"], number=payload["pull_request"]["number"], title=payload["pull_request"]["title"], base_sha=payload["pull_request"]["base"]["sha"], head_sha=payload["pull_request"]["head"]["sha"], ) asyncio.create_task(review_pull_request(pr)) return {"status": "ok"} ``` The key insight: process reviews asynchronously. GitHub's webhook timeout is 10 seconds. A full review takes 30-90 seconds. Kick off a background task and return immediately. ### 2. Event classifier Not every PR needs reviewing. I classify events to save cost and reduce noise: ```python async def should_review(pr: PRContext) -> bool: """Skip drafts, docs-only changes, and trivial PRs.""" # Skip draft PRs if pr.is_draft: return False # Get changed files files = await github.get_pr_files(pr.owner, pr.repo, pr.number) # Skip docs-only PRs if all(f.endswith(".md") for f in files): return False # Skip trivial changes (less than 10 lines changed) diff = await github.get_pr_diff(pr.owner, pr.repo, pr.number) if len(diff) < 200: # Rough: 200 chars ≈ 10 lines return False return True ``` This filters out about 30% of webhook events and saves about ₹650 ($8) per month in unnecessary LLM calls. ### 3. Diff fetching and chunking Large diffs are the hardest problem. A 2000-line diff won't fit in a single review context without losing quality. ```python async def chunk_diff(diff: str, max_file_changes: int = 500) -> list[DiffChunk]: """Split a large diff into reviewable chunks.""" files = parse_diff_files(diff) chunks = [] for file_path, file_diff in files: if len(file_diff) <= max_file_changes: chunks.append(DiffChunk(file=file_path, diff=file_diff)) else: # Split by function boundaries within the file function_diffs = split_by_functions(file_diff) for i, func_diff in enumerate(function_diffs): chunks.append( DiffChunk( file=f"{file_path}#function-{i}", diff=func_diff, ) ) return chunks ``` Each chunk gets an independent review. Then I aggregate the results: ```python async def review_pull_request(pr: PRContext): diff = await github.get_pr_diff(pr.owner, pr.repo, pr.number) chunks = await chunk_diff(diff) # Review each chunk independently all_findings = [] for chunk in chunks: findings = await review_chunk(pr, chunk) all_findings.extend(findings) # Deduplicate and aggregate consolidated = consolidate_findings(all_findings) # Post comments for finding in consolidated: if finding.confidence > 0.7: # Confidence threshold await post_comment(pr, finding) ``` ### 4. The system prompt This was the most iterated part. Here's what works: ```python REVIEW_PROMPT = """You are a senior engineer reviewing a junior developer's pull request. Focus ONLY on genuine issues: CRITICAL issues (block merge): - Logic errors that would cause incorrect behavior - Security vulnerabilities (injection, auth bypass, data leaks) - Performance problems in hot paths - Race conditions or concurrency bugs WARNINGS (should fix): - Error handling gaps (uncaught exceptions, silent failures) - Resource leaks (unclosed connections, file handles) - Testing gaps that would miss real bugs Do NOT comment on: - Code style preferences (use the project's formatter) - Missing docstrings on private methods - Variable naming unless genuinely confusing - Patterns that are unconventional but correct For each issue found, provide: 1. File and line number 2. Severity (CRITICAL or WARNING) 3. Clear explanation of the problem 4. Specific code suggestion (exact diff if possible) 5. Confidence (0.0 to 1.0) If you're unsure about something, skip it. False positives erode trust.""" ``` The key decisions: - **"Senior engineer reviewing a junior"** sets the right tone — helpful, not pedantic - **Explicit "do not comment on" list** reduces noise by about 40% - **Confidence scoring** lets me filter low-confidence findings - **"If unsure, skip"** is the most important instruction — it directly fights false positives ## The surprising failure modes ### 1. Hallucinated bugs The AI would find bugs that don't exist. Here's a real example: The PR changed a CSS class name from `btn-primary` to `btn-main`. The AI commented: "This class is missing the `:hover` state — users won't see any visual feedback on mouseover." But the hover state was defined in a parent class in a different file that the AI couldn't see. **Fix:** Added cross-file context by including the most relevant files (imports, parent classes) in the review context. And lowered expectations — the agent now adds a disclaimer: "Based on the diff alone; verify against full codebase." ### 2. Missing real bugs in large files The agent did great on files under 300 lines. Above that, quality dropped sharply. In a 600-line file, it missed a null pointer dereference that a human reviewer caught immediately. **Fix:** Chunking at function boundaries, not file boundaries. Each function gets its own review pass. But this introduced the next problem... ### 3. Duplicate comments When the same issue appears in two chunks — or when two chunks touch adjacent code — the AI would flag the same problem twice. Once in chunk A, once in chunk B. Sometimes in slightly different wording, making deduplication non-trivial. **Fix:** A deduplication pass that compares findings by location and semantic similarity: ```python def consolidate_findings(findings: list[Finding]) -> list[Finding]: """Merge duplicate findings from different chunks.""" grouped = {} for f in findings: # Group by file + line proximity key = (f.file, f.line // 10) # Group within 10 lines if key not in grouped: grouped[key] = f else: # Keep the higher severity, merge explanations existing = grouped[key] if f.severity == "CRITICAL" and existing.severity == "WARNING": grouped[key] = f return list(grouped.values()) ``` ### 4. Arguing with human reviewers This was the most bizarre failure. A human would comment "Actually, this pattern is intentional for performance reasons." The AI agent, in a follow-up review of the same PR, would post a rebuttal: "While performance is a concern, the correctness issue outweighs it..." The AI didn't know it was arguing with a human. It saw new context (the human's comment) and treated it as new code to review. **Fix:** The agent reads existing review comments before posting and skips topics that have been discussed: ```python async def has_been_discussed(pr: PRContext, finding: Finding) -> bool: existing_comments = await github.get_pr_comments( pr.owner, pr.repo, pr.number ) for comment in existing_comments: if comment.path == finding.file: # Simple overlap check if abs(comment.line - finding.line) < 5: return True return False ``` ### 5. Cost variation Early estimates were way off. Here's actual cost data from 100 PRs: | PR Size | Average Cost | p90 Cost | Average Time | |---|---|---|---| | Small (<100 lines) | ₹65 ($0.80) | ₹100 ($1.20) | 20s | | Medium (100-500 lines) | ₹165 ($2.00) | ₹250 ($3.00) | 45s | | Large (500-2000 lines) | ₹500 ($6.00) | ₹820 ($10.00) | 90s | | Monstrous (>2000 lines) | ₹1,000 ($12.00) | ₹1,650 ($20.00) | 180s | **Total monthly cost at 50 PRs/month (my current volume):** approximately ₹4,100 ($50). ## When NOT to use AI code review I learned this the hard way. These scenarios will waste your money and frustrate your team: 1. **Codebases you can't send to external APIs** — if your company bans sending code to third-party LLMs, skip it 2. **Generated code** — the AI will flag patterns it doesn't recognize as odd (because they are, they were generated) 3. **As a replacement for humans** — AI catches style issues and simple bugs. Humans catch architectural problems and design inconsistencies 4. **For greenfield projects** — the first few PRs on a new project are mostly scaffolding. AI reviews add little value 5. **When you can't handle false positives** — if your team will start ignoring all AI comments after 3 bad ones, don't start ## What I'd do differently If I were starting over: 1. **Start with a prompt-only approach** — I jumped to complex chunking and aggregation too early. A good prompt on the full diff catches 70% of issues 2. **Invest in deduplication earlier** — duplicate comments erode trust faster than missed bugs 3. **Track per-developer false positive rate** — some developers get more false positives because their code style triggers the AI more. Adjust sensitivity per developer 4. **Add human feedback loop** — let developers thumbs-up/thumbs-down comments. Use that to fine-tune The agent runs daily on my repos now. It catches about 3-5 real issues per week, misses about 1-2 that humans catch, and posts about 2-3 false positives. Not perfect. But it makes the team a little better, and that's enough. --- *Related: [How to build your first AI agent in 2026](/posts/how-to-build-first-ai-agent-2026/) — a step-by-step tutorial from scratch. Also see [LangGraph tutorial for beginners](/posts/langgraph-tutorial-beginners/) for building agent workflows.*