We Ran a Complex Task — A LangChain Repo Analysis with Five Claude Models

Anthropic just shipped Claude Fable. We wanted a real answer to a practical question:

If you run the same complex engineering task on Opus, Fable, Sonnet, and Haiku — what do you actually get back?

Not a benchmark score. Not a vibe check. A full principal-engineer audit of a production open-source monorepo — with evidence, severity labels, and an execution plan.

We ran that experiment inside CTRL NODE: one prompt, five agents, five models, one cloned repository.

1. The goal: one hard task, five models

LangChain principal-engineer audit prompt in a CTRL NODE task

What we tested

We gave every model the same four-phase audit prompt and the same target: the LangChain Python monorepo (a large, mature library ecosystem — not a toy repo).

The prompt asks for:

Repository Map — explore first, judge second
Audit Report — architecture, security, tests, performance, deps, DX, docs (with file:line citations)
Improvement Strategy — themes, trade-offs, measurable “done” criteria
Task Plan — milestones M0–M3, quick wins, effort/risk/deps on each item

Every finding must be evidence-based. Guessing is explicitly forbidden.

That is a genuinely heavy task: thousands of files, real CI configs, security-sensitive deserialization paths, and god-class modules on hot code paths. It is the kind of work teams normally spread across several senior engineers.

Why Fable vs the rest

Fable is positioned as a strong reasoning model for long, structured work. We included it alongside:

Model	Role in the experiment
Claude Opus 4.8	Premium tier — threat modeling baseline
Claude Fable 5	New tier — strategy & execution planning
Claude Sonnet 5	Current Sonnet — primary audit pass
Claude Sonnet 4.6	Previous Sonnet — ops / CI lens
Claude Haiku 4.5	Fast tier — exploration & map

The hypothesis was not “Fable wins everything.” It was: each tier sees different things, and Fable might be the best at turning findings into a shippable backlog.

The prompt

The full prompt lives in our catalog as langchain-prompt.md. Core instruction (abbreviated):

You are a world-class, principal-engineer-level software engineer and technical audit expert.
Perform an in-depth analysis of this code repository, provide an honest audit report,
and offer a prioritized, actionable improvement plan.

Follow four phases in order: Discovery → Audit → Strategy → Task Plan.
All judgments must cite real file paths and line numbers. Do not guess.

Deliverables requested per run:

audit-report-<model>.md — full Markdown report
audit-report-<model>.html — interactive dark-theme dashboard (tabs: Overview, Map, Audit, Strategy, Tasks)

Summary of the prompt: resumen-langchain-prompt.md.

2. How we set it up in CTRL NODE

CTRL NODE project configured for the LangChain audit experiment

We did not paste the prompt into five browser tabs. We ran it the way a team would: Bridge on a real machine, a project work directory pointing at the clone, one agent per model tier.

Prerequisites

Bridge (ctrlnode) installed and paired — see Bridge setup.
Claude SDK API key set in ~/.ctrlnode/.env (providers load automatically — no PROVIDERS flag needed):
```
ANTHROPIC_API_KEY=sk-ant-...
BASE_PATH=/home/you/workspace
```
LangChain cloned on the Bridge host under BASE_PATH (CTRL NODE does not git-clone for you; the work directory points at an existing folder).

Project

In the web app: + NEW PROJECT

Field	Value
NAME	`langchain-audit-experiment`
AGENT TYPE	Claude
WORK DIRECTORY	Browse → select the LangChain clone → USE THIS DIRECTORY
DESCRIPTION	Five-model audit benchmark

The work directory is what lets agents read the full tree in WORK DIRECTORY task mode — the same scope a staff engineer would need.

Agents (one per model)

Team → + ADD AGENT — we created five agents on the same project:

Agent name	MODEL field	Purpose
`audit-opus`	`claude-opus-4-8`	Threat & design review
`audit-fable`	`claude-fable-5`	Strategy & task plan
`audit-sonnet-5`	`claude-sonnet-5`	Primary audit
`audit-sonnet-46`	`claude-sonnet-4-6`	CI / ops pass
`audit-haiku`	`claude-haiku-4-5`	Fast map

Models are selected in the MODEL combobox (synced from Bridge when online) or typed manually. Fable appears as claude-fable-5 in the Bridge model manifest (v2026.2.4+).

Optional AGENT SYSTEM INSTRUCTIONS were left minimal — we wanted the task prompt to carry the spec, not per-agent persona drift.

3. How we ran the prompt

Five audit agents — one per Claude model tier — in CTRL NODE Team

For each agent, same procedure:

+ NEW TASK on the project
TITLE: LangChain principal audit — <model>
INSTRUCTIONS: paste full contents of langchain-prompt.md
ASSIGN TO AGENT: pick the matching agent chip
OUTPUT MODE: WORK DIRECTORY (full repo scope; optional focus paths left empty)
NEW TASK → task lands in Backlog
RUN → dispatches to Bridge → agent moves to In progress

Bridge delivers the task with repositoryPaths and repo dispatch context so the Claude SDK runs against the LangChain tree on disk. Outputs (audit-report-*.md / .html) were collected from the agent’s work directory and copied into our marketing catalog folder.

Tip for reproducibility: use the same commit SHA for every run. Our reports reference LangChain master at 2b47357 where noted.

4. What Fable returned

Audit deliverables in CTRL NODE Files — report and dashboard output

Fable graded the repo A− — the same calibration as Opus, more honest than Haiku’s self-awarded A.

Executive summary (Fable)

Top 3 risks

Complexity concentration — five files exceed 1,800 lines; runnables/base.py is 6,574 LOC. High blast radius on every invoke/stream path.
Unsafe-by-default deserialization — langchain_core.load defaults to allowed_objects='core', documented as unsafe for untrusted manifests. Safe options exist but are opt-in.
Type-safety escape hatches — 208 type: ignore comments in langchain-core alone; disallow_any_generics=false weakens the public API contract.

Top 3 opportunities

Flip deserialization default to a safe allowlist ('messages') on the next major version.
Burn down parked lint TODOs (BLE, ANN401, ERA) — enforcement infra already exists.
Decompose the top god files behind unchanged public façades (zero API break).

What stood out

Fable’s differentiator was not a hotter take on security headlines. It was Phase 3 and Phase 4:

Four strategic themes (complexity, switched-off guardrails, safe-by-default trust boundaries, workspace hygiene)
Explicit non-goals (e.g. don’t rewrite vendored mustache.py this cycle — add property tests instead)
Milestones M0–M3 with workload badges (S/M/L/XL), risk, dependencies, and acceptance criteria
Quick wins you could ship in an afternoon (.gitignore for audit artifacts, logger.debug on swallowed AttributeError in callbacks/usage.py, CI ratchet on type: ignore count)

Near-exclusive Fable findings:

Vendored 704-line Mustache engine (mustache.py) with its own security surface
McCabe C90 complexity lint explicitly disabled — no automated backpressure on god-file growth
Thin test breadth vs complexity for langchain_v1/agents/factory.py (56 test files vs 1,891-line factory)

What Fable did not emphasize

Fable did not surface several issues other models caught:

TOCTOU / DNS rebinding on SSRF paths (Opus)
ShellToolMiddleware host execution by default (Opus)
SSRF transport adopted in only two call sites + unprotected graph_mermaid.py fetch (Sonnet 5)
Commented lockfile check in CI _lint.yml (Sonnet 4.6)
Broken README model example / missing SECURITY.md (Sonnet 4.6)

That gap is the point: Fable is not a replacement for a multi-model pipeline.

Full report: audit-report-fable.md · Interactive dashboard: audit-report-fable.html

5. How the five models compare

Model	Grade	Best at	Weak at
Opus 4.8	A−	Threat modeling (TOCTOU, agent shell defaults, env bypass)	CI lockfile, default `load()`, README gaps
Fable 5	A−	Strategy, milestones, quick wins, engineering debt	Agent-specific threats, SSRF adoption map
Sonnet 5	B+	SSRF infra vs adoption, silent `except`, repo hygiene	Lockfile CI, README, SECURITY.md
Sonnet 4.6	B+	Ops: lockfile CI, `load()` default, onboarding docs	Newer SSRF adoption analysis
Haiku 4.5	A*	Fast LOC map, callback cycles, duplicate translators	*Inflated grade; factual CI error on lockfile

*Haiku’s A looks confident on paper. Cross-checking against Sonnet 4.6 showed a wrong claim about lockfile validation in CI.

Exclusive findings matrix (selected)

Finding	Op	Fb	S5	S4.6	Hk
TOCTOU / DNS rebinding	✓	—	—	—	—
Shell host by default	✓	—	—	—	—
SSRF transport ~2 call sites	—	—	✓	—	—
`graph_mermaid.py` no SSRF	—	—	✓	—	—
Default `load()` unsafe	—	✓	—	✓	—
Plan M0–M3 + non-goals	—	✓	—	—	—
`mustache.py` / C90 off	—	✓	—	—	—
Lockfile CI commented	—	—	—	✓	✗ wrong
Callback/tracer cycles	—	—	—	—	✓

The pipeline we’d actually use

Haiku        → fast map & architecture hotspots
Sonnet 5     → primary audit + security adoption gaps
Sonnet 4.6   → CI, docs, onboarding landmines
Opus         → threat review for agent-facing surfaces
Fable        → merge into one prioritized backlog
Human        → verify _lint.yml, load.py, README in your checkout

No single model replaces this chain. Paying only for Opus — or only for Fable — leaves blind spots.

Deep dive: comparison-models-report.md

Slide deck for the story

We also built a 14-slide presenter deck for video walkthroughs: model-comparison-presentation.html (←/→ navigate, F fullscreen).

6. What this means for CTRL NODE users

Model choice is a workflow decision, not a vanity tier pick. Use Haiku to scout, Sonnet to audit, Opus for threats, Fable to plan — on the same project and work directory.
WORK DIRECTORY mode matters for tasks like this. An output-only sandbox would not have produced file:line citations across CI, core, and partner packages.
Fable earns a slot after discovery, not instead of Sonnet or Opus. Its A− grade matched Opus; its deliverable shape (milestones, ratchets, non-goals) was the most actionable.
Re-run the experiment on your repo — clone under Bridge BASE_PATH, point a Claude project at it, duplicate the task five times with different MODEL values.

7. References — all artifacts

The full experiment — every prompt, per-model report, and the comparison deck — is published below as supporting material for this article.

Prompt

File	Description
`langchain-prompt.md`	Full four-phase audit prompt (English)
`resumen-langchain-prompt.md`	Prompt summary (Spanish)

Per-model reports

Model	Markdown	HTML dashboard
Claude Fable 5	`audit-report-fable.md`	`audit-report-fable.html`
Claude Opus 4.8	`audit-report-opus.md`	`audit-report-opus.html`
Claude Sonnet 5	`audit-report-sonnet-5.md`	`audit-report-sonnet-5.html`
Claude Sonnet 4.6	`audit-report-sonnet-4-6.md`	`audit-report-sonnet-4-6.html`
Claude Haiku 4.5	`audit-report-haiku.md`	`audit-report-haiku.html`

The prompt asks every model for paired .md + .html outputs. Every model in this batch produced both formats.

Comparison & media

File	Description
`comparison-models-report.md`	Full five-model written comparison
`model-comparison-presentation.html`	Animated 14-slide deck (Op · Fb · S5 · S4.6 · Hk)

Try it yourself

Start free — create a Claude project and pair Bridge.
Clone the repo you care about on the Bridge machine; set WORK DIRECTORY.
Register agents with different MODEL values (claude-fable-5, claude-opus-4-8, …).
Paste the audit prompt into INSTRUCTIONS, assign, RUN, compare outputs.

Questions or want us to run this on your stack? info@ctrlnode.ai

Experiment date: 17 June 2026 · CTRL NODE — orchestrate Claude, Copilot, Gemini, Cursor, and more from one control plane.