← Back to article · Internal artifact

LangChain Monorepo — Technical Audit Report

Scope: langchain/ Python monorepo (langchain-core, langchain (v1), langchain-classic, text-splitters, standard-tests, model-profiles, 16 partner packages). Method: Evidence-based static review. All file:line references are to the repository tree rooted at the directory containing this file. Where something could not be verified statically (e.g. live coverage %, runtime behavior), it is labeled as such. Audit date: 2026-06-17.


1. Executive Summary

Overall health grade: A− (strong, mature, production-grade project with a small number of real but bounded security/design risks).

LangChain is a large, actively maintained, MIT-licensed Python monorepo that is the de-facto standard framework for building LLM applications and agents. The engineering culture is unusually disciplined for an OSS project of this size: ruff is configured with select = ["ALL"], mypy runs in strict mode, GitHub Actions are pinned to full commit SHAs, CI is change-scoped for speed, dependency ranges are bounded, and there is a dedicated _security package with SSRF protection and a documented usedforsecurity=False posture around SHA-1. The codebase is well-documented (Google-style docstrings enforced) and has a deep test footprint (167 test files in core, 90 in langchain-v1). The grade is held just below A because of a handful of architectural and security items that matter at this project's scale: an SSRF guard that is inherently vulnerable to DNS-rebinding (time-of-check/time-of-use), an environment-variable-driven validation bypass that is broader than its docstring claims, a host-shell agent tool that defaults to full host access, and several genuine God-files (notably runnables/base.py at 6,574 lines).

Top 3 risks

  1. SSRF protection is TOCTOU-vulnerable (DNS rebinding). validate_safe_url / validate_url resolve DNS at validation time, but the real HTTP request resolves DNS again later — an attacker-controlled DNS record can pass validation then re-point to a private IP. (libs/core/langchain_core/_security/_ssrf_protection.py:86, libs/core/langchain_core/_security/_policy.py:259)
  2. Environment-variable SSRF bypass is broader than documented. _effective_allowed_hosts allows localhost/testserver for any LANGCHAIN_ENV starting with "local", while validate_safe_url's own bypass and docstrings describe a narrower local_test condition. (libs/core/langchain_core/_security/_policy.py:231, _ssrf_protection.py:69)
  3. ShellToolMiddleware defaults to HostExecutionPolicy (full host shell, redaction is post-execution only). Safe defaults matter because agents execute model-chosen commands. (libs/langchain_v1/langchain/agents/middleware/shell_tool.py:503, :565, :538)

Top 3 opportunities

  1. Adopt connection-time IP pinning / a custom transport for SSRF to close the DNS-rebinding gap (a _transport.py already exists — wire validation into the actual socket connect).
  2. Decompose the God-files (runnables/base.py 6,574 lines; callbacks/manager.py 2,792; language_models/chat_models.py 2,714) to improve navigability, review velocity, and import cost.
  3. Tighten the security defaults & make them explicit (opt-in host shell, narrow the env bypass, default key_encoder documentation) — high trust-impact, low effort.

2. Repository Map (Phase 1)

Purpose & maturity

  • Purpose: "The agent engineering platform" — a framework for building agents and LLM-powered applications with a standard interface across model providers, embeddings, vector stores, retrievers, and tools.
  • Intended users: Python application developers building LLM/agent apps; partner integrators.
  • Maturity: Production library. pyproject.toml classifiers declare Development Status :: 5 - Production/Stable. langchain-core==1.4.3, langchain==1.3.6. (libs/core/pyproject.toml:11, :24; libs/langchain_v1/pyproject.toml:24)

Tech stack

Area Choice
Language Python >=3.10,<4.0 (3.10–3.14 classifiers)
Packaging/build uv workspace + hatchling build backend; per-package pyproject.toml + uv.lock
Core runtime deps (core) pydantic>=2.7.4,<3, langsmith, tenacity, jsonpatch, PyYAML, typing-extensions, packaging, uuid-utils, langchain-protocol
Agents langgraph>=1.2.4,<1.3 (langchain v1 depends on langgraph)
Lint/format ruff (select = ["ALL"])
Types mypy strict = true, pydantic mypy plugin
Tests pytest, pytest-asyncio (auto), syrupy snapshots, pytest-socket (no-network enforcement), pytest-xdist, blockbuster, pytest-benchmark/codspeed
CI/CD GitHub Actions (27 workflows), change-scoped matrix, SHA-pinned actions, manual release workflow

Architecture sketch

langchain-protocol (external)        langgraph (external, 1.2.x)
        │                                   │
        ▼                                   ▼
  langchain-core  ──────────────────►  langchain (v1, public)
  (base abstractions:                  (init_chat_model, create_agent,
   Runnables, messages,                 middleware, tools, structured output)
   tools, callbacks,                          │
   _security, indexing)                       │  optional extras
        ▲                                     ▼
        │                       partners/* (openai, anthropic, ollama, groq,
  text-splitters                 mistralai, huggingface, qdrant, chroma, exa,
  standard-tests                 nomic, fireworks, deepseek, openrouter,
  model-profiles                 perplexity, xai)
        │
        └──► langchain-classic (libs/langchain) — legacy, maintenance-only

Key directories (one line each)

Path Description
libs/core/langchain_core/ Base abstractions: Runnables, messages, tools, callbacks, tracers, indexing, _security.
libs/langchain_v1/langchain/ Actively maintained public langchain package: init_chat_model, agents, middleware, tools.
libs/langchain/langchain_classic/ Legacy langchain-classic package (maintenance only, no new features).
libs/partners/*/ 16 first-party provider integrations, each its own package.
libs/text-splitters/ Document chunking utilities.
libs/standard-tests/ Shared standardized test suite for partner integrations.
libs/model-profiles/ Model capability profile data + langchain-profiles CLI.
.github/workflows/ 27 CI/CD workflows (lint, test, release, labeling, codspeed perf).

What surprised me (positively & otherwise)

  • A dedicated _security package with a real, policy-driven SSRF implementation (IPv4/IPv6 blocklists, cloud-metadata IPs, NAT64-embedded-IPv4 extraction, k8s .svc.cluster.local blocking). This is far more than most OSS libraries ship. (libs/core/langchain_core/_security/_policy.py)
  • ruff select = ["ALL"] + mypy strict across the monorepo — an aggressive quality bar that is rare at this scale.
  • A LANGCHAIN_ENV-based validation bypass baked into the security policy (_policy.py:231) — convenient for tests but a security-relevant surprise.
  • AGENTS.md and CLAUDE.md are byte-identical (318 lines each) — duplicated guidance rather than one file referencing the other.
  • The directory layout has a doubled root (langchain/langchain/) and the repo is a shallow git clone (.git/shallow present), so full history-based analysis is not possible here.

3. Audit Report (Phase 2)

Findings are grouped by dimension and sorted by severity. Each is tagged [Fact] (directly verifiable in a file) or [Judgment] (informed assessment).

Security

S1 — SSRF validation is TOCTOU / DNS-rebinding vulnerable — High [Fact + Judgment]

  • What: validate_safe_url resolves the hostname via socket.getaddrinfo and validates the returned IPs, then returns the URL string. The actual HTTP request happens later in the caller and re-resolves DNS. An attacker controlling the DNS record can return a public IP during validation and a private/metadata IP at fetch time.
  • Where: libs/core/langchain_core/_security/_ssrf_protection.py:8698; async equivalent libs/core/langchain_core/_security/_policy.py:259268.
  • Why it matters: The function's stated purpose is to "prevent SSRF" (_ssrf_protection.py:49). Validating the URL but not pinning the validated IP at connection time means the guarantee does not hold against an active attacker. Consequences: access to cloud metadata (credentials) and internal services.
  • Severity: High.

S2 — Env-driven SSRF bypass is broader than its docstring — Medium [Fact]

  • What: _effective_allowed_hosts adds localhost and testserver to the allow-list whenever LANGCHAIN_ENV starts with "local" (e.g. local, localdev, local_anything). Separately, validate_safe_url has its own bypass requiring LANGCHAIN_ENV == "local_test" AND hostname test...server.
  • Where: libs/core/langchain_core/_security/_policy.py:231; libs/core/langchain_core/_security/_ssrf_protection.py:6974.
  • Why it matters: Two different bypass conditions for the same subsystem are confusing and the _policy.py one is wider than a reader of validate_safe_url would expect. If an environment is misconfigured (or an attacker can influence env), localhost SSRF is silently re-enabled. The bypass is undocumented in the public docstring.
  • Severity: Medium.

S3 — ShellToolMiddleware defaults to full host shell access — High [Fact + Judgment]

  • What: When no execution_policy is supplied, the middleware uses HostExecutionPolicy() — the model can run arbitrary commands on the host. Redaction rules are applied after execution and explicitly "do not prevent exfiltration of secrets" under host policy.
  • Where: libs/langchain_v1/langchain/agents/middleware/shell_tool.py:503 (class docstring), :565 (default), :538 (warning).
  • Why it matters: This is opt-out rather than opt-in for the most dangerous capability an agent can have. The risk is partially mitigated by documentation, but a "safe by default" posture (e.g. require an explicit policy, or default to a sandbox when available) is the safer design.
  • Severity: High (by impact; it is an intentional, documented design choice, so partly a Judgment on default-selection).

S4 — SHA-1 is the default key_encoder for the indexing API — Low [Fact]

  • What: index/aindex default key_encoder="sha1". A one-time UserWarning is emitted and usedforsecurity=False is set, but SHA-1 remains the default fingerprint algorithm.
  • Where: libs/core/langchain_core/indexing/api.py:307, :646, :46, :5570.
  • Why it matters: SHA-1 is not collision-resistant; the code itself warns of this. For document de-duplication this is mostly a correctness/robustness concern (deliberate collisions could cause documents to be treated as identical). Defaulting to blake2b/sha256 would be safer, but changing a default is a breaking change — hence Low + documented.
  • Severity: Low.

S5 — Proactive dependency-CVE pinning is present (positive, but note maintenance burden) — Low [Fact]

  • What: constraint-dependencies pin pygments>=2.20.0 # CVE-2026-4539 (core) and urllib3>=2.6.3, pygments>=2.20.0 (langchain v1).
  • Where: libs/core/pyproject.toml:82; libs/langchain_v1/pyproject.toml:96.
  • Why it matters: Demonstrates active CVE tracking. The minor risk is that hand-maintained constraint comments can drift; these belong in a tracked SCA process. Largely a strength.
  • Severity: Low.

Architecture & Design

A1 — God-file: runnables/base.py at 6,574 lines — Medium [Fact + Judgment]

  • What: The core Runnable abstraction file is 6,574 lines; callbacks/manager.py 2,792; language_models/chat_models.py 2,714; messages/utils.py 2,400.
  • Where: libs/core/langchain_core/runnables/base.py (6574 LOC).
  • Why it matters: Single very large modules raise the cost of review, increase merge-conflict surface, slow IDE/type-checker performance, and inflate import time. Runnable is the most central abstraction, so the blast radius of any change here is large.
  • Severity: Medium (it is cohesive and stable, so this is partly Judgment).

A2 — init_chat_model provider registry is a hardcoded God-dict — Low [Fact]

  • What: _BUILTIN_PROVIDERS hardcodes 28 providers with import paths/class names/creator lambdas, plus a parallel _attempt_infer_model_provider prefix table and a docstring list — three sources of the same truth that must be kept in sync.
  • Where: libs/langchain_v1/langchain/chat_models/base.py:38100, :521594, :207309 (docstring).
  • Why it matters: Adding/renaming a provider requires editing three places; drift produces confusing inference behavior. Low because it is well-contained and covered by the CLAUDE.md "FOR CONTRIBUTORS" note.
  • Severity: Low.

A3 — Three coexisting langchain packages (core / v1 / classic) — Low [Judgment]

  • What: libs/core (langchain-core), libs/langchain_v1 (langchain), libs/langchain (langchain-classic) coexist; CLAUDE.md labels classic "legacy, no new features."
  • Where: libs/langchain/, libs/langchain_v1/, CLAUDE.md:1617.
  • Why it matters: Necessary for a major-version migration, but newcomers can edit the wrong package. The directory name langchain_v1 vs published name langchain is a known footgun.
  • Severity: Low.

Code Quality

Q1 — Broad-exception handling is intentionally allowed and used — Medium [Fact]

  • What: ruff ignores the BLE (blind-except) rule monorepo-wide; 28 except (Base)Exception/bare-pattern occurrences exist across 9 files in langchain_v1/langchain, e.g. _create_resources catches BaseException (shell_tool.py:716, :775).
  • Where: libs/core/pyproject.toml:114 and libs/langchain_v1/pyproject.toml:145 ("BLE" ignored); occurrences in factory.py, structured_output.py, model_fallback.py, summarization.py, types.py, shell_tool.py, etc.
  • Why it matters: Catching BaseException can swallow KeyboardInterrupt/SystemExit and mask real errors. Several uses are legitimate (resource cleanup re-raises), but disabling the lint rule globally removes the guardrail that would force each case to be justified.
  • Severity: Medium.

Q2 — mypy strictness is partially disabled with TODO markers — Low [Fact]

  • What: core sets disallow_any_generics = false # TODO: activate for 'strict' checking; langchain v1 sets warn_return_any = false # TODO. v1 also excludes several agent test trees from type checking.
  • Where: libs/core/pyproject.toml:9495; libs/langchain_v1/pyproject.toml:112120.
  • Why it matters: These are honest, tracked gaps in an otherwise strict config; they leave some Any-leakage unchecked in central code.
  • Severity: Low.

Q3 — ANN401 (no Any in annotations) globally ignored — Low [Fact]

  • What: Any annotations are pervasive (e.g. _ConfigurableModel.invoke(... ) -> Any, **kwargs: Any). The ANN401 rule is in the ignore list.
  • Where: libs/core/pyproject.toml:113; libs/langchain_v1/pyproject.toml:144; usage throughout chat_models/base.py.
  • Why it matters: Any is sometimes unavoidable at framework boundaries (pluggable kwargs), but blanket-ignoring the rule means accidental Any is invisible. Low — largely a pragmatic framework tradeoff.
  • Severity: Low.

Testing

T1 — Substantial unit-test footprint with network isolation enforced — Strength/Low [Fact]

  • What: 167 test files in libs/core, 90 in libs/langchain_v1; pytest-socket blocks network in unit tests; snapshot testing via syrupy; blockbuster detects blocking calls in async paths.
  • Where: test trees under libs/*/tests; libs/core/pyproject.toml:6178, :146154.
  • Why it matters: Strong baseline. The one caveat: actual coverage % could not be measured statically here, so coverage-gap claims are deferred.
  • Severity: Low (informational).

T2 — Whole agent test trees excluded from type checking — Medium [Fact]

  • What: mypy excludes tests/unit_tests/agents/middleware/, .../specifications/, and test_*.py under agents; ruff also relaxes ANN/ARG for tests/unit_tests/agents/* and disables ALL rules for test_react_agent.py.
  • Where: libs/langchain_v1/pyproject.toml:112117, :161168.
  • Why it matters: The agents subsystem is the newest and highest-churn area; excluding its tests from type/lint checks reduces the safety net exactly where it is most needed.
  • Severity: Medium.

Performance

P1 — Linear blocklist scans per IP in the SSRF hot path — Low [Fact]

  • What: _ip_in_blocked_networks iterates all blocked networks for each address; the code itself notes "if profiling shows this is a hot path, consider memoising".
  • Where: libs/core/langchain_core/_security/_policy.py:138183 (note at :143).
  • Why it matters: Negligible for typical request volumes; only relevant if used in tight retrieval loops. Documented and bounded.
  • Severity: Low.

P2 — Per-line encode("utf-8") and list-append accumulation in shell output collection — Low [Judgment]

  • What: _collect_output encodes every line to measure bytes and appends to a Python list; for very chatty commands this is O(lines) allocations.
  • Where: libs/langchain_v1/langchain/agents/middleware/shell_tool.py:277298.
  • Why it matters: Output is already truncated by line/byte limits, so unbounded growth is mitigated; allocation overhead is minor.
  • Severity: Low.

Dependencies

D1 — Bounded version ranges + per-package lockfiles — Strength/Low [Fact]

  • What: All runtime deps use bounded ranges (e.g. pydantic>=2.7.4,<3, langgraph>=1.2.4,<1.3); each package ships uv.lock; dependabot.yml present.
  • Where: libs/core/pyproject.toml:2636; libs/langchain_v1/pyproject.toml:2630; libs/*/uv.lock.
  • Why it matters: Reproducible builds and controlled upgrades. Strong.
  • Severity: Low (informational).

Developer Experience & Operations

O1 — Pre-commit lint/format hooks omit several partner packages — Medium [Fact]

  • What: .pre-commit-config.yaml defines per-package format lint hooks for core, langchain, standard-tests, text-splitters, anthropic, chroma, exa, fireworks, groq, huggingface, mistralai, nomic, ollama, openai, qdrant — but deepseek, openrouter, perplexity, and xai (which exist under libs/partners/) have no corresponding local hook.
  • Where: .pre-commit-config.yaml:48113; partner dirs from libs/partners/ listing.
  • Why it matters: Contributors editing those partner packages get no local format/lint enforcement; they rely solely on CI. Inconsistent DX and a drift risk as packages are added.
  • Severity: Medium.

O2 — AGENTS.md and CLAUDE.md are duplicated verbatim — Low [Fact]

  • What: The two files are identical 318-line copies of the same guidance.
  • Where: AGENTS.md, CLAUDE.md.
  • Why it matters: Two copies will drift; one should be the source of truth and the other a pointer (or a symlink / generated file checked in CI). There is a check_agents_sync.yml workflow, which suggests sync is enforced — but maintaining two full copies is still heavier than necessary.
  • Severity: Low.

O3 — Mature, security-conscious CI — Strength/Low [Fact]

  • What: 27 workflows; change-scoped test matrix; GitHub Actions pinned to full commit SHAs (e.g. actions/checkout@de0fac2e…); least-privilege permissions: contents: read; concurrency cancellation.
  • Where: .github/workflows/check_diffs.yml:3356; CLAUDE.md:310–312 (SHA-pin policy).
  • Why it matters: Strong supply-chain hygiene.
  • Severity: Low (informational).

Documentation

DOC1 — Docstrings are extensive and enforced — Strength [Fact]

  • What: Google-style docstrings enforced via ruff pydocstyle; init_chat_model has a multi-hundred-line docstring with examples; security functions document Raises.
  • Where: libs/*/pyproject.toml pydocstyle config; chat_models/base.py:218474.
  • Severity: Strength.

DOC2 — The SSRF bypass behavior is not surfaced in the public docstring — Low [Fact]

  • What: validate_safe_url's docstring describes blocking private/metadata but not the LANGCHAIN_ENV test bypass (_ssrf_protection.py:69) nor the _policy.py:231 localhost allowance.
  • Where: _ssrf_protection.py:4763 vs :6974; _policy.py:231.
  • Severity: Low.

Strengths (preserve these)

  • Dedicated, policy-based SSRF protection with IPv6/NAT64/cloud-metadata awareness — rare and valuable. (_policy.py)
  • ruff ALL + mypy strict monorepo-wide quality bar.
  • SHA-pinned GitHub Actions, least-privilege permissions, change-scoped CI.
  • Bounded dependency ranges + per-package lockfiles and active CVE pinning.
  • Deep unit-test footprint with network isolation (pytest-socket) and async-blocking detection (blockbuster).
  • Strong, enforced documentation standards and contributor guidance (CLAUDE.md/AGENTS.md).
  • Clean layered architecture (core → langchain → partners) with a deliberate classic/v1 split for migration.

4. Improvement Strategy (Phase 3)

Theme 1 — "Security guarantees should be end-to-end, not point-in-time"

  • Explains: S1 (TOCTOU), S2 (env bypass), DOC2.
  • Target state: SSRF validation pins the validated IP through to the actual socket connect (no second, unvalidated DNS resolution), the env bypass has exactly one well-documented condition, and all bypasses are documented in the public docstring.
  • Principles: Time-of-check must equal time-of-use; least surprise; document security-relevant escape hatches.

Theme 2 — "Dangerous capabilities should be safe-by-default and opt-in"

  • Explains: S3 (host shell default).
  • Target state: The most dangerous middleware (host shell) requires an explicit execution policy or defaults to the strongest available sandbox; the host policy is a conscious opt-in.
  • Principles: Secure defaults; principle of least privilege for agent tools.

Theme 3 — "Decompose the central God-files to protect velocity"

  • Explains: A1, partially A2.
  • Target state: runnables/base.py and the other 2k+-line core modules are split along cohesive seams (sync/async, declarative ops, schema) behind a stable public surface, with no public API changes.
  • Principles: High cohesion / low coupling; keep public __init__ exports stable (CLAUDE.md's stable-interface rule).

Theme 4 — "Make the quality net uniform across the monorepo"

  • Explains: O1 (missing pre-commit hooks), T2 (agent tests excluded from typing), Q2/Q3 (strictness TODOs).
  • Target state: Every package present in libs/partners/ has a pre-commit hook; agent tests are type-checked; strictness TODOs are burned down or ticketed.
  • Principles: Consistency reduces cognitive load and drift; the safety net should be strongest in the highest-churn area (agents).

Trade-offs — what NOT to fix now (and why)

  • Do not change key_encoder default from SHA-1 (S4) — it is a breaking change for existing indexes; the warning + usedforsecurity=False are adequate for now. Revisit at the next major version.
  • Do not re-enable BLE/ANN401 globally overnight (Q1/Q3) — would generate large, low-signal churn across a 1.4M+ token codebase. Burn down per-package instead.
  • Do not merge classic/v1/core (A3) — the split is intentional for the v1 migration; consolidating now is high-risk and low-reward.
  • Do not micro-optimize the SSRF blocklist (P1) or shell output loop (P2) — bounded and not on a measured hot path; the code already flags where to optimize if profiling justifies it.

Definition of done (measurable signals)

  • No High security findings remain (S1, S3 resolved or explicitly accepted with mitigations).
  • SSRF subsystem has exactly one documented env bypass; a regression test asserts a rebinding-style scenario is blocked at connect time.
  • ShellToolMiddleware has no implicit HostExecutionPolicy default (or a test asserting the documented opt-in).
  • Every directory under libs/partners/ has a matching pre-commit hook (CI check passes).
  • Agent test trees are type-checked (removed from mypy exclude) or each exclusion has a tracked ticket.
  • runnables/base.py reduced below an agreed LOC budget with no public API diff (snapshot of __init__ exports unchanged).

5. Task Plan (Phase 4)

Workload: S = <2h · M = half day · L = 1–2 days · XL = needs breakdown.

⚡ Quick Wins (high-impact, S-effort, do immediately)

  • QW1 — Unify the SSRF env bypass + document it. Make _effective_allowed_hosts use the same single, narrow condition as validate_safe_url, and document the bypass in the public docstring. (S, low risk)
  • QW2 — Add pre-commit hooks for deepseek, openrouter, perplexity, xai. Mirror existing per-package hook blocks. (S, low risk)
  • QW3 — Collapse AGENTS.md/CLAUDE.md duplication to one source + a pointer, relying on check_agents_sync.yml. (S, low risk)
  • QW4 — Document the SHA-1 key_encoder default and recommend blake2b/sha256 in the index/aindex docstrings (no behavior change). (S, no risk)

Milestone 0 — Safety Net (do before refactoring)

M0.1 — Add SSRF rebinding regression tests

  • Description: Add unit tests that simulate a host resolving to a public IP at validation and a private/metadata IP at "connect" time, asserting the request is blocked.
  • Affected: libs/core/tests/unit_tests/_security/, _security/_ssrf_protection.py, _policy.py.
  • Acceptance: Test fails against current code (demonstrating the gap), passes after S1 fix.
  • Workload: M · Risk: Low · Depends on: none.

M0.2 — Snapshot public API surface of langchain_core.runnables

  • Description: Capture the exported names of runnables/__init__.py as a test fixture to guard the M2 refactor.
  • Affected: libs/core/tests/unit_tests/runnables/.
  • Acceptance: A test asserts the export set is unchanged.
  • Workload: S · Risk: Low · Depends on: none.

Milestone 1 — Critical Fixes (security & correctness)

M1.1 — Close the SSRF TOCTOU gap (IP pinning at connect) (TOP PRIORITY #1)

  • Description: Wire validated IPs into the actual transport so the connection uses the IP that was validated, eliminating the second DNS resolution. Leverage the existing _security/_transport.py.
  • Affected: _security/_transport.py, _security/_ssrf_protection.py, callers that fetch URLs.
  • Acceptance: M0.1 rebinding test passes; existing SSRF tests pass; no public signature change to validate_safe_url.
  • Workload: L · Risk: Medium (touches request path) · Depends on: M0.1.

M1.2 — Make ShellToolMiddleware safe-by-default (TOP PRIORITY #2)

  • Description: Require an explicit execution_policy, OR default to the strongest available sandbox (CodexSandboxExecutionPolicy/DockerExecutionPolicy) when present, falling back to host only with an explicit flag.
  • Affected: libs/langchain_v1/langchain/agents/middleware/shell_tool.py:508571.
  • Acceptance: Constructing the middleware without a policy does not silently grant host shell; a test asserts the documented default; docstring updated.
  • Workload: M · Risk: Medium (default change is user-visible — follow CLAUDE.md stable-interface rule, use keyword-only + warn) · Depends on: none.

M1.3 — Unify & document the env bypass (QW1, promoted)

  • Description: Single bypass condition + public docstring note.
  • Affected: _policy.py:231, _ssrf_protection.py:69.
  • Acceptance: One code path; test covers it; docstring documents it.
  • Workload: S · Risk: Low · Depends on: none.

Milestone 2 — High-Leverage Improvements

M2.1 — Decompose runnables/base.py (TOP PRIORITY #3)

  • Description: Split the 6,574-line module into cohesive submodules (e.g. base protocol, sync impl, async impl, declarative/config ops, schema) re-exported from runnables/__init__.py.
  • Affected: libs/core/langchain_core/runnables/base.py (+ new submodules), runnables/__init__.py.
  • Acceptance: M0.2 export snapshot unchanged; mypy strict + ruff pass; import time not regressed.
  • Workload: XL (needs design breakdown) · Risk: Medium-High (most central abstraction) · Depends on: M0.2.

M2.2 — Type-check the agents test trees

  • Description: Remove the mypy excludes for agents tests; fix resulting errors incrementally.
  • Affected: libs/langchain_v1/pyproject.toml:112117, :161168; agent test files.
  • Acceptance: mypy . passes without the excludes (or excludes reduced with tickets for the rest).
  • Workload: L · Risk: Low · Depends on: none.

M2.3 — Single source of truth for the provider registry

  • Description: Derive the inference prefix table and docstring provider list from _BUILTIN_PROVIDERS (or a generated check) to prevent drift.
  • Affected: libs/langchain_v1/langchain/chat_models/base.py:38100, :521594.
  • Acceptance: A test asserts inference table ⊆ registry; adding a provider requires one edit.
  • Workload: M · Risk: Low · Depends on: none.

Milestone 3 — Quality & Polish

M3.1 — Burn down BLE (blind-except) per package

  • Description: Re-enable BLE package-by-package; replace except BaseException/broad catches with specific exceptions or justified # noqa with a reason.
  • Affected: ignore lists in libs/*/pyproject.toml; ~9 files in langchain_v1/langchain.
  • Acceptance: BLE enabled for at least core + langchain_v1; remaining exceptions justified inline.
  • Workload: L · Risk: Low-Medium · Depends on: none.

M3.2 — Burn down mypy strictness TODOs

  • Description: Enable disallow_any_generics (core) and warn_return_any (v1), fixing fallout.
  • Affected: libs/core/pyproject.toml:94, libs/langchain_v1/pyproject.toml:120.
  • Acceptance: Flags enabled; mypy passes.
  • Workload: L · Risk: Low · Depends on: M2.1 (touches same central code).

M3.3 — Make SHA-1 default explicit / plan migration

  • Description: Keep SHA-1 default but document it loudly and schedule a default change for the next major.
  • Affected: libs/core/langchain_core/indexing/api.py docstrings.
  • Acceptance: Docstrings recommend stronger algorithms; a tracked issue exists for the major-version change.
  • Workload: S · Risk: Low · Depends on: none.

Implementation sketches — Top 3 tasks

#1 — M1.1: Close the SSRF TOCTOU gap

  • Approach: Resolve the hostname once, validate every resolved IP, then connect to the validated IP directly (passing the original hostname for TLS SNI / Host header). Implement via a custom requests/httpx/urllib transport adapter (a _transport.py already exists to build on).
  • Key steps: (1) Extend the transport to accept a pre-validated IP set; (2) have validate_safe_url/validate_url return the validated IP(s), not just the string; (3) route fetches through the transport; (4) add the M0.1 rebinding test using a stub resolver.
  • Pitfalls: Breaking TLS hostname verification if you connect by IP without preserving SNI; IPv6 literal formatting in the Host header; keeping validate_safe_url's public signature stable (return type must stay str — expose IPs via a new internal function). Round-robin DNS / multiple A records must all be validated and the connect must use a validated one.

#2 — M1.2: Safe-by-default shell middleware

  • Approach: Change the constructor so an unspecified policy does not mean "host". Prefer a sandbox if available; otherwise require an explicit HostExecutionPolicy() (or a allow_host=True keyword-only flag) and emit a warning.
  • Key steps: (1) Add keyword-only opt-in; (2) detect sandbox availability (Codex/Docker) and select it; (3) update class docstring + the post-exec-redaction warning; (4) add tests for each default path.
  • Pitfalls: This is a user-visible behavior change — follow CLAUDE.md's stable-interface rule: introduce via keyword-only argument with a deprecation/transition warning rather than silently flipping the default; document clearly in release notes.

#3 — M2.1: Decompose runnables/base.py

  • Approach: Identify cohesive seams (core Runnable/RunnableSerializable base, RunnableSequence/RunnableParallel, binding/config/declarative ops, schema generation) and move each into a submodule, re-exporting from runnables/__init__.py so the public surface is byte-identical.
  • Key steps: (1) Land M0.2 export snapshot; (2) move one cohesive group at a time, running mypy+ruff+tests after each; (3) keep relative-import ban in mind (ruff ban-relative-imports = "all").
  • Pitfalls: Circular imports between the split modules (use TYPE_CHECKING guards, already idiomatic here); import-time regressions; accidental changes to __all__. Because Runnable is the most depended-on abstraction, do this in small, individually-reviewable PRs, not one mega-diff.

End of report.