From 0 to 250K Lines: How a 100% AI-Coded Project Builds a Governance Closed Loop
on , tagged translations, ai, processes (share this post, e.g., on Mastodon or on Bluesky)
Over the past month, while creating the Routa project, I made a bold decision: to let AI drive the project, turning my ideas into issues and then pushing those issues into executable code.
What truly gave me the confidence to do this wasn’t just interest or model capabilities, but the fact that I finally had access to enough AI tool quotas to treat this as a continuous experiment. Based on my own experience, a so-called “100% agent” project isn’t just about buying a membership. You probably need three times the maximum quota—or around $600 worth of continuous model resources—for AI to truly become your daily workhorse, not just an occasional helper.
During this month, I used the following tools:
- Initially, Copilot Pro (open-source sponsorship), Cursor (billed per request), and, after exhausting my requests at the end of February, Augment Code (open-source sponsorship).
- Currently, I mainly use Codex (open-source sponsorship), Kiro Opus 4.6 (billed per credit), and Augment Code.
- Additionally, I use GLM in the pipeline, as well as other tools like Qoder, Antigravity, and Kimi Code.
But I soon realized that what really matters isn’t how many AIs I connected, but what happens to the project itself when AI truly begins to participate.
About a year ago, I wrote an article titled “Two Weeks, 30K Lines of Code! Our 7 AI ‘Dung Heap’ Survival Coding Practices”, documenting how we used AI for a highly experimental “vibe coding” session. The focus of that practice was to verify whether AI could write code and stack up results. To put it more bluntly, the question at the time was still, is AI usable?
But this time is different. In this round of practice with Routa, I began to truly treat AI agents as teammates. Once the relationship shifted from tool to teammate, the question was no longer whether it could produce code, but whether it could integrate into the project’s collaboration system: Does it have clear boundaries? Does it follow current rules? Does its output still fall within the shared contract? Can failures be promptly corrected, and can those failures be sedimented into more stable harness engineering [translator’s note: specialized infrastructure designed to wrap around an AI model to test, benchmark, and validate its performance]?
What this article truly aims to discuss is this: 250,000 lines of code are just the result; the key is how a fully AI-driven project gradually develops its own rule entry points, feedback loops, and governance closed loops.
Teammates: AI Is No Longer an Assistant, but a Working Member Within the System
Looking at the co-author statistics in the repository, it’s clear that this is no longer just a single AI helping to write code. In Routa, there have been a cumulative 846 Co-authored-by entries across 834 commits, involving 24 tools and 25 models working together. The most frequently used is OpenAI Codex/GPT-5, but we also see continuous participation from tools like Kiro, Claude, Augment, GitHub Copilot Agent, Qoder, and Cursor in the delivery pipeline.
This indicates a significant shift: The issue is no longer “AI appearing among contributors,” but rather that AI is beginning to enter the engineering system like a human, leaving governable operational traces.
In this context, GitHub’s “Contributors” view is no longer sufficient. At best, it can only show who participated, but it cannot answer the questions that engineering governance truly cares about: Which AI wrote what code in which scenarios? Which issues might be related to specific tool/model combinations? Which rules should be prioritized for feedback to which agents?
Thus, Co-authored-by here is not just a ceremonial signature, but a source of provenance. The project begins to know who wrote a piece of code and through which model, allowing subsequent rule feedback, failure attribution, and architectural adequacy functions to become truly governable system capabilities.
In other words, what’s most worth discussing about Routa isn’t “250,000 lines of code generated by AI,” but rather how, once AI truly becomes part of the project’s workforce, the project gradually develops its own feedback system.
I. The First Feedback Loop: Agent + AGENTS.md, Turning Collaboration Rules Into a Unified Entry Point
The starting point of this system isn’t a contract or CI, but rule.
Routa’s rule entry point is AGENTS.md; CLAUDE.md is just a soft link pointing to it. In other words, there is only a single source of truth for the rules, not multiple copies maintained for different tools.
Progressive Disclosure: Not Dumping All Knowledge on the Agent at Once
This might seem like a file organization issue, but it actually reflects a governance mindset: Rules should not be scattered across local configurations for different tools; instead, they should converge into a unified entry point that the entire engineering system can reference.
More importantly, this entry point doesn’t dump all knowledge on the agent at once but serves as a “progressive disclosure” role.
AGENTS.md is more like a directory and operational contract than an encyclopedia. It first tells the agent what stable entry points exist in the repository: Product boundaries are in docs/product-specs/FEATURE_TREE.md, system boundaries in docs/ARCHITECTURE.md, design intent in docs/design-docs/, execution plans in docs/exec-plans/, and quality and verification in docs/fitness/. Only when addressing a specific issue does the agent drill down to the next layer of documentation and implementation details.
The significance of this organization is that it avoids prompt overload. For AI, dumping too many rules at once doesn’t mean stronger constraints; often, it just becomes noise. Instead, by designing clear entry points, boundaries, and drill-down paths, the agent can more easily access the right constraints at the right time.
The Evolution of AGENTS.md: From Collaboration Discipline to Governance Engine
The evolution of AGENTS.md can be roughly divided into three stages.
First stage: Collaboration discipline is documented. Rules gradually clarify baby-step commits, co-author norms, pre-commit checks, and issue feedback processes. The focus is still on making “how to collaborate with AI” clear, so that coding behavior begins to have repeatability.
Second stage: Fitness-ization. Quality gates begin to be explicitly written into the process. Rules are no longer just “suggestions” but start to enter the executable check chain. The system no longer just tells AI “what it should do” but begins to have a mechanism for “what will be blocked if done wrong.”
Third stage: Governance converges toward Entrix. Rules originally scattered across documents, hooks, and CI begin to converge around a unified governance engine. Fitness is no longer a collection of isolated scripts but becomes a consistent engineering adjudication logic.
Thus, the rule here is not about “writing a long prompt and handing it to the agent,” but about turning collaboration rules into a layered knowledge entry point: First, expose the smallest but stable collaboration contract, then use the document tree, hooks, contracts, and fitness to unfold stronger constraints layer by layer.
This step is crucial. We’ve upgraded how AI should act from natural language prompts to a structured, drillable, and executable engineering constraint system.
II. The Second Feedback Loop: Monorepo + Contract, Bringing Implementation Back Within Shared Boundaries
If the first feedback loop constrains the agent’s behavior, the second feedback loop constrains the implementation itself.
Monorepo: Putting Project-Level Changes Back in the Same Context
Many teams think of a monorepo first in terms of code hosting: putting frontend, backend, scripts, and tools into a single repository. But in an AI project, its more important meaning is shared context engineering.
The most problematic aspect of AI isn’t that it can’t write a certain layer of code, but that when a requirement spans frontend, backend, desktop shell, and multiple services, it often receives several fragmented local contexts. Each local piece can be written, and each piece may seem valid, but without a unified workspace, the agent struggles to converge these changes into a complete, verifiable project-level change.
Thus, a monorepo here is not just code centralization but context convergence.
In a dispersed collaboration model, a requirement is split across frontend, backend, desktop, scripts, and multiple services, with each repo or directory evolving separately and dependencies existing implicitly. The agent often switches between different sessions, resulting in context fragmentation.
In a shared context model, project-level changes are converged into the same workspace. Product boundaries, architectural boundaries, implementation code, governance scripts, fitness evidence, and issue feedback are all placed in a continuous context, giving AI the opportunity to move from local coding to cross-layer linkage and one-time completion.
Routa’s current approach essentially turns shared context into an engineering structure, rather than relying on human memory to fill the gaps:
- AGENTS.md and
docs/provide unified navigation, allowing the agent to first enter the same project entry point. src/,crates/,apps/, andtools/coexist in one repository, enabling frontend, Rust backend, desktop shell, and governance tools to interact within the same workspace.docs/fitness/,docs/issues/, anddocs/exec-plans/place verification evidence, failure feedback, and execution plans in the same context, rather than scattering them across external systems.
Only by first converging the context can the contract truly land.
Contract: Preemptively Defining System Boundaries as Shared Agreements
What first spirals out of control in a fully AI-driven project is usually not code style, but boundary drift. Especially in a dual-backend project like Routa, without a strongly constrained single source of truth, AI can easily write locally reasonable implementations on one end, but create systemic deviations.
This is where api-contract.yaml comes in. It’s not just an interface description file, but a shared boundary definition for both backends. The Next.js and Rust ends must implement the same set of endpoints and maintain consistent request and response shapes.
More importantly, this is not a static document. It has entered the validation process: Change the contract first, then the implementation; define the boundary first, then allow coding. The project no longer defaults to “write the code first, then patch consistency,” but requires that implementation falls within the shared contract from the start.
From the current implementation, this contract has at least three layers of constraints in effect simultaneously.
First layer: docs/fitness/api-contract.md explicitly defines api-contract.yaml as the single source of truth and requires that new endpoints be pushed in the order of “change contract first, then change src/app/api/, then change crates/routa-server/src/api/.”
Second layer: The npm run api:schema:validate and npm run api:check checks pull schema validity, Next.js implementation, and Rust implementation into the same comparison plane. Here, it’s not just about “whether the interface exists,” but also whether breaking changes have been unintentionally introduced.
Third layer: Runtime contract tests like tests/api-contract/test-schema-validation.ts directly read the contract, check operationId, request schema, and response schema, and perform AJV validation on real API responses. In other words, the contract in Routa constrains both design-time boundaries and runtime shapes.
This means AI can explore implementations, but it cannot arbitrarily rewrite boundaries. For a fully AI-driven project, this is crucial. Because what usually collapses first is not the functionality itself, but consistency.
III. The Third Feedback Loop: Git Hooks + Lifecycle, Moving Constraints to Pre-Commit
If the rule addresses “how the agent should act,” and the contract addresses “implementation boundaries must not drift,” then the third feedback loop addresses this: AI cannot first create large local deviations and then throw the entire problem to remote CI or manual review.
Git Hook Lifecycle: Local Checks Are Not Just Checks, but Preemptive Adjudication
The current repository’s pre-commit is lightweight, only running lint. The real key is pre-push and smart-check.sh. They not only run Entrix-based checks but also identify whether the current environment is an AI agent environment: If it’s AI, they return structured failure information and require fixes before retrying; if it’s a human environment, they retain a more flexible interactive fix path.
In other words, these hooks are no longer just Git hooks, but preemptive adjudicators.
scripts/smart-check.shactually callspython3 -m entrix.cli runbefore push, defaulting to running clear metrics likeeslint_pass,ts_typecheck_pass,ts_test_pass, andmarkdown_external_links, rather than arbitrarily concatenating shell commands.- The same script also executes
entrix.cli review-trigger. This step doesn’t test right or wrong, but risk: For example,docs/fitness/review-triggers.yamlhas already marked directories likesrc/core/acp/**,src/core/orchestration/**, andcrates/routa-server/src/api/**as high-risk boundaries; any changes to api-contract.yaml, defense.yaml, or simultaneous cross-boundary changes inweb/rust/toolswill trigger human review. - More crucially, the evidence gap has been preemptively moved into the hook. In other words, it’s not just “whether code changes should pass tests,” but “you changed a core path but didn’t synchronously update
docs/fitness/**or contract evidence,” and this behavior itself will be identified.
What’s reflected here isn’t script trickery, but a governance mindset: The same feedback loop should return different types of feedback for different executors. For AI, feedback must be explicit; for humans, feedback can retain judgment space.
Automatically Triggering Refactoring: First Stop Further Deterioration, Then Talk About a Complete Cure
Going further, code governance is no longer just a simple wc -l. The repository has now evolved this into a budget and freeze mechanism: Changed files must be controlled within the budget, and historically oversized hotspot files are not required to be completely cured at once, but they cannot continue to expand—they can only gradually shrink.
This has already been concretely implemented in Routa.
tools/entrix/file_budgets.jsonwrites the budget as configuration: The default upper limit for.ts/.tsxfiles is 1,000 lines, and for .rs files, it’s 800 lines.tools/entrix/entrix/file_budgets.pydoesn’t just statically compare absolute thresholds; it reads the line count of the historical version inHEAD, freezing files that are already over the limit based on the current baseline.- This creates a very practical ratchet mechanism: Normal files cannot exceed the budget; historical hotspot files don’t need to return to an ideal state at once, but their ceiling is locked, and subsequent commits can only make them smaller, not larger.
- This mechanism has already entered fitness. In
docs/fitness/code-quality.md, one metric checks whether changed files meet the budget, and another,legacy_hotspot_budget_guard, acts as a hard gate, specifically ensuring that registered hotspot files can only shrink, not expand. - At the same time, it has also entered hooks and refactoring workflows:
.husky/post-commitcallsentrix hook file-lengthto issue budget warnings, anddocs/REFACTOR.mdwrites the explicit sequence “address budget violations first, then sort by hotspot and change frequency.”
The significance of this mechanism is that it doesn’t attempt to eliminate historical burdens all at once, but first establishes a boundary of no further deterioration. For a fully AI-driven project, this is more realistic and more important than perfect governance.
IV. The Fourth Feedback Loop: CI/CD + Fitness, Elevating Local Constraints to Repository-Level Adjudication
The first three feedback loops occur more locally and during development. But if the project truly enters a fully AI-driven state, local constraints alone are not enough. The team also needs a repository-level, shared, and auditable system to determine the actual quality state of the current changes.
This is the role of docs/fitness/, Entrix, and the defense workflow.
In Routa, fitness is no longer just a “test supplement,” but a repository-level adjudication mechanism. It emphasizes contract-first, blast radius control, and anti-entropy, breaking repository governance into multiple quality dimensions and executing them through a unified engine.
Looking at the implementation details, this fitness system has several notable points.
First, it is Markdown frontmatter-driven. The files in docs/fitness/*.md are not ordinary documentation but dimension declaration files. Each file defines dimension, weight, tier, threshold, and metrics, placing rules and evidence in the same carrier.
Second, it has a clear layered execution model. The current docs/fitness/README.md divides checks into fast, normal, and deep layers, corresponding to lint/contract checks, unit/API tests, and E2E/security scans. In other words, fitness here is not a one-time comprehensive check, but is organized in layers based on feedback timing.
Third, it already has a unified execution engine. tools/entrix/entrix/engine.py loads these dimensions, dispatches them to shell runners, SARIF runners, and graph probe runners based on metric type, and then summarizes them into a report. Metrics like graph:impact and graph:test-radius have also entered the same execution path, rather than being scattered in script corners.
Fourth, it doesn’t just score; it also does incremental filtering. Entrix infers the domain based on changed files and only runs metrics related to the current changes. This way, the AI receives more focused feedback and isn’t overwhelmed by repository-wide noise every time.
We further expand this through GitHub Actions (.github/workflows/defense.yaml). It’s no longer just CI in the traditional sense, but breaks down dimensions like code quality, testability, security, API contract, design system, evolvability, UI consistency, observability, and performance into independent jobs, executing and summarizing them separately.
This change is significant: The team no longer just knows “this commit failed,” but knows whether it failed due to contracts, testability, security, or evolvability. Quality issues are no longer a black-box result but a set of governance dimensions that can be discussed, divided, and continuously optimized.
Auto-Fixing CI: Failures Aren’t Just Blocked, They Must Be Able to Flow Back
Going further, when CI fails at night, the system hands it over to the coding agent for automatic fixes. This shows that the system’s goal isn’t just to “keep errors out,” but to attempt to build a closed loop where failures can flow back, rules can sediment, and fixes can be automated.
This is precisely the mark of a mature fully AI-driven project: It’s not that AI always succeeds on the first try, but whether the system can turn failures into input for the next round of convergence.
The Real Threshold: Not Model Capability, but Governance Capability
When many teams talk about AI development, they tend to focus on models, agents, and the degree of automation. But these are mostly supply-side capabilities.
What truly makes the difference is whether the project is ready with a system to receive these capabilities.
What Routa currently demonstrates isn’t just “AI writing a lot of code,” but “the project gradually integrating AI into the engineering system”: First, using rules to constrain agents, then using a monorepo to converge context, then using contracts to lock boundaries, then using hooks to preempt constraints, and finally using fitness and CI for repository-level adjudication.
The code itself is just the result; what truly supports this result is a continuously tightening, continuously feeding back, and continuously converging engineering closed loop.
This path demonstrates one thing: In the era of AI programming, true competitiveness may not come first from how strong the model is, but from whether the team can weave rules, verification, feedback, and fixes into a stable, operating engineering system.
Code scale is just the result; governance capability is the moat.
Conclusion: Full AI Is Not About Abandoning Governance, but Strengthening It
Many people, when talking about “fully AI-driven projects,” tend to imagine something like autonomous driving: As long as the model is strong enough, the system will run more and more stably on its own.
But engineering reality is often the opposite.
The more you rely on AI, the more you need clearer boundaries, more explicit rules, higher-frequency feedback, and more systematic verification. What truly supports a fully AI-driven project isn’t abandoning governance, but moving governance forward, refining it, and automating it.
Thus, for a project like Routa, what’s most worth telling isn’t the result “250,000 lines of code written by AI,” but the engineering judgment behind this result: The team didn’t treat AI as just a faster code generator, but continuously integrated it into an increasingly rigorous feedback system.
This is the moment when a fully AI-driven project truly begins to take shape.
(This post is a machine-made, human-reviewed, and authorized translation of phodal.com/blog/ai-coded-project-governance-evolution-250k-lines/.)