Most platforms made it easy to build a first agent. Nobody has solved how to run one in production with the same discipline we apply to software. That gap has a name. It’s going to close the same way every gap like it has closed before.
We’ve Been Here Before
Cast your mind back to 2010. Every serious engineering organization had servers. Those servers were configured manually. Someone logged in, ran commands, made changes. Each one was slightly different from the others. Nobody was entirely sure what was running where. When something broke, you debugged by comparing what you remembered against what was actually there.
Then Infrastructure as Code happened. Chef. Puppet. Ansible. Terraform. And the industry stopped treating infrastructure as something you configure by hand and started treating it as something you define in a file, check into source control, and promote through a pipeline. The infrastructure didn’t change. The discipline around it did.
The same arc played out with cloud services. With application deployments. With configuration management. Every time a new class of system emerged, you first figured out how to build it, then how to run it reliably, and eventually treating it like deployable software became table stakes.
The pattern is consistent. Servers became Infrastructure as Code. Cloud services trended toward declarative and pipeline-managed operation. Serious engineering organizations stopped treating manual console configuration as the target operating model, because Terraform and CloudFormation made repeatable, versioned infrastructure the standard. Applications got CI/CD pipelines with gates, approvals, and rollbacks. Every mature category trended in the same direction: versioning, automation, and controlled promotion.
Agents are next. And right now, most of the market is still in the “configure it by hand and hope” phase.
The Scenario That Exposes the Gap
You have an application or pipeline that invokes an AI agent. The agent has a configuration: instructions, skills, connectors, a knowledge source, model and runtime dependencies. It does something: generates a brief, processes a document, handles a customer query.
You build in Dev. You test in Test. You promote to Production.
Your application code moves through this pipeline cleanly. It lives in source control. Changes are tracked. The same artifact that passes tests in Test gets deployed to Prod. That’s the discipline you’ve built over years of software engineering.
Now ask yourself what happens to the agent configuration when you promote.
Too often, the answer is: someone recreates it. They open the agent platform in the next environment, manually reconstruct the instructions, re-add the connectors, reselect the model or runtime configuration, and hope they got it right. Then they point the application config at the new agent ID and call it a promotion.
That’s not a pipeline. That’s copy-paste with extra steps.
I ran into this directly while building a test harness for a production agent workflow. The right instinct surfaced immediately: you need a Dev agent, a Test agent, and a Production agent. Separate IDs, each environment isolated. The dropdown was obvious. But the moment you build it, the problem becomes visible. You’re not promoting one agent through three environments. You’re maintaining three parallel agents and trusting that they stay in sync through manual effort.
That trust is misplaced. And the consequences are specific.
Three Failure Modes
You tested one agent. You deployed another.
The agent that passed UAT in Test is not the agent running in Production. It’s a manually reconstructed approximation of it. One missed instruction. A different connector setting. Runtime or model configuration that drifted from what was validated in Test. You don’t know until something breaks in production. And when it does, you can’t easily tell whether the code changed or the agent changed or both.
Deployment is copy-paste.
Every time the business owner wants to adjust agent logic, someone makes that change in Dev, then makes it again in Test, then makes it again in Production. Three separate manual operations, each one an opportunity for error or omission. At one agent, this is annoying. At ten agents across two platforms, it’s a bottleneck. At fifty agents, it’s operationally unsustainable.
You have no cross-environment diff. You have no idea whether your environments match.
Some platforms show version history inside a single agent’s UI. Some DevOps tooling gives you pipeline history within a specific stack. That’s useful. But it’s not the same as a cross-environment diff on the full agent configuration as a single governed artifact. You can’t ask “what’s different between the Test agent and the Production agent right now?” and get a reliable answer. The platform might tell you what changed inside each environment independently. It won’t tell you whether the two agents are in sync with each other.
That’s not a governance failure. It’s a tooling gap. But the consequences land the same way.
What the Platforms Are Doing
This isn’t a platform comparison piece. I covered the landscape in the companion article on agent change management. The short version: the platforms with the most complete story inherited it from existing enterprise DevOps infrastructure. Salesforce Agentforce treats agents as deployable metadata within the Salesforce release-management model. Microsoft Copilot Studio agents live inside Power Platform solutions that move through environment pipelines. Both work, if your application lives inside their ecosystem.
Outside those ecosystems, native end-to-end environment promotion for agents is still immature, inconsistent, or absent as a standard operating model. Most pure-play AI platforms have added prompt versioning or playbook history. OpenAI is adding controlled sandboxes and expanding runtime controls, but it does not yet present the same kind of full enterprise ALM surface that Microsoft and Salesforce do. The tooling exists to route your application to different agent IDs per environment. What’s missing is a general-purpose way to define an agent as a governed artifact, version it in source control, diff it across environments, and promote that same artifact through gates the way we do with software.
One distinction matters here. Code-first platforms are closer to Agent as Code today. Building directly against an LLM API, using AWS Bedrock with CDK, Google ADK, or LangChain means your system prompt and skill files are text files that live in your repo. They version naturally. They promote with your application code because they are your application code. The tradeoff is that you own the infrastructure: the agent loop, tool execution, session management, connectors.
Managed platforms optimized for enterprise workflow features and business-user accessibility make the first mile much faster. The tradeoff is that much of the agent configuration often lives in the platform, not natively in your repo. It’s a record in their system, not a file in yours. That’s an architectural choice those platforms made deliberately, and it’s a reasonable one for many use cases. But teams using those platforms inherit the gap this article describes. The separate-agent-IDs-per-environment pattern is a reasonable workaround given that constraint. It’s not wrong. It’s just not a pipeline.
The parallel to the Time to Hello World article is direct. The platforms are competing on how fast you can get an agent running. Not on how safely you can run it in production. Time to first agent is the metric everyone optimizes for. Time to production-grade deployment is the problem the market still hasn’t solved cleanly.
What Agent as Code Actually Requires
The concept isn’t complicated. The implementation isn’t trivial. But the shape of the solution is clear.
Agent configuration needs to be representable in source control alongside application code. Not a screenshot of the platform UI. Not a manually maintained runbook. A file, or a set of files, that captures what the agent actually is: its instructions, its model and runtime dependencies controlled explicitly rather than left to drift, its tool connections, its knowledge source references. When that definition changes, your pipeline knows. When a business owner wants to adjust agent logic, they make a change to that file, submit it for review, and the pipeline takes it from there.
The pipeline deploys the same governed definition to each environment in sequence. Not a recreation. The same artifact, evaluated in Dev, promoted to Test, tested under real conditions, approved by a human, deployed to Prod. What goes to Prod is exactly what was tested. Not an approximation.
Evaluation gates reduce manual testing and make it systematic, with human review where the workflow requires it. An evaluation set of representative inputs with expected behavioral ranges runs automatically when the configuration changes. Pass the gate, advance to the next environment. Fail the gate, stop and notify. The same discipline as automated testing in software CI/CD, applied to agent behavior.
Rollback becomes straightforward because the prior approved configuration is stored. Something breaks in Prod? You reactivate the last known good version. Not from memory. From the record.
LLMOps and AgentOps describe pieces of this. Neither one, as currently tooled, fully solves governed promotion. The vocabulary is new. The discipline is the same one that solved infrastructure, cloud services, and application deployment before it.
This Is How the Category Resolves
It doesn’t become standard because a platform announces a feature. It becomes standard because the first few production incidents make the current approach too costly to defend.
Regulation adds pressure. The EU AI Act’s August 2026 applicability milestone is a real pressure point for organizations running AI in regulated workflows. Auditors and diligence teams asking for change-management evidence will accelerate the timeline. But regulation isn’t what closes this gap. Incidents close it. The customer service agent that returns bad responses for three days because someone manually updated the wrong environment. The compliance audit that asks for evidence of controlled deployment and finds copy-paste. The transaction due diligence that surfaces three parallel agent configurations that don’t match each other.
Every category that preceded this one resolved the same way. Not because someone made a prediction. Because the pain of the alternative became obvious at scale.
Infrastructure moved toward Infrastructure as Code when managing servers by hand stopped scaling. Cloud services moved toward Terraform and CloudFormation when manual console provisioning became indefensible for serious engineering teams. Application delivery moved toward CI/CD when hand-managed releases became too fragile to defend. Agents will move the same way.
We’re not far from that moment.
The Closing Question
Can you deploy an agent change the same way you deploy a code change?
Same source control. Same review process. Same pipeline. Same governed definition in every environment. Same rollback capability.
If the answer is no, you have a copy-paste pipeline, not a real one. And the longer your agents run in production without it, the more expensive that gap becomes to close.
This article is a companion to the Managing the Digital Workforce series. The agent change management fundamentals that precede this piece are covered in Your Agents Are Software. Treat Them Like It.. The registry that makes governed deployment possible is covered in Part 2. Platform posture management, which covers monitoring what’s actually running against what was approved, is covered in Part 5.