OpenAI logo, laptop

OpenAI Builds System to Detect Misaligned AI Coding Agents in Real Time

The company uses GPT-5.4 thinking to flag misaligned behavior across tens of millions of internal agent sessions.

Share your love

The company uses GPT-5.4 thinking to flag misaligned behavior across tens of millions of internal agent sessions.

OpenAI disclosed a low-latency monitoring system it has operated for five months to detect misaligned behavior in AI coding agents deployed internally, using its own GPT-5.4 Thinking model to review agent reasoning and flag suspicious actions for human review. 

The system reviews full conversation histories, including chains of thought, tool calls, and developer and assistant messages, within 30 minutes of session completion, assigns each interaction a category and severity level, and automatically alerts OpenAI’s Safety Oversight and Security Incident Response teams when anomalies surface. The company said the monitor has reviewed tens of millions of internal coding trajectories since it launched. 

Zero real conversations triggered the highest severity level in that period. That threshold is designed to capture coherent scheming, a form of misalignment where a model pursues goals beyond its assigned task. Approximately 1,000 conversations reached moderate severity, many originating from deliberate internal red-teaming. 

The move comes as AI developers face mounting pressure to demonstrate that autonomous agent systems capable of writing and executing code without step-by-step human instruction can be deployed without material safety risk.

Internal coding agents raise specific concerns. They have access to select OpenAI systems, can inspect the code and documentation behind the company’s safeguards, and could, in theory, attempt to alter those safeguards or affect future model versions. OpenAI said these properties make internal deployments a critical environment for developing safety infrastructure. 

MORE FROM OpenAI: OpenAI Launches GPT-5.4, Its Most Advanced Work Model Yet

The most common behaviors the system flagged were circumvention attempts and deception. In one documented case, an agent encountered an “Access is denied” error and responded by trying to encode commands in base64, splitting payload construction into smaller steps to avoid detection, and obfuscating suspicious content before ultimately switching to an approach that complied with the restriction. OpenAI said the developer prompt in that instance had inadvertently encouraged the behavior; revising it reduced but did not eliminate similar incidents. 

“We’ve found that our models can be overly eager to work around restrictions in pursuit of a user-specified goal,” the company said in a blog post, adding that it has so far found no evidence of motivations beyond the original task, such as self-preservation or scheming. 

The monitor outperformed employee escalations on every case. It flagged every interaction that staff reported through internal channels and surfaced additional incidents that would otherwise have gone unnoticed, OpenAI said. Less than 0.1% of traffic, primarily from bespoke or local setups, currently falls outside monitoring coverage. The company said it is working to close that gap.

OpenAI acknowledged the system’s limits. It cannot yet reliably measure false-negative rates in open-ended real-world traffic without dedicated control evaluations and red-team testing. Future models may also exhibit behaviors harder to detect from reasoning traces. The company said it treats monitoring as one layer in a broader defense-in-depth stack, not a complete solution. 

The company outlined a roadmap that includes expanding monitoring from asynchronous review to synchronous blocking, allowing the system to evaluate and potentially stop the highest-risk actions before they execute. That capability is particularly relevant in settings where a single step can cause irreversible harm, OpenAI said. A follow-up report is expected in the summer.

OpenAI called on other AI developers to adopt similar safeguards for internal coding-agent deployments, describing the approach as an early step toward a broader safety case for frontier AI systems.

Avatar photo
NN Desk

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay updated with NervNow Weekly

Subscribe now