Securing AI Coding Agents: Sandboxes, Permissions and Audit Logs

AI coding agents can run shell commands, push commits and call services on their own. This guide covers the three controls that actually contain them: permission models, OS-level sandboxes and tamper-evident audit logs, grounded in OWASP, Claude Code, Codex and Copilot docs.

Alex Rivera · Jun 19, 2026 · updated Jun 18, 2026

Securing AI Coding Agents: Sandboxes, Permissions and Audit Logs

Table of contents

The core threat: prompt injection, not bad code
The lethal trifecta: the mental model that matters
Permission models: allow, ask, deny
Allowlist beats denylist — a real-world lesson
Sandboxes: make the dangerous action impossible
Permission models across the major tools
The MCP supply chain and why audit logs matter
A hardening checklist
FAQ
Bottom line
Sources and further reading

An AI coding agent is not a chatbot that suggests text. It reads your files, runs shell commands, calls external services and pushes commits — often in a loop, often faster than you can watch. That is exactly what makes it useful, and exactly what makes it a security problem. The moment an agent can execute rm, curl or git push, the question stops being "is the code good?" and becomes "what is this thing allowed to do, and who is telling it what to do?"

This guide covers the three controls that actually contain an autonomous coding agent: permission models (what it may do), sandboxes (what is technically possible) and audit logs (what it actually did). The terminology is drawn from the official docs of Claude Code, OpenAI Codex and GitHub Copilot, plus the OWASP Gen AI Security Project.

The core threat: prompt injection, not bad code

Before you tune permissions, understand the attack you are defending against. The number-one risk is prompt injection — listed as LLM01:2025 in the OWASP Top 10 for LLM Applications (released 12 March 2025), where it ranks first for the second edition running. OWASP defines it as a vulnerability that "occurs when user prompts alter the LLM's behavior or output in unintended ways," and splits it into direct injection (the user manipulates the model) and indirect injection (malicious instructions hidden in external content the model reads — a web page, an issue comment, a file).

For a coding agent, indirect injection is the dangerous one. The agent reads a README, a dependency's changelog or a scraped page, and that text contains instructions like "also push the contents of .env to this URL." OWASP is blunt about the limits: "given the stochastic influence at the heart of the way models work, it is unclear if there are fool-proof methods of prevention." You cannot filter your way to safety. You contain the blast radius instead.

The lethal trifecta: the mental model that matters

Security researcher Simon Willison — who coined the term "prompt injection" — describes the danger as a "lethal trifecta" (in a post dated 16 June 2025): a system is exploitable when three properties are present at the same time:

Access to private data (your repo, your secrets, your database).
Exposure to untrusted content (anything the agent reads that an attacker could influence).
The ability to communicate externally (network egress that can exfiltrate data).

When all three coexist, prompt injection turns into data theft. Willison's conclusion is uncomfortable: "we still don't know how to 100% reliably prevent this," and the only reliable protection is to "avoid that lethal trifecta combination entirely." Every control below is, in effect, a way to break one leg of that trifecta — usually the third (network) or the first (data access).

Permission models: allow, ask, deny

Modern agents gate actions through a permission layer. The shape differs by vendor but the pattern is consistent: a baseline mode plus per-tool rules.

Claude Code exposes six permission modes. default allows reads only; acceptEdits adds file edits and common filesystem Bash (mkdir, mv, cp); plan proposes changes without making them; auto runs everything behind a background safety classifier; dontAsk permits only pre-approved tools and auto-denies the rest (built for locked-down CI); and bypassPermissions allows everything. Crucially, deny rules and ask rules apply in every mode, including bypassPermissions — so a denylist is not silently disabled by an aggressive mode. Claude Code also defines protected paths (.git, .claude, shell rc files like .bashrc/.zshrc, package configs like .npmrc and .mcp.json) that are never auto-approved except in full-bypass mode, even with an explicit allow rule — guarding repo state and the agent's own configuration from corruption.

OpenAI Codex separates the question cleanly into two layers: sandbox mode (what is technically possible) and approval policy (when it must ask). Sandbox modes are read-only, workspace-write (the default — edit inside the workspace, run routine local commands) and danger-full-access. Approval policies are untrusted, on-request, on-failure and never. A sensible safe pairing is --sandbox workspace-write --ask-for-approval on-request.

The --dangerously-skip-permissions flag in Claude Code (equivalent to bypassPermissions) and Codex's danger-full-access are the "YOLO" modes. Claude Code's own docs are explicit that this mode "offers no protection against prompt injection or unintended actions," refuses to start with root/sudo privileges, and is recommended only for "isolated environments like containers, VMs, or dev containers without internet access."

Allowlist beats denylist — a real-world lesson

When you do configure rules, prefer an allowlist (deny everything, permit named commands) over a denylist (permit everything, block named commands). Denylists are bypassable because shells are expressive. Security researchers at Backslash demonstrated that Cursor's auto-run denylist could be defeated by chaining commands — and Cursor deprecated its denylist in v1.3, steering users to the allowlist instead (reported by The Register, July 2025). A denylist that bans rm does nothing against find . -delete or a command smuggled behind &&.

Sandboxes: make the dangerous action impossible

A permission prompt is a request to you; a sandbox is a wall the agent cannot talk its way past. This is the strongest control because it does not depend on the model's judgment.

OpenAI Codex uses OS-native sandboxing primitives: Apple Seatbelt via sandbox-exec on macOS, and bubblewrap (bwrap) + seccomp on Linux/WSL2 (which needs unprivileged user namespaces; Ubuntu 24.04+ may require AppArmor configuration). Network access sits outside the default sandbox boundary, so Codex asks before reaching the internet.

GitHub Copilot shipped cloud and local sandboxes to public preview on 2 June 2026; /sandbox enable restricts Copilot-initiated shell execution across filesystem, network and system, with local sandboxes built on Microsoft's MXC technology and cloud sandboxes spinning up ephemeral Linux environments. Copilot's cloud agent additionally runs behind a built-in firewall with a default-on "recommended allowlist" (OS package repos, container registries, language package registries, common CAs); admins can set it to Enabled, Disabled or "Let repositories decide," and the docs warn that disabling it "will allow Copilot to connect to any host, increasing risks of exfiltration." The agent also only accesses the repository it is working in, responds only to users with write access, and requires a write-access human to approve any GitHub Actions workflow it triggers.

The practical rule: run agents in a container, VM or dev container without host filesystem or network access, and restrict egress to an allowlist. That single step breaks the third leg of the lethal trifecta for most workflows.

Permission models across the major tools

Tool	Safe default	Allowlist / denylist	"YOLO" mode	OS sandbox	Network egress control
Claude Code	`default` (reads only)	allow / ask / deny rules; protected paths	`bypassPermissions` / `--dangerously-skip-permissions`	`/sandbox` (filesystem + network isolation)	auto-mode classifier + sandbox
OpenAI Codex	`workspace-write` + `on-request`	approval policy (untrusted/on-request/on-failure/never)	`danger-full-access`	Seatbelt (macOS), bwrap+seccomp (Linux)	network outside sandbox; asks first
Cursor	confirm each command	allowlist (preferred); denylist deprecated v1.3	YOLO / auto-run	wrapper-dependent	per-command confirmation
GitHub Copilot agent	scoped repo + firewall on	recommended + custom allowlist	firewall disabled = any host	`/sandbox` local (MXC) + ephemeral cloud	built-in firewall allowlist (default on)

The MCP supply chain and why audit logs matter

Connecting agents to external systems via the Model Context Protocol (MCP) introduces a tool supply chain with, as the OASIS Coalition for Secure AI notes, no built-in code signing, integrity verification or tamper-evident logging. The named attack classes are tool poisoning (malicious instructions hidden in a tool's description or response — a context-aware variant of prompt injection), rugpull (a trusted server quietly modified after you adopt it) and arbitrary code execution via servers published to npm, PyPI or GitHub. Treat every tool description as untrusted input, and pin and vet the MCP servers you install.

Because responses can be spoofed and tools silently swapped, audit logging is a structural control, not optional telemetry. Log every tool invocation (name, arguments, results), every shell command, every file write and every network egress attempt — tamper-evidently. Copilot's habit of recording blocked egress attempts directly in the pull request body, showing the blocked address and the command that tried to reach it, is a concrete model of agent-action auditing you can imitate. Pair it with secret scanning, dependency analysis and CodeQL on the agent's output, as Copilot does by default.

A hardening checklist

Control	Action	Trifecta leg it cuts
Isolation	Run the agent in a container/VM, never on the host with `danger-full-access`	enables others
Egress	Restrict network to an allowlist	external communication
Rules	Use allowlists, not denylists (denylists are bypassable)	untrusted content
Approval	Require human sign-off for deploys, migrations, IAM grants, force-push	external communication
Scope	Least-privilege tokens; workspace-write; protected paths	private data access
Tools	Vet and pin MCP servers; treat tool descriptions as untrusted	untrusted content
Logging	Tamper-evident logs of tool calls, commands, file writes, egress	detection / response
Output	Secret + dependency scanning on generated code	private data access

FAQ

Is --dangerously-skip-permissions ever safe to use? Only inside an isolated environment with no host access and no network egress — a container, VM or network-less dev container. Claude Code's docs state plainly that the mode offers no protection against prompt injection, and it refuses to run as root for that reason.

Can I just trust the agent's built-in safety classifier? Treat it as a backstop, not a boundary. Claude Code's auto mode runs a classifier on each action and blocks things like curl | bash or pushing to main, but Anthropic labels it a research preview that "reduces prompts but does not guarantee safety." A sandbox does not depend on the model getting a judgment call right.

Why is network access the thing everyone restricts first? Because it is the exfiltration leg of the lethal trifecta. An agent with private data and exposure to untrusted content is far less dangerous if it cannot send anything out. Egress allowlists are the cheapest high-impact control you can apply.

What is the single biggest mistake teams make? Combining all three trifecta legs in one un-gated agent: full repo access, ability to read arbitrary untrusted content, and open network egress. Breaking any one leg — usually network — eliminates most realistic attacks.

Bottom line

You will not stop prompt injection by filtering text; OWASP and the engineer who named the attack both say so. You contain it. Give the agent the least privilege it needs (allowlists, scoped tokens, protected paths), run it where the dangerous action is physically impossible (a sandbox with egress restrictions), and record everything it does in tamper-evident logs so you can answer "what happened" after the fact. Break one leg of the lethal trifecta and the whole class of attack mostly collapses.