A mental model for agent harnesses
Most confusion around AI coding tools comes from mixing layers.
The confusing part is that the same names often point at different layers. "Claude" might mean a model such as Claude Opus, the Claude Code CLI, a Slack bot built on Claude, or an Anthropic API call. Codex can mean a model family, a product surface, or a coding-agent experience. MCP is not a model or an agent; it is a protocol for exposing tools and context. Once you split the stack into inference runtime, provider protocol, auth surface, and agent harness, the system becomes much easier to reason about.
The shortest useful model is this:
- Model runtime
Predicts tokens.
- Provider API
Accepts structured requests and returns structured events.
- Harness
Decides context, tools, loop, approvals, memory, and execution.
- Auth surface
Decides which endpoint and entitlement path you can use.
Where each thing sits
Lower layers are raw capability. Upper layers decide behavior, context, tools, safety, session shape, and UX.
Product UI
The native app, web UI, CLI, or IDE surface where users see sessions, logs, diffs, files, tools, approvals, and final output.
Provider binaries
Claude Code, Codex, and similar binaries bundle auth, prompt, tools, session behavior, and execution semantics.
OpenCode
OpenCode is a third-party harness with many providers. More open, but your product still adapts to its prompt/tool/session model.
Pi
Pi is a minimal customizable harness. Prompt, tools, extensions, compaction, sessions, RPC, and SDK are meant to be changed.
Custom harness
Your application owns prompt, tools, workers, compaction, session tree, and event stream. Providers become transports.
Subscriptions / OAuth
Entitlement path for ChatGPT Plus/Pro, Claude Pro/Max, Copilot, etc. Useful when users already pay, but product dialect constraints leak into adapters.
API keys / direct billing
Clean for custom harnesses, but usually metered token billing. This path has fewer product-protocol constraints.
Two loops, not one
One loop generates tokens. The other loop decides what to do with model output. Tool use, MCP, files, shell, approvals, and retries live in the agent loop.
Inference loop
Owned by OpenAI, Anthropic, Ollama, vLLM, llama.cpp, or another model runtime. It is one model turn.
tokens = tokenize(rendered_prompt)
while not stopped:
logits = model.forward(tokens)
next = sampler(logits)
tokens.append(next)
if next == EOS or stop_sequence or max_tokens:
break
return detokenize(new_tokens)Agent loop
Owned by the harness: OpenCode, Claude Code, Codex, TanStack AI, Pi, or your own app. It is many model turns plus effects.
messages = [user_task]
for turn in 1..max_turns:
output = run_inference(messages, tools)
if output.tool_call:
result = execute_tool(output.tool_call)
messages += [output.tool_call, result]
continue
return output.final_textProviders can blur this boundary with server-side tools, managed agents, or built-in code execution. But client-side tool execution still needs a harness loop.
Everything becomes tokens, eventually
For the text path, a model does not directly see JavaScript objects, JSON objects, or HTTP requests. The runtime first turns text into tokens: small chunks such as words, word pieces, punctuation, or special control markers. Each token has a numeric ID in the model's vocabulary. Those numbers are what the model receives.
Text:
"The capital of France is"
One possible tokenization:
["The", " capital", " of", " France", " is"]
Token IDs sent to the model:
[791, 6864, 315, 9822, 374]
An API request may start as structured data:
{
"role": "tool",
"content": "test failed: expected 4 got 5"
}
Before inference, the provider or local runtime renders that into a model-specific format, conceptually something like:
<|tool|>
test failed: expected 4 got 5
<|end|>
<|assistant|>
Then a tokenizer converts it into token IDs. The model computes raw scores for the next token. Those scores are called logits.
Prompt: "The capital of France is"
Raw next-token logits:
Paris: 8.2
London: 2.1
banana: -4.0
.: 0.3
The runtime turns logits into probabilities, samples a token, appends it, and repeats.
There are caveats. Images and audio may become embeddings or multimodal tokens rather than plain text. Cloud providers may add hidden system/tool scaffolding. Strict structured outputs can use constrained decoding, where invalid next tokens are masked. But for chat, tool calls, and reasoning text, the useful mental model remains: every next token the model predicts depends on the token sequence it has received so far.
What actually happens on a prompt
Example: a harness sends OpenAI-hosted model requests with MCP tools. The model sees rendered tokens; the harness owns execution and reprompting. The loop shape below is distilled from TanStack AI, but the important part is runtime behavior, not library names.
1. User sends a prompt
The product user only sends the prompt. The harness, written by the product developer, decides which tools are available.
const messages = [
{
role: 'user',
content: 'Read src/app/page.tsx and summarize it.',
},
]
await runAgentLoop({
model: 'gpt-5.5',
messages,
tools: [readFileTool],
})2. MCP tools become executable harness tools
For this read_file example, the schema tells the model that a valid request needs a path, such as "src/app/page.tsx". The execute function is not shown to the model; the harness runs it after validating the tool request.
const mcpTool = {
name: 'read_file',
description: 'Read a UTF-8 file from the workspace.',
inputSchema: {
type: 'object',
properties: { path: { type: 'string' } },
required: ['path'],
},
}
const readFileTool = {
name: mcpTool.name,
description: mcpTool.description,
inputSchema: mcpTool.inputSchema,
execute: async ({ path }, { abortSignal }) => {
return await mcp.callTool(
{ name: 'read_file', arguments: { path } },
{ signal: abortSignal },
)
},
}3. Chat engine starts the agent loop
The harness alternates tool execution and model turns. The call to askModel({ messages, tools }) is the handoff to the next step.
const maxTurns = 5
let pendingToolCalls = []
for (let turn = 0; turn < maxTurns; turn += 1) {
if (pendingToolCalls.length > 0) {
await runToolsAndAppendResults(pendingToolCalls, messages)
}
const modelTurn = await askModel({ messages, tools })
if (modelTurn.finalAnswer) return modelTurn.finalAnswer
pendingToolCalls = modelTurn.toolCalls
}4. Adapter prepares one provider turn
askModel does not stream immediately. It first converts harness state into the provider's request shape: messages plus provider-formatted tool contracts.
async function askModel({ messages, tools }) {
const providerTools = tools.map(toProviderTool)
const request = toProviderRequest({
model: 'gpt-5.5',
messages,
tools: providerTools,
})
return streamProviderTurn(request)
}5. Adapter converts the tool contract
This is the useful adapter move: an executable local tool becomes a provider-visible contract. The model can request read_file with arguments matching the schema; later, the harness maps that request back to readFileTool.
function toProviderTool(tool) {
return {
type: 'function',
name: tool.name,
description: tool.description,
parameters: tool.inputSchema,
}
}
const openAITool = toProviderTool(readFileTool)
openAITool === {
type: 'function',
name: 'read_file',
description: 'Read a UTF-8 file from the workspace.',
parameters: {
type: 'object',
properties: { path: { type: 'string' } },
required: ['path'],
},
}6. Request streams to the provider
Now the request leaves the harness. The provider runs one inference turn. Internally, structured fields become model-specific token/control-token context, plus any hidden provider scaffolding.
const request = {
model: 'gpt-5.5',
instructions: 'system/developer prompts...',
input: [
{ role: 'user', content: 'Read src/app/page.tsx and summarize it.' }
],
tools: [
{
type: 'function',
name: 'read_file',
description: 'Read a UTF-8 file from the workspace.',
parameters: {
type: 'object',
properties: { path: { type: 'string' } },
required: ['path'],
}
}
],
stream: true,
}
for await (const event of provider.stream(request)) {
handleProviderEvent(event)
}7. Provider adapter normalizes stream events
This happens in the adapter/harness layer, not inside the model. The provider emits raw stream events; the adapter maps them into the harness's event vocabulary.
if (chunk.type === 'response.output_text.delta') {
yield { type: EventType.TEXT_MESSAGE_CONTENT, delta: textDelta }
}
if (chunk.type === 'response.function_call_arguments.done') {
yield {
type: EventType.TOOL_CALL_END,
toolCallId: chunk.item_id,
toolName: metadata.name,
input: JSON.parse(chunk.arguments),
}
}
if (chunk.type === 'response.reasoning_text.delta') {
yield { type: EventType.STEP_FINISHED, stepType: 'thinking', delta }
}8. Harness executes tool and reprompts
If the model requested a tool, the harness appends the assistant tool-call message, executes the matching tool, appends the trusted result, and starts another inference turn.
const toolCall = {
id: 'call_123',
name: 'read_file',
arguments: { path: 'src/app/page.tsx' },
}
const result = await readFileTool.execute(toolCall.arguments)
messages.push({ role: 'assistant', toolCall })
messages.push({
role: 'tool',
toolCallId: 'call_123',
content: result,
})
await runNextModelTurn({ messages, tools })Here is the same flow summarized as a sequence diagram:
Where chain of thought lives
Chain of thought is not one clean layer. There is private model reasoning, visible reasoning text, provider summaries, and harness policy about what to store or show.
Local model / Ollama
If the model is trained to emit reasoning tags, the inference loop can generate those tokens like any other text. The runtime may stream, split, or hide them.
<|user|>
Solve 2+2.
<|assistant|>
<think>
Need add 2 and 2. Result 4.
</think>
The answer is 4.Cloud provider
Providers often keep full chain of thought private. APIs may expose summaries, reasoning token counts, encrypted continuity blocks, or no reasoning.
request.reasoning = { effort: 'high', summary: 'auto' }
stream events may include:
- response.reasoning_summary_text.delta
- encrypted reasoning continuity
- usage.reasoning_tokens
full hidden chain of thought: provider-privateExternal deliberation
The harness can create visible planning by asking the model to plan, critique, call tools, revise, and verify across multiple turns.
messages.push({ role: 'system', content: 'First produce a short plan.' })
plan = await runInference(messages)
messages.push({ role: 'assistant', content: plan })
messages.push({ role: 'system', content: 'Critique the plan against repo evidence.' })
critique = await runInference(messages)
messages.push({ role: 'assistant', content: critique })
messages.push({ role: 'system', content: 'Now revise and execute with tools.' })
final = await agentLoop(messages, tools)Subscription auth is a product dialect
One more layer matters: how you are authenticated.
Direct API-key access usually means using the public developer endpoint documented for integrations, with metered billing. Subscription-based access often means speaking the product protocol used by the official app or CLI.
Some harnesses implement product-compatible OAuth and stream directly to provider/product endpoints instead of shelling out to provider binaries. The catch is that subscription paths usually mean speaking the official product's protocol: different endpoints, headers, identity strings, tool names, and rate/feature rules than the public API-key endpoint.
OpenAI / Codex
// direct API-key style
POST https://api.openai.com/v1/responses
Authorization: Bearer sk-...
// ChatGPT/Codex subscription style
POST https://chatgpt.com/backend-api/codex/responses
Authorization: Bearer <ChatGPT OAuth access token>
chatgpt-account-id: <account id from access token>
originator: <client identifier>
OpenAI-Beta: responses=experimentalAnthropic / Claude OAuth
A Claude Code-compatible subscription adapter speaks a product dialect: bearer OAuth token, beta flags, CLI-ish headers, Claude Code identity preamble, and canonical Claude Code tool names.
new Anthropic({
apiKey: null,
authToken: oauthAccessToken,
baseURL: 'https://api.anthropic.com',
defaultHeaders: {
'anthropic-beta': 'claude-code-20250219,oauth-2025-04-20,...',
'user-agent': 'claude-cli/2.1.75',
'x-app': 'cli',
},
})
params.system = [
{ type: 'text', text: "You are Claude Code, Anthropic's official CLI for Claude." },
{ type: 'text', text: customSystemPrompt },
]
toClaudeCodeName('bash') // 'Bash'
toClaudeCodeName('read') // 'Read'GitHub Copilot
GET https://api.github.com/copilot_internal/v2/token
Authorization: Bearer <GitHub device-flow access token>
proxy-ep=proxy.individual.githubcopilot.com
baseUrl = 'https://api.individual.githubcopilot.com'
User-Agent: GitHubCopilotChat/0.35.0
Editor-Version: vscode/1.107.0
Copilot-Integration-Id: vscode-chat
X-Initiator: user | agent
Openai-Intent: conversation-editsWhat dialect means
Product-compatible integration means matching protocol shape, not just replacing an API key with an OAuth token.
type ProductDialect = {
endpoint: string
auth: 'api-key' | 'chatgpt-oauth' | 'claude-oauth' | 'copilot-token'
headers: Record<string, string>
toolFormat: 'openai-functions' | 'anthropic-tools' | 'claude-code-compatible-tools'
identity?: 'raw-api-client' | 'claude-code-compatible-cli' | 'vscode-copilot-chat'
}