What Conductor is (and is not)
Conductor isn't a regular API gateway — gateways do routing + billing + observability. Conductor is the AI Runtime: a decision layer between your app and the actual model, packing every capability a serious call needs into one round-trip.
One model=nexevo-auto call, internally:
- Checks local semantic cache (hit → return immediately, 0 tokens)
- Injects cross-model memory (user's capsules + context match → prompt prefix)
- Dynamic model selection (catalog + bandit + ELO weighted scoring)
- Executes the LLM call with auto tool-arg retry
- Verify anti-hallucination (triggered on legal / medical / code_critical intents)
- Decides whether to escalate to multi-step agent (task decomposition + tool loop)
- Async auto-curator saves high-confidence answers into Recall
Conductor isn't: a replacement for OpenAI / Anthropic model capabilities, and it doesn't lock you to a single upstream — switch tomodel=gpt-5 / model=claude-opus-4-7 any time to bypass Conductor decisions and go passthrough.
9-step pipeline (in order)
- 01Cache lookupLocal semantic cache, cosine ≥ 0.95 = hit; per-user isolated, 1h TTL, options-hash keyed. Hit → return immediately, skip all remaining steps (0 tokens billed).
- 02Memory.attachRecall capsule injection. sim ≥ 0.7, top-K, ≤800-token budget, ≤30% of context. options.recall=off skips; options.recall={ids:[...]} pins specific capsules.
- 03Context guardAuto-compress / truncate over-threshold prompts; code blocks preserved verbatim.
- 04Sticky sessionLock model selection within a conversation (avoids turn-by-turn drift). Triggered by metadata.conversation_id.
- 05LLM callLayer1 catalog + Layer2 bandit + Layer3 ELO weighted scoring picks the best model. tools=[...] schema passed through.
- 06Tool validateIf model's tool_call args fail schema validation → auto-retry once with inline schema-error feedback.
- 07Verify (anti-hallucination)Intent whitelist detection — legal / medical / financial / security / code_critical etc. trigger a cheap judge re-evaluation. options.verify=always forces every call; off disables.
- 08Agent multi-stepIf task is judged multi-step (explicit tool_call + long reasoning trace + verify needs_review) → escalate to agent sandbox loop. options.agent=always forces; off disables.
- 09Auto-curatorHigh-confidence answers async-saved into Recall (gray zone refined by a lightweight LLM, admin-configurable). Does not affect response latency.
Each step's outcome is appended to conductor.pipeline in the response (also pipe-delimited in X-Nexevo-Pipeline header).
Two call modes
Conductor pipeline runs on both endpoints; only response shape differs:
- OpenAI-compat shim:
POST /v1/chat/completionswithmodel=nexevo-auto. Response shape is strict OpenAI-compatible (existing SDKs work unchanged); conductor metadata is sideband viaX-Nexevo-*response headers. - Clean conductor entry:
POST /v1/conductor/chat. Response body has a top-levelconductormetadata block (pipeline / cache / memory / cost / elapsed) — no header parsing needed.
Which to use?Migrating existing OpenAI code → use the shim (just change base_url). Writing new code that wants conductor metadata → use the clean entry (no header parsing).
ConductorOptions reference
All options are optional; defaults are sane and fit the vast majority of use cases.
recall"auto" | "off" | { ids: ["cap_..."] }default: "auto"auto = match capsules by context; off = skip memory injection; {ids:[...]} pin specific capsules
verify"off" | "auto" | "always"default: "auto"auto = triggered by intent (legal / medical / code_critical etc.); always = run every call; off = disabled
agent"off" | "auto-if-multi-step" | "always"default: "off"off = no agent escalation (chat-only); auto-if-multi-step = automatic; always = force agent sandbox
cache"auto" | "strict-fresh" | "off"default: "auto"auto = cosine ≥ 0.95 hit; strict-fresh = skip cache lookup, force LLM call; off = neither read nor write cache (test mode)
max_cost_usdfloatdefault: 0.10Per-call cost ceiling (USD). If exceeded → block escalation to pricier models, downgrade or return max_cost_exceeded
streambooleandefault: falsetrue = SSE streaming (token-by-token + step events); false = single non-streaming response
Response shape · conductor metadata block
When hitting /v1/conductor/chat, the response body has a top-level conductor block:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"model": "claude-opus-4-7", // ← Conductor 实际选中的 model
"choices": [{ "message": { "role": "assistant", "content": "..." } }],
"usage": { "prompt_tokens": 450, "completion_tokens": 120, "total_tokens": 570 },
"conductor": {
"pipeline": ["cache_miss", "memory_attached(3)", "model:claude-opus-4-7", "verify:pass"],
"model_chosen": "claude-opus-4-7",
"cache": { "hit": false, "sim": 0.78, "via_prewarm": false },
"memory": { "attached": true, "tokens": 642, "caps": 3, "diff": false },
"usage": { "input_tokens": 450, "output_tokens": 120 },
"elapsed_ms": { "total": 1234, "cache": 8, "memory": 42, "llm": 1180 },
"cost_usd": "0.005670",
"saved_usd": null,
"sticky": null
}
}For the OpenAI shim (/v1/chat/completions) the body is plain OpenAI-shape; the same info is in X-Nexevo-* response headers (next section).
X-Nexevo-* response headers
| Header | Type | Meaning |
|---|---|---|
X-Nexevo-Pipeline | pipe-delimited | 9-step execution order, e.g. cache_miss|model:opus-4-7|verify:pass |
X-Nexevo-Cache-Hit | true/false | Whether this call hit cache |
X-Nexevo-Cache-Score | 0.0–1.0 | Cosine similarity score (≥0.95 = hit) |
X-Nexevo-Cache-Via-Prewarm | true/false | Whether hit was via cluster pre-warm job |
X-Nexevo-XMM-Attached | true/false | Whether cross-model memory was injected |
X-Nexevo-XMM-Tokens | int | Memory tokens attached |
X-Nexevo-XMM-Caps | int | Capsule count attached |
X-Nexevo-XMM-Family | string | Target model family (for diff encoding) |
X-Nexevo-Cost-Usd | float 6dp | Actual LLM cost for this call (USD) |
X-Nexevo-Saved-Usd | float 6dp | Cost saved by cache hit (USD) |
X-Nexevo-Elapsed-Ms | int | End-to-end latency (ms) |
X-Nexevo-Judge-Verdict | pass/needs_review/fail | Verify step verdict |
X-Nexevo-Tool-Retried | true/false | Whether tool args were auto-retried once |
X-Usage-Input-Tokens | int | Billed input tokens |
X-Usage-Output-Tokens | int | Billed output tokens |
X-Trace-ID | uuid | Request correlation ID; include when reporting issues |
Full curl examples
Minimal call (all options default):
# Clean 入口 — 直接拿 conductor metadata 块
curl https://api.nexevo.ai/v1/conductor/chat \
-H "Authorization: Bearer $NEXEVO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"messages": [{ "role": "user", "content": "解释一下 CAP 定理" }]
}'
# OpenAI 兼容 shim — 现有 OpenAI 代码直接换 base_url
curl https://api.nexevo.ai/v1/chat/completions \
-H "Authorization: Bearer $NEXEVO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nexevo-auto",
"messages": [{ "role": "user", "content": "解释一下 CAP 定理" }]
}'With explicit options + advanced usage:
curl https://api.nexevo.ai/v1/conductor/chat \
-H "Authorization: Bearer $NEXEVO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a senior security engineer." },
{ "role": "user", "content": "审一下这段 JWT 验证代码" }
],
"options": {
"recall": "auto",
"verify": "always",
"agent": "auto-if-multi-step",
"cache": "auto",
"max_cost_usd": 0.20,
"stream": false
},
"metadata": {
"conversation_id": "conv_abc123",
"user_intent": "code_review"
},
"temperature": 0.4,
"max_tokens": 4096
}'FAQ
How is Conductor different from a regular AI gateway?
Regular gateways do routing + billing + observability. Conductor is the AI Runtime — one call gives you dynamic model selection + local cache + cross-model memory + verify + on-demand agent. Not 4 separate products glued together, but one cooperating pipeline.
I'm already on OpenAI SDK — how costly is migration?
Change base_url and API key — two lines. Response shape is strict OpenAI-compatible, SDK code untouched. Switch to /v1/conductor/chat later if you want explicit conductor metadata.
What happens when max_cost_usd is exceeded?
Conductor first tries to downgrade to a cheaper model. If no acceptable downgrade exists → returns max_cost_exceeded error rather than silently overcharging.
How do I read conductor metadata in streaming mode?
An extra SSE event (type=conductor.metadata) is sent at end-of-stream with the same payload as the non-streaming conductor block. X-Nexevo-* headers are also present in the HTTP response (streams have headers too).
Is the cache per-user or shared across the tenant?
Per-user (API key scoped), no cross-user leakage. Org-level shared cache is on roadmap, not enabled today.
Related
- MCP integration doc — One-click Conductor in Claude Desktop / Cursor
- Recall long-term memory doc — capsule architecture / pricing / REST API for the memory subsystem
- Tasks doc (task-as-a-service) — Planner + Verifier + Auto-repair loop
- Conductor product page — value props / comparison / customer scenarios