Nexevo.aiNexevo.ai
Back to docs

ConductorAI Runtime — one call = routing + cache + memory + verify + agent

Conductor is Nexevo's main entry point — dynamic model selection, local semantic cache, cross-model memory injection, anti-hallucination verify, and on-demand multi-step agent all wrapped into one OpenAI-compatible call. This doc covers: the 9-step pipeline, full ConductorOptions schema, every X-Nexevo-* response header, two call modes (/v1/chat/completions shim and /v1/conductor/chat clean entry), and FAQ.

01

What Conductor is (and is not)

Conductor isn't a regular API gateway — gateways do routing + billing + observability. Conductor is the AI Runtime: a decision layer between your app and the actual model, packing every capability a serious call needs into one round-trip.

One model=nexevo-auto call, internally:

  • Checks local semantic cache (hit → return immediately, 0 tokens)
  • Injects cross-model memory (user's capsules + context match → prompt prefix)
  • Dynamic model selection (catalog + bandit + ELO weighted scoring)
  • Executes the LLM call with auto tool-arg retry
  • Verify anti-hallucination (triggered on legal / medical / code_critical intents)
  • Decides whether to escalate to multi-step agent (task decomposition + tool loop)
  • Async auto-curator saves high-confidence answers into Recall

Conductor isn't: a replacement for OpenAI / Anthropic model capabilities, and it doesn't lock you to a single upstream — switch tomodel=gpt-5 / model=claude-opus-4-7 any time to bypass Conductor decisions and go passthrough.

02

9-step pipeline (in order)

  1. 01
    Cache lookup
    Local semantic cache, cosine ≥ 0.95 = hit; per-user isolated, 1h TTL, options-hash keyed. Hit → return immediately, skip all remaining steps (0 tokens billed).
  2. 02
    Memory.attach
    Recall capsule injection. sim ≥ 0.7, top-K, ≤800-token budget, ≤30% of context. options.recall=off skips; options.recall={ids:[...]} pins specific capsules.
  3. 03
    Context guard
    Auto-compress / truncate over-threshold prompts; code blocks preserved verbatim.
  4. 04
    Sticky session
    Lock model selection within a conversation (avoids turn-by-turn drift). Triggered by metadata.conversation_id.
  5. 05
    LLM call
    Layer1 catalog + Layer2 bandit + Layer3 ELO weighted scoring picks the best model. tools=[...] schema passed through.
  6. 06
    Tool validate
    If model's tool_call args fail schema validation → auto-retry once with inline schema-error feedback.
  7. 07
    Verify (anti-hallucination)
    Intent whitelist detection — legal / medical / financial / security / code_critical etc. trigger a cheap judge re-evaluation. options.verify=always forces every call; off disables.
  8. 08
    Agent multi-step
    If task is judged multi-step (explicit tool_call + long reasoning trace + verify needs_review) → escalate to agent sandbox loop. options.agent=always forces; off disables.
  9. 09
    Auto-curator
    High-confidence answers async-saved into Recall (gray zone refined by a lightweight LLM, admin-configurable). Does not affect response latency.

Each step's outcome is appended to conductor.pipeline in the response (also pipe-delimited in X-Nexevo-Pipeline header).

03

Two call modes

Conductor pipeline runs on both endpoints; only response shape differs:

  • OpenAI-compat shim:POST /v1/chat/completions with model=nexevo-auto. Response shape is strict OpenAI-compatible (existing SDKs work unchanged); conductor metadata is sideband via X-Nexevo-* response headers.
  • Clean conductor entry:POST /v1/conductor/chat. Response body has a top-levelconductor metadata block (pipeline / cache / memory / cost / elapsed) — no header parsing needed.

Which to use?Migrating existing OpenAI code → use the shim (just change base_url). Writing new code that wants conductor metadata → use the clean entry (no header parsing).

04

ConductorOptions reference

All options are optional; defaults are sane and fit the vast majority of use cases.

recall"auto" | "off" | { ids: ["cap_..."] }default: "auto"

auto = match capsules by context; off = skip memory injection; {ids:[...]} pin specific capsules

verify"off" | "auto" | "always"default: "auto"

auto = triggered by intent (legal / medical / code_critical etc.); always = run every call; off = disabled

agent"off" | "auto-if-multi-step" | "always"default: "off"

off = no agent escalation (chat-only); auto-if-multi-step = automatic; always = force agent sandbox

cache"auto" | "strict-fresh" | "off"default: "auto"

auto = cosine ≥ 0.95 hit; strict-fresh = skip cache lookup, force LLM call; off = neither read nor write cache (test mode)

max_cost_usdfloatdefault: 0.10

Per-call cost ceiling (USD). If exceeded → block escalation to pricier models, downgrade or return max_cost_exceeded

streambooleandefault: false

true = SSE streaming (token-by-token + step events); false = single non-streaming response

05

Response shape · conductor metadata block

When hitting /v1/conductor/chat, the response body has a top-level conductor block:

json
{
  "id":      "chatcmpl-...",
  "object":  "chat.completion",
  "model":   "claude-opus-4-7",      // ← Conductor 实际选中的 model
  "choices": [{ "message": { "role": "assistant", "content": "..." } }],
  "usage":   { "prompt_tokens": 450, "completion_tokens": 120, "total_tokens": 570 },

  "conductor": {
    "pipeline":     ["cache_miss", "memory_attached(3)", "model:claude-opus-4-7", "verify:pass"],
    "model_chosen": "claude-opus-4-7",
    "cache":        { "hit": false, "sim": 0.78, "via_prewarm": false },
    "memory":       { "attached": true, "tokens": 642, "caps": 3, "diff": false },
    "usage":        { "input_tokens": 450, "output_tokens": 120 },
    "elapsed_ms":   { "total": 1234, "cache": 8, "memory": 42, "llm": 1180 },
    "cost_usd":     "0.005670",
    "saved_usd":    null,
    "sticky":       null
  }
}

For the OpenAI shim (/v1/chat/completions) the body is plain OpenAI-shape; the same info is in X-Nexevo-* response headers (next section).

06

X-Nexevo-* response headers

HeaderTypeMeaning
X-Nexevo-Pipelinepipe-delimited9-step execution order, e.g. cache_miss|model:opus-4-7|verify:pass
X-Nexevo-Cache-Hittrue/falseWhether this call hit cache
X-Nexevo-Cache-Score0.0–1.0Cosine similarity score (≥0.95 = hit)
X-Nexevo-Cache-Via-Prewarmtrue/falseWhether hit was via cluster pre-warm job
X-Nexevo-XMM-Attachedtrue/falseWhether cross-model memory was injected
X-Nexevo-XMM-TokensintMemory tokens attached
X-Nexevo-XMM-CapsintCapsule count attached
X-Nexevo-XMM-FamilystringTarget model family (for diff encoding)
X-Nexevo-Cost-Usdfloat 6dpActual LLM cost for this call (USD)
X-Nexevo-Saved-Usdfloat 6dpCost saved by cache hit (USD)
X-Nexevo-Elapsed-MsintEnd-to-end latency (ms)
X-Nexevo-Judge-Verdictpass/needs_review/failVerify step verdict
X-Nexevo-Tool-Retriedtrue/falseWhether tool args were auto-retried once
X-Usage-Input-TokensintBilled input tokens
X-Usage-Output-TokensintBilled output tokens
X-Trace-IDuuidRequest correlation ID; include when reporting issues
07

Full curl examples

Minimal call (all options default):

bash
# Clean 入口 — 直接拿 conductor metadata 块
curl https://api.nexevo.ai/v1/conductor/chat \
  -H "Authorization: Bearer $NEXEVO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{ "role": "user", "content": "解释一下 CAP 定理" }]
  }'

# OpenAI 兼容 shim — 现有 OpenAI 代码直接换 base_url
curl https://api.nexevo.ai/v1/chat/completions \
  -H "Authorization: Bearer $NEXEVO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nexevo-auto",
    "messages": [{ "role": "user", "content": "解释一下 CAP 定理" }]
  }'

With explicit options + advanced usage:

bash
curl https://api.nexevo.ai/v1/conductor/chat \
  -H "Authorization: Bearer $NEXEVO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a senior security engineer." },
      { "role": "user",   "content": "审一下这段 JWT 验证代码" }
    ],
    "options": {
      "recall":       "auto",
      "verify":       "always",
      "agent":        "auto-if-multi-step",
      "cache":        "auto",
      "max_cost_usd": 0.20,
      "stream":       false
    },
    "metadata": {
      "conversation_id": "conv_abc123",
      "user_intent":     "code_review"
    },
    "temperature": 0.4,
    "max_tokens":  4096
  }'
08

FAQ

How is Conductor different from a regular AI gateway?

Regular gateways do routing + billing + observability. Conductor is the AI Runtime — one call gives you dynamic model selection + local cache + cross-model memory + verify + on-demand agent. Not 4 separate products glued together, but one cooperating pipeline.

I'm already on OpenAI SDK — how costly is migration?

Change base_url and API key — two lines. Response shape is strict OpenAI-compatible, SDK code untouched. Switch to /v1/conductor/chat later if you want explicit conductor metadata.

What happens when max_cost_usd is exceeded?

Conductor first tries to downgrade to a cheaper model. If no acceptable downgrade exists → returns max_cost_exceeded error rather than silently overcharging.

How do I read conductor metadata in streaming mode?

An extra SSE event (type=conductor.metadata) is sent at end-of-stream with the same payload as the non-streaming conductor block. X-Nexevo-* headers are also present in the HTTP response (streams have headers too).

Is the cache per-user or shared across the tenant?

Per-user (API key scoped), no cross-user leakage. Org-level shared cache is on roadmap, not enabled today.

09

Related

Conductor · AI Runtime — Nexevo Docs | Nexevo.ai