Nexevo.ai

What Conductor is (and is not)

Conductor isn't a regular API gateway — gateways do routing + billing + observability. Conductor is the AI Runtime: a decision layer between your app and the actual model, packing every capability a serious call needs into one round-trip.

One model=nexevo-auto call, internally:

Checks local semantic cache (hit → return immediately, 0 tokens)
Injects cross-model memory (user's capsules + context match → prompt prefix)
Dynamic model selection (catalog + bandit + ELO weighted scoring)
Executes the LLM call with auto tool-arg retry
Verify anti-hallucination (triggered on legal / medical / code_critical intents)
Decides whether to escalate to multi-step agent (task decomposition + tool loop)
Async auto-curator saves high-confidence answers into Recall

Conductor isn't: a replacement for OpenAI / Anthropic model capabilities, and it doesn't lock you to a single upstream — switch tomodel=gpt-5 / model=claude-opus-4-7 any time to bypass Conductor decisions and go passthrough.

9-step pipeline (in order)

01
Cache lookup
Local semantic cache, cosine ≥ 0.95 = hit; per-user isolated, 1h TTL, options-hash keyed. Hit → return immediately, skip all remaining steps (0 tokens billed).
02
Memory.attach
Recall capsule injection. sim ≥ 0.7, top-K, ≤800-token budget, ≤30% of context. options.recall=off skips; options.recall={ids:[...]} pins specific capsules.
03
Context guard
Auto-compress / truncate over-threshold prompts; code blocks preserved verbatim.
04
Sticky session
Lock model selection within a conversation (avoids turn-by-turn drift). Triggered by metadata.conversation_id.
05
LLM call
Layer1 catalog + Layer2 bandit + Layer3 ELO weighted scoring picks the best model. tools=[...] schema passed through.
06
Tool validate
If model's tool_call args fail schema validation → auto-retry once with inline schema-error feedback.
07
Verify (anti-hallucination)
Intent whitelist detection — legal / medical / financial / security / code_critical etc. trigger a cheap judge re-evaluation. options.verify=always forces every call; off disables.
08
Agent multi-step
If task is judged multi-step (explicit tool_call + long reasoning trace + verify needs_review) → escalate to agent sandbox loop. options.agent=always forces; off disables.
09
Auto-curator
High-confidence answers async-saved into Recall (gray zone refined by a lightweight LLM, admin-configurable). Does not affect response latency.

Each step's outcome is appended to conductor.pipeline in the response (also pipe-delimited in X-Nexevo-Pipeline header).

Two call modes

Conductor pipeline runs on both endpoints; only response shape differs:

OpenAI-compat shim:POST /v1/chat/completions with model=nexevo-auto. Response shape is strict OpenAI-compatible (existing SDKs work unchanged); conductor metadata is sideband via X-Nexevo-* response headers.
Clean conductor entry:POST /v1/conductor/chat. Response body has a top-levelconductor metadata block (pipeline / cache / memory / cost / elapsed) — no header parsing needed.

Which to use?Migrating existing OpenAI code → use the shim (just change base_url). Writing new code that wants conductor metadata → use the clean entry (no header parsing).

ConductorOptions reference

All options are optional; defaults are sane and fit the vast majority of use cases.

recall"auto" | "off" | { ids: ["cap_..."] }default: "auto"

auto = match capsules by context; off = skip memory injection; {ids:[...]} pin specific capsules

verify"off" | "auto" | "always"default: "auto"

auto = triggered by intent (legal / medical / code_critical etc.); always = run every call; off = disabled

agent"off" | "auto-if-multi-step" | "always"default: "off"

off = no agent escalation (chat-only); auto-if-multi-step = automatic; always = force agent sandbox

cache"auto" | "strict-fresh" | "off"default: "auto"

auto = cosine ≥ 0.95 hit; strict-fresh = skip cache lookup, force LLM call; off = neither read nor write cache (test mode)

max_cost_usdfloatdefault: 0.10

Per-call cost ceiling (USD). If exceeded → block escalation to pricier models, downgrade or return max_cost_exceeded

streambooleandefault: false

true = SSE streaming (token-by-token + step events); false = single non-streaming response

Response shape · conductor metadata block

When hitting /v1/conductor/chat, the response body has a top-level conductor block:

json

{
  "id":      "chatcmpl-...",
  "object":  "chat.completion",
  "model":   "claude-opus-4-7",      // ← Conductor 实际选中的 model
  "choices": [{ "message": { "role": "assistant", "content": "..." } }],
  "usage":   { "prompt_tokens": 450, "completion_tokens": 120, "total_tokens": 570 },

  "conductor": {
    "pipeline":     ["cache_miss", "memory_attached(3)", "model:claude-opus-4-7", "verify:pass"],
    "model_chosen": "claude-opus-4-7",
    "cache":        { "hit": false, "sim": 0.78, "via_prewarm": false },
    "memory":       { "attached": true, "tokens": 642, "caps": 3, "diff": false },
    "usage":        { "input_tokens": 450, "output_tokens": 120 },
    "elapsed_ms":   { "total": 1234, "cache": 8, "memory": 42, "llm": 1180 },
    "cost_usd":     "0.005670",
    "saved_usd":    null,
    "sticky":       null
  }
}

For the OpenAI shim (/v1/chat/completions) the body is plain OpenAI-shape; the same info is in X-Nexevo-* response headers (next section).

X-Nexevo-* response headers

Header	Type	Meaning
`X-Nexevo-Pipeline`	`pipe-delimited`	9-step execution order, e.g. cache_miss\|model:opus-4-7\|verify:pass
`X-Nexevo-Cache-Hit`	`true/false`	Whether this call hit cache
`X-Nexevo-Cache-Score`	`0.0–1.0`	Cosine similarity score (≥0.95 = hit)
`X-Nexevo-Cache-Via-Prewarm`	`true/false`	Whether hit was via cluster pre-warm job
`X-Nexevo-XMM-Attached`	`true/false`	Whether cross-model memory was injected
`X-Nexevo-XMM-Tokens`	`int`	Memory tokens attached
`X-Nexevo-XMM-Caps`	`int`	Capsule count attached
`X-Nexevo-XMM-Family`	`string`	Target model family (for diff encoding)
`X-Nexevo-Cost-Usd`	`float 6dp`	Actual LLM cost for this call (USD)
`X-Nexevo-Saved-Usd`	`float 6dp`	Cost saved by cache hit (USD)
`X-Nexevo-Elapsed-Ms`	`int`	End-to-end latency (ms)
`X-Nexevo-Judge-Verdict`	`pass/needs_review/fail`	Verify step verdict
`X-Nexevo-Tool-Retried`	`true/false`	Whether tool args were auto-retried once
`X-Usage-Input-Tokens`	`int`	Billed input tokens
`X-Usage-Output-Tokens`	`int`	Billed output tokens
`X-Trace-ID`	`uuid`	Request correlation ID; include when reporting issues

Full curl examples

Minimal call (all options default):

bash

# Clean 入口 — 直接拿 conductor metadata 块
curl https://api.nexevo.ai/v1/conductor/chat \
  -H "Authorization: Bearer $NEXEVO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{ "role": "user", "content": "解释一下 CAP 定理" }]
  }'

# OpenAI 兼容 shim — 现有 OpenAI 代码直接换 base_url
curl https://api.nexevo.ai/v1/chat/completions \
  -H "Authorization: Bearer $NEXEVO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nexevo-auto",
    "messages": [{ "role": "user", "content": "解释一下 CAP 定理" }]
  }'

With explicit options + advanced usage:

bash

curl https://api.nexevo.ai/v1/conductor/chat \
  -H "Authorization: Bearer $NEXEVO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a senior security engineer." },
      { "role": "user",   "content": "审一下这段 JWT 验证代码" }
    ],
    "options": {
      "recall":       "auto",
      "verify":       "always",
      "agent":        "auto-if-multi-step",
      "cache":        "auto",
      "max_cost_usd": 0.20,
      "stream":       false
    },
    "metadata": {
      "conversation_id": "conv_abc123",
      "user_intent":     "code_review"
    },
    "temperature": 0.4,
    "max_tokens":  4096
  }'

FAQ

How is Conductor different from a regular AI gateway?

Regular gateways do routing + billing + observability. Conductor is the AI Runtime — one call gives you dynamic model selection + local cache + cross-model memory + verify + on-demand agent. Not 4 separate products glued together, but one cooperating pipeline.

I'm already on OpenAI SDK — how costly is migration?

Change base_url and API key — two lines. Response shape is strict OpenAI-compatible, SDK code untouched. Switch to /v1/conductor/chat later if you want explicit conductor metadata.

What happens when max_cost_usd is exceeded?

Conductor first tries to downgrade to a cheaper model. If no acceptable downgrade exists → returns max_cost_exceeded error rather than silently overcharging.

How do I read conductor metadata in streaming mode?

An extra SSE event (type=conductor.metadata) is sent at end-of-stream with the same payload as the non-streaming conductor block. X-Nexevo-* headers are also present in the HTTP response (streams have headers too).

Is the cache per-user or shared across the tenant?

Per-user (API key scoped), no cross-user leakage. Org-level shared cache is on roadmap, not enabled today.

MCP integration doc — One-click Conductor in Claude Desktop / Cursor
Recall long-term memory doc — capsule architecture / pricing / REST API for the memory subsystem
Tasks doc (task-as-a-service) — Planner + Verifier + Auto-repair loop
Conductor product page — value props / comparison / customer scenarios

ConductorAI Runtime — one call = routing + cache + memory + verify + agent