v0.2.0 · ocp-router · New package

Hybrid local/cloud model routing.

Instead of sending every request to a paid API, OCP now scores each request for complexity and dispatches it to the right model tier automatically. Simple tasks stay local. Complex reasoning escalates to your paid provider.

Zero IDE changes Deterministic scoring Vendor-neutral backends 54 tests passing

Install ocp-router → How it works

Install

New package: `ocp-router`

Install the core package plus whichever paid backend you use. ocp-server and ocp-client are unchanged at v0.1.1 — no breaking changes to existing deployments.

pip install ocp-router Core — Ollama local backend included

pip install ocp-router[anthropic] + Anthropic Claude paid backend

pip install ocp-router[openai] + OpenAI paid backend

# Upgrade from earlier version
pip install --upgrade ocp-router

# ocp-server and ocp-client are unchanged
# No breaking changes to existing deployments

How it works

Classify → dispatch → trace.

Three components work together: TaskClassifier scores complexity, OCPRouter dispatches to the right backend, and every call returns a full RouteResult trace.

1. TaskClassifier — instant complexity scoring

Every request is scored 0.0 → 1.0 using deterministic heuristics — no model required, runs in microseconds.

RequestScoreDestination

"explain this function"0.00Local

"find all usages of db.connect"0.00Local

"summarise the last session"0.00Local

"refactor the login function"0.25Local

── threshold 0.5 (configurable) ──────────────────

"review security vulnerabilities"0.55Paid

"design the payment architecture"0.55Paid

"debug this production deadlock"0.60Paid

"refactor auth across all files"0.80Paid

Escalates to paid ↑

security+0.55

architect+0.55

migrate+0.55

deadlock+0.40

multi-file+0.35

refactor+0.25

Token length, code block size, file path countvaries

Keeps local ↓

explain−0.10

summarise−0.10

search−0.10

Short token count−varies

No code blocks−varies

No file path references−varies

2. OCPRouter — one call, full trace

make_router() reads all config from env vars. Every call returns the answer and a full trace of the routing decision.

from ocp_router import make_router

router = await make_router()   # reads env vars

result = await router.route("explain the auth middleware")

result.route_to"local"which tier

result.classify.complexity_score0.000.0 → 1.0

result.classify.signals[]heuristics fired

result.model"llama3.2"model used

result.prompt_tokens34tokens in

result.completion_tokens41tokens out

result.duration_ms82.0measured

result.text"The auth middleware..."the answer

3. OllamaBackend — verified locally

Integration-tested against llama3.2 running via ollama serve. Ollama uses Metal GPU on Apple Silicon and CUDA on NVIDIA — a dedicated GPU host is recommended for production.

# Integration test result
test_ollama_live_generate  PASSED  (10.59s including model cold start)

4. Vendor-neutral backend protocol

Both local and paid slots accept any object implementing the ModelBackend protocol — three methods, no base class required. AnthropicBackend and OpenAIBackend are convenience adapters.

class MyBackend:
    @property
    def model(self) -> str: ...

    async def is_available(self) -> bool: ...

    async def generate(self, request: GenerateRequest) -> GenerateResponse: ...

# Wire it in:
router = OCPRouter(
    local=MyBackend(),
    paid=MyBackend(),
    classifier=TaskClassifier()
)
        

IDE integration

Three env vars. Nothing else changes.

Add three lines to your .mcp.json env block. Works with Claude Code, Cursor, Windsurf, and any MCP-compatible IDE.

{
  "mcpServers": {
    "ocp": {
      "command": "uvx",
      "args": ["ocp-server"],
      "env": {
        "OCP_DB_PATH":         "${workspaceFolder}/.ocp.db",
        "OCP_LOCAL_MODEL":     "llama3.2",      ← new
        "OCP_OLLAMA_URL":      "http://localhost:11434", ← new
        "OCP_ROUTE_THRESHOLD": "0.5"           ← new
      }
    }
  }
}
        

Configuration

All config via env vars.

Every option has a sensible default. Override only what you need.

Variable	Default	Description
`OCP_LOCAL_MODEL`	`llama3.2`	Ollama model to use for local inference
`OCP_OLLAMA_URL`	`http://localhost:11434`	Ollama base URL
`OCP_PAID_BACKEND`	`anthropic`	`anthropic` or `openai`
`OCP_PAID_MODEL`	`claude-sonnet-4-6`	Paid model identifier
`OCP_ROUTE_THRESHOLD`	`0.5`	Complexity score at which requests escalate to paid. Set `0.0` for always local, `1.0` for always paid.

Full pipeline

Where routing fits in OCP.

Routing builds on OCP's existing context primitives. Before a complex request reaches your paid provider, OCP assembles an enriched prompt from three sources — so the model gets targeted, context-rich input, not just the raw request.

ocp.context.searchRetrieve the relevant chunks from the workspace index for this specific request

ocp.session.restoreInject prior decisions, summaries, and state from earlier sessions

ocp.model.routeScore complexity → assemble enriched prompt → dispatch to local or paid — new in v0.2.0

↳ local modelRaw request → Ollama · fast, private, free

↳ paid providerEnriched prompt → Claude / GPT-4 · context-rich input, better output

ocp.result.savePersist the full RouteResult trace for the next session

What the paid model actually receives

Context Retrieved workspace chunks relevant to this request — not the whole codebase, only what matters

Session Summary of prior decisions, agreed conventions, and state from earlier agent sessions

Request The original developer request, unchanged

↓

Enriched prompt — targeted, context-rich, session-aware. The paid model reasons over the right information rather than starting from scratch every call.

Simple tasks routed to the local model skip this assembly — they receive only the raw request, which is why they return in under 100ms.

What's next

ocp.prompt.prepare

Before a complex request reaches your paid provider, a local step compresses and optimises the prompt. Fewer tokens in, better answer out. Coming in a future release.