LLM Interface

The interface layer handles model communication, key rotation, and rate limiting.

OpenAICompatible

Works with any provider that implements the OpenAI Chat Completions API:

from SimpleLLMFunc import OpenAICompatible

# From provider.json
models = OpenAICompatible.load_from_json_file("provider.json")
llm = models["openrouter"]["openai/gpt-4o"]

# Direct construction
from SimpleLLMFunc import APIKeyPool

llm = OpenAICompatible(
    api_key_pool=APIKeyPool(api_keys=["sk-key"], provider_id="openai"),
    model_name="gpt-4o",
    base_url="https://api.openai.com/v1",
    max_retries=3,
    retry_delay=1.0,
    rate_limit_capacity=20,
    rate_limit_refill_rate=3.0,
    context_window=128_000,
)

Compatible with: OpenAI, OpenRouter, Together, Groq, local vLLM, Ollama, etc.

OpenAIResponsesCompatible

For providers implementing OpenAI’s Responses API:

from SimpleLLMFunc import OpenAIResponsesCompatible

llm = OpenAIResponsesCompatible(
    api_key_pool=APIKeyPool(api_keys=["sk-key"], provider_id="openai"),
    model_name="gpt-4o",
    base_url="https://api.openai.com/v1",
)

Differences from OpenAICompatible:

Maps system prompts to instructions field
Handles Responses-specific streaming events
Supports reasoning={...} kwargs for reasoning effort
Different wire format for tool calls

From your decorator code, both adapters look the same. The wire-format differences are handled internally.

APIKeyPool

Manages multiple keys with round-robin rotation:

from SimpleLLMFunc import APIKeyPool

pool = APIKeyPool(
    api_keys=["sk-key-1", "sk-key-2", "sk-key-3"],
    provider_id="openrouter-gpt4",
)

When a key hits rate limits, the pool rotates to the next. Put your highest-rate keys first.

Rate Limiting

Built-in token bucket rate limiter:

# Configured via constructor
llm = OpenAICompatible(
    ...,
    rate_limit_capacity=20,       # Max concurrent "tokens" in the bucket
    rate_limit_refill_rate=3.0,   # Tokens added per second
)

# Check status
status = llm.get_rate_limit_status()
# {"available": 15, "capacity": 20, "refill_rate": 3.0}

# Reset after rate limit errors
llm.reset_rate_limit()

The rate limiter is per-instance. Multiple OpenAICompatible instances for the same model can have different rate limits.

Passing LLM kwargs

Extra parameters are forwarded to the provider:

@llm_chat(
    llm_interface=llm,
    temperature=0.7,
    max_tokens=4096,
    top_p=0.9,
)
async def agent(message: str, history: list | None = None):
    """My agent."""
    pass

For OpenAIResponsesCompatible, you can pass reasoning effort:

@llm_chat(
    llm_interface=llm,
    reasoning_effort="high",
)
async def reasoning_agent(message: str, history: list | None = None):
    """An agent that reasons deeply."""
    pass

Context Window

Set context_window to enable framework features that depend on knowing the model’s capacity:

llm = OpenAICompatible(
    ...,
    context_window=128_000,  # GPT-4o's context window
)

Used by: auto-compaction threshold calculations, token usage tracking. Default: 200,000 tokens. → API Reference: Interfaces

​LLM Interface

​OpenAICompatible

​OpenAIResponsesCompatible

​APIKeyPool

​Rate Limiting

​Passing LLM kwargs

​Context Window