Skip to main content

LLM Interface

The interface layer handles model communication, key rotation, and rate limiting.

OpenAICompatible

Works with any provider that implements the OpenAI Chat Completions API:
from SimpleLLMFunc import OpenAICompatible

# From provider.json
models = OpenAICompatible.load_from_json_file("provider.json")
llm = models["openrouter"]["openai/gpt-4o"]

# Direct construction
from SimpleLLMFunc import APIKeyPool

llm = OpenAICompatible(
    api_key_pool=APIKeyPool(api_keys=["sk-key"], provider_id="openai"),
    model_name="gpt-4o",
    base_url="https://api.openai.com/v1",
    max_retries=3,
    retry_delay=1.0,
    rate_limit_capacity=20,
    rate_limit_refill_rate=3.0,
    context_window=128_000,
)
Compatible with: OpenAI, OpenRouter, Together, Groq, local vLLM, Ollama, etc.

OpenAIResponsesCompatible

For providers implementing OpenAI’s Responses API:
from SimpleLLMFunc import OpenAIResponsesCompatible

llm = OpenAIResponsesCompatible(
    api_key_pool=APIKeyPool(api_keys=["sk-key"], provider_id="openai"),
    model_name="gpt-4o",
    base_url="https://api.openai.com/v1",
)
Differences from OpenAICompatible:
  • Maps system prompts to instructions field
  • Handles Responses-specific streaming events
  • Supports reasoning={...} kwargs for reasoning effort
  • Different wire format for tool calls
From your decorator code, both adapters look the same. The wire-format differences are handled internally.

APIKeyPool

Manages multiple keys with round-robin rotation:
from SimpleLLMFunc import APIKeyPool

pool = APIKeyPool(
    api_keys=["sk-key-1", "sk-key-2", "sk-key-3"],
    provider_id="openrouter-gpt4",
)
When a key hits rate limits, the pool rotates to the next. Put your highest-rate keys first.

Rate Limiting

Built-in token bucket rate limiter:
# Configured via constructor
llm = OpenAICompatible(
    ...,
    rate_limit_capacity=20,       # Max concurrent "tokens" in the bucket
    rate_limit_refill_rate=3.0,   # Tokens added per second
)

# Check status
status = llm.get_rate_limit_status()
# {"available": 15, "capacity": 20, "refill_rate": 3.0}

# Reset after rate limit errors
llm.reset_rate_limit()
The rate limiter is per-instance. Multiple OpenAICompatible instances for the same model can have different rate limits.

Passing LLM kwargs

Extra parameters are forwarded to the provider:
@llm_chat(
    llm_interface=llm,
    temperature=0.7,
    max_tokens=4096,
    top_p=0.9,
)
async def agent(message: str, history: list | None = None):
    """My agent."""
    pass
For OpenAIResponsesCompatible, you can pass reasoning effort:
@llm_chat(
    llm_interface=llm,
    reasoning_effort="high",
)
async def reasoning_agent(message: str, history: list | None = None):
    """An agent that reasons deeply."""
    pass

Context Window

Set context_window to enable framework features that depend on knowing the model’s capacity:
llm = OpenAICompatible(
    ...,
    context_window=128_000,  # GPT-4o's context window
)
Used by: auto-compaction threshold calculations, token usage tracking. Default: 200,000 tokens. API Reference: Interfaces