Skip to main content

Documentation Index

Fetch the complete documentation index at: https://simplellmfunc.cn/llms.txt

Use this file to discover all available pages before exploring further.

LLM Interface

The interface layer handles model communication, key rotation, and rate limiting.

OpenAICompatible

Works with any provider that implements the OpenAI Chat Completions API:
from SimpleLLMFunc import OpenAICompatible

# From provider.json
models = OpenAICompatible.load_from_json_file("provider.json")
llm = models["openrouter"]["openai/gpt-4o"]

# Direct construction
from SimpleLLMFunc import APIKeyPool

llm = OpenAICompatible(
    api_key_pool=APIKeyPool(api_keys=["sk-key"], provider_id="openai"),
    model_name="gpt-4o",
    base_url="https://api.openai.com/v1",
    max_retries=3,
    retry_delay=1.0,
    rate_limit_capacity=20,
    rate_limit_refill_rate=3.0,
    context_window=128_000,
)
Compatible with: OpenAI, OpenRouter, Together, Groq, local vLLM, Ollama, etc.

OpenAIResponsesCompatible

For providers implementing OpenAI’s Responses API:
from SimpleLLMFunc import OpenAIResponsesCompatible

llm = OpenAIResponsesCompatible(
    api_key_pool=APIKeyPool(api_keys=["sk-key"], provider_id="openai"),
    model_name="gpt-4o",
    base_url="https://api.openai.com/v1",
)
Differences from OpenAICompatible:
  • Maps system prompts to instructions field
  • Handles Responses-specific streaming events
  • Supports reasoning={...} kwargs for reasoning effort
  • Different wire format for tool calls
From your decorator code, both adapters look the same. The wire-format differences are handled internally.

APIKeyPool

Manages multiple keys with round-robin rotation:
from SimpleLLMFunc import APIKeyPool

pool = APIKeyPool(
    api_keys=["sk-key-1", "sk-key-2", "sk-key-3"],
    provider_id="openrouter-gpt4",
)
When a key hits rate limits, the pool rotates to the next. Put your highest-rate keys first.

Rate Limiting

Built-in token bucket rate limiter:
# Configured via constructor
llm = OpenAICompatible(
    ...,
    rate_limit_capacity=20,       # Max concurrent "tokens" in the bucket
    rate_limit_refill_rate=3.0,   # Tokens added per second
)

# Check status
status = llm.get_rate_limit_status()
# {"available": 15, "capacity": 20, "refill_rate": 3.0}

# Reset after rate limit errors
llm.reset_rate_limit()
The rate limiter is per-instance. Multiple OpenAICompatible instances for the same model can have different rate limits.

Passing LLM kwargs

Extra parameters are forwarded to the provider:
@llm_chat(
    llm_interface=llm,
    temperature=0.7,
    max_tokens=4096,
    top_p=0.9,
)
async def agent(message: str, history: list | None = None):
    """My agent."""
    pass
For OpenAIResponsesCompatible, you can pass reasoning effort:
@llm_chat(
    llm_interface=llm,
    reasoning_effort="high",
)
async def reasoning_agent(message: str, history: list | None = None):
    """An agent that reasons deeply."""
    pass

Context Window

Set context_window to enable framework features that depend on knowing the model’s capacity:
llm = OpenAICompatible(
    ...,
    context_window=128_000,  # GPT-4o's context window
)
Used by: auto-compaction threshold calculations, token usage tracking. Default: 200,000 tokens. API Reference: Interfaces