Documentation Index
Fetch the complete documentation index at: https://simplellmfunc.cn/llms.txt
Use this file to discover all available pages before exploring further.
LLM Interface
The interface layer handles model communication, key rotation, and rate limiting.
OpenAICompatible
Works with any provider that implements the OpenAI Chat Completions API:
from SimpleLLMFunc import OpenAICompatible
# From provider.json
models = OpenAICompatible.load_from_json_file("provider.json")
llm = models["openrouter"]["openai/gpt-4o"]
# Direct construction
from SimpleLLMFunc import APIKeyPool
llm = OpenAICompatible(
api_key_pool=APIKeyPool(api_keys=["sk-key"], provider_id="openai"),
model_name="gpt-4o",
base_url="https://api.openai.com/v1",
max_retries=3,
retry_delay=1.0,
rate_limit_capacity=20,
rate_limit_refill_rate=3.0,
context_window=128_000,
)
Compatible with: OpenAI, OpenRouter, Together, Groq, local vLLM, Ollama, etc.
OpenAIResponsesCompatible
For providers implementing OpenAI’s Responses API:
from SimpleLLMFunc import OpenAIResponsesCompatible
llm = OpenAIResponsesCompatible(
api_key_pool=APIKeyPool(api_keys=["sk-key"], provider_id="openai"),
model_name="gpt-4o",
base_url="https://api.openai.com/v1",
)
Differences from OpenAICompatible:
- Maps system prompts to
instructions field
- Handles Responses-specific streaming events
- Supports
reasoning={...} kwargs for reasoning effort
- Different wire format for tool calls
From your decorator code, both adapters look the same. The wire-format differences are handled internally.
APIKeyPool
Manages multiple keys with round-robin rotation:
from SimpleLLMFunc import APIKeyPool
pool = APIKeyPool(
api_keys=["sk-key-1", "sk-key-2", "sk-key-3"],
provider_id="openrouter-gpt4",
)
When a key hits rate limits, the pool rotates to the next. Put your highest-rate keys first.
Rate Limiting
Built-in token bucket rate limiter:
# Configured via constructor
llm = OpenAICompatible(
...,
rate_limit_capacity=20, # Max concurrent "tokens" in the bucket
rate_limit_refill_rate=3.0, # Tokens added per second
)
# Check status
status = llm.get_rate_limit_status()
# {"available": 15, "capacity": 20, "refill_rate": 3.0}
# Reset after rate limit errors
llm.reset_rate_limit()
The rate limiter is per-instance. Multiple OpenAICompatible instances for the same model can have different rate limits.
Passing LLM kwargs
Extra parameters are forwarded to the provider:
@llm_chat(
llm_interface=llm,
temperature=0.7,
max_tokens=4096,
top_p=0.9,
)
async def agent(message: str, history: list | None = None):
"""My agent."""
pass
For OpenAIResponsesCompatible, you can pass reasoning effort:
@llm_chat(
llm_interface=llm,
reasoning_effort="high",
)
async def reasoning_agent(message: str, history: list | None = None):
"""An agent that reasons deeply."""
pass
Context Window
Set context_window to enable framework features that depend on knowing the model’s capacity:
llm = OpenAICompatible(
...,
context_window=128_000, # GPT-4o's context window
)
Used by: auto-compaction threshold calculations, token usage tracking.
Default: 200,000 tokens.
→ API Reference: Interfaces