Skip to main content
This page describes SimpleLLMFunc’s LLM interface layer design, including unified interface abstraction, the OpenAICompatible and OpenAIResponsesCompatible adapters, key pools, and flow control.

What will you use?

LLM_Interface

Abstract base class that uniformly defines chat() and chat_stream().

OpenAICompatible

The default implementation for integrating with OpenAI-compatible interfaces, suitable as the entry point for most projects.

OpenAIResponsesCompatible

An adapter for OpenAI Responses API endpoints while keeping the same decorator-facing usage model.

APIKeyPool

Responsible for load balancing and task allocation of multiple API Keys.

TokenBucket

Responsible for request rate control to avoid hitting backend rate limits.
If you only want to quickly integrate the model, prefer using OpenAICompatible.load_from_json_file(...) or OpenAIResponsesCompatible.load_from_json_file(...). Only when you need programmatic control over the key pool or rate limiting parameters should you manually instantiate the interface object.
Simple rule: use OpenAICompatible for normal chat/completions-compatible endpoints, and OpenAIResponsesCompatible for OpenAI Responses API endpoints. Both use the same provider.json structure.

Quick Start

1

Create provider.json

{
  "openai": [
    {
      "model_name": "gpt-3.5-turbo",
      "api_keys": ["sk-key1", "sk-key2"],
      "base_url": "https://api.openai.com/v1",
      "max_retries": 5,
      "retry_delay": 1.0,
      "rate_limit_capacity": 20,
      "rate_limit_refill_rate": 3.0
    }
  ],
  "deepseek": [
    {
      "model_name": "deepseek-chat",
      "api_keys": ["your-deepseek-key"],
      "base_url": "https://api.deepseek.com/v1"
    }
  ]
}
2

Load model from configuration file

from SimpleLLMFunc import OpenAICompatible, OpenAIResponsesCompatible

llm = OpenAICompatible.load_from_json_file("provider.json")["openai"]["gpt-3.5-turbo"]
responses_llm = OpenAIResponsesCompatible.load_from_json_file("provider.json")["openrouter"]["gpt-5.4"]
3

for use by the decorator

from SimpleLLMFunc import llm_function

@llm_function(llm_interface=llm)
async def my_task(text: str) -> str:
    """Process a text task."""
    pass

Component Description

LLM_Interface is the abstract base class for all LLM implementations, defining a unified interface specification. Its goal is to converge the calling method and return format to a consistent interface. Both OpenAICompatible and OpenAIResponsesCompatible inherit from it.Core Features:
  • Standardized Interface: Unified chat() and chat_stream()
  • Type Safety: Used with Python Type Annotations
  • Async Native: Suitable for High Concurrency and Event Stream Scenarios
  • Extensible: Easy to integrate new OpenAI-compatible services
from abc import ABC, abstractmethod
from typing import Iterable, Dict, Optional, Literal, AsyncGenerator
from openai.types.chat.chat_completion import ChatCompletion
from openai.types.chat.chat_completion_chunk import ChatCompletionChunk

class LLM_Interface(ABC):
    @abstractmethod
    async def chat(
        self,
        trace_id: str = get_current_trace_id(),
        stream: Literal[False] = False,
        messages: Iterable[Dict[str, str]] = [{"role": "user", "content": ""}],
        timeout: Optional[int] = None,
        *args,
        **kwargs,
    ) -> ChatCompletion:
        pass

    @abstractmethod
    async def chat_stream(
        self,
        trace_id: str = get_current_trace_id(),
        stream: Literal[True] = True,
        messages: Iterable[Dict[str, str]] = [{"role": "user", "content": ""}],
        timeout: Optional[int] = None,
        *args,
        **kwargs,
    ) -> AsyncGenerator[ChatCompletionChunk, None]:
        if False:
            yield ChatCompletionChunk(id="", created=0, model="", object="", choices=[])
OpenAICompatible is the default implementation of LLM_Interface, supporting any service compatible with the OpenAI API, including:
  • OpenAI
  • Deepseek
  • Anthropic Claude Compatible Entry
  • Volcano Engine Ark
  • Baidu Qianfan
  • Ollama, vLLM and other local model services
  • Other OpenAI-compatible providers
Extra Abilities:
  • Automatic Retry
  • Token Statistics
  • Rate Limiting Control
  • Multi-Key Rotation
OpenAIResponsesCompatible integrates with OpenAI Responses API endpoints while keeping the same @llm_function / @llm_chat surface for application code.Key points:
  • It uses the same provider.json structure as OpenAICompatible
  • Its constructor still takes APIKeyPool, model_name, and base_url
  • Decorator code still builds normal system/user messages first, then the adapter maps the selected system prompt to Responses instructions
  • Responses-specific features such as reasoning={...} are forwarded by the adapter without forcing app code to speak raw Responses schema
APIKeyPool uses a min heap to maintain the current load of each key, preferentially assigning the most idle key.You will get:
  • Automatic Load Balancing
  • Concurrent State Tracking
  • Thread Safety Under Lock Protection
  • Share state by provider_id
Heap structure: [(task_count, api_key), ...]
The top of the heap is always the least-loaded key.
TokenBucket uses the classic token bucket algorithm to control the request rate.Algorithm Key Points:
  1. Refill tokens at a fixed rate
  2. Token consumption per request
  3. Bucket capacity is limited, excess will be discarded
  4. Allow burst requests when there are sufficient tokens in the bucket

How to Create Interface Instances

from SimpleLLMFunc import OpenAICompatible

all_models = OpenAICompatible.load_from_json_file("provider.json")

gpt35 = all_models["openai"]["gpt-3.5-turbo"]
gpt4 = all_models["openai"]["gpt-4"]
deepseek = all_models["deepseek"]["deepseek-chat"]

APIKeyPool Usage Example

from SimpleLLMFunc.interface import APIKeyPool

key_pool = APIKeyPool(
    api_keys=["sk-key1", "sk-key2", "sk-key3"],
    provider_id="my-provider",
)

key = key_pool.get_least_loaded_key()
key_pool.increment_task_count(key)
key_pool.decrement_task_count(key)

TokenBucket Parameter Suggestions

ParameterTypeExplanationRecommended Range
CapacityThe text parameter must be a string. The provided value has an invalid type: int.Token bucket capacity10-50
refill_rateThe text parameter must be a string.tokens per second0.5-5.0
Common scenarios:
# High-throughput APIs (for example OpenAI)
capacity = 20
refill_rate = 3.0

# Standard API
capacity = 10
refill_rate = 1.0

# Strictly rate-limited API
capacity = 5
refill_rate = 0.5

Production Mode

from SimpleLLMFunc import OpenAICompatible, llm_function

models = OpenAICompatible.load_from_json_file("provider.json")

fast_llm = models["openai"]["gpt-3.5-turbo"]
powerful_llm = models["openai"]["gpt-4"]
deepseek_llm = models["deepseek"]["deepseek-chat"]

@llm_function(llm_interface=fast_llm)
async def simple_task(text: str) -> str:
    """Use a faster model for simple tasks."""
    pass

@llm_function(llm_interface=powerful_llm)
async def complex_task(text: str) -> str:
    """Use a more capable model for complex tasks."""
    pass
@llm_function(llm_interface=llm)
async def robust_call(text: str) -> str:
    """A resilient LLM call."""
    pass

async def call_with_fallback():
    try:
        return await robust_call("test")
    except Exception as e:
        print(f"Primary model failed: {e}")
        return await backup_call("test")
from SimpleLLMFunc.hooks.events import LLMCallEndEvent
from SimpleLLMFunc.hooks.stream import is_event_yield

async for output in summarize_text("..."):
    if is_event_yield(output) and isinstance(output.event, LLMCallEndEvent):
        print(output.event.usage)

least_loaded = llm.key_pool.get_least_loaded_key()
print(f"Least-loaded key: {least_loaded}")
print(llm.get_rate_limit_status())

Best Practices

  • Use different key sets for different environments
  • Set more conservative retry and rate limiting parameters separately for high-cost models
  • Do not let multiple environments share the same high-frequency production key
  • Set capacity and refill_rate according to the vendor rate limit
  • For local models, a higher value can be set appropriately
  • When rate limiting occurs, prioritize adjusting parameters first, then consider increasing the number of keys
import asyncio
from typing import Optional

async def call_with_exponential_backoff(
    llm_call,
    max_retries: int = 3,
    base_delay: float = 1.0,
) -> Optional[str]:
    for attempt in range(max_retries):
        try:
            return await llm_call()
        except Exception:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(base_delay * (2 ** attempt))

Troubleshooting

  • Increase rate_limit_capacity or rate_limit_refill_rate
  • Check if configuration matches vendor restrictions
  • Check if the API Key is valid and has remaining quota
  • Check if base_url is correct
  • Some vendors do not return complete token statistics
  • The framework will try to estimate, but it cannot always be completely accurate.
import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("SimpleLLMFunc")
logger.setLevel(logging.DEBUG)

Summary

The interface layer of SimpleLLMFunc centralizes model access, rate limiting, and key management into a unified abstraction:
  1. LLM_Interface Unified Interface
  2. OpenAICompatible provides default implementation
  3. APIKeyPool Multi-Key Load Balancing
  4. TokenBucket Rate Control
This design is suitable for quick integration as well as gradual scaling to production environments.