LLM Interface Layer

This page describes SimpleLLMFunc’s LLM interface layer design, including unified interface abstraction, the OpenAICompatible and OpenAIResponsesCompatible adapters, key pools, and flow control.

What will you use?

LLM_Interface

Abstract base class that uniformly defines chat() and chat_stream().

OpenAICompatible

The default implementation for integrating with OpenAI-compatible interfaces, suitable as the entry point for most projects.

OpenAIResponsesCompatible

An adapter for OpenAI Responses API endpoints while keeping the same decorator-facing usage model.

APIKeyPool

Responsible for load balancing and task allocation of multiple API Keys.

TokenBucket

Responsible for request rate control to avoid hitting backend rate limits.

If you only want to quickly integrate the model, prefer using OpenAICompatible.load_from_json_file(...) or OpenAIResponsesCompatible.load_from_json_file(...). Only when you need programmatic control over the key pool or rate limiting parameters should you manually instantiate the interface object.

Simple rule: use OpenAICompatible for normal chat/completions-compatible endpoints, and OpenAIResponsesCompatible for OpenAI Responses API endpoints. Both use the same provider.json structure.

Quick Start

Create provider.json

{
  "openai": [
    {
      "model_name": "gpt-3.5-turbo",
      "api_keys": ["sk-key1", "sk-key2"],
      "base_url": "https://api.openai.com/v1",
      "max_retries": 5,
      "retry_delay": 1.0,
      "rate_limit_capacity": 20,
      "rate_limit_refill_rate": 3.0
    }
  ],
  "deepseek": [
    {
      "model_name": "deepseek-chat",
      "api_keys": ["your-deepseek-key"],
      "base_url": "https://api.deepseek.com/v1"
    }
  ]
}

Load model from configuration file

from SimpleLLMFunc import OpenAICompatible, OpenAIResponsesCompatible

llm = OpenAICompatible.load_from_json_file("provider.json")["openai"]["gpt-3.5-turbo"]
responses_llm = OpenAIResponsesCompatible.load_from_json_file("provider.json")["openrouter"]["gpt-5.4"]

for use by the decorator

from SimpleLLMFunc import llm_function

@llm_function(llm_interface=llm)
async def my_task(text: str) -> str:
    """Process a text task."""
    pass

Component Description

LLM_Interface abstract base class

LLM_Interface is the abstract base class for all LLM implementations, defining a unified interface specification. Its goal is to converge the calling method and return format to a consistent interface. Both OpenAICompatible and OpenAIResponsesCompatible inherit from it.Core Features:

Standardized Interface: Unified chat() and chat_stream()
Type Safety: Used with Python Type Annotations
Async Native: Suitable for High Concurrency and Event Stream Scenarios
Extensible: Easy to integrate new OpenAI-compatible services

from abc import ABC, abstractmethod
from typing import Iterable, Dict, Optional, Literal, AsyncGenerator
from openai.types.chat.chat_completion import ChatCompletion
from openai.types.chat.chat_completion_chunk import ChatCompletionChunk

class LLM_Interface(ABC):
    @abstractmethod
    async def chat(
        self,
        trace_id: str = get_current_trace_id(),
        stream: Literal[False] = False,
        messages: Iterable[Dict[str, str]] = [{"role": "user", "content": ""}],
        timeout: Optional[int] = None,
        *args,
        **kwargs,
    ) -> ChatCompletion:
        pass

    @abstractmethod
    async def chat_stream(
        self,
        trace_id: str = get_current_trace_id(),
        stream: Literal[True] = True,
        messages: Iterable[Dict[str, str]] = [{"role": "user", "content": ""}],
        timeout: Optional[int] = None,
        *args,
        **kwargs,
    ) -> AsyncGenerator[ChatCompletionChunk, None]:
        if False:
            yield ChatCompletionChunk(id="", created=0, model="", object="", choices=[])

OpenAICompatible Default Implementation

OpenAICompatible is the default implementation of LLM_Interface, supporting any service compatible with the OpenAI API, including:

OpenAI
Deepseek
Anthropic Claude Compatible Entry
Volcano Engine Ark
Baidu Qianfan
Ollama, vLLM and other local model services
Other OpenAI-compatible providers

Extra Abilities:

Automatic Retry
Token Statistics
Rate Limiting Control
Multi-Key Rotation

OpenAIResponsesCompatible Adapter

OpenAIResponsesCompatible integrates with OpenAI Responses API endpoints while keeping the same @llm_function / @llm_chat surface for application code.Key points:

It uses the same provider.json structure as OpenAICompatible
Its constructor still takes APIKeyPool, model_name, and base_url
Decorator code still builds normal system/user messages first, then the adapter maps the selected system prompt to Responses instructions
Responses-specific features such as reasoning={...} are forwarded by the adapter without forcing app code to speak raw Responses schema

APIKeyPool Key Management

APIKeyPool uses a min heap to maintain the current load of each key, preferentially assigning the most idle key.You will get:

Automatic Load Balancing
Concurrent State Tracking
Thread Safety Under Lock Protection
Share state by provider_id

Heap structure: [(task_count, api_key), ...]
The top of the heap is always the least-loaded key.

TokenBucket Flow Control

TokenBucket uses the classic token bucket algorithm to control the request rate.Algorithm Key Points:

Refill tokens at a fixed rate
Token consumption per request
Bucket capacity is limited, excess will be discarded
Allow burst requests when there are sufficient tokens in the bucket

How to Create Interface Instances

OpenAICompatible: load from config
OpenAICompatible: create manually
OpenAIResponsesCompatible: load from config
OpenAIResponsesCompatible: create manually

from SimpleLLMFunc import OpenAICompatible

all_models = OpenAICompatible.load_from_json_file("provider.json")

gpt35 = all_models["openai"]["gpt-3.5-turbo"]
gpt4 = all_models["openai"]["gpt-4"]
deepseek = all_models["deepseek"]["deepseek-chat"]

from SimpleLLMFunc import OpenAICompatible, APIKeyPool

key_pool = APIKeyPool(
    api_keys=["sk-key1", "sk-key2", "sk-key3"],
    provider_id="openai-gpt35",
)

llm = OpenAICompatible(
    api_key_pool=key_pool,
    model_name="gpt-3.5-turbo",
    base_url="https://api.openai.com/v1",
    max_retries=5,
    retry_delay=1.0,
    rate_limit_capacity=20,
    rate_limit_refill_rate=3.0,
)

from SimpleLLMFunc import OpenAIResponsesCompatible

all_models = OpenAIResponsesCompatible.load_from_json_file("provider.json")
responses_llm = all_models["openrouter"]["gpt-5.4"]

from SimpleLLMFunc import APIKeyPool, OpenAIResponsesCompatible

key_pool = APIKeyPool(
    api_keys=["sk-key1", "sk-key2"],
    provider_id="openrouter-gpt-5.4-responses",
)

responses_llm = OpenAIResponsesCompatible(
    api_key_pool=key_pool,
    model_name="gpt-5.4",
    base_url="https://openrouter.ai/api/v1",
    max_retries=5,
    retry_delay=1.0,
    rate_limit_capacity=20,
    rate_limit_refill_rate=3.0,
)

APIKeyPool Usage Example

from SimpleLLMFunc.interface import APIKeyPool

key_pool = APIKeyPool(
    api_keys=["sk-key1", "sk-key2", "sk-key3"],
    provider_id="my-provider",
)

key = key_pool.get_least_loaded_key()
key_pool.increment_task_count(key)
key_pool.decrement_task_count(key)

TokenBucket Parameter Suggestions

Parameter	Type	Explanation	Recommended Range
Capacity	The text parameter must be a string. The provided value has an invalid type: int.	Token bucket capacity	10-50
refill_rate	The text parameter must be a string.	tokens per second	0.5-5.0

Common scenarios:

# High-throughput APIs (for example OpenAI)
capacity = 20
refill_rate = 3.0

# Standard API
capacity = 10
refill_rate = 1.0

# Strictly rate-limited API
capacity = 5
refill_rate = 0.5

Production Mode

Multi-model Collaboration

from SimpleLLMFunc import OpenAICompatible, llm_function

models = OpenAICompatible.load_from_json_file("provider.json")

fast_llm = models["openai"]["gpt-3.5-turbo"]
powerful_llm = models["openai"]["gpt-4"]
deepseek_llm = models["deepseek"]["deepseek-chat"]

@llm_function(llm_interface=fast_llm)
async def simple_task(text: str) -> str:
    """Use a faster model for simple tasks."""
    pass

@llm_function(llm_interface=powerful_llm)
async def complex_task(text: str) -> str:
    """Use a more capable model for complex tasks."""
    pass

Failback and Retry

@llm_function(llm_interface=llm)
async def robust_call(text: str) -> str:
    """A resilient LLM call."""
    pass

async def call_with_fallback():
    try:
        return await robust_call("test")
    except Exception as e:
        print(f"Primary model failed: {e}")
        return await backup_call("test")

Monitoring and Debugging

from SimpleLLMFunc.hooks.events import LLMCallEndEvent
from SimpleLLMFunc.hooks.stream import is_event_yield

async for output in summarize_text("..."):
    if is_event_yield(output) and isinstance(output.event, LLMCallEndEvent):
        print(output.event.usage)

least_loaded = llm.key_pool.get_least_loaded_key()
print(f"Least-loaded key: {least_loaded}")
print(llm.get_rate_limit_status())

Best Practices

Key Management

Use different key sets for different environments
Set more conservative retry and rate limiting parameters separately for high-cost models
Do not let multiple environments share the same high-frequency production key

Traffic Control

Set capacity and refill_rate according to the vendor rate limit
For local models, a higher value can be set appropriately
When rate limiting occurs, prioritize adjusting parameters first, then consider increasing the number of keys

Error Handling

import asyncio
from typing import Optional

async def call_with_exponential_backoff(
    llm_call,
    max_retries: int = 3,
    base_delay: float = 1.0,
) -> Optional[str]:
    for attempt in range(max_retries):
        try:
            return await llm_call()
        except Exception:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(base_delay * (2 ** attempt))

Troubleshooting

Rate limit exceeded

Increase rate_limit_capacity or rate_limit_refill_rate
Check if configuration matches vendor restrictions

Key continuous failure

Check if the API Key is valid and has remaining quota
Check if base_url is correct

Token statistics inaccurate

Some vendors do not return complete token statistics
The framework will try to estimate, but it cannot always be completely accurate.

debug log

import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("SimpleLLMFunc")
logger.setLevel(logging.DEBUG)

Summary

The interface layer of SimpleLLMFunc centralizes model access, rate limiting, and key management into a unified abstraction:

LLM_Interface Unified Interface
OpenAICompatible provides default implementation
APIKeyPool Multi-Key Load Balancing
TokenBucket Rate Control

This design is suitable for quick integration as well as gradual scaling to production environments.

Overview

Getting Started

Infrastructure

Developer Experience

Agent Execution

Tools and Runtime

UI and Interaction

Integrations and Examples

What will you use?

LLM_Interface

OpenAICompatible

OpenAIResponsesCompatible

APIKeyPool

TokenBucket

Quick Start

Component Description

How to Create Interface Instances

APIKeyPool Usage Example

TokenBucket Parameter Suggestions

Production Mode

Best Practices

Troubleshooting

Summary

Overview

Getting Started

Infrastructure

Developer Experience

Agent Execution

Tools and Runtime

UI and Interaction

Integrations and Examples

​What will you use?

LLM_Interface

OpenAICompatible

OpenAIResponsesCompatible

APIKeyPool

TokenBucket

​Quick Start

​Component Description

​How to Create Interface Instances

​APIKeyPool Usage Example

​TokenBucket Parameter Suggestions

​Production Mode

​Best Practices

​Troubleshooting

​Summary

What will you use?

Quick Start

Component Description

How to Create Interface Instances

APIKeyPool Usage Example

TokenBucket Parameter Suggestions

Production Mode

Best Practices

Troubleshooting

Summary