Fix/api token limit test #220

AlaeMghirbi · 2026-01-20T16:24:07Z

This PR implements pre-flight validation for the completion route to ensure message consistency and context window compliance before reaching the LLM. By verifying role alternation and token limits locally, the system provides immediate feedback to the client and prevents "context window exceeded" errors. This approach enhances stability while saving significant GPU/CPU resources and costs by avoiding invalid inference requests.

Closes #198

Summary by CodeRabbit

New Features
- Configurable LLM context size limit (default 8192 tokens).
- Pre‑flight token‑limit validation for chat and completion endpoints; oversize requests return HTTP 413 with a clear message.
Improvements
- Centralized token-counting and model-query utilities.
- Clearer error responses when configured models aren’t available.
Tests
- Added unit and API tests covering token validation, boundaries, and 413 behavior.
Chores
- Updated local test orchestration and CI test invocation; minor test infra cleanup.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-20T16:24:31Z

📝 Walkthrough

Walkthrough

Adds a declarative llm_context.max_llm_context_size config and implements pre-flight token-limit checks for OpenAI-compatible /v1/chat/completions and /v1/completions endpoints by querying model limits (vLLM) with a config fallback; oversized requests return HTTP 413.

Changes

Cohort / File(s)	Summary
Configuration `\.hydra_config/config.yaml`	Added top-level `llm_context.max_llm_context_size` sourced from env `max_llm_context_size`, default `8192`.
API router / Token validation `openrag/routers/openai.py`	Added startup caching of model max tokens, `get_max_model_tokens()`/`_fetch_max_model_tokens()`, `validate_tokens_limit()` and `check_tokens_limit()`; integrated token checks into `/v1/chat/completions` and `/v1/completions`, raising 413 for oversized requests; imports `get_num_tokens`.
Token utilities `openrag/components/utils.py`, `openrag/routers/utils.py`	Added `get_num_tokens()` (cached token-count function) and `async get_openai_models(base_url, api_key)`; refactored callers to use these helpers.
Unit tests — token validation `openrag/test_token_validation.py`	New tests for `validate_tokens_limit()` covering boundary, default, error-message and exception-handling scenarios with a patched token counter.
Integration tests / CI changes `tests/api_tests/test_openai_compat.py`, `.github/workflows/api_tests.yml`, `tests/api_tests/api_run/*`	Added parameterized integration test asserting 413 when payloads exceed token limits; workflow/test-run invocation adjusted to `cd tests/api_tests && python3 -m pytest ...`; added local test runner script and cleaned comments in docker-compose/mock files.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Endpoint as /v1/chat.completions<br/>/v1/completions
    participant vLLM as vLLM /v1/models
    participant Config as Config (llm.max_tokens / llm_context.max_llm_context_size)
    participant Validator as validate_tokens_limit()
    participant TokenCounter as get_num_tokens()

    Client->>Endpoint: POST messages/prompt (+ optional model_id)
    Endpoint->>vLLM: GET /v1/models (or cached value)
    alt vLLM returns model info
        vLLM-->>Endpoint: model.max_tokens
    else vLLM unavailable or model missing
        Endpoint->>Config: read configured max tokens
        Config-->>Endpoint: fallback max_tokens
    end
    Endpoint->>Validator: validate_tokens_limit(request, max_tokens)
    Validator->>TokenCounter: count tokens (messages/prompt + params)
    TokenCounter-->>Validator: token_count
    alt token_count ≤ max_tokens
        Validator-->>Endpoint: valid
        Endpoint->>Client: proceed (200 / streamed response)
    else token_count > max_tokens
        Validator-->>Endpoint: invalid (error message)
        Endpoint-->>Client: 413 Payload Too Large (detail: exceeds maximum token limit)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I count each token with a careful hop,
I ask the model where the top does stop,
If your prompt spills past the line,
I thump and return a gentle sign,
Now requests stay tidy — little hop, big bop.

🚥 Pre-merge checks | ✅ 1 | ❌ 4

❌ Failed checks (3 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR implements token/context-window limit enforcement (`#198`), but the changes do not include validation for role alternation in the messages array, which is a key requirement from the linked issue.	Add validation logic for message role alternation (after optional system message: user/assistant/user/assistant/...) as specified in issue `#198`.
Out of Scope Changes check	⚠️ Warning	Several changes appear out of scope: refactoring get_openai_models utility, cleanup of docker-compose comments, and removal of comments from mock_vllm.py are not directly required by issue `#198`'s token limit validation requirements.	Isolate token validation changes from refactoring work. Move utility changes and cleanup to separate PRs or justify their necessity for token validation.
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Fix/api token limit test' is partially related to the changeset. While the PR does address token limit validation, the title emphasizes 'test' when the primary change involves implementing token-aware pre-flight validation in the API router itself.	Consider a more descriptive title like 'Implement pre-flight token validation for completion API' to better capture the core change and primary objective.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @.hydra_config/config.yaml:
- Around line 75-76: The env var name under llm_context -> max_llm_context_size
is using lowercase; update the oc.env reference to use the uppercase variable
MAX_LLM_CONTEXT_SIZE (i.e., change the expression
`${oc.decode:${oc.env:max_llm_context_size, 8192}}` to reference
`MAX_LLM_CONTEXT_SIZE`) so it matches the project's env naming convention and
prevents silent deployment failures.

In `@openrag/routers/openai.py`:
- Around line 10-13: Reformat the import block and remove trailing/blank
whitespace to satisfy Ruff (I001, W291, W293): group and order imports into
standard/third‑party/local sections and alphabetize within groups for the
imports that include Document, ChatOpenAI, AsyncOpenAI and the models.openai
imports, and remove any trailing spaces and extra blank lines in the
token-validation block (the new token validation code near the top that produced
blank-line/trailing-space warnings). Run ruff/format or a similar formatter to
apply the import grouping and whitespace fixes across the file (also address the
other reported occurrences around the token-validation area).
- Around line 157-229: The token validation currently fails open in
validate_tokens_limit by returning (True, "") on exceptions; change the except
block to fail closed by returning (False, f"Token validation error: {e}")
(include the exception text) and log the error (e.g., with logger.error or
logger.warning) so callers can reject the request (HTTP 413); refer to
validate_tokens_limit, ChatOpenAI instantiation, and the existing logger usage
to locate and update the except block.

🧹 Nitpick comments (1)

openrag/routers/openai.py (1)

115-154: Avoid per‑request /v1/models calls; reuse & close the client.

Each request constructs a new AsyncOpenAI client and calls /v1/models, which adds latency and can leak connections if the client isn’t closed. Consider a shared client + cached max_model_len (TTL) or an async with AsyncOpenAI(...) context to ensure cleanup.

coderabbitai · 2026-01-20T16:34:19Z

.hydra_config/config.yaml

+llm_context:
+  max_llm_context_size: ${oc.decode:${oc.env:max_llm_context_size, 8192}}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for both lowercase and uppercase variants echo "=== Searching for max_llm_context_size and MAX_LLM_CONTEXT_SIZE ===" rg -n "max_llm_context_size|MAX_LLM_CONTEXT_SIZE" echo "" echo "=== Checking the config file for env var patterns ===" cat -n .hydra_config/config.yaml | head -100 echo "" echo "=== Searching for oc.env patterns to see convention ===" rg -n "oc\.env:" .hydra_config/config.yaml

Repository: linagora/openrag

Length of output: 176

🏁 Script executed:

#!/bin/bash # Search for both lowercase and uppercase variants echo "=== Searching for max_llm_context_size and MAX_LLM_CONTEXT_SIZE ===" rg -n "max_llm_context_size|MAX_LLM_CONTEXT_SIZE" echo "" echo "=== Checking the config file for env var patterns ===" cat -n .hydra_config/config.yaml | head -100 echo "" echo "=== Searching for oc.env patterns to see convention ===" rg -n "oc\.env:" .hydra_config/config.yaml

Repository: linagora/openrag

Length of output: 10027

Use uppercase MAX_LLM_CONTEXT_SIZE for the environment variable name at line 76.

The environment variable reference uses lowercase max_llm_context_size, which breaks the uppercase convention used consistently throughout the config (e.g., LLM_SEMAPHORE, MAX_MODEL_LEN, EMBEDDER_BASE_URL, etc.). Deployments setting MAX_LLM_CONTEXT_SIZE will fail silently. Change to ${oc.decode:${oc.env:MAX_LLM_CONTEXT_SIZE, 8192}} to match the established convention.

🤖 Prompt for AI Agents

In @.hydra_config/config.yaml around lines 75 - 76, The env var name under llm_context -> max_llm_context_size is using lowercase; update the oc.env reference to use the uppercase variable MAX_LLM_CONTEXT_SIZE (i.e., change the expression `${oc.decode:${oc.env:max_llm_context_size, 8192}}` to reference `MAX_LLM_CONTEXT_SIZE`) so it matches the project's env naming convention and prevents silent deployment failures.

openrag/routers/openai.py

coderabbitai · 2026-01-20T16:34:19Z

openrag/routers/openai.py

+def validate_tokens_limit(
+    request: OpenAIChatCompletionRequest | OpenAICompletionRequest,
+    max_tokens_allowed: int | None = None,
+) -> tuple[bool, str]:
+    """Validate if the request respects the maximum token limit.
+
+    Args:
+        request: The OpenAI request object
+        max_tokens_allowed: Maximum allowed tokens for the request.
+                          If None, retrieves from config.
+
+    Returns:
+        Tuple of (is_valid, error_message)
+    """
+    try:
+        llm = ChatOpenAI(**config.llm)
+        _length_function = llm.get_num_tokens
+
+        if max_tokens_allowed is None:
+            max_tokens_allowed = config.llm.get("max_tokens", 4096)
+
+        if isinstance(request, OpenAIChatCompletionRequest):
+            message_tokens = sum(
+                _length_function(m.content) + 4  
+                for m in request.messages
+            )
+            requested_tokens = request.max_tokens or 1024
+            total_tokens_needed = message_tokens + requested_tokens
+
+            logger.debug(
+                "Token validation for chat completion",
+                message_tokens=message_tokens,
+                requested_tokens=requested_tokens,
+                total_tokens=total_tokens_needed,
+                max_allowed=max_tokens_allowed,
+            )
+
+            if total_tokens_needed > max_tokens_allowed:
+                return False, (
+                    f"Request exceeds maximum token limit. "
+                    f"Messages: {message_tokens} tokens + "
+                    f"Requested output: {requested_tokens} tokens = "
+                    f"{total_tokens_needed} tokens. "
+                    f"Maximum allowed: {max_tokens_allowed} tokens."
+                )
+
+        elif isinstance(request, OpenAICompletionRequest):
+            prompt_tokens = _length_function(request.prompt)
+            requested_tokens = request.max_tokens or 512
+            total_tokens_needed = prompt_tokens + requested_tokens
+
+            logger.debug(
+                "Token validation for completion",
+                prompt_tokens=prompt_tokens,
+                requested_tokens=requested_tokens,
+                total_tokens=total_tokens_needed,
+                max_allowed=max_tokens_allowed,
+            )
+
+            if total_tokens_needed > max_tokens_allowed:
+                return False, (
+                    f"Request exceeds maximum token limit. "
+                    f"Prompt: {prompt_tokens} tokens + "
+                    f"Requested output: {requested_tokens} tokens = "
+                    f"{total_tokens_needed} tokens. "
+                    f"Maximum allowed: {max_tokens_allowed} tokens."
+                )
+
+        return True, ""
+
+    except Exception as e:
+        logger.warning("Error during token validation", error=str(e))
+        return True, ""


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's confirm the file exists and check its structure fd "openai.py" | grep -i routers

Repository: linagora/openrag

Length of output: 85

🏁 Script executed:

# Find the function and examine it rg "def validate_tokens_limit" -A 80

Repository: linagora/openrag

Length of output: 5301

🏁 Script executed:

# Find all callers of this function rg "validate_tokens_limit" -B 2 -A 2

Repository: linagora/openrag

Length of output: 1313

🏁 Script executed:

# Check if there are any tests for this function fd "test.*\.py|.*test\.py" | xargs rg "validate_tokens_limit" 2>/dev/null

Repository: linagora/openrag

Length of output: 44

🏁 Script executed:

# Check the overall structure to understand context cat -n openrag/routers/openai.py | head -250 | tail -100

Repository: linagora/openrag

Length of output: 4588

🏁 Script executed:

# Get more context around the callers to see what happens after validation rg "validate_tokens_limit" -B 5 -A 10

Repository: linagora/openrag

Length of output: 3247

🏁 Script executed:

# Look for any error handling patterns or configuration that might justify the current behavior rg "HTTPException|raise_for_status|4096" openrag/routers/openai.py | head -20

Repository: linagora/openrag

Length of output: 469

🏁 Script executed:

# Check if there's documentation or comments explaining the design decision rg "fail.*closed|bypass.*validation|exception.*handling" openrag/routers/openai.py

Repository: linagora/openrag

Length of output: 42

Fail-closed when token counting fails.

On exceptions, the function currently returns (True, ""), which bypasses validation and allows requests to proceed when token counting fails (e.g., unknown model, tokenizer errors). This undermines the safeguard—change to return (False, error_message) to fail closed and let callers reject the request with HTTP 413.

Recommended fix

except Exception as e: logger.warning("Error during token validation", error=str(e)) - return True, "" + return False, "Unable to validate token limit; please reduce input or retry."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def validate_tokens_limit(

request: OpenAIChatCompletionRequest | OpenAICompletionRequest,

max_tokens_allowed: int | None = None,

) -> tuple[bool, str]:

"""Validate if the request respects the maximum token limit.

Args:

request: The OpenAI request object

max_tokens_allowed: Maximum allowed tokens for the request.

If None, retrieves from config.

Returns:

Tuple of (is_valid, error_message)

"""

try:

llm = ChatOpenAI(**config.llm)

_length_function = llm.get_num_tokens

if max_tokens_allowed is None:

max_tokens_allowed = config.llm.get("max_tokens", 4096)

if isinstance(request, OpenAIChatCompletionRequest):

message_tokens = sum(

_length_function(m.content) + 4

for m in request.messages

)

requested_tokens = request.max_tokens or 1024

total_tokens_needed = message_tokens + requested_tokens

logger.debug(

"Token validation for chat completion",

message_tokens=message_tokens,

requested_tokens=requested_tokens,

total_tokens=total_tokens_needed,

max_allowed=max_tokens_allowed,

)

if total_tokens_needed > max_tokens_allowed:

return False, (

f"Request exceeds maximum token limit. "

f"Messages: {message_tokens} tokens + "

f"Requested output: {requested_tokens} tokens = "

f"{total_tokens_needed} tokens. "

f"Maximum allowed: {max_tokens_allowed} tokens."

)

elif isinstance(request, OpenAICompletionRequest):

prompt_tokens = _length_function(request.prompt)

requested_tokens = request.max_tokens or 512

total_tokens_needed = prompt_tokens + requested_tokens

logger.debug(

"Token validation for completion",

prompt_tokens=prompt_tokens,

requested_tokens=requested_tokens,

total_tokens=total_tokens_needed,

max_allowed=max_tokens_allowed,

)

if total_tokens_needed > max_tokens_allowed:

return False, (

f"Request exceeds maximum token limit. "

f"Prompt: {prompt_tokens} tokens + "

f"Requested output: {requested_tokens} tokens = "

f"{total_tokens_needed} tokens. "

f"Maximum allowed: {max_tokens_allowed} tokens."

)

return True, ""

except Exception as e:

logger.warning("Error during token validation", error=str(e))

return True, ""

def validate_tokens_limit(

request: OpenAIChatCompletionRequest | OpenAICompletionRequest,

max_tokens_allowed: int | None = None,

) -> tuple[bool, str]:

"""Validate if the request respects the maximum token limit.

Args:

request: The OpenAI request object

max_tokens_allowed: Maximum allowed tokens for the request.

If None, retrieves from config.

Returns:

Tuple of (is_valid, error_message)

"""

try:

llm = ChatOpenAI(**config.llm)

_length_function = llm.get_num_tokens

if max_tokens_allowed is None:

max_tokens_allowed = config.llm.get("max_tokens", 4096)

if isinstance(request, OpenAIChatCompletionRequest):

message_tokens = sum(

_length_function(m.content) + 4

for m in request.messages

)

requested_tokens = request.max_tokens or 1024

total_tokens_needed = message_tokens + requested_tokens

logger.debug(

"Token validation for chat completion",

message_tokens=message_tokens,

requested_tokens=requested_tokens,

total_tokens=total_tokens_needed,

max_allowed=max_tokens_allowed,

)

if total_tokens_needed > max_tokens_allowed:

return False, (

f"Request exceeds maximum token limit. "

f"Messages: {message_tokens} tokens + "

f"Requested output: {requested_tokens} tokens = "

f"{total_tokens_needed} tokens. "

f"Maximum allowed: {max_tokens_allowed} tokens."

)

elif isinstance(request, OpenAICompletionRequest):

prompt_tokens = _length_function(request.prompt)

requested_tokens = request.max_tokens or 512

total_tokens_needed = prompt_tokens + requested_tokens

logger.debug(

"Token validation for completion",

prompt_tokens=prompt_tokens,

requested_tokens=requested_tokens,

total_tokens=total_tokens_needed,

max_allowed=max_tokens_allowed,

)

if total_tokens_needed > max_tokens_allowed:

return False, (

f"Request exceeds maximum token limit. "

f"Prompt: {prompt_tokens} tokens + "

f"Requested output: {requested_tokens} tokens = "

f"{total_tokens_needed} tokens. "

f"Maximum allowed: {max_tokens_allowed} tokens."

)

return True, ""

except Exception as e:

logger.warning("Error during token validation", error=str(e))

return False, "Unable to validate token limit; please reduce input or retry."

🧰 Tools

🪛 GitHub Check: lint (3.12)

[failure] 193-193: Ruff (W293)
openrag/routers/openai.py:193:1: W293 Blank line contains whitespace

[failure] 185-185: Ruff (W293)
openrag/routers/openai.py:185:1: W293 Blank line contains whitespace

[failure] 180-180: Ruff (W291)
openrag/routers/openai.py:180:48: W291 Trailing whitespace

[failure] 177-177: Ruff (W293)
openrag/routers/openai.py:177:1: W293 Blank line contains whitespace

[failure] 174-174: Ruff (W293)
openrag/routers/openai.py:174:1: W293 Blank line contains whitespace

[failure] 167-167: Ruff (W293)
openrag/routers/openai.py:167:1: W293 Blank line contains whitespace

[failure] 162-162: Ruff (W293)
openrag/routers/openai.py:162:1: W293 Blank line contains whitespace

🤖 Prompt for AI Agents

In `@openrag/routers/openai.py` around lines 157 - 229, The token validation currently fails open in validate_tokens_limit by returning (True, "") on exceptions; change the except block to fail closed by returning (False, f"Token validation error: {e}") (include the exception text) and log the error (e.g., with logger.error or logger.warning) so callers can reject the request (HTTP 413); refer to validate_tokens_limit, ChatOpenAI instantiation, and the existing logger usage to locate and update the except block.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@openrag/routers/openai.py`:
- Around line 274-276: Replace the redundant ternary that sets target_model_id
using is_direct_llm_model(request); always use the underlying configured LLM
model by assigning target_model_id = config.llm.get("model") before calling
get_max_model_tokens, so token validation via
get_max_model_tokens(target_model_id) and validate_tokens_limit(request,
max_tokens_allowed=...) always checks the actual LLM limits (refer to
target_model_id, is_direct_llm_model, config.llm.get("model"),
get_max_model_tokens, validate_tokens_limit, request, model_name to locate the
code).

♻️ Duplicate comments (1)

openrag/routers/openai.py (1)
221-223: Fail-open on exception undermines the validation safeguard.

The exception handler returns (True, ""), allowing requests to proceed when token counting fails. This defeats the PR's goal of preventing invalid requests from reaching the LLM. Change to fail-closed by returning (False, error_message).
🔒 Recommended fix
     except Exception as e:
         logger.warning("Error during token validation", error=str(e))
-        return True, ""
+        return False, f"Unable to validate token limit: {e}. Please reduce input or retry."

🧹 Nitpick comments (3)

openrag/routers/openai.py (3)
119-121: Consider caching or reusing the AsyncOpenAI client.

A new AsyncOpenAI client is instantiated on every call to get_max_model_tokens. For high-frequency requests, this could add overhead. Consider initializing the client once at module level or using a cached singleton pattern.
♻️ Suggested refactor
+# At module level, after config initialization
+_openai_client: AsyncOpenAI | None = None
+
+def _get_openai_client() -> AsyncOpenAI:
+    global _openai_client
+    if _openai_client is None:
+        _openai_client = AsyncOpenAI(base_url=config.llm["base_url"], api_key=config.llm["api_key"])
+    return _openai_client
+

 async def get_max_model_tokens(model_id: str | None) -> int:
     ...
     try:
-        client = AsyncOpenAI(base_url=config.llm["base_url"], api_key=config.llm["api_key"])
+        client = _get_openai_client()
         resp = await client.models.list()
168-170: Consider reusing the ChatOpenAI instance for token counting.

Similar to the AsyncOpenAI client, a new ChatOpenAI instance is created on every validation call. For token counting purposes, a single cached instance would suffice.

176-176: Document the per-message token overhead constant.

The + 4 magic number approximates OpenAI's chat format overhead (role tokens, separators). Consider extracting this to a named constant with a comment explaining its purpose.
♻️ Suggested refactor
+# Approximate token overhead per message (role, separators, formatting)
+MESSAGE_TOKEN_OVERHEAD = 4
+
 def validate_tokens_limit(...):
     ...
-            message_tokens = sum(_length_function(m.content) + 4 for m in request.messages)
+            message_tokens = sum(_length_function(m.content) + MESSAGE_TOKEN_OVERHEAD for m in request.messages)

openrag/routers/openai.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/api_tests/api_run/scripts/run_api_tests_local.sh`:
- Around line 45-61: Add a guaranteed cleanup trap so containers are torn down
even if pytest fails: define a cleanup function that runs the existing cleanup
steps (cd "$COMPOSE_DIR" and docker compose down -v), register it with trap for
EXIT (and optionally ERR INT TERM) near the top of the script, and remove the
explicit cleanup call at the end of the file; reference the pytest invocation
(OPENRAG_API_URL=... python3 -m pytest ...) and the existing cleanup commands so
the trap executes the same teardown when the script exits unexpectedly.

🧹 Nitpick comments (2)

.github/workflows/api_tests.yml (1)
69-74: Redundant environment variable assignment.

OPENRAG_API_URL is set both in the env: block (line 71) and inline in the command (line 74). The inline assignment shadows the block-level one, making the env: block redundant here.
♻️ Suggested fix

Either keep only the env: block:
       - name: Run API tests
         env:
           OPENRAG_API_URL: http://localhost:8080
         run: |
           cd tests/api_tests
-          OPENRAG_API_URL=http://localhost:8080 python3 -m pytest . -v --timeout=120 --tb=short
+          python3 -m pytest . -v --timeout=120 --tb=short
Or remove the env: block entirely:
       - name: Run API tests
-        env:
-          OPENRAG_API_URL: http://localhost:8080
         run: |
           cd tests/api_tests
           OPENRAG_API_URL=http://localhost:8080 python3 -m pytest . -v --timeout=120 --tb=short
tests/api_tests/api_run/scripts/run_api_tests_local.sh (1)
48-56: Consider simplifying dependency check.

The heredoc pattern works but could be simplified:
♻️ Simpler alternative
-if ! python3 - <<'PY'
-import httpx
-import pytest  # noqa: F401
-import pytest_timeout  # noqa: F401
-PY
-then
+if ! python3 -c "import httpx, pytest, pytest_timeout" 2>/dev/null; then
   echo "Installing pytest + httpx + pytest-timeout..."
   pip install pytest httpx pytest-timeout
 fi

coderabbitai · 2026-01-21T14:38:46Z

tests/api_tests/api_run/scripts/run_api_tests_local.sh

+echo "[3/4] Running API tests"
+cd "$ROOT_DIR"
+# Ensure test deps are present (mirror CI)
+if ! python3 - <<'PY'
+import httpx
+import pytest  # noqa: F401
+import pytest_timeout  # noqa: F401
+PY
+then
+  echo "Installing pytest + httpx + pytest-timeout..."
+  pip install pytest httpx pytest-timeout
+fi
+OPENRAG_API_URL=http://localhost:8080 python3 -m pytest ../  -v --timeout=120 --tb=short
+
+echo "[4/4] Cleaning up"
+cd "$COMPOSE_DIR"
+docker compose down -v


⚠️ Potential issue | 🟡 Minor

Missing cleanup on test failure leaves containers running.

With set -e enabled, if pytest fails at line 57, the script exits immediately and the cleanup (lines 59-61) never runs. This differs from the CI workflow which uses if: always() for cleanup. Local developers will have stale containers after a failed test run.

🛠️ Suggested fix using a trap

#!/usr/bin/env bash set -euo pipefail # Local replica of .github/workflows/api_tests.yml # Builds/starts the stack (mock-vllm included), waits for health, runs tests, then cleans up. ROOT_DIR="$(cd "$(dirname "$0")/.." && pwd)" COMPOSE_DIR="$ROOT_DIR" cd "$COMPOSE_DIR" +cleanup() { + echo "Cleaning up..." + cd "$COMPOSE_DIR" + docker compose down -v +} +trap cleanup EXIT + echo "[1/4] Building and starting services" docker compose build docker compose up -d

Then remove the explicit cleanup call at the end:

-echo "[4/4] Cleaning up" -cd "$COMPOSE_DIR" -docker compose down -v - echo "Done."

🤖 Prompt for AI Agents

In `@tests/api_tests/api_run/scripts/run_api_tests_local.sh` around lines 45 - 61, Add a guaranteed cleanup trap so containers are torn down even if pytest fails: define a cleanup function that runs the existing cleanup steps (cd "$COMPOSE_DIR" and docker compose down -v), register it with trap for EXIT (and optionally ERR INT TERM) near the top of the script, and remove the explicit cleanup call at the end of the file; reference the pytest invocation (OPENRAG_API_URL=... python3 -m pytest ...) and the existing cleanup commands so the trap executes the same teardown when the script exits unexpectedly.

openrag/routers/openai.py

paultranvan · 2026-01-26T17:16:36Z

openrag/routers/openai.py

+                _length_function(m.content) + 4  
+                for m in request.messages
+            )
+            requested_tokens = request.max_tokens or 1024


what is this 1024 value ?

1024 is the default value defined in the OpenAIChatCompletionRequest model here:

openrag/openrag/models/openai.py

Line 22 in 0b5f84c

max_tokens: int | None = Field(1024)

I feel like this should either be in the config, or moved as a global constant. What do you think @Ahmath-Gadji ?

paultranvan · 2026-01-26T17:21:12Z

openrag/routers/openai.py

+        if max_tokens_allowed is None:
+            max_tokens_allowed = config.llm.get("max_tokens", 4096)
+
+        if isinstance(request, OpenAIChatCompletionRequest):


why do you need to make a different treatment between OpenAIChatCompletionRequest and OpenAICompletionRequest?

Yes, it's different .... For OpenAIChatCompletionRequest we need to sum up all message tokens and add 4 to every message for message overhead. For OpenAICompletionRequest we just need to calculate the number of tokens in the prompt

@Ahmath-Gadji is it useful to support the /completions route? I feel like now everything use /chat/completions ?

openrag/routers/openai.py

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@openrag/components/utils.py`:
- Line 88: Remove the whitespace-only blank line in the utils module that is
triggering Ruff W293: edit the openrag.components.utils file and delete the
trailing whitespace-only line (ensure no trailing spaces remain on that line),
then save so the file contains no blank line with spaces; you can confirm by
running ruff or trimming trailing whitespace in your editor.

In `@openrag/routers/openai.py`:
- Around line 120-140: Move the call to get_openai_models(...) into the try
block so network/HTTP errors are caught and the fallback default_limit is
returned on failure; inside the try, call openai_models = await
get_openai_models(...) and iterate over openai_models.data (not openai_models)
when locating the model (the lookup using current_model.id should remain but
operate on items in .data), then proceed to model = model.model_dump() if
hasattr(model, 'model_dump') else model.dict(), extract max_model_len from
model.get("max_model_len") or model.get("model_extra", {}).get("max_model_len"),
log and return int(max_len) or return default_limit in the except path.

🧹 Nitpick comments (1)

openrag/components/utils.py (1)
79-90: Cache the token counter to avoid per-call ChatOpenAI instantiation.

get_num_tokens() creates a new client every call; format_context can be hot. Cache the length function (or the client) once to reduce overhead and connection churn.
♻️ Suggested change
+from functools import lru_cache
@@
-def get_num_tokens():
+@lru_cache(maxsize=1)
+def get_num_tokens():
     llm = ChatOpenAI(**config.llm)
-    _length_function = llm.get_num_tokens
-    return _length_function
+    return llm.get_num_tokens

coderabbitai · 2026-01-27T14:04:44Z

openrag/components/utils.py

-
-    llm = ChatOpenAI(**config.llm)
-    _length_function = llm.get_num_tokens
+


⚠️ Potential issue | 🟡 Minor

Remove whitespace-only line to satisfy Ruff W293.

CI is failing on this blank-line whitespace.

🧹 Minimal fix

- +

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

🧰 Tools

🪛 GitHub Actions: Linting

[error] 88-88: W293: Blank line contains whitespace.

🪛 GitHub Check: lint (3.12)

[failure] 88-88: Ruff (W293)
openrag/components/utils.py:88:1: W293 Blank line contains whitespace

🤖 Prompt for AI Agents

In `@openrag/components/utils.py` at line 88, Remove the whitespace-only blank line in the utils module that is triggering Ruff W293: edit the openrag.components.utils file and delete the trailing whitespace-only line (ensure no trailing spaces remain on that line), then save so the file contains no blank line with spaces; you can confirm by running ruff or trimming trailing whitespace in your editor.

openrag/routers/openai.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

openrag/components/utils.py (1)

1-145: Pipeline failure: Ruff format check would reformat this file.

The CI is failing due to formatting issues. Run ruff format openrag/components/utils.py to fix.

openrag/routers/openai.py (1)

1-440: Pipeline failure: Ruff format check would reformat this file.

CI is failing due to formatting issues. Run ruff format openrag/routers/openai.py to fix.

🤖 Fix all issues with AI agents

In `@openrag/routers/openai.py`:
- Around line 161-164: The token counting code in the
OpenAIChatCompletionRequest branch can pass None to _length_function because
OpenAIMessage.content may be None; update the computation of message_tokens in
openai.py to guard against None (e.g., use (m.content or "") or skip None
messages) so _length_function always receives a str; adjust the expression that
sums message_tokens (referencing OpenAIChatCompletionRequest, OpenAIMessage, and
_length_function) and keep requested_tokens/total_tokens_needed logic unchanged.

In `@openrag/routers/utils.py`:
- Around line 238-242: The file has only one blank line between top-level
functions which violates PEP8/Ruff; add a second blank line after the end of the
async function get_openai_models so there are two blank lines separating it from
the next top-level function check_llm_model_availability, then run `ruff format
openrag/routers/utils.py` to auto-apply formatting fixes.

🧹 Nitpick comments (1)

openrag/components/utils.py (1)
79-82: Consider caching the ChatOpenAI instance to avoid repeated instantiation.

get_num_tokens() creates a new ChatOpenAI instance on every call. Since this function is now called from format_context (and also used in validate_tokens_limit in openrag/routers/openai.py), each token counting operation instantiates a new client. This adds unnecessary overhead, especially for batch operations.
♻️ Suggested improvement using module-level caching
+_llm_instance = None
+
+def get_num_tokens():
+    global _llm_instance
+    if _llm_instance is None:
+        _llm_instance = ChatOpenAI(**config.llm)
+    return _llm_instance.get_num_tokens
-def get_num_tokens():
-    llm = ChatOpenAI(**config.llm)
-    _length_function = llm.get_num_tokens
-    return _length_function
Alternatively, consider using @functools.lru_cache if config is immutable during runtime.

openrag/routers/openai.py

openrag/routers/utils.py

Lazy-init a module-level _cached_length_function so ChatOpenAI is only instantiated once instead of on every call. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Return models_response.data (list) instead of SyncPage, and remove the now-redundant .data access in check_llm_model_availability. Co-Authored-By: Claude Opus 4.5 <[email protected]>

paultranvan · 2026-01-30T14:04:38Z

openrag/routers/openai.py

+        if max_tokens_allowed is None:
+            max_tokens_allowed = config.llm.get("max_tokens", 4096)
+
+        if isinstance(request, OpenAIChatCompletionRequest):


@Ahmath-Gadji is it useful to support the /completions route? I feel like now everything use /chat/completions ?

paultranvan · 2026-01-30T14:09:18Z

openrag/routers/openai.py

+                _length_function(m.content) + 4  
+                for m in request.messages
+            )
+            requested_tokens = request.max_tokens or 1024


I feel like this should either be in the config, or moved as a global constant. What do you think @Ahmath-Gadji ?

paultranvan · 2026-01-30T14:13:47Z

tests/api_tests/test_openai_compat.py

+            "max_tokens": 100000,
+        }
+        response = api_client.post("/v1/chat/completions", json=payload)
+        assert response.status_code in [413, 404]


Why the status code would be 404 ? assert on 413 is the expected result, right?

paultranvan · 2026-01-30T15:30:24Z

tests/api_tests/test_openai_compat.py

+            "prompt": very_long_prompt,
+            "max_tokens": 100000,
+        }
+        response = api_client.post("/v1/completions", json=payload)


I guess this is exactly the same test with just a path difference ; it should be refactored

Remove the None default and the dead fallback to 4096, since all callers already pass the value from get_max_model_tokens(). Co-Authored-By: Claude Opus 4.5 <[email protected]>

Log a warning but return True so the request proceeds normally when token validation itself fails (e.g. tokenizer error), avoiding blocking clients on temporary failures. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add openrag/test_token_validation.py with 10 unit tests covering: boundary conditions (under/exact/over) for both chat and completion, default max_tokens fallback, error message content, and graceful exception handling (skip validation on tokenizer failure). - Fix integration tests: strict 413 assertion (was [413, 404]), deduplicate via pytest.mark.parametrize. Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@openrag/routers/utils.py`:
- Around line 239-242: In get_openai_models, the AsyncOpenAI client is created
but never closed; change the function to use the async context manager form
(async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client:) so the
client is deterministically cleaned up, then call await client.models.list()
inside that block and return models_response.data; this ensures AsyncOpenAI is
properly closed after use.

🧹 Nitpick comments (1)

openrag/test_token_validation.py (1)

14-25: Move Ray init/shutdown into a session fixture to avoid global side effects and resource leaks.

Initializing Ray at module import time leaves resources active for the entire test session with no cleanup mechanism. Using a session-scoped autouse fixture with explicit ray.shutdown() ensures proper resource management and follows Ray's recommended pytest patterns.

♻️ Suggested refactor

-# Prevent Ray from scanning the working directory (which may contain
-# permission-restricted folders like db/) and stub RagPipeline to
-# avoid heavy initialization during test collection.
-import ray  # noqa: E402
-
-if not ray.is_initialized():
-    ray.init(runtime_env={"working_dir": None}, ignore_reinit_error=True)
-
-if "components.pipeline" not in sys.modules:
-    _stub = ModuleType("components.pipeline")
-    _stub.RagPipeline = MagicMock()
-    sys.modules["components.pipeline"] = _stub
+import ray  # noqa: E402
+
+@pytest.fixture(scope="session", autouse=True)
+def _init_ray_and_stub_pipeline():
+    """Initialize Ray and stub RagPipeline before test session."""
+    if not ray.is_initialized():
+        ray.init(runtime_env={"working_dir": None}, ignore_reinit_error=True)
+    
+    if "components.pipeline" not in sys.modules:
+        _stub = ModuleType("components.pipeline")
+        _stub.RagPipeline = MagicMock()
+        sys.modules["components.pipeline"] = _stub
+    
+    yield
+    
+    if ray.is_initialized():
+        ray.shutdown()

coderabbitai · 2026-01-30T15:59:20Z

openrag/routers/utils.py

+async def get_openai_models(base_url: str, api_key: str):
+    client = AsyncOpenAI(base_url=base_url, api_key=api_key)
+    models_response = await client.models.list()
+    return models_response.data


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

AsyncOpenAI Python client async context manager and close() cleanup requirements

💡 Result:

AsyncOpenAI does support an async context manager, and using it is the simplest way to ensure the underlying HTTP client/session is closed:

from openai import AsyncOpenAI async with AsyncOpenAI(api_key=...) as client: resp = await client.responses.create(model="...", input="hi")

This pattern is shown in the official openai-python README (e.g., when using the aiohttp backend) and is intended to close resources when you exit the block. [1]

Do you have to call close()?

The library says it will close underlying HTTP connections when the client is garbage collected, and you can also close the client manually (or via a context manager) for deterministic cleanup. [1]

For the async client specifically, the close call is awaitable (i.e., await client.close()), per the client-class docs. [2]

Related gotcha: streaming responses must be closed

If you use .with_streaming_response..., you should use its context manager so the response is reliably closed:

with client.chat.completions.with_streaming_response.create(...) as response: ...

The README explicitly notes the context manager is required for reliable closing. [1]

Sources: [1] openai/openai-python README (Async usage / aiohttp example; “Managing HTTP resources”; streaming response closing note). [2] DeepWiki “Core Client Classes” (AsyncOpenAI: await client.close()).

Use an async context manager to clean up the AsyncOpenAI client.

The client is created per call but not closed, which can leak connections in long-lived processes. While the library will eventually close connections during garbage collection, deterministic cleanup via the async context manager is the recommended approach.

🔧 Suggested fix

async def get_openai_models(base_url: str, api_key: str): - client = AsyncOpenAI(base_url=base_url, api_key=api_key) - models_response = await client.models.list() - return models_response.data + async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client: + models_response = await client.models.list() + return models_response.data

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def get_openai_models(base_url: str, api_key: str):

client = AsyncOpenAI(base_url=base_url, api_key=api_key)

models_response = await client.models.list()

return models_response.data

async def get_openai_models(base_url: str, api_key: str):

async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client:

models_response = await client.models.list()

return models_response.data

🤖 Prompt for AI Agents

In `@openrag/routers/utils.py` around lines 239 - 242, In get_openai_models, the AsyncOpenAI client is created but never closed; change the function to use the async context manager form (async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client:) so the client is deterministically cleaned up, then call await client.models.list() inside that block and return models_response.data; this ensures AsyncOpenAI is properly closed after use.

AlaeMghirbi added 2 commits January 20, 2026 16:45

feat(api): enforce model maximum token limit for completions

63e5423

test(api): verify error handling when token limit is exceeded

9f40be6

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

AlaeMghirbi force-pushed the fix/api-token-limit-test branch from 37777ef to ffbfe2a Compare January 21, 2026 08:50

fix(style): resolve linting issues

7fb03c5

AlaeMghirbi force-pushed the fix/api-token-limit-test branch from ffbfe2a to 7fb03c5 Compare January 21, 2026 08:56

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

openrag/routers/openai.py Outdated Show resolved Hide resolved

feat(test): local api tests

10333f9

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

paultranvan reviewed Jan 26, 2026

View reviewed changes

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

AlaeMghirbi force-pushed the fix/api-token-limit-test branch 2 times, most recently from de4489a to 141f1e3 Compare January 27, 2026 14:56

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

openrag/routers/openai.py Show resolved Hide resolved

openrag/routers/utils.py Show resolved Hide resolved

refactor: extract reusable utilities to avoid code duplication

cdd459a

AlaeMghirbi force-pushed the fix/api-token-limit-test branch from 141f1e3 to cdd459a Compare January 27, 2026 15:07

paultranvan and others added 2 commits January 30, 2026 15:29

refactor: cache get_num_tokens() at module level

6c415f8

Lazy-init a module-level _cached_length_function so ChatOpenAI is only instantiated once instead of on every call. Co-Authored-By: Claude Opus 4.5 <[email protected]>

refactor: return models list from get_openai_models()

ba3a11c

Return models_response.data (list) instead of SyncPage, and remove the now-redundant .data access in check_llm_model_availability. Co-Authored-By: Claude Opus 4.5 <[email protected]>

paultranvan reviewed Jan 30, 2026

View reviewed changes

paultranvan and others added 3 commits January 30, 2026 16:41

refactor: make max_tokens_allowed required in validate_tokens_limit

ccbd7a2

Remove the None default and the dead fallback to 4096, since all callers already pass the value from get_max_model_tokens(). Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: keep validate_tokens_limit graceful on unexpected errors

97578bf

Log a warning but return True so the request proceeds normally when token validation itself fails (e.g. tokenizer error), avoiding blocking clients on temporary failures. Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

		llm_context:
		max_llm_context_size: ${oc.decode:${oc.env:max_llm_context_size, 8192}}


		llm = ChatOpenAI(**config.llm)
		_length_function = llm.get_num_tokens

Fix/api token limit test #220

Are you sure you want to change the base?

Fix/api token limit test #220

Conversation

AlaeMghirbi commented Jan 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

AlaeMghirbi commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

Do you have to call `close()`?