Skip to content

Conversation

@AlaeMghirbi
Copy link
Collaborator

@AlaeMghirbi AlaeMghirbi commented Jan 20, 2026

This PR implements pre-flight validation for the completion route to ensure message consistency and context window compliance before reaching the LLM. By verifying role alternation and token limits locally, the system provides immediate feedback to the client and prevents "context window exceeded" errors. This approach enhances stability while saving significant GPU/CPU resources and costs by avoiding invalid inference requests.

Closes #198

Summary by CodeRabbit

  • New Features

    • Configurable LLM context size limit (default 8192 tokens).
    • Pre‑flight token‑limit validation for chat and completion endpoints; oversize requests return HTTP 413 with a clear message.
  • Improvements

    • Centralized token-counting and model-query utilities.
    • Clearer error responses when configured models aren’t available.
  • Tests

    • Added unit and API tests covering token validation, boundaries, and 413 behavior.
  • Chores

    • Updated local test orchestration and CI test invocation; minor test infra cleanup.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

📝 Walkthrough

Walkthrough

Adds a declarative llm_context.max_llm_context_size config and implements pre-flight token-limit checks for OpenAI-compatible /v1/chat/completions and /v1/completions endpoints by querying model limits (vLLM) with a config fallback; oversized requests return HTTP 413.

Changes

Cohort / File(s) Summary
Configuration
\.hydra_config/config.yaml
Added top-level llm_context.max_llm_context_size sourced from env max_llm_context_size, default 8192.
API router / Token validation
openrag/routers/openai.py
Added startup caching of model max tokens, get_max_model_tokens()/_fetch_max_model_tokens(), validate_tokens_limit() and check_tokens_limit(); integrated token checks into /v1/chat/completions and /v1/completions, raising 413 for oversized requests; imports get_num_tokens.
Token utilities
openrag/components/utils.py, openrag/routers/utils.py
Added get_num_tokens() (cached token-count function) and async get_openai_models(base_url, api_key); refactored callers to use these helpers.
Unit tests — token validation
openrag/test_token_validation.py
New tests for validate_tokens_limit() covering boundary, default, error-message and exception-handling scenarios with a patched token counter.
Integration tests / CI changes
tests/api_tests/test_openai_compat.py, .github/workflows/api_tests.yml, tests/api_tests/api_run/*
Added parameterized integration test asserting 413 when payloads exceed token limits; workflow/test-run invocation adjusted to cd tests/api_tests && python3 -m pytest ...; added local test runner script and cleaned comments in docker-compose/mock files.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Endpoint as /v1/chat.completions<br/>/v1/completions
    participant vLLM as vLLM /v1/models
    participant Config as Config (llm.max_tokens / llm_context.max_llm_context_size)
    participant Validator as validate_tokens_limit()
    participant TokenCounter as get_num_tokens()

    Client->>Endpoint: POST messages/prompt (+ optional model_id)
    Endpoint->>vLLM: GET /v1/models (or cached value)
    alt vLLM returns model info
        vLLM-->>Endpoint: model.max_tokens
    else vLLM unavailable or model missing
        Endpoint->>Config: read configured max tokens
        Config-->>Endpoint: fallback max_tokens
    end
    Endpoint->>Validator: validate_tokens_limit(request, max_tokens)
    Validator->>TokenCounter: count tokens (messages/prompt + params)
    TokenCounter-->>Validator: token_count
    alt token_count ≤ max_tokens
        Validator-->>Endpoint: valid
        Endpoint->>Client: proceed (200 / streamed response)
    else token_count > max_tokens
        Validator-->>Endpoint: invalid (error message)
        Endpoint-->>Client: 413 Payload Too Large (detail: exceeds maximum token limit)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I count each token with a careful hop,
I ask the model where the top does stop,
If your prompt spills past the line,
I thump and return a gentle sign,
Now requests stay tidy — little hop, big bop.

🚥 Pre-merge checks | ✅ 1 | ❌ 4
❌ Failed checks (3 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements token/context-window limit enforcement (#198), but the changes do not include validation for role alternation in the messages array, which is a key requirement from the linked issue. Add validation logic for message role alternation (after optional system message: user/assistant/user/assistant/...) as specified in issue #198.
Out of Scope Changes check ⚠️ Warning Several changes appear out of scope: refactoring get_openai_models utility, cleanup of docker-compose comments, and removal of comments from mock_vllm.py are not directly required by issue #198's token limit validation requirements. Isolate token validation changes from refactoring work. Move utility changes and cleanup to separate PRs or justify their necessity for token validation.
Docstring Coverage ⚠️ Warning Docstring coverage is 40.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Fix/api token limit test' is partially related to the changeset. While the PR does address token limit validation, the title emphasizes 'test' when the primary change involves implementing token-aware pre-flight validation in the API router itself. Consider a more descriptive title like 'Implement pre-flight token validation for completion API' to better capture the core change and primary objective.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @.hydra_config/config.yaml:
- Around line 75-76: The env var name under llm_context -> max_llm_context_size
is using lowercase; update the oc.env reference to use the uppercase variable
MAX_LLM_CONTEXT_SIZE (i.e., change the expression
`${oc.decode:${oc.env:max_llm_context_size, 8192}}` to reference
`MAX_LLM_CONTEXT_SIZE`) so it matches the project's env naming convention and
prevents silent deployment failures.

In `@openrag/routers/openai.py`:
- Around line 10-13: Reformat the import block and remove trailing/blank
whitespace to satisfy Ruff (I001, W291, W293): group and order imports into
standard/third‑party/local sections and alphabetize within groups for the
imports that include Document, ChatOpenAI, AsyncOpenAI and the models.openai
imports, and remove any trailing spaces and extra blank lines in the
token-validation block (the new token validation code near the top that produced
blank-line/trailing-space warnings). Run ruff/format or a similar formatter to
apply the import grouping and whitespace fixes across the file (also address the
other reported occurrences around the token-validation area).
- Around line 157-229: The token validation currently fails open in
validate_tokens_limit by returning (True, "") on exceptions; change the except
block to fail closed by returning (False, f"Token validation error: {e}")
(include the exception text) and log the error (e.g., with logger.error or
logger.warning) so callers can reject the request (HTTP 413); refer to
validate_tokens_limit, ChatOpenAI instantiation, and the existing logger usage
to locate and update the except block.
🧹 Nitpick comments (1)
openrag/routers/openai.py (1)

115-154: Avoid per‑request /v1/models calls; reuse & close the client.

Each request constructs a new AsyncOpenAI client and calls /v1/models, which adds latency and can leak connections if the client isn’t closed. Consider a shared client + cached max_model_len (TTL) or an async with AsyncOpenAI(...) context to ensure cleanup.

Comment on lines +75 to +76
llm_context:
max_llm_context_size: ${oc.decode:${oc.env:max_llm_context_size, 8192}}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for both lowercase and uppercase variants
echo "=== Searching for max_llm_context_size and MAX_LLM_CONTEXT_SIZE ==="
rg -n "max_llm_context_size|MAX_LLM_CONTEXT_SIZE"

echo ""
echo "=== Checking the config file for env var patterns ==="
cat -n .hydra_config/config.yaml | head -100

echo ""
echo "=== Searching for oc.env patterns to see convention ==="
rg -n "oc\.env:" .hydra_config/config.yaml

Repository: linagora/openrag

Length of output: 176


🏁 Script executed:

#!/bin/bash
# Search for both lowercase and uppercase variants
echo "=== Searching for max_llm_context_size and MAX_LLM_CONTEXT_SIZE ==="
rg -n "max_llm_context_size|MAX_LLM_CONTEXT_SIZE"

echo ""
echo "=== Checking the config file for env var patterns ==="
cat -n .hydra_config/config.yaml | head -100

echo ""
echo "=== Searching for oc.env patterns to see convention ==="
rg -n "oc\.env:" .hydra_config/config.yaml

Repository: linagora/openrag

Length of output: 10027


Use uppercase MAX_LLM_CONTEXT_SIZE for the environment variable name at line 76.

The environment variable reference uses lowercase max_llm_context_size, which breaks the uppercase convention used consistently throughout the config (e.g., LLM_SEMAPHORE, MAX_MODEL_LEN, EMBEDDER_BASE_URL, etc.). Deployments setting MAX_LLM_CONTEXT_SIZE will fail silently. Change to ${oc.decode:${oc.env:MAX_LLM_CONTEXT_SIZE, 8192}} to match the established convention.

🤖 Prompt for AI Agents
In @.hydra_config/config.yaml around lines 75 - 76, The env var name under
llm_context -> max_llm_context_size is using lowercase; update the oc.env
reference to use the uppercase variable MAX_LLM_CONTEXT_SIZE (i.e., change the
expression `${oc.decode:${oc.env:max_llm_context_size, 8192}}` to reference
`MAX_LLM_CONTEXT_SIZE`) so it matches the project's env naming convention and
prevents silent deployment failures.

Comment on lines 157 to 229
def validate_tokens_limit(
request: OpenAIChatCompletionRequest | OpenAICompletionRequest,
max_tokens_allowed: int | None = None,
) -> tuple[bool, str]:
"""Validate if the request respects the maximum token limit.

Args:
request: The OpenAI request object
max_tokens_allowed: Maximum allowed tokens for the request.
If None, retrieves from config.

Returns:
Tuple of (is_valid, error_message)
"""
try:
llm = ChatOpenAI(**config.llm)
_length_function = llm.get_num_tokens

if max_tokens_allowed is None:
max_tokens_allowed = config.llm.get("max_tokens", 4096)

if isinstance(request, OpenAIChatCompletionRequest):
message_tokens = sum(
_length_function(m.content) + 4
for m in request.messages
)
requested_tokens = request.max_tokens or 1024
total_tokens_needed = message_tokens + requested_tokens

logger.debug(
"Token validation for chat completion",
message_tokens=message_tokens,
requested_tokens=requested_tokens,
total_tokens=total_tokens_needed,
max_allowed=max_tokens_allowed,
)

if total_tokens_needed > max_tokens_allowed:
return False, (
f"Request exceeds maximum token limit. "
f"Messages: {message_tokens} tokens + "
f"Requested output: {requested_tokens} tokens = "
f"{total_tokens_needed} tokens. "
f"Maximum allowed: {max_tokens_allowed} tokens."
)

elif isinstance(request, OpenAICompletionRequest):
prompt_tokens = _length_function(request.prompt)
requested_tokens = request.max_tokens or 512
total_tokens_needed = prompt_tokens + requested_tokens

logger.debug(
"Token validation for completion",
prompt_tokens=prompt_tokens,
requested_tokens=requested_tokens,
total_tokens=total_tokens_needed,
max_allowed=max_tokens_allowed,
)

if total_tokens_needed > max_tokens_allowed:
return False, (
f"Request exceeds maximum token limit. "
f"Prompt: {prompt_tokens} tokens + "
f"Requested output: {requested_tokens} tokens = "
f"{total_tokens_needed} tokens. "
f"Maximum allowed: {max_tokens_allowed} tokens."
)

return True, ""

except Exception as e:
logger.warning("Error during token validation", error=str(e))
return True, ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's confirm the file exists and check its structure
fd "openai.py" | grep -i routers

Repository: linagora/openrag

Length of output: 85


🏁 Script executed:

# Find the function and examine it
rg "def validate_tokens_limit" -A 80

Repository: linagora/openrag

Length of output: 5301


🏁 Script executed:

# Find all callers of this function
rg "validate_tokens_limit" -B 2 -A 2

Repository: linagora/openrag

Length of output: 1313


🏁 Script executed:

# Check if there are any tests for this function
fd "test.*\.py|.*test\.py" | xargs rg "validate_tokens_limit" 2>/dev/null

Repository: linagora/openrag

Length of output: 44


🏁 Script executed:

# Check the overall structure to understand context
cat -n openrag/routers/openai.py | head -250 | tail -100

Repository: linagora/openrag

Length of output: 4588


🏁 Script executed:

# Get more context around the callers to see what happens after validation
rg "validate_tokens_limit" -B 5 -A 10

Repository: linagora/openrag

Length of output: 3247


🏁 Script executed:

# Look for any error handling patterns or configuration that might justify the current behavior
rg "HTTPException|raise_for_status|4096" openrag/routers/openai.py | head -20

Repository: linagora/openrag

Length of output: 469


🏁 Script executed:

# Check if there's documentation or comments explaining the design decision
rg "fail.*closed|bypass.*validation|exception.*handling" openrag/routers/openai.py

Repository: linagora/openrag

Length of output: 42


Fail-closed when token counting fails.

On exceptions, the function currently returns (True, ""), which bypasses validation and allows requests to proceed when token counting fails (e.g., unknown model, tokenizer errors). This undermines the safeguard—change to return (False, error_message) to fail closed and let callers reject the request with HTTP 413.

Recommended fix
    except Exception as e:
        logger.warning("Error during token validation", error=str(e))
-       return True, ""
+       return False, "Unable to validate token limit; please reduce input or retry."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def validate_tokens_limit(
request: OpenAIChatCompletionRequest | OpenAICompletionRequest,
max_tokens_allowed: int | None = None,
) -> tuple[bool, str]:
"""Validate if the request respects the maximum token limit.
Args:
request: The OpenAI request object
max_tokens_allowed: Maximum allowed tokens for the request.
If None, retrieves from config.
Returns:
Tuple of (is_valid, error_message)
"""
try:
llm = ChatOpenAI(**config.llm)
_length_function = llm.get_num_tokens
if max_tokens_allowed is None:
max_tokens_allowed = config.llm.get("max_tokens", 4096)
if isinstance(request, OpenAIChatCompletionRequest):
message_tokens = sum(
_length_function(m.content) + 4
for m in request.messages
)
requested_tokens = request.max_tokens or 1024
total_tokens_needed = message_tokens + requested_tokens
logger.debug(
"Token validation for chat completion",
message_tokens=message_tokens,
requested_tokens=requested_tokens,
total_tokens=total_tokens_needed,
max_allowed=max_tokens_allowed,
)
if total_tokens_needed > max_tokens_allowed:
return False, (
f"Request exceeds maximum token limit. "
f"Messages: {message_tokens} tokens + "
f"Requested output: {requested_tokens} tokens = "
f"{total_tokens_needed} tokens. "
f"Maximum allowed: {max_tokens_allowed} tokens."
)
elif isinstance(request, OpenAICompletionRequest):
prompt_tokens = _length_function(request.prompt)
requested_tokens = request.max_tokens or 512
total_tokens_needed = prompt_tokens + requested_tokens
logger.debug(
"Token validation for completion",
prompt_tokens=prompt_tokens,
requested_tokens=requested_tokens,
total_tokens=total_tokens_needed,
max_allowed=max_tokens_allowed,
)
if total_tokens_needed > max_tokens_allowed:
return False, (
f"Request exceeds maximum token limit. "
f"Prompt: {prompt_tokens} tokens + "
f"Requested output: {requested_tokens} tokens = "
f"{total_tokens_needed} tokens. "
f"Maximum allowed: {max_tokens_allowed} tokens."
)
return True, ""
except Exception as e:
logger.warning("Error during token validation", error=str(e))
return True, ""
def validate_tokens_limit(
request: OpenAIChatCompletionRequest | OpenAICompletionRequest,
max_tokens_allowed: int | None = None,
) -> tuple[bool, str]:
"""Validate if the request respects the maximum token limit.
Args:
request: The OpenAI request object
max_tokens_allowed: Maximum allowed tokens for the request.
If None, retrieves from config.
Returns:
Tuple of (is_valid, error_message)
"""
try:
llm = ChatOpenAI(**config.llm)
_length_function = llm.get_num_tokens
if max_tokens_allowed is None:
max_tokens_allowed = config.llm.get("max_tokens", 4096)
if isinstance(request, OpenAIChatCompletionRequest):
message_tokens = sum(
_length_function(m.content) + 4
for m in request.messages
)
requested_tokens = request.max_tokens or 1024
total_tokens_needed = message_tokens + requested_tokens
logger.debug(
"Token validation for chat completion",
message_tokens=message_tokens,
requested_tokens=requested_tokens,
total_tokens=total_tokens_needed,
max_allowed=max_tokens_allowed,
)
if total_tokens_needed > max_tokens_allowed:
return False, (
f"Request exceeds maximum token limit. "
f"Messages: {message_tokens} tokens + "
f"Requested output: {requested_tokens} tokens = "
f"{total_tokens_needed} tokens. "
f"Maximum allowed: {max_tokens_allowed} tokens."
)
elif isinstance(request, OpenAICompletionRequest):
prompt_tokens = _length_function(request.prompt)
requested_tokens = request.max_tokens or 512
total_tokens_needed = prompt_tokens + requested_tokens
logger.debug(
"Token validation for completion",
prompt_tokens=prompt_tokens,
requested_tokens=requested_tokens,
total_tokens=total_tokens_needed,
max_allowed=max_tokens_allowed,
)
if total_tokens_needed > max_tokens_allowed:
return False, (
f"Request exceeds maximum token limit. "
f"Prompt: {prompt_tokens} tokens + "
f"Requested output: {requested_tokens} tokens = "
f"{total_tokens_needed} tokens. "
f"Maximum allowed: {max_tokens_allowed} tokens."
)
return True, ""
except Exception as e:
logger.warning("Error during token validation", error=str(e))
return False, "Unable to validate token limit; please reduce input or retry."
🧰 Tools
🪛 GitHub Check: lint (3.12)

[failure] 193-193: Ruff (W293)
openrag/routers/openai.py:193:1: W293 Blank line contains whitespace


[failure] 185-185: Ruff (W293)
openrag/routers/openai.py:185:1: W293 Blank line contains whitespace


[failure] 180-180: Ruff (W291)
openrag/routers/openai.py:180:48: W291 Trailing whitespace


[failure] 177-177: Ruff (W293)
openrag/routers/openai.py:177:1: W293 Blank line contains whitespace


[failure] 174-174: Ruff (W293)
openrag/routers/openai.py:174:1: W293 Blank line contains whitespace


[failure] 167-167: Ruff (W293)
openrag/routers/openai.py:167:1: W293 Blank line contains whitespace


[failure] 162-162: Ruff (W293)
openrag/routers/openai.py:162:1: W293 Blank line contains whitespace

🤖 Prompt for AI Agents
In `@openrag/routers/openai.py` around lines 157 - 229, The token validation
currently fails open in validate_tokens_limit by returning (True, "") on
exceptions; change the except block to fail closed by returning (False, f"Token
validation error: {e}") (include the exception text) and log the error (e.g.,
with logger.error or logger.warning) so callers can reject the request (HTTP
413); refer to validate_tokens_limit, ChatOpenAI instantiation, and the existing
logger usage to locate and update the except block.

@AlaeMghirbi AlaeMghirbi force-pushed the fix/api-token-limit-test branch from 37777ef to ffbfe2a Compare January 21, 2026 08:50
@AlaeMghirbi AlaeMghirbi force-pushed the fix/api-token-limit-test branch from ffbfe2a to 7fb03c5 Compare January 21, 2026 08:56
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@openrag/routers/openai.py`:
- Around line 274-276: Replace the redundant ternary that sets target_model_id
using is_direct_llm_model(request); always use the underlying configured LLM
model by assigning target_model_id = config.llm.get("model") before calling
get_max_model_tokens, so token validation via
get_max_model_tokens(target_model_id) and validate_tokens_limit(request,
max_tokens_allowed=...) always checks the actual LLM limits (refer to
target_model_id, is_direct_llm_model, config.llm.get("model"),
get_max_model_tokens, validate_tokens_limit, request, model_name to locate the
code).
♻️ Duplicate comments (1)
openrag/routers/openai.py (1)

221-223: Fail-open on exception undermines the validation safeguard.

The exception handler returns (True, ""), allowing requests to proceed when token counting fails. This defeats the PR's goal of preventing invalid requests from reaching the LLM. Change to fail-closed by returning (False, error_message).

🔒 Recommended fix
     except Exception as e:
         logger.warning("Error during token validation", error=str(e))
-        return True, ""
+        return False, f"Unable to validate token limit: {e}. Please reduce input or retry."
🧹 Nitpick comments (3)
openrag/routers/openai.py (3)

119-121: Consider caching or reusing the AsyncOpenAI client.

A new AsyncOpenAI client is instantiated on every call to get_max_model_tokens. For high-frequency requests, this could add overhead. Consider initializing the client once at module level or using a cached singleton pattern.

♻️ Suggested refactor
+# At module level, after config initialization
+_openai_client: AsyncOpenAI | None = None
+
+def _get_openai_client() -> AsyncOpenAI:
+    global _openai_client
+    if _openai_client is None:
+        _openai_client = AsyncOpenAI(base_url=config.llm["base_url"], api_key=config.llm["api_key"])
+    return _openai_client
+

 async def get_max_model_tokens(model_id: str | None) -> int:
     ...
     try:
-        client = AsyncOpenAI(base_url=config.llm["base_url"], api_key=config.llm["api_key"])
+        client = _get_openai_client()
         resp = await client.models.list()

168-170: Consider reusing the ChatOpenAI instance for token counting.

Similar to the AsyncOpenAI client, a new ChatOpenAI instance is created on every validation call. For token counting purposes, a single cached instance would suffice.


176-176: Document the per-message token overhead constant.

The + 4 magic number approximates OpenAI's chat format overhead (role tokens, separators). Consider extracting this to a named constant with a comment explaining its purpose.

♻️ Suggested refactor
+# Approximate token overhead per message (role, separators, formatting)
+MESSAGE_TOKEN_OVERHEAD = 4
+
 def validate_tokens_limit(...):
     ...
-            message_tokens = sum(_length_function(m.content) + 4 for m in request.messages)
+            message_tokens = sum(_length_function(m.content) + MESSAGE_TOKEN_OVERHEAD for m in request.messages)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/api_tests/api_run/scripts/run_api_tests_local.sh`:
- Around line 45-61: Add a guaranteed cleanup trap so containers are torn down
even if pytest fails: define a cleanup function that runs the existing cleanup
steps (cd "$COMPOSE_DIR" and docker compose down -v), register it with trap for
EXIT (and optionally ERR INT TERM) near the top of the script, and remove the
explicit cleanup call at the end of the file; reference the pytest invocation
(OPENRAG_API_URL=... python3 -m pytest ...) and the existing cleanup commands so
the trap executes the same teardown when the script exits unexpectedly.
🧹 Nitpick comments (2)
.github/workflows/api_tests.yml (1)

69-74: Redundant environment variable assignment.

OPENRAG_API_URL is set both in the env: block (line 71) and inline in the command (line 74). The inline assignment shadows the block-level one, making the env: block redundant here.

♻️ Suggested fix

Either keep only the env: block:

       - name: Run API tests
         env:
           OPENRAG_API_URL: http://localhost:8080
         run: |
           cd tests/api_tests
-          OPENRAG_API_URL=http://localhost:8080 python3 -m pytest . -v --timeout=120 --tb=short
+          python3 -m pytest . -v --timeout=120 --tb=short

Or remove the env: block entirely:

       - name: Run API tests
-        env:
-          OPENRAG_API_URL: http://localhost:8080
         run: |
           cd tests/api_tests
           OPENRAG_API_URL=http://localhost:8080 python3 -m pytest . -v --timeout=120 --tb=short
tests/api_tests/api_run/scripts/run_api_tests_local.sh (1)

48-56: Consider simplifying dependency check.

The heredoc pattern works but could be simplified:

♻️ Simpler alternative
-if ! python3 - <<'PY'
-import httpx
-import pytest  # noqa: F401
-import pytest_timeout  # noqa: F401
-PY
-then
+if ! python3 -c "import httpx, pytest, pytest_timeout" 2>/dev/null; then
   echo "Installing pytest + httpx + pytest-timeout..."
   pip install pytest httpx pytest-timeout
 fi

Comment on lines +45 to +61
echo "[3/4] Running API tests"
cd "$ROOT_DIR"
# Ensure test deps are present (mirror CI)
if ! python3 - <<'PY'
import httpx
import pytest # noqa: F401
import pytest_timeout # noqa: F401
PY
then
echo "Installing pytest + httpx + pytest-timeout..."
pip install pytest httpx pytest-timeout
fi
OPENRAG_API_URL=http://localhost:8080 python3 -m pytest ../ -v --timeout=120 --tb=short

echo "[4/4] Cleaning up"
cd "$COMPOSE_DIR"
docker compose down -v
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing cleanup on test failure leaves containers running.

With set -e enabled, if pytest fails at line 57, the script exits immediately and the cleanup (lines 59-61) never runs. This differs from the CI workflow which uses if: always() for cleanup. Local developers will have stale containers after a failed test run.

🛠️ Suggested fix using a trap
 #!/usr/bin/env bash
 set -euo pipefail

 # Local replica of .github/workflows/api_tests.yml
 # Builds/starts the stack (mock-vllm included), waits for health, runs tests, then cleans up.

 ROOT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
 COMPOSE_DIR="$ROOT_DIR"

 cd "$COMPOSE_DIR"

+cleanup() {
+  echo "Cleaning up..."
+  cd "$COMPOSE_DIR"
+  docker compose down -v
+}
+trap cleanup EXIT
+
 echo "[1/4] Building and starting services"
 docker compose build
 docker compose up -d

Then remove the explicit cleanup call at the end:

-echo "[4/4] Cleaning up"
-cd "$COMPOSE_DIR"
-docker compose down -v
-
 echo "Done."
🤖 Prompt for AI Agents
In `@tests/api_tests/api_run/scripts/run_api_tests_local.sh` around lines 45 - 61,
Add a guaranteed cleanup trap so containers are torn down even if pytest fails:
define a cleanup function that runs the existing cleanup steps (cd
"$COMPOSE_DIR" and docker compose down -v), register it with trap for EXIT (and
optionally ERR INT TERM) near the top of the script, and remove the explicit
cleanup call at the end of the file; reference the pytest invocation
(OPENRAG_API_URL=... python3 -m pytest ...) and the existing cleanup commands so
the trap executes the same teardown when the script exits unexpectedly.

_length_function(m.content) + 4
for m in request.messages
)
requested_tokens = request.max_tokens or 1024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this 1024 value ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 is the default value defined in the OpenAIChatCompletionRequest model here:

max_tokens: int | None = Field(1024)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should either be in the config, or moved as a global constant. What do you think @Ahmath-Gadji ?

if max_tokens_allowed is None:
max_tokens_allowed = config.llm.get("max_tokens", 4096)

if isinstance(request, OpenAIChatCompletionRequest):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to make a different treatment between OpenAIChatCompletionRequest and OpenAICompletionRequest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's different .... For OpenAIChatCompletionRequest we need to sum up all message tokens and add 4 to every message for message overhead. For OpenAICompletionRequest we just need to calculate the number of tokens in the prompt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ahmath-Gadji is it useful to support the /completions route? I feel like now everything use /chat/completions ?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@openrag/components/utils.py`:
- Line 88: Remove the whitespace-only blank line in the utils module that is
triggering Ruff W293: edit the openrag.components.utils file and delete the
trailing whitespace-only line (ensure no trailing spaces remain on that line),
then save so the file contains no blank line with spaces; you can confirm by
running ruff or trimming trailing whitespace in your editor.

In `@openrag/routers/openai.py`:
- Around line 120-140: Move the call to get_openai_models(...) into the try
block so network/HTTP errors are caught and the fallback default_limit is
returned on failure; inside the try, call openai_models = await
get_openai_models(...) and iterate over openai_models.data (not openai_models)
when locating the model (the lookup using current_model.id should remain but
operate on items in .data), then proceed to model = model.model_dump() if
hasattr(model, 'model_dump') else model.dict(), extract max_model_len from
model.get("max_model_len") or model.get("model_extra", {}).get("max_model_len"),
log and return int(max_len) or return default_limit in the except path.
🧹 Nitpick comments (1)
openrag/components/utils.py (1)

79-90: Cache the token counter to avoid per-call ChatOpenAI instantiation.

get_num_tokens() creates a new client every call; format_context can be hot. Cache the length function (or the client) once to reduce overhead and connection churn.

♻️ Suggested change
+from functools import lru_cache
@@
-def get_num_tokens():
+@lru_cache(maxsize=1)
+def get_num_tokens():
     llm = ChatOpenAI(**config.llm)
-    _length_function = llm.get_num_tokens
-    return _length_function
+    return llm.get_num_tokens


llm = ChatOpenAI(**config.llm)
_length_function = llm.get_num_tokens

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove whitespace-only line to satisfy Ruff W293.

CI is failing on this blank-line whitespace.

🧹 Minimal fix
-    
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
🧰 Tools
🪛 GitHub Actions: Linting

[error] 88-88: W293: Blank line contains whitespace.

🪛 GitHub Check: lint (3.12)

[failure] 88-88: Ruff (W293)
openrag/components/utils.py:88:1: W293 Blank line contains whitespace

🤖 Prompt for AI Agents
In `@openrag/components/utils.py` at line 88, Remove the whitespace-only blank
line in the utils module that is triggering Ruff W293: edit the
openrag.components.utils file and delete the trailing whitespace-only line
(ensure no trailing spaces remain on that line), then save so the file contains
no blank line with spaces; you can confirm by running ruff or trimming trailing
whitespace in your editor.

@AlaeMghirbi AlaeMghirbi force-pushed the fix/api-token-limit-test branch 2 times, most recently from de4489a to 141f1e3 Compare January 27, 2026 14:56
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
openrag/components/utils.py (1)

1-145: Pipeline failure: Ruff format check would reformat this file.

The CI is failing due to formatting issues. Run ruff format openrag/components/utils.py to fix.

openrag/routers/openai.py (1)

1-440: Pipeline failure: Ruff format check would reformat this file.

CI is failing due to formatting issues. Run ruff format openrag/routers/openai.py to fix.

🤖 Fix all issues with AI agents
In `@openrag/routers/openai.py`:
- Around line 161-164: The token counting code in the
OpenAIChatCompletionRequest branch can pass None to _length_function because
OpenAIMessage.content may be None; update the computation of message_tokens in
openai.py to guard against None (e.g., use (m.content or "") or skip None
messages) so _length_function always receives a str; adjust the expression that
sums message_tokens (referencing OpenAIChatCompletionRequest, OpenAIMessage, and
_length_function) and keep requested_tokens/total_tokens_needed logic unchanged.

In `@openrag/routers/utils.py`:
- Around line 238-242: The file has only one blank line between top-level
functions which violates PEP8/Ruff; add a second blank line after the end of the
async function get_openai_models so there are two blank lines separating it from
the next top-level function check_llm_model_availability, then run `ruff format
openrag/routers/utils.py` to auto-apply formatting fixes.
🧹 Nitpick comments (1)
openrag/components/utils.py (1)

79-82: Consider caching the ChatOpenAI instance to avoid repeated instantiation.

get_num_tokens() creates a new ChatOpenAI instance on every call. Since this function is now called from format_context (and also used in validate_tokens_limit in openrag/routers/openai.py), each token counting operation instantiates a new client. This adds unnecessary overhead, especially for batch operations.

♻️ Suggested improvement using module-level caching
+_llm_instance = None
+
+def get_num_tokens():
+    global _llm_instance
+    if _llm_instance is None:
+        _llm_instance = ChatOpenAI(**config.llm)
+    return _llm_instance.get_num_tokens
-def get_num_tokens():
-    llm = ChatOpenAI(**config.llm)
-    _length_function = llm.get_num_tokens
-    return _length_function

Alternatively, consider using @functools.lru_cache if config is immutable during runtime.

@AlaeMghirbi AlaeMghirbi force-pushed the fix/api-token-limit-test branch from 141f1e3 to cdd459a Compare January 27, 2026 15:07
paultranvan and others added 2 commits January 30, 2026 15:29
Lazy-init a module-level _cached_length_function so ChatOpenAI is only
instantiated once instead of on every call.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Return models_response.data (list) instead of SyncPage, and remove
the now-redundant .data access in check_llm_model_availability.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
if max_tokens_allowed is None:
max_tokens_allowed = config.llm.get("max_tokens", 4096)

if isinstance(request, OpenAIChatCompletionRequest):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ahmath-Gadji is it useful to support the /completions route? I feel like now everything use /chat/completions ?

_length_function(m.content) + 4
for m in request.messages
)
requested_tokens = request.max_tokens or 1024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should either be in the config, or moved as a global constant. What do you think @Ahmath-Gadji ?

"max_tokens": 100000,
}
response = api_client.post("/v1/chat/completions", json=payload)
assert response.status_code in [413, 404]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the status code would be 404 ? assert on 413 is the expected result, right?

"prompt": very_long_prompt,
"max_tokens": 100000,
}
response = api_client.post("/v1/completions", json=payload)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is exactly the same test with just a path difference ; it should be refactored

paultranvan and others added 3 commits January 30, 2026 16:41
Remove the None default and the dead fallback to 4096, since all
callers already pass the value from get_max_model_tokens().

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Log a warning but return True so the request proceeds normally when
token validation itself fails (e.g. tokenizer error), avoiding
blocking clients on temporary failures.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add openrag/test_token_validation.py with 10 unit tests covering:
  boundary conditions (under/exact/over) for both chat and completion,
  default max_tokens fallback, error message content, and graceful
  exception handling (skip validation on tokenizer failure).
- Fix integration tests: strict 413 assertion (was [413, 404]),
  deduplicate via pytest.mark.parametrize.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@openrag/routers/utils.py`:
- Around line 239-242: In get_openai_models, the AsyncOpenAI client is created
but never closed; change the function to use the async context manager form
(async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client:) so the
client is deterministically cleaned up, then call await client.models.list()
inside that block and return models_response.data; this ensures AsyncOpenAI is
properly closed after use.
🧹 Nitpick comments (1)
openrag/test_token_validation.py (1)

14-25: Move Ray init/shutdown into a session fixture to avoid global side effects and resource leaks.

Initializing Ray at module import time leaves resources active for the entire test session with no cleanup mechanism. Using a session-scoped autouse fixture with explicit ray.shutdown() ensures proper resource management and follows Ray's recommended pytest patterns.

♻️ Suggested refactor
-# Prevent Ray from scanning the working directory (which may contain
-# permission-restricted folders like db/) and stub RagPipeline to
-# avoid heavy initialization during test collection.
-import ray  # noqa: E402
-
-if not ray.is_initialized():
-    ray.init(runtime_env={"working_dir": None}, ignore_reinit_error=True)
-
-if "components.pipeline" not in sys.modules:
-    _stub = ModuleType("components.pipeline")
-    _stub.RagPipeline = MagicMock()
-    sys.modules["components.pipeline"] = _stub
+import ray  # noqa: E402
+
+@pytest.fixture(scope="session", autouse=True)
+def _init_ray_and_stub_pipeline():
+    """Initialize Ray and stub RagPipeline before test session."""
+    if not ray.is_initialized():
+        ray.init(runtime_env={"working_dir": None}, ignore_reinit_error=True)
+    
+    if "components.pipeline" not in sys.modules:
+        _stub = ModuleType("components.pipeline")
+        _stub.RagPipeline = MagicMock()
+        sys.modules["components.pipeline"] = _stub
+    
+    yield
+    
+    if ray.is_initialized():
+        ray.shutdown()

Comment on lines +239 to +242
async def get_openai_models(base_url: str, api_key: str):
client = AsyncOpenAI(base_url=base_url, api_key=api_key)
models_response = await client.models.list()
return models_response.data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

AsyncOpenAI Python client async context manager and close() cleanup requirements

💡 Result:

AsyncOpenAI does support an async context manager, and using it is the simplest way to ensure the underlying HTTP client/session is closed:

from openai import AsyncOpenAI

async with AsyncOpenAI(api_key=...) as client:
    resp = await client.responses.create(model="...", input="hi")

This pattern is shown in the official openai-python README (e.g., when using the aiohttp backend) and is intended to close resources when you exit the block. [1]

Do you have to call close()?

  • The library says it will close underlying HTTP connections when the client is garbage collected, and you can also close the client manually (or via a context manager) for deterministic cleanup. [1]
  • For the async client specifically, the close call is awaitable (i.e., await client.close()), per the client-class docs. [2]

Related gotcha: streaming responses must be closed

If you use .with_streaming_response..., you should use its context manager so the response is reliably closed:

with client.chat.completions.with_streaming_response.create(...) as response:
    ...

The README explicitly notes the context manager is required for reliable closing. [1]

Sources: [1] openai/openai-python README (Async usage / aiohttp example; “Managing HTTP resources”; streaming response closing note). [2] DeepWiki “Core Client Classes” (AsyncOpenAI: await client.close()).


Use an async context manager to clean up the AsyncOpenAI client.

The client is created per call but not closed, which can leak connections in long-lived processes. While the library will eventually close connections during garbage collection, deterministic cleanup via the async context manager is the recommended approach.

🔧 Suggested fix
 async def get_openai_models(base_url: str, api_key: str):
-    client = AsyncOpenAI(base_url=base_url, api_key=api_key)
-    models_response = await client.models.list()
-    return models_response.data
+    async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client:
+        models_response = await client.models.list()
+        return models_response.data
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def get_openai_models(base_url: str, api_key: str):
client = AsyncOpenAI(base_url=base_url, api_key=api_key)
models_response = await client.models.list()
return models_response.data
async def get_openai_models(base_url: str, api_key: str):
async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client:
models_response = await client.models.list()
return models_response.data
🤖 Prompt for AI Agents
In `@openrag/routers/utils.py` around lines 239 - 242, In get_openai_models, the
AsyncOpenAI client is created but never closed; change the function to use the
async context manager form (async with AsyncOpenAI(base_url=base_url,
api_key=api_key) as client:) so the client is deterministically cleaned up, then
call await client.models.list() inside that block and return
models_response.data; this ensures AsyncOpenAI is properly closed after use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants