-
Notifications
You must be signed in to change notification settings - Fork 19
Fix/api token limit test #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds a declarative Changes
Sequence DiagramsequenceDiagram
participant Client
participant Endpoint as /v1/chat.completions<br/>/v1/completions
participant vLLM as vLLM /v1/models
participant Config as Config (llm.max_tokens / llm_context.max_llm_context_size)
participant Validator as validate_tokens_limit()
participant TokenCounter as get_num_tokens()
Client->>Endpoint: POST messages/prompt (+ optional model_id)
Endpoint->>vLLM: GET /v1/models (or cached value)
alt vLLM returns model info
vLLM-->>Endpoint: model.max_tokens
else vLLM unavailable or model missing
Endpoint->>Config: read configured max tokens
Config-->>Endpoint: fallback max_tokens
end
Endpoint->>Validator: validate_tokens_limit(request, max_tokens)
Validator->>TokenCounter: count tokens (messages/prompt + params)
TokenCounter-->>Validator: token_count
alt token_count ≤ max_tokens
Validator-->>Endpoint: valid
Endpoint->>Client: proceed (200 / streamed response)
else token_count > max_tokens
Validator-->>Endpoint: invalid (error message)
Endpoint-->>Client: 413 Payload Too Large (detail: exceeds maximum token limit)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 4❌ Failed checks (3 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 🧪 Unit Test Generation v2 is now available!We have significantly improved our unit test generation capabilities. To enable: Add this to your reviews:
finishing_touches:
unit_tests:
enabled: trueTry it out by using the Have feedback? Share your thoughts on our Discord thread! Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @.hydra_config/config.yaml:
- Around line 75-76: The env var name under llm_context -> max_llm_context_size
is using lowercase; update the oc.env reference to use the uppercase variable
MAX_LLM_CONTEXT_SIZE (i.e., change the expression
`${oc.decode:${oc.env:max_llm_context_size, 8192}}` to reference
`MAX_LLM_CONTEXT_SIZE`) so it matches the project's env naming convention and
prevents silent deployment failures.
In `@openrag/routers/openai.py`:
- Around line 10-13: Reformat the import block and remove trailing/blank
whitespace to satisfy Ruff (I001, W291, W293): group and order imports into
standard/third‑party/local sections and alphabetize within groups for the
imports that include Document, ChatOpenAI, AsyncOpenAI and the models.openai
imports, and remove any trailing spaces and extra blank lines in the
token-validation block (the new token validation code near the top that produced
blank-line/trailing-space warnings). Run ruff/format or a similar formatter to
apply the import grouping and whitespace fixes across the file (also address the
other reported occurrences around the token-validation area).
- Around line 157-229: The token validation currently fails open in
validate_tokens_limit by returning (True, "") on exceptions; change the except
block to fail closed by returning (False, f"Token validation error: {e}")
(include the exception text) and log the error (e.g., with logger.error or
logger.warning) so callers can reject the request (HTTP 413); refer to
validate_tokens_limit, ChatOpenAI instantiation, and the existing logger usage
to locate and update the except block.
🧹 Nitpick comments (1)
openrag/routers/openai.py (1)
115-154: Avoid per‑request/v1/modelscalls; reuse & close the client.Each request constructs a new
AsyncOpenAIclient and calls/v1/models, which adds latency and can leak connections if the client isn’t closed. Consider a shared client + cachedmax_model_len(TTL) or anasync with AsyncOpenAI(...)context to ensure cleanup.
| llm_context: | ||
| max_llm_context_size: ${oc.decode:${oc.env:max_llm_context_size, 8192}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for both lowercase and uppercase variants
echo "=== Searching for max_llm_context_size and MAX_LLM_CONTEXT_SIZE ==="
rg -n "max_llm_context_size|MAX_LLM_CONTEXT_SIZE"
echo ""
echo "=== Checking the config file for env var patterns ==="
cat -n .hydra_config/config.yaml | head -100
echo ""
echo "=== Searching for oc.env patterns to see convention ==="
rg -n "oc\.env:" .hydra_config/config.yamlRepository: linagora/openrag
Length of output: 176
🏁 Script executed:
#!/bin/bash
# Search for both lowercase and uppercase variants
echo "=== Searching for max_llm_context_size and MAX_LLM_CONTEXT_SIZE ==="
rg -n "max_llm_context_size|MAX_LLM_CONTEXT_SIZE"
echo ""
echo "=== Checking the config file for env var patterns ==="
cat -n .hydra_config/config.yaml | head -100
echo ""
echo "=== Searching for oc.env patterns to see convention ==="
rg -n "oc\.env:" .hydra_config/config.yamlRepository: linagora/openrag
Length of output: 10027
Use uppercase MAX_LLM_CONTEXT_SIZE for the environment variable name at line 76.
The environment variable reference uses lowercase max_llm_context_size, which breaks the uppercase convention used consistently throughout the config (e.g., LLM_SEMAPHORE, MAX_MODEL_LEN, EMBEDDER_BASE_URL, etc.). Deployments setting MAX_LLM_CONTEXT_SIZE will fail silently. Change to ${oc.decode:${oc.env:MAX_LLM_CONTEXT_SIZE, 8192}} to match the established convention.
🤖 Prompt for AI Agents
In @.hydra_config/config.yaml around lines 75 - 76, The env var name under
llm_context -> max_llm_context_size is using lowercase; update the oc.env
reference to use the uppercase variable MAX_LLM_CONTEXT_SIZE (i.e., change the
expression `${oc.decode:${oc.env:max_llm_context_size, 8192}}` to reference
`MAX_LLM_CONTEXT_SIZE`) so it matches the project's env naming convention and
prevents silent deployment failures.
| def validate_tokens_limit( | ||
| request: OpenAIChatCompletionRequest | OpenAICompletionRequest, | ||
| max_tokens_allowed: int | None = None, | ||
| ) -> tuple[bool, str]: | ||
| """Validate if the request respects the maximum token limit. | ||
|
|
||
| Args: | ||
| request: The OpenAI request object | ||
| max_tokens_allowed: Maximum allowed tokens for the request. | ||
| If None, retrieves from config. | ||
|
|
||
| Returns: | ||
| Tuple of (is_valid, error_message) | ||
| """ | ||
| try: | ||
| llm = ChatOpenAI(**config.llm) | ||
| _length_function = llm.get_num_tokens | ||
|
|
||
| if max_tokens_allowed is None: | ||
| max_tokens_allowed = config.llm.get("max_tokens", 4096) | ||
|
|
||
| if isinstance(request, OpenAIChatCompletionRequest): | ||
| message_tokens = sum( | ||
| _length_function(m.content) + 4 | ||
| for m in request.messages | ||
| ) | ||
| requested_tokens = request.max_tokens or 1024 | ||
| total_tokens_needed = message_tokens + requested_tokens | ||
|
|
||
| logger.debug( | ||
| "Token validation for chat completion", | ||
| message_tokens=message_tokens, | ||
| requested_tokens=requested_tokens, | ||
| total_tokens=total_tokens_needed, | ||
| max_allowed=max_tokens_allowed, | ||
| ) | ||
|
|
||
| if total_tokens_needed > max_tokens_allowed: | ||
| return False, ( | ||
| f"Request exceeds maximum token limit. " | ||
| f"Messages: {message_tokens} tokens + " | ||
| f"Requested output: {requested_tokens} tokens = " | ||
| f"{total_tokens_needed} tokens. " | ||
| f"Maximum allowed: {max_tokens_allowed} tokens." | ||
| ) | ||
|
|
||
| elif isinstance(request, OpenAICompletionRequest): | ||
| prompt_tokens = _length_function(request.prompt) | ||
| requested_tokens = request.max_tokens or 512 | ||
| total_tokens_needed = prompt_tokens + requested_tokens | ||
|
|
||
| logger.debug( | ||
| "Token validation for completion", | ||
| prompt_tokens=prompt_tokens, | ||
| requested_tokens=requested_tokens, | ||
| total_tokens=total_tokens_needed, | ||
| max_allowed=max_tokens_allowed, | ||
| ) | ||
|
|
||
| if total_tokens_needed > max_tokens_allowed: | ||
| return False, ( | ||
| f"Request exceeds maximum token limit. " | ||
| f"Prompt: {prompt_tokens} tokens + " | ||
| f"Requested output: {requested_tokens} tokens = " | ||
| f"{total_tokens_needed} tokens. " | ||
| f"Maximum allowed: {max_tokens_allowed} tokens." | ||
| ) | ||
|
|
||
| return True, "" | ||
|
|
||
| except Exception as e: | ||
| logger.warning("Error during token validation", error=str(e)) | ||
| return True, "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's confirm the file exists and check its structure
fd "openai.py" | grep -i routersRepository: linagora/openrag
Length of output: 85
🏁 Script executed:
# Find the function and examine it
rg "def validate_tokens_limit" -A 80Repository: linagora/openrag
Length of output: 5301
🏁 Script executed:
# Find all callers of this function
rg "validate_tokens_limit" -B 2 -A 2Repository: linagora/openrag
Length of output: 1313
🏁 Script executed:
# Check if there are any tests for this function
fd "test.*\.py|.*test\.py" | xargs rg "validate_tokens_limit" 2>/dev/nullRepository: linagora/openrag
Length of output: 44
🏁 Script executed:
# Check the overall structure to understand context
cat -n openrag/routers/openai.py | head -250 | tail -100Repository: linagora/openrag
Length of output: 4588
🏁 Script executed:
# Get more context around the callers to see what happens after validation
rg "validate_tokens_limit" -B 5 -A 10Repository: linagora/openrag
Length of output: 3247
🏁 Script executed:
# Look for any error handling patterns or configuration that might justify the current behavior
rg "HTTPException|raise_for_status|4096" openrag/routers/openai.py | head -20Repository: linagora/openrag
Length of output: 469
🏁 Script executed:
# Check if there's documentation or comments explaining the design decision
rg "fail.*closed|bypass.*validation|exception.*handling" openrag/routers/openai.pyRepository: linagora/openrag
Length of output: 42
Fail-closed when token counting fails.
On exceptions, the function currently returns (True, ""), which bypasses validation and allows requests to proceed when token counting fails (e.g., unknown model, tokenizer errors). This undermines the safeguard—change to return (False, error_message) to fail closed and let callers reject the request with HTTP 413.
Recommended fix
except Exception as e:
logger.warning("Error during token validation", error=str(e))
- return True, ""
+ return False, "Unable to validate token limit; please reduce input or retry."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def validate_tokens_limit( | |
| request: OpenAIChatCompletionRequest | OpenAICompletionRequest, | |
| max_tokens_allowed: int | None = None, | |
| ) -> tuple[bool, str]: | |
| """Validate if the request respects the maximum token limit. | |
| Args: | |
| request: The OpenAI request object | |
| max_tokens_allowed: Maximum allowed tokens for the request. | |
| If None, retrieves from config. | |
| Returns: | |
| Tuple of (is_valid, error_message) | |
| """ | |
| try: | |
| llm = ChatOpenAI(**config.llm) | |
| _length_function = llm.get_num_tokens | |
| if max_tokens_allowed is None: | |
| max_tokens_allowed = config.llm.get("max_tokens", 4096) | |
| if isinstance(request, OpenAIChatCompletionRequest): | |
| message_tokens = sum( | |
| _length_function(m.content) + 4 | |
| for m in request.messages | |
| ) | |
| requested_tokens = request.max_tokens or 1024 | |
| total_tokens_needed = message_tokens + requested_tokens | |
| logger.debug( | |
| "Token validation for chat completion", | |
| message_tokens=message_tokens, | |
| requested_tokens=requested_tokens, | |
| total_tokens=total_tokens_needed, | |
| max_allowed=max_tokens_allowed, | |
| ) | |
| if total_tokens_needed > max_tokens_allowed: | |
| return False, ( | |
| f"Request exceeds maximum token limit. " | |
| f"Messages: {message_tokens} tokens + " | |
| f"Requested output: {requested_tokens} tokens = " | |
| f"{total_tokens_needed} tokens. " | |
| f"Maximum allowed: {max_tokens_allowed} tokens." | |
| ) | |
| elif isinstance(request, OpenAICompletionRequest): | |
| prompt_tokens = _length_function(request.prompt) | |
| requested_tokens = request.max_tokens or 512 | |
| total_tokens_needed = prompt_tokens + requested_tokens | |
| logger.debug( | |
| "Token validation for completion", | |
| prompt_tokens=prompt_tokens, | |
| requested_tokens=requested_tokens, | |
| total_tokens=total_tokens_needed, | |
| max_allowed=max_tokens_allowed, | |
| ) | |
| if total_tokens_needed > max_tokens_allowed: | |
| return False, ( | |
| f"Request exceeds maximum token limit. " | |
| f"Prompt: {prompt_tokens} tokens + " | |
| f"Requested output: {requested_tokens} tokens = " | |
| f"{total_tokens_needed} tokens. " | |
| f"Maximum allowed: {max_tokens_allowed} tokens." | |
| ) | |
| return True, "" | |
| except Exception as e: | |
| logger.warning("Error during token validation", error=str(e)) | |
| return True, "" | |
| def validate_tokens_limit( | |
| request: OpenAIChatCompletionRequest | OpenAICompletionRequest, | |
| max_tokens_allowed: int | None = None, | |
| ) -> tuple[bool, str]: | |
| """Validate if the request respects the maximum token limit. | |
| Args: | |
| request: The OpenAI request object | |
| max_tokens_allowed: Maximum allowed tokens for the request. | |
| If None, retrieves from config. | |
| Returns: | |
| Tuple of (is_valid, error_message) | |
| """ | |
| try: | |
| llm = ChatOpenAI(**config.llm) | |
| _length_function = llm.get_num_tokens | |
| if max_tokens_allowed is None: | |
| max_tokens_allowed = config.llm.get("max_tokens", 4096) | |
| if isinstance(request, OpenAIChatCompletionRequest): | |
| message_tokens = sum( | |
| _length_function(m.content) + 4 | |
| for m in request.messages | |
| ) | |
| requested_tokens = request.max_tokens or 1024 | |
| total_tokens_needed = message_tokens + requested_tokens | |
| logger.debug( | |
| "Token validation for chat completion", | |
| message_tokens=message_tokens, | |
| requested_tokens=requested_tokens, | |
| total_tokens=total_tokens_needed, | |
| max_allowed=max_tokens_allowed, | |
| ) | |
| if total_tokens_needed > max_tokens_allowed: | |
| return False, ( | |
| f"Request exceeds maximum token limit. " | |
| f"Messages: {message_tokens} tokens + " | |
| f"Requested output: {requested_tokens} tokens = " | |
| f"{total_tokens_needed} tokens. " | |
| f"Maximum allowed: {max_tokens_allowed} tokens." | |
| ) | |
| elif isinstance(request, OpenAICompletionRequest): | |
| prompt_tokens = _length_function(request.prompt) | |
| requested_tokens = request.max_tokens or 512 | |
| total_tokens_needed = prompt_tokens + requested_tokens | |
| logger.debug( | |
| "Token validation for completion", | |
| prompt_tokens=prompt_tokens, | |
| requested_tokens=requested_tokens, | |
| total_tokens=total_tokens_needed, | |
| max_allowed=max_tokens_allowed, | |
| ) | |
| if total_tokens_needed > max_tokens_allowed: | |
| return False, ( | |
| f"Request exceeds maximum token limit. " | |
| f"Prompt: {prompt_tokens} tokens + " | |
| f"Requested output: {requested_tokens} tokens = " | |
| f"{total_tokens_needed} tokens. " | |
| f"Maximum allowed: {max_tokens_allowed} tokens." | |
| ) | |
| return True, "" | |
| except Exception as e: | |
| logger.warning("Error during token validation", error=str(e)) | |
| return False, "Unable to validate token limit; please reduce input or retry." |
🧰 Tools
🪛 GitHub Check: lint (3.12)
[failure] 193-193: Ruff (W293)
openrag/routers/openai.py:193:1: W293 Blank line contains whitespace
[failure] 185-185: Ruff (W293)
openrag/routers/openai.py:185:1: W293 Blank line contains whitespace
[failure] 180-180: Ruff (W291)
openrag/routers/openai.py:180:48: W291 Trailing whitespace
[failure] 177-177: Ruff (W293)
openrag/routers/openai.py:177:1: W293 Blank line contains whitespace
[failure] 174-174: Ruff (W293)
openrag/routers/openai.py:174:1: W293 Blank line contains whitespace
[failure] 167-167: Ruff (W293)
openrag/routers/openai.py:167:1: W293 Blank line contains whitespace
[failure] 162-162: Ruff (W293)
openrag/routers/openai.py:162:1: W293 Blank line contains whitespace
🤖 Prompt for AI Agents
In `@openrag/routers/openai.py` around lines 157 - 229, The token validation
currently fails open in validate_tokens_limit by returning (True, "") on
exceptions; change the except block to fail closed by returning (False, f"Token
validation error: {e}") (include the exception text) and log the error (e.g.,
with logger.error or logger.warning) so callers can reject the request (HTTP
413); refer to validate_tokens_limit, ChatOpenAI instantiation, and the existing
logger usage to locate and update the except block.
37777ef to
ffbfe2a
Compare
ffbfe2a to
7fb03c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@openrag/routers/openai.py`:
- Around line 274-276: Replace the redundant ternary that sets target_model_id
using is_direct_llm_model(request); always use the underlying configured LLM
model by assigning target_model_id = config.llm.get("model") before calling
get_max_model_tokens, so token validation via
get_max_model_tokens(target_model_id) and validate_tokens_limit(request,
max_tokens_allowed=...) always checks the actual LLM limits (refer to
target_model_id, is_direct_llm_model, config.llm.get("model"),
get_max_model_tokens, validate_tokens_limit, request, model_name to locate the
code).
♻️ Duplicate comments (1)
openrag/routers/openai.py (1)
221-223: Fail-open on exception undermines the validation safeguard.The exception handler returns
(True, ""), allowing requests to proceed when token counting fails. This defeats the PR's goal of preventing invalid requests from reaching the LLM. Change to fail-closed by returning(False, error_message).🔒 Recommended fix
except Exception as e: logger.warning("Error during token validation", error=str(e)) - return True, "" + return False, f"Unable to validate token limit: {e}. Please reduce input or retry."
🧹 Nitpick comments (3)
openrag/routers/openai.py (3)
119-121: Consider caching or reusing the AsyncOpenAI client.A new
AsyncOpenAIclient is instantiated on every call toget_max_model_tokens. For high-frequency requests, this could add overhead. Consider initializing the client once at module level or using a cached singleton pattern.♻️ Suggested refactor
+# At module level, after config initialization +_openai_client: AsyncOpenAI | None = None + +def _get_openai_client() -> AsyncOpenAI: + global _openai_client + if _openai_client is None: + _openai_client = AsyncOpenAI(base_url=config.llm["base_url"], api_key=config.llm["api_key"]) + return _openai_client + async def get_max_model_tokens(model_id: str | None) -> int: ... try: - client = AsyncOpenAI(base_url=config.llm["base_url"], api_key=config.llm["api_key"]) + client = _get_openai_client() resp = await client.models.list()
168-170: Consider reusing the ChatOpenAI instance for token counting.Similar to the
AsyncOpenAIclient, a newChatOpenAIinstance is created on every validation call. For token counting purposes, a single cached instance would suffice.
176-176: Document the per-message token overhead constant.The
+ 4magic number approximates OpenAI's chat format overhead (role tokens, separators). Consider extracting this to a named constant with a comment explaining its purpose.♻️ Suggested refactor
+# Approximate token overhead per message (role, separators, formatting) +MESSAGE_TOKEN_OVERHEAD = 4 + def validate_tokens_limit(...): ... - message_tokens = sum(_length_function(m.content) + 4 for m in request.messages) + message_tokens = sum(_length_function(m.content) + MESSAGE_TOKEN_OVERHEAD for m in request.messages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/api_tests/api_run/scripts/run_api_tests_local.sh`:
- Around line 45-61: Add a guaranteed cleanup trap so containers are torn down
even if pytest fails: define a cleanup function that runs the existing cleanup
steps (cd "$COMPOSE_DIR" and docker compose down -v), register it with trap for
EXIT (and optionally ERR INT TERM) near the top of the script, and remove the
explicit cleanup call at the end of the file; reference the pytest invocation
(OPENRAG_API_URL=... python3 -m pytest ...) and the existing cleanup commands so
the trap executes the same teardown when the script exits unexpectedly.
🧹 Nitpick comments (2)
.github/workflows/api_tests.yml (1)
69-74: Redundant environment variable assignment.
OPENRAG_API_URLis set both in theenv:block (line 71) and inline in the command (line 74). The inline assignment shadows the block-level one, making theenv:block redundant here.♻️ Suggested fix
Either keep only the
env:block:- name: Run API tests env: OPENRAG_API_URL: http://localhost:8080 run: | cd tests/api_tests - OPENRAG_API_URL=http://localhost:8080 python3 -m pytest . -v --timeout=120 --tb=short + python3 -m pytest . -v --timeout=120 --tb=shortOr remove the
env:block entirely:- name: Run API tests - env: - OPENRAG_API_URL: http://localhost:8080 run: | cd tests/api_tests OPENRAG_API_URL=http://localhost:8080 python3 -m pytest . -v --timeout=120 --tb=shorttests/api_tests/api_run/scripts/run_api_tests_local.sh (1)
48-56: Consider simplifying dependency check.The heredoc pattern works but could be simplified:
♻️ Simpler alternative
-if ! python3 - <<'PY' -import httpx -import pytest # noqa: F401 -import pytest_timeout # noqa: F401 -PY -then +if ! python3 -c "import httpx, pytest, pytest_timeout" 2>/dev/null; then echo "Installing pytest + httpx + pytest-timeout..." pip install pytest httpx pytest-timeout fi
| echo "[3/4] Running API tests" | ||
| cd "$ROOT_DIR" | ||
| # Ensure test deps are present (mirror CI) | ||
| if ! python3 - <<'PY' | ||
| import httpx | ||
| import pytest # noqa: F401 | ||
| import pytest_timeout # noqa: F401 | ||
| PY | ||
| then | ||
| echo "Installing pytest + httpx + pytest-timeout..." | ||
| pip install pytest httpx pytest-timeout | ||
| fi | ||
| OPENRAG_API_URL=http://localhost:8080 python3 -m pytest ../ -v --timeout=120 --tb=short | ||
|
|
||
| echo "[4/4] Cleaning up" | ||
| cd "$COMPOSE_DIR" | ||
| docker compose down -v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing cleanup on test failure leaves containers running.
With set -e enabled, if pytest fails at line 57, the script exits immediately and the cleanup (lines 59-61) never runs. This differs from the CI workflow which uses if: always() for cleanup. Local developers will have stale containers after a failed test run.
🛠️ Suggested fix using a trap
#!/usr/bin/env bash
set -euo pipefail
# Local replica of .github/workflows/api_tests.yml
# Builds/starts the stack (mock-vllm included), waits for health, runs tests, then cleans up.
ROOT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
COMPOSE_DIR="$ROOT_DIR"
cd "$COMPOSE_DIR"
+cleanup() {
+ echo "Cleaning up..."
+ cd "$COMPOSE_DIR"
+ docker compose down -v
+}
+trap cleanup EXIT
+
echo "[1/4] Building and starting services"
docker compose build
docker compose up -dThen remove the explicit cleanup call at the end:
-echo "[4/4] Cleaning up"
-cd "$COMPOSE_DIR"
-docker compose down -v
-
echo "Done."🤖 Prompt for AI Agents
In `@tests/api_tests/api_run/scripts/run_api_tests_local.sh` around lines 45 - 61,
Add a guaranteed cleanup trap so containers are torn down even if pytest fails:
define a cleanup function that runs the existing cleanup steps (cd
"$COMPOSE_DIR" and docker compose down -v), register it with trap for EXIT (and
optionally ERR INT TERM) near the top of the script, and remove the explicit
cleanup call at the end of the file; reference the pytest invocation
(OPENRAG_API_URL=... python3 -m pytest ...) and the existing cleanup commands so
the trap executes the same teardown when the script exits unexpectedly.
| _length_function(m.content) + 4 | ||
| for m in request.messages | ||
| ) | ||
| requested_tokens = request.max_tokens or 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this 1024 value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1024 is the default value defined in the OpenAIChatCompletionRequest model here:
openrag/openrag/models/openai.py
Line 22 in 0b5f84c
| max_tokens: int | None = Field(1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should either be in the config, or moved as a global constant. What do you think @Ahmath-Gadji ?
| if max_tokens_allowed is None: | ||
| max_tokens_allowed = config.llm.get("max_tokens", 4096) | ||
|
|
||
| if isinstance(request, OpenAIChatCompletionRequest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to make a different treatment between OpenAIChatCompletionRequest and OpenAICompletionRequest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's different .... For OpenAIChatCompletionRequest we need to sum up all message tokens and add 4 to every message for message overhead. For OpenAICompletionRequest we just need to calculate the number of tokens in the prompt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ahmath-Gadji is it useful to support the /completions route? I feel like now everything use /chat/completions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@openrag/components/utils.py`:
- Line 88: Remove the whitespace-only blank line in the utils module that is
triggering Ruff W293: edit the openrag.components.utils file and delete the
trailing whitespace-only line (ensure no trailing spaces remain on that line),
then save so the file contains no blank line with spaces; you can confirm by
running ruff or trimming trailing whitespace in your editor.
In `@openrag/routers/openai.py`:
- Around line 120-140: Move the call to get_openai_models(...) into the try
block so network/HTTP errors are caught and the fallback default_limit is
returned on failure; inside the try, call openai_models = await
get_openai_models(...) and iterate over openai_models.data (not openai_models)
when locating the model (the lookup using current_model.id should remain but
operate on items in .data), then proceed to model = model.model_dump() if
hasattr(model, 'model_dump') else model.dict(), extract max_model_len from
model.get("max_model_len") or model.get("model_extra", {}).get("max_model_len"),
log and return int(max_len) or return default_limit in the except path.
🧹 Nitpick comments (1)
openrag/components/utils.py (1)
79-90: Cache the token counter to avoid per-callChatOpenAIinstantiation.
get_num_tokens()creates a new client every call;format_contextcan be hot. Cache the length function (or the client) once to reduce overhead and connection churn.♻️ Suggested change
+from functools import lru_cache @@ -def get_num_tokens(): +@lru_cache(maxsize=1) +def get_num_tokens(): llm = ChatOpenAI(**config.llm) - _length_function = llm.get_num_tokens - return _length_function + return llm.get_num_tokens
openrag/components/utils.py
Outdated
|
|
||
| llm = ChatOpenAI(**config.llm) | ||
| _length_function = llm.get_num_tokens | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove whitespace-only line to satisfy Ruff W293.
CI is failing on this blank-line whitespace.
🧹 Minimal fix
-
+📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
🧰 Tools
🪛 GitHub Actions: Linting
[error] 88-88: W293: Blank line contains whitespace.
🪛 GitHub Check: lint (3.12)
[failure] 88-88: Ruff (W293)
openrag/components/utils.py:88:1: W293 Blank line contains whitespace
🤖 Prompt for AI Agents
In `@openrag/components/utils.py` at line 88, Remove the whitespace-only blank
line in the utils module that is triggering Ruff W293: edit the
openrag.components.utils file and delete the trailing whitespace-only line
(ensure no trailing spaces remain on that line), then save so the file contains
no blank line with spaces; you can confirm by running ruff or trimming trailing
whitespace in your editor.
de4489a to
141f1e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
openrag/components/utils.py (1)
1-145: Pipeline failure: Ruff format check would reformat this file.The CI is failing due to formatting issues. Run
ruff format openrag/components/utils.pyto fix.openrag/routers/openai.py (1)
1-440: Pipeline failure: Ruff format check would reformat this file.CI is failing due to formatting issues. Run
ruff format openrag/routers/openai.pyto fix.
🤖 Fix all issues with AI agents
In `@openrag/routers/openai.py`:
- Around line 161-164: The token counting code in the
OpenAIChatCompletionRequest branch can pass None to _length_function because
OpenAIMessage.content may be None; update the computation of message_tokens in
openai.py to guard against None (e.g., use (m.content or "") or skip None
messages) so _length_function always receives a str; adjust the expression that
sums message_tokens (referencing OpenAIChatCompletionRequest, OpenAIMessage, and
_length_function) and keep requested_tokens/total_tokens_needed logic unchanged.
In `@openrag/routers/utils.py`:
- Around line 238-242: The file has only one blank line between top-level
functions which violates PEP8/Ruff; add a second blank line after the end of the
async function get_openai_models so there are two blank lines separating it from
the next top-level function check_llm_model_availability, then run `ruff format
openrag/routers/utils.py` to auto-apply formatting fixes.
🧹 Nitpick comments (1)
openrag/components/utils.py (1)
79-82: Consider caching the ChatOpenAI instance to avoid repeated instantiation.
get_num_tokens()creates a newChatOpenAIinstance on every call. Since this function is now called fromformat_context(and also used invalidate_tokens_limitinopenrag/routers/openai.py), each token counting operation instantiates a new client. This adds unnecessary overhead, especially for batch operations.♻️ Suggested improvement using module-level caching
+_llm_instance = None + +def get_num_tokens(): + global _llm_instance + if _llm_instance is None: + _llm_instance = ChatOpenAI(**config.llm) + return _llm_instance.get_num_tokens -def get_num_tokens(): - llm = ChatOpenAI(**config.llm) - _length_function = llm.get_num_tokens - return _length_functionAlternatively, consider using
@functools.lru_cacheif config is immutable during runtime.
141f1e3 to
cdd459a
Compare
Lazy-init a module-level _cached_length_function so ChatOpenAI is only instantiated once instead of on every call. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Return models_response.data (list) instead of SyncPage, and remove the now-redundant .data access in check_llm_model_availability. Co-Authored-By: Claude Opus 4.5 <[email protected]>
| if max_tokens_allowed is None: | ||
| max_tokens_allowed = config.llm.get("max_tokens", 4096) | ||
|
|
||
| if isinstance(request, OpenAIChatCompletionRequest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ahmath-Gadji is it useful to support the /completions route? I feel like now everything use /chat/completions ?
| _length_function(m.content) + 4 | ||
| for m in request.messages | ||
| ) | ||
| requested_tokens = request.max_tokens or 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should either be in the config, or moved as a global constant. What do you think @Ahmath-Gadji ?
| "max_tokens": 100000, | ||
| } | ||
| response = api_client.post("/v1/chat/completions", json=payload) | ||
| assert response.status_code in [413, 404] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the status code would be 404 ? assert on 413 is the expected result, right?
| "prompt": very_long_prompt, | ||
| "max_tokens": 100000, | ||
| } | ||
| response = api_client.post("/v1/completions", json=payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is exactly the same test with just a path difference ; it should be refactored
Remove the None default and the dead fallback to 4096, since all callers already pass the value from get_max_model_tokens(). Co-Authored-By: Claude Opus 4.5 <[email protected]>
Log a warning but return True so the request proceeds normally when token validation itself fails (e.g. tokenizer error), avoiding blocking clients on temporary failures. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add openrag/test_token_validation.py with 10 unit tests covering: boundary conditions (under/exact/over) for both chat and completion, default max_tokens fallback, error message content, and graceful exception handling (skip validation on tokenizer failure). - Fix integration tests: strict 413 assertion (was [413, 404]), deduplicate via pytest.mark.parametrize. Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@openrag/routers/utils.py`:
- Around line 239-242: In get_openai_models, the AsyncOpenAI client is created
but never closed; change the function to use the async context manager form
(async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client:) so the
client is deterministically cleaned up, then call await client.models.list()
inside that block and return models_response.data; this ensures AsyncOpenAI is
properly closed after use.
🧹 Nitpick comments (1)
openrag/test_token_validation.py (1)
14-25: Move Ray init/shutdown into a session fixture to avoid global side effects and resource leaks.Initializing Ray at module import time leaves resources active for the entire test session with no cleanup mechanism. Using a session-scoped
autousefixture with explicitray.shutdown()ensures proper resource management and follows Ray's recommended pytest patterns.♻️ Suggested refactor
-# Prevent Ray from scanning the working directory (which may contain -# permission-restricted folders like db/) and stub RagPipeline to -# avoid heavy initialization during test collection. -import ray # noqa: E402 - -if not ray.is_initialized(): - ray.init(runtime_env={"working_dir": None}, ignore_reinit_error=True) - -if "components.pipeline" not in sys.modules: - _stub = ModuleType("components.pipeline") - _stub.RagPipeline = MagicMock() - sys.modules["components.pipeline"] = _stub +import ray # noqa: E402 + +@pytest.fixture(scope="session", autouse=True) +def _init_ray_and_stub_pipeline(): + """Initialize Ray and stub RagPipeline before test session.""" + if not ray.is_initialized(): + ray.init(runtime_env={"working_dir": None}, ignore_reinit_error=True) + + if "components.pipeline" not in sys.modules: + _stub = ModuleType("components.pipeline") + _stub.RagPipeline = MagicMock() + sys.modules["components.pipeline"] = _stub + + yield + + if ray.is_initialized(): + ray.shutdown()
| async def get_openai_models(base_url: str, api_key: str): | ||
| client = AsyncOpenAI(base_url=base_url, api_key=api_key) | ||
| models_response = await client.models.list() | ||
| return models_response.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
AsyncOpenAI Python client async context manager and close() cleanup requirements
💡 Result:
AsyncOpenAI does support an async context manager, and using it is the simplest way to ensure the underlying HTTP client/session is closed:
from openai import AsyncOpenAI
async with AsyncOpenAI(api_key=...) as client:
resp = await client.responses.create(model="...", input="hi")This pattern is shown in the official openai-python README (e.g., when using the aiohttp backend) and is intended to close resources when you exit the block. [1]
Do you have to call close()?
- The library says it will close underlying HTTP connections when the client is garbage collected, and you can also close the client manually (or via a context manager) for deterministic cleanup. [1]
- For the async client specifically, the close call is awaitable (i.e.,
await client.close()), per the client-class docs. [2]
Related gotcha: streaming responses must be closed
If you use .with_streaming_response..., you should use its context manager so the response is reliably closed:
with client.chat.completions.with_streaming_response.create(...) as response:
...The README explicitly notes the context manager is required for reliable closing. [1]
Sources: [1] openai/openai-python README (Async usage / aiohttp example; “Managing HTTP resources”; streaming response closing note). [2] DeepWiki “Core Client Classes” (AsyncOpenAI: await client.close()).
Use an async context manager to clean up the AsyncOpenAI client.
The client is created per call but not closed, which can leak connections in long-lived processes. While the library will eventually close connections during garbage collection, deterministic cleanup via the async context manager is the recommended approach.
🔧 Suggested fix
async def get_openai_models(base_url: str, api_key: str):
- client = AsyncOpenAI(base_url=base_url, api_key=api_key)
- models_response = await client.models.list()
- return models_response.data
+ async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client:
+ models_response = await client.models.list()
+ return models_response.data📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def get_openai_models(base_url: str, api_key: str): | |
| client = AsyncOpenAI(base_url=base_url, api_key=api_key) | |
| models_response = await client.models.list() | |
| return models_response.data | |
| async def get_openai_models(base_url: str, api_key: str): | |
| async with AsyncOpenAI(base_url=base_url, api_key=api_key) as client: | |
| models_response = await client.models.list() | |
| return models_response.data |
🤖 Prompt for AI Agents
In `@openrag/routers/utils.py` around lines 239 - 242, In get_openai_models, the
AsyncOpenAI client is created but never closed; change the function to use the
async context manager form (async with AsyncOpenAI(base_url=base_url,
api_key=api_key) as client:) so the client is deterministically cleaned up, then
call await client.models.list() inside that block and return
models_response.data; this ensures AsyncOpenAI is properly closed after use.
This PR implements pre-flight validation for the completion route to ensure message consistency and context window compliance before reaching the LLM. By verifying role alternation and token limits locally, the system provides immediate feedback to the client and prevents "context window exceeded" errors. This approach enhances stability while saving significant GPU/CPU resources and costs by avoiding invalid inference requests.
Closes #198
Summary by CodeRabbit
New Features
Improvements
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.