Evaluation: Use Config Management #477

avirajsingh7 · 2025-12-10T04:44:18Z

Summary

This change refactors the evaluation run process to utilize a stored configuration instead of a configuration dictionary. It introduces fields for config_id, config_version, and model in the evaluation run table, streamlining the evaluation process and improving data integrity.

Checklist

Before submitting a pull request, please ensure that you mark these tasks.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

New Features
- Evaluation API now accepts stored configuration references (config ID + version) and records the resolved model for each run.
- Added "openai" as a recognized provider; embedding batches now use text-embedding-3-large.
Bug Fixes
- Improved config resolution and validation with clearer error responses for missing/invalid configs.
- Provider validation tightened to allowed OpenAI providers.
Chores
- Database migration to store config_id and config_version on evaluation runs.
Tests
- Tests updated to use config references and cover config-not-found/access scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-10T04:44:28Z

📝 Walkthrough

Walkthrough

Evaluation endpoints, CRUD, models, and tests now use stored configs referenced by config_id + config_version; configs are resolved at runtime to extract the LLM model. EvaluationRun stores config_id, config_version, and model. DB migration replaces embedded JSON config with a FK to config. Batch/embedding/provider flows updated.

Changes

Cohort / File(s)	Summary
Evaluation Route & Public API `backend/app/api/routes/evaluation.py`	Endpoint signature changed to accept `config_id: UUID` and `config_version: int`; resolves stored config via `ConfigVersionCrud`/`resolve_config_blob`; enforces provider == `OPENAI`; returns validation/resolve errors; logging updated.
Evaluation CRUD & Model Resolution `backend/app/crud/evaluations/core.py`, `backend/app/crud/evaluations/__init__.py`	`create_evaluation_run` now records `config_id`/`config_version` (removed embedded config). Added and exported `resolve_model_from_config(session, eval_run)` to fetch/validate the model from the stored config.
Evaluation Data Model & Migration `backend/app/models/evaluation.py`, `backend/app/alembic/versions/041_add_config_in_evals_run_table.py`	Replaced `config` JSONB with `config_id: UUID` and `config_version: int`, added `model: str` to `EvaluationRun`; migration adds FK to `config.id` and drops legacy `config` column (downgrade restores JSONB).
Batch Construction & LLM Config Types `backend/app/crud/evaluations/batch.py`	`config` parameter now typed as `KaapiLLMParams`; JSONL bodies explicitly map `model`, `input`, and optional `instructions`, `temperature`, `reasoning`, `tools`; batch start serializes config via `model_dump(exclude_none=True)`.
Processing & Embeddings `backend/app/crud/evaluations/processing.py`, `backend/app/crud/evaluations/embeddings.py`	Processing resolves model via `resolve_model_from_config` and passes it downstream; embeddings now hard-code `"text-embedding-3-large"` (removed dynamic model selection).
LLM Provider Registry `backend/app/services/llm/providers/registry.py`	Added `LLMProvider.OPENAI = "openai"` and mapped it to `OpenAIProvider`; `get_llm_provider` accepts both `OPENAI` and `OPENAI_NATIVE`.
Tests & Test Helpers `backend/app/tests/api/routes/test_evaluation.py`, `backend/app/tests/crud/evaluations/test_processing.py`, `backend/app/tests/...`	Tests updated to create/use test configs (`create_test_config`) and pass `config_id`/`config_version`; fixtures and assertions adjusted for config-not-found/config-not-accessible semantics; many calls updated to new `create_evaluation_run` signature.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant API as Evaluation API
  participant Config as ConfigVersionCrud
  participant Eval as Evaluation CRUD
  participant Batch as Batch Service
  participant LLM as OpenAI Provider

  Client->>API: POST /evaluate (config_id, config_version, dataset_id...)
  API->>Config: resolve_config_blob(config_id, config_version)
  Config-->>API: config blob (KaapiLLMParams) or error
  API->>API: validate provider == "openai"
  API->>Eval: create_evaluation_run(config_id, config_version, metadata)
  Eval-->>API: EvaluationRun (id, config_id, config_version)
  API->>Batch: start_evaluation_batch(eval_run, config.completion.params)
  Batch->>LLM: enqueue/call provider
  Batch-->>Eval: update run status / attach batch ids
  API-->>Client: 200 Created (EvaluationRun) or 4xx on error

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

#498: Overlaps on LLM config types and provider handling (Kaapi params, transform/resolve functions, provider registry changes).
#405: Broad evaluation refactor touching evaluation models and workflows; likely overlaps with config/model resolution edits.
#488: Related test-suite enhancements and helpers that intersect with evaluation testing and fixtures.

Suggested reviewers

kartpop
AkhileshNegi

Poem

I hopped through configs, one by one, 🐇
Replaced raw dicts with ids and versioned fun,
I sniffed the model, resolved its name,
Queued batches, saved runs — migration game,
A rabbit celebrates — tests green, deploy begun! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.42% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Evaluation: Use Config Management' accurately describes the main change: refactoring evaluation runs to use stored configuration management instead of inline config dictionaries.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d7db45 and afec624.

📒 Files selected for processing (1)

backend/app/tests/crud/evaluations/test_processing.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/tests/crud/evaluations/test_processing.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/crud/evaluations/test_processing.py

🧠 Learnings (1)

📚 Learning: 2025-12-17T15:39:30.469Z

Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/tests/**/*.py : Use factory pattern for test fixtures in `backend/app/tests/`

Applied to files:

backend/app/tests/crud/evaluations/test_processing.py

🧬 Code graph analysis (1)

backend/app/tests/crud/evaluations/test_processing.py (2)

backend/app/tests/utils/test_data.py (2)

create_test_evaluation_dataset (341-376)

create_test_config (238-301)

backend/app/crud/evaluations/core.py (1)

create_evaluation_run (18-71)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (7)

backend/app/tests/crud/evaluations/test_processing.py (7)

16-16: Factory import aligns with test data pattern.

Nice update to pull in create_test_config for config setup.

263-288: Config-backed eval run setup looks good.

Using create_test_config and passing config_id/config_version keeps the fixture aligned with the new stored-config flow.

408-415: Config creation wired correctly.

Creating a test config and passing config_id/config_version keeps the test consistent with the new API.

459-484: Embedding batch fixture aligned with config storage.

The new config setup and config_id/config_version use are consistent with the refactor.

609-634: Completed-batch test updated appropriately.

Config creation and config_id/config_version usage look correct here.

672-698: Failed-batch test updated appropriately.

Config creation and config_id/config_version usage look correct here.

774-799: Pending-evaluations test updated appropriately.

Config creation and config_id/config_version usage look correct here.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

backend/app/crud/evaluations/embeddings.py (1)
366-367: Misleading comment - update to reflect actual behavior.

The comment says "Get embedding model from config" but the code hardcodes the value. Update the comment to accurately describe the implementation.
-        # Get embedding model from config (default: text-embedding-3-large)
-        embedding_model = "text-embedding-3-large"
+        # Use fixed embedding model (text-embedding-3-large)
+        embedding_model = "text-embedding-3-large"
backend/app/tests/api/routes/test_evaluation.py (1)
524-545: Consider renaming function to match its new purpose.

The function test_start_batch_evaluation_missing_model was repurposed to test invalid config_id scenarios. The docstring was updated but the function name still references "missing_model". Consider renaming for clarity.
-    def test_start_batch_evaluation_missing_model(self, client, user_api_key_header):
-        """Test batch evaluation fails with invalid config_id."""
+    def test_start_batch_evaluation_invalid_config_id(self, client, user_api_key_header):
+        """Test batch evaluation fails with invalid config_id."""
backend/app/api/routes/evaluation.py (1)
499-510: Consider validating that model is present in config params.

The model is extracted with .get("model") which returns None if not present. Since model is critical for cost tracking (used in create_langfuse_dataset_run), consider validating its presence and returning an error if missing.
     # Extract model from config for storage
     model = config.completion.params.get("model")
+    if not model:
+        raise HTTPException(
+            status_code=400,
+            detail="Config must specify a 'model' in completion params for evaluation",
+        )

     # Create EvaluationRun record with config references
backend/app/crud/evaluations/core.py (1)
15-69: Config-based create_evaluation_run refactor is correctly implemented; consider logging model for improved traceability.

The refactor from inline config dict to config_id: UUID and config_version: int is properly implemented throughout:

The sole call site in backend/app/api/routes/evaluation.py:503 correctly passes all new parameters with the right types (config_id as UUID, config_version as int, model extracted from config).

The EvaluationRun model in backend/app/models/evaluation.py correctly defines all three fields with appropriate types and descriptions.

All type hints align with Python 3.11+ guidelines.

One suggested improvement for debugging:

Include model in the creation log for better traceability when correlating evaluation runs with model versions:
logger.info(
    f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
-   f"config_id={config_id}, config_version={config_version}"
+   f"config_id={config_id}, config_version={config_version}, model={model}"
)
Since the model is already extracted at the call site and passed to the function, including it in the log will provide fuller context for operational debugging without any additional cost.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30ef268 and d5f9d4d.

📒 Files selected for processing (7)

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py (1 hunks)
backend/app/api/routes/evaluation.py (5 hunks)
backend/app/crud/evaluations/core.py (5 hunks)
backend/app/crud/evaluations/embeddings.py (1 hunks)
backend/app/crud/evaluations/processing.py (1 hunks)
backend/app/models/evaluation.py (3 hunks)
backend/app/tests/api/routes/test_evaluation.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/api/routes/evaluation.py
backend/app/models/evaluation.py
backend/app/crud/evaluations/embeddings.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/core.py
backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/evaluation.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/evaluation.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/embeddings.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/core.py

🧬 Code graph analysis (2)

backend/app/tests/api/routes/test_evaluation.py (2)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (62-115)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-130)

EvaluationRun (133-248)

backend/app/crud/evaluations/processing.py (1)

backend/app/crud/evaluations/langfuse.py (1)

create_langfuse_dataset_run (20-163)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (3)

backend/app/crud/evaluations/processing.py (1)

257-264: LGTM! Clean refactor to use stored model field.

The change correctly retrieves the model from eval_run.model instead of extracting it from config. This aligns with the new data model where the model is snapshotted at evaluation creation time.

backend/app/models/evaluation.py (1)

148-157: LGTM! Well-structured config reference fields.

The new config_id and config_version fields properly establish the relationship to stored configs with appropriate constraints (ge=1 for version). The nullable design allows backward compatibility with existing data.

backend/app/api/routes/evaluation.py (1)

478-495: LGTM! Robust config resolution with provider validation.

The config resolution flow properly validates that the stored config exists and uses the OPENAI provider. Error handling returns appropriate HTTP 400 responses with descriptive messages.

backend/app/alembic/versions/040_add_config_in_evals_run_table.py

backend/app/models/evaluation.py

codecov · 2025-12-10T04:48:57Z

Codecov Report

❌ Patch coverage is 55.00000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/core.py	35.71%	9 Missing ⚠️
backend/app/api/routes/evaluation.py	41.66%	7 Missing ⚠️
backend/app/crud/evaluations/embeddings.py	0.00%	1 Missing ⚠️
backend/app/crud/evaluations/processing.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

backend/app/models/evaluation.py (1)
148-158: Align EvaluationRun type hints with nullable DB columns for config fields

config_id and config_version are nullable in the schema but annotated as non-optional types. This can mislead callers and type checkers into assuming they’re always present, even for legacy runs or transitional data.

Consider updating the annotations to reflect nullability:
-    config_id: UUID = SQLField(
+    config_id: UUID | None = SQLField(
         foreign_key="config.id",
         nullable=True,
         description="Reference to the stored config used for this evaluation",
     )
-    config_version: int = SQLField(
+    config_version: int | None = SQLField(
         nullable=True,
         ge=1,
         description="Version of the config used for this evaluation",
     )
This keeps the schema the same while making runtime and type expectations clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5f9d4d and eda7762.

📒 Files selected for processing (1)

backend/app/models/evaluation.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/models/evaluation.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/evaluation.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (1)

backend/app/models/evaluation.py (1)

271-273: Public model nullability now matches the schema

Making config_id, config_version, and model nullable in EvaluationRunPublic correctly reflects the DB fields and avoids validation issues for existing rows. This resolves the earlier mismatch between the table and the public model.

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py

backend/app/api/routes/evaluation.py

…ssistant_id handling

…g dict

…nstead of config dict

…ig_version fields

…ve_model_from_config function, and update processing logic to use config references

Prajna1999

lgtm

avirajsingh7 · 2025-12-18T04:02:59Z

hold merge- untill frontend is ready.

Prajna1999 · 2025-12-22T03:35:18Z

good to go. Can be merged

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/api/routes/evaluation.py (1)
8-16: Critical: Missing Depends import.

The pipeline failure indicates NameError: name 'Depends' is not defined at line 130. The import statement for Depends from fastapi is missing.
🐛 Add missing import
 from fastapi import (
     APIRouter,
     Body,
+    Depends,
     File,
     Form,
     HTTPException,
     Query,
     UploadFile,
 )

🤖 Fix all issues with AI agents

In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py:
- Line 20: The migration functions upgrade() and downgrade() lack return type
hints; update both function definitions (upgrade and downgrade) to include
explicit return types (e.g., change "def upgrade():" and "def downgrade():" to
"def upgrade() -> None:" and "def downgrade() -> None:") so they conform to the
project's typing guidelines.
- Around line 47-56: The downgrade currently re-adds the "config" column on the
"evaluation_run" table using op.add_column with sa.Column(..., nullable=False)
which will fail if rows exist; update that op.add_column call in the downgrade
to use nullable=True (or alternatively add a server_default or a prior data
migration to populate values before setting non-nullable), ensuring the column
is created nullable during downgrade to avoid PostgreSQL errors.

In @backend/app/api/routes/evaluation.py:
- Around line 505-509: The code references a non-existent constant
LLMProvider.OPENAI in the evaluation config validation, causing AttributeError;
update the check in evaluation.py (the block that raises HTTPException) to
compare against the actual provider string "openai" (i.e., use
config.completion.provider != "openai") or alternatively add a new constant
OPENAI = "openai" to the LLMProvider class in
backend/app/services/llm/providers/registry.py so the symbol exists and matches
tests; pick one approach and ensure the error message and tests remain
consistent with the chosen value.

In @backend/app/crud/evaluations/core.py:
- Around line 308-349: resolve_model_from_config currently declares returning
str but assigns model = config.completion.params.get("model") which may be None;
update resolve_model_from_config to validate that model is present and a str
(e.g., if not model: raise ValueError(...) with context including eval_run.id,
config_id, config_version) before returning, or coerce/choose a safe default
only if intended; reference the resolve_model_from_config function and the model
variable from config.completion.params.get("model") when implementing the check.

In @backend/app/crud/evaluations/processing.py:
- Around line 257-263: resolve_model_from_config currently uses
config.get("model") which can return None despite its str return annotation and
docstring promise; modify resolve_model_from_config to validate the retrieved
value and raise ValueError if missing (or alternatively change the function
signature to return str | None and update callers), e.g., after fetching model =
config.get("model") check if model is truthy and raise ValueError("missing model
in config") to enforce the contract so callers like
resolve_model_from_config(session=session, eval_run=eval_run) always receive a
str or an explicit None-aware type is used consistently.

🧹 Nitpick comments (1)

backend/app/alembic/versions/041_add_config_in_evals_run_table.py (1)

1-60: Consider a multi-step migration strategy for safer deployment.

Given the destructive nature of this schema change (dropping the config column) and the PR status ("hold merge - until frontend is ready"), consider deploying this as a multi-phase migration:

Phase 1: Add new columns without dropping old ones

Add config_id and config_version (nullable)

Add foreign key constraint

Deploy application code that writes to both old and new columns

Phase 2: Backfill existing data

Create a data migration script to populate config_id/config_version from existing config JSONB

Validate data integrity

Phase 3: Cut over

Deploy application code that only uses new columns

Monitor for issues

Phase 4: Cleanup

Drop the old config column in a subsequent migration

This approach provides:

Zero downtime deployment

Easy rollback at each phase

Data preservation and validation

Safer production deployment

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eda7762 and 31d9523.

📒 Files selected for processing (8)

backend/app/alembic/versions/041_add_config_in_evals_run_table.py
backend/app/api/routes/evaluation.py
backend/app/crud/evaluations/__init__.py
backend/app/crud/evaluations/core.py
backend/app/crud/evaluations/embeddings.py
backend/app/crud/evaluations/processing.py
backend/app/models/evaluation.py
backend/app/tests/api/routes/test_evaluation.py

🚧 Files skipped from review as they are similar to previous changes (2)

backend/app/crud/evaluations/embeddings.py
backend/app/models/evaluation.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/tests/api/routes/test_evaluation.py
backend/app/alembic/versions/041_add_config_in_evals_run_table.py
backend/app/api/routes/evaluation.py
backend/app/crud/evaluations/__init__.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/core.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py

backend/app/alembic/versions/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Generate database migrations using alembic revision --autogenerate -m "Description" --rev-id <number> where rev-id is the latest existing revision ID + 1

Files:

backend/app/alembic/versions/041_add_config_in_evals_run_table.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

🧠 Learnings (2)

📚 Learning: 2025-12-17T15:39:30.469Z

Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/alembic/versions/*.py : Generate database migrations using `alembic revision --autogenerate -m "Description" --rev-id <number>` where rev-id is the latest existing revision ID + 1

Applied to files:

backend/app/alembic/versions/041_add_config_in_evals_run_table.py

📚 Learning: 2025-12-17T15:39:30.469Z

Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Organize backend code in `backend/app/` following the layered architecture: Models, CRUD, Routes, Core, Services, and Celery directories

Applied to files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (5)

backend/app/tests/api/routes/test_evaluation.py (3)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (62-115)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-168)

EvaluationRun (171-322)

backend/app/tests/utils/test_data.py (1)

create_test_config (239-302)

backend/app/api/routes/evaluation.py (6)

backend/app/crud/config/version.py (1)

ConfigVersionCrud (15-142)

backend/app/models/llm/request.py (1)

LLMCallConfig (132-188)

backend/app/services/llm/jobs.py (1)

resolve_config_blob (84-116)

backend/app/services/llm/providers/registry.py (1)

LLMProvider (14-41)

backend/app/utils.py (4)

APIResponse (33-57)

get_langfuse_client (212-248)

get_openai_client (179-209)

load_description (393-398)

backend/app/crud/evaluations/core.py (1)

create_evaluation_run (18-71)

backend/app/crud/evaluations/__init__.py (1)

backend/app/crud/evaluations/core.py (1)

resolve_model_from_config (308-349)

backend/app/crud/evaluations/processing.py (2)

backend/app/crud/evaluations/core.py (2)

update_evaluation_run (154-206)

resolve_model_from_config (308-349)

backend/app/crud/evaluations/langfuse.py (1)

create_langfuse_dataset_run (21-164)

backend/app/crud/evaluations/core.py (3)

backend/app/crud/config/version.py (1)

ConfigVersionCrud (15-142)

backend/app/models/llm/request.py (1)

LLMCallConfig (132-188)

backend/app/services/llm/jobs.py (1)

resolve_config_blob (84-116)

🪛 GitHub Actions: Kaapi CI

backend/app/api/routes/evaluation.py

[error] 130-130: NameError: name 'Depends' is not defined.

🔇 Additional comments (4)

backend/app/crud/evaluations/__init__.py (1)

8-8: LGTM!

The new resolve_model_from_config function is correctly imported and exported for public use.

Also applies to: 43-43

backend/app/tests/api/routes/test_evaluation.py (1)

3-3: LGTM!

The test updates correctly reflect the shift from inline config dictionaries to stored config references. The use of create_test_config factory function aligns with the coding guidelines for test fixtures, and the error scenarios properly test config-not-found cases.

Also applies to: 10-10, 499-545, 728-803

backend/app/api/routes/evaluation.py (1)

492-509: Verify config resolution error handling covers all failure modes.

The config resolution logic handles errors from resolve_config_blob and validates the provider, but ensure that:

Config version not found scenarios are properly handled

Invalid/corrupted config blobs are caught

The provider validation matches actual config schemas used in production

backend/app/crud/evaluations/core.py (1)

66-69: LGTM!

The logging statement correctly follows the coding guideline format with function context and includes the new config_id and config_version fields.

coderabbitai · 2026-01-07T08:12:00Z

backend/app/alembic/versions/041_add_config_in_evals_run_table.py

+depends_on = None
+
+
+def upgrade():


⚠️ Potential issue | 🟡 Minor

Add return type hints to migration functions.

Both upgrade() and downgrade() functions are missing return type hints.

As per coding guidelines, all functions should have type hints.

📝 Proposed fix

-def upgrade(): +def upgrade() -> None:

-def downgrade(): +def downgrade() -> None:

Also applies to: 45-45

🤖 Prompt for AI Agents

In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py at line 20, The migration functions upgrade() and downgrade() lack return type hints; update both function definitions (upgrade and downgrade) to include explicit return types (e.g., change "def upgrade():" and "def downgrade():" to "def upgrade() -> None:" and "def downgrade() -> None:") so they conform to the project's typing guidelines.

coderabbitai · 2026-01-07T08:12:00Z

backend/app/alembic/versions/041_add_config_in_evals_run_table.py

+    op.add_column(
+        "evaluation_run",
+        sa.Column(
+            "config_id",
+            sa.Uuid(),
+            nullable=True,
+            comment="Reference to the stored config used",
+        ),
+    )
+    op.add_column(
+        "evaluation_run",
+        sa.Column(
+            "config_version",
+            sa.Integer(),
+            nullable=True,
+            comment="Version of the config used",
+        ),
+    )
+    op.create_foreign_key(None, "evaluation_run", "config", ["config_id"], ["id"])
+    op.drop_column("evaluation_run", "config")


⚠️ Potential issue | 🔴 Critical

Critical: Data loss and foreign key constraint naming issues.

This migration has several critical problems:

Data loss: Line 41 drops the config column without migrating existing data to the new config_id/config_version columns. Any existing evaluation runs will lose their configuration data permanently.

Foreign key constraint naming: Line 40 creates a foreign key with None as the constraint name, causing Alembic to auto-generate a name. However, the downgrade function (Line 57) also uses None to drop the constraint, which won't match the auto-generated name and will fail.

Required actions:

Add a data migration step before dropping the config column. You'll need to:

Parse each existing config JSONB object

Look up or create corresponding config records with appropriate versions

Update config_id and config_version for each evaluation_run

Or, if data migration isn't feasible, add a comment explaining why data loss is acceptable

Specify an explicit constraint name instead of None:

🔧 Proposed fix for FK constraint naming

- op.create_foreign_key(None, "evaluation_run", "config", ["config_id"], ["id"]) + op.create_foreign_key( + "fk_evaluation_run_config_id", + "evaluation_run", + "config", + ["config_id"], + ["id"] + )

And update the downgrade:

- op.drop_constraint(None, "evaluation_run", type_="foreignkey") + op.drop_constraint("fk_evaluation_run_config_id", "evaluation_run", type_="foreignkey")

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2026-01-07T08:12:00Z

backend/app/alembic/versions/041_add_config_in_evals_run_table.py

+    op.add_column(
+        "evaluation_run",
+        sa.Column(
+            "config",
+            postgresql.JSONB(astext_type=sa.Text()),
+            autoincrement=False,
+            nullable=False,
+            comment="Evaluation configuration (model, instructions, etc.)",
+        ),
+    )


⚠️ Potential issue | 🔴 Critical

Critical: Downgrade will fail with existing data.

The downgrade re-adds the config column with nullable=False (Line 53). If the evaluation_run table contains any records when downgrading, this operation will fail because PostgreSQL cannot add a non-nullable column to a table with existing rows without specifying a default value.

Either:

Make the column nullable during downgrade: nullable=True

Provide a server default value

Add a data migration to populate the column before setting it non-nullable

🔧 Proposed fix (Option 1: Make nullable)

op.add_column( "evaluation_run", sa.Column( "config", postgresql.JSONB(astext_type=sa.Text()), autoincrement=False, - nullable=False, + nullable=True, comment="Evaluation configuration (model, instructions, etc.)", ), )

🤖 Prompt for AI Agents

In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py around lines 47 - 56, The downgrade currently re-adds the "config" column on the "evaluation_run" table using op.add_column with sa.Column(..., nullable=False) which will fail if rows exist; update that op.add_column call in the downgrade to use nullable=True (or alternatively add a server_default or a prior data migration to populate values before setting non-nullable), ensuring the column is created nullable during downgrade to avoid PostgreSQL errors.

backend/app/api/routes/evaluation.py

backend/app/crud/evaluations/core.py

backend/app/crud/evaluations/processing.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/api/routes/evaluation.py (1)
527-534: Fix type incompatibility in start_evaluation_batch call.

The function expects config: dict[str, Any], but config.completion.params could be either dict[str, Any] (NativeCompletionConfig) or KaapiLLMParams (KaapiCompletionConfig). Update the function signature to accept both types:
config: dict[str, Any] | KaapiLLMParams
Alternatively, convert config.completion.params to a dict before passing it if the function strictly requires a dict.

🤖 Fix all issues with AI agents

In @backend/app/api/routes/evaluation.py:
- Around line 520-524: The attribute access is inconsistent: replace
auth_context.organization.id and auth_context.project.id with the existing
variables organization_.id and project_.id to match the rest of the file; update
the call that currently passes config_id=config_id,
config_version=config_version, organization_id=auth_context.organization.id,
project_id=auth_context.project.id to use organization_.id and project_.id
(keeping the same parameter names) so all references are consistent with the
file’s use of organization_ and project_.
- Around line 493-495: Replace direct nullable attribute access to AuthContext
with the non-null asserting properties: update usages of auth_context.project.id
to auth_context.project_.id and auth_context.organization.id to
auth_context.organization_.id (e.g., where ConfigVersionCrud is instantiated as
config_version_crud = ConfigVersionCrud(session=_session, config_id=config_id,
project_id=auth_context.project.id) change to use project_); scan the rest of
this module for any other auth_context.project or auth_context.organization
usages and switch them to project_ and organization_ to avoid potential
AttributeError when those attributes are None.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31d9523 and 6b00e0f.

📒 Files selected for processing (1)

backend/app/api/routes/evaluation.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluation.py

🧠 Learnings (2)

📚 Learning: 2025-12-17T15:39:30.469Z

Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Organize backend code in `backend/app/` following the layered architecture: Models, CRUD, Routes, Core, Services, and Celery directories

Applied to files:

backend/app/api/routes/evaluation.py

📚 Learning: 2025-12-17T10:16:25.880Z

Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.

Applied to files:

backend/app/api/routes/evaluation.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

backend/app/api/routes/evaluation.py

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/crud/evaluations/batch.py (1)

104-119: Filter out None values from request body to avoid API errors.

The body construction includes fields that may be None (e.g., instructions, temperature, reasoning). Passing explicit None values to the OpenAI API may cause errors, as some endpoints don't accept null for optional fields.

Proposed fix

         batch_request = {
             "custom_id": item["id"],
             "method": "POST",
             "url": "/v1/responses",
-            "body": {
-                # Use config as-is
-                "model": config.model,
-                "instructions": config.instructions,
-                "temperature": config.temperature,
-                "reasoning": {"effort": config.reasoning} if config.reasoning else None,
-                "tools": config.knowledge_base_ids if config.knowledge_base_ids else [],
-                "input": question,  # Add input from dataset
-            },
+            "body": {
+                k: v for k, v in {
+                    "model": config.model,
+                    "instructions": config.instructions,
+                    "temperature": config.temperature,
+                    "reasoning": {"effort": config.reasoning} if config.reasoning else None,
+                    "tools": config.knowledge_base_ids if config.knowledge_base_ids else None,
+                    "input": question,
+                }.items() if v is not None
+            },
         }

🤖 Fix all issues with AI agents

In @backend/app/crud/evaluations/batch.py:
- Around line 108-116: The body currently sets "tools" to
config.knowledge_base_ids (a list of IDs) but OpenAI expects tool objects;
update the body construction in batch.py to use the mapper or construct the
required tool object: call map_kaapi_to_openai_params(config) or transform
config.knowledge_base_ids into [{"type":"file_search","vector_store_ids":
config.knowledge_base_ids, "max_num_results": config.max_num_results or 20}] and
assign that to "tools" so the payload matches the OpenAI Responses API format
(refer to map_kaapi_to_openai_params in backend/app/services/llm/mappers.py and
the body assembly where "tools" is set).

In @backend/app/crud/evaluations/core.py:
- Around line 342-348: The code directly accesses config.completion.params.model
which can raise AttributeError if completion or params are missing; update the
block that follows the config resolution (around the use of eval_run and config)
to defensively validate these nested attributes: verify config has a non-None
completion and that completion has non-None params and a model (e.g., using
explicit checks or getattr with defaults), and if any are missing raise a
ValueError that includes eval_run.id, eval_run.config_id,
eval_run.config_version and a clear message about the missing field(s); ensure
subsequent uses of model assume it is valid.

In @backend/app/services/audio/speech_to_text.py:
- Around line 610-625: The __main__ test block contains developer-specific
hardcoded file paths and should be removed or guarded; delete or replace the if
__name__ == "__main__": section (which calls
transcribe_audio_with_indic_conformer with ai4b_file_path) or wrap it behind a
non-committed debug flag/CLI using argparse or environment variables, or move
the scenario into a proper test/integration test that uses fixtures or temp
files; ensure no absolute local paths remain and any manual-run examples are
documented separately rather than committed in the module.
- Around line 53-66: In the __init__ of the speech-to-text class fix the typo
and clarify the error: replace the incorrect attribute assignment self.provide =
provider with self.provider = provider, and change the API-key missing
ValueError message ("Missing OpenAI API Key for Client STT Client
initialization") to a provider-agnostic message like "Missing API key for
speech-to-text client initialization" so it no longer references only OpenAI;
keep the rest of the provider branching (provider == "openai" / provider ==
"gemini") and the unsupported-provider ValueError as-is.
- Around line 9-13: Remove the duplicate Literal import and modernize typing:
keep only the needed typing imports (e.g., BinaryIO) and remove the second
"Literal" import from the top of speech_to_text.py, then replace all usages of
List[...] and Dict[...] in function signatures, class BaseModel fields, and
return types (e.g., any occurrences in functions like transcribe_* or classes
that reference lists/dicts) with the Python 3.9+ built-in generics list[...] and
dict[...]; ensure annotations still import BinaryIO and, if Literal is required,
import it only once (or use typing.Literal) and run a quick lint/type-check to
confirm no remaining deprecated typing usages.
- Around line 76-92: The code opens audio_file when it's a string but never
closes it; change the handling around isinstance(audio_file, str) to open the
file with a context manager or track the opened file in a local variable and
ensure it's closed in a finally block so that the file handle is always released
even on exceptions; keep using self.openai_client.audio.transcriptions.create
for transcription and return the result, and update the logger calls
(logger.info and logger.error) to prefix messages with the function/method name
(e.g., the enclosing method name) per guidelines so logs include that context.

In @backend/app/services/llm/providers/registry.py:
- Around line 16-23: The registry currently includes the OPENAI constant but
get_llm_provider only recognizes OPENAI_NATIVE, causing a runtime ValueError;
fix by either removing the OPENAI entry from the _registry dict so only
OPENAI_NATIVE maps to OpenAIProvider, or update get_llm_provider to accept the
OPENAI symbol as an alias (handle provider_type == OPENAI by mapping it to
OpenAIProvider) and ensure the registry check and the ValueError logic align;
refer to OPENAI, OPENAI_NATIVE, _registry, and get_llm_provider when making the
change.

🧹 Nitpick comments (4)

backend/app/services/audio/speech_to_text.py (2)
316-323: Use tempfile module instead of hardcoded /tmp path.

The hardcoded /tmp path is not portable (e.g., Windows). Consider using tempfile.NamedTemporaryFile for cross-platform compatibility.
Proposed fix
+import tempfile
 ...
-    media_extention = (
+    media_extension = (
         file_data.media_type.split("/")[-1] if file_data.media_type else ".ogg"
     )
-    temp_file = f"/tmp/{uuid4()}.{media_extention}"
+    temp_file = tempfile.NamedTemporaryFile(
+        suffix=f".{media_extension}", delete=False
+    ).name
52-67: Add type hints to all function parameters and return values.

Per coding guidelines, all functions should have type hints. Several functions in this file are missing return type annotations. For example:
Example fixes
-def __init__(self, provider, api_key: str | None = None):
+def __init__(self, provider: str, api_key: str | None = None) -> None:

-def transcribe_with_openai(
+def transcribe_with_openai(
     self,
     audio_file: BinaryIO | str,
     model: str = "gpt-4o-transcribe",
     prompt: str | None = None,
-):
+) -> str:

-def transcribe_audio_with_chirp_v3(audio_file_path: str):
+def transcribe_audio_with_chirp_v3(audio_file_path: str) -> str | None:

-def transcribe_audio_with_indic_conformer(audio_file_path: str):
+def transcribe_audio_with_indic_conformer(audio_file_path: str) -> str:
Apply similar changes to other functions missing return type hints.
backend/app/crud/evaluations/core.py (2)
61-69: Add function name prefix to log messages.

Per coding guidelines, log messages should be prefixed with the function name in square brackets.
Proposed fix
     except Exception as e:
         session.rollback()
-        logger.error(f"Failed to create EvaluationRun: {e}", exc_info=True)
+        logger.error(f"[create_evaluation_run] Failed to create EvaluationRun: {e}", exc_info=True)
         raise

-    logger.info(
-        f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
+    logger.info(
+        f"[create_evaluation_run] Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
         f"config_id={config_id}, config_version={config_version}"
     )
308-311: Add return type hint.

The function is missing a return type annotation in the signature on line 311.

The signature already shows -> str: so this appears correct. The function implementation looks good overall.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b00e0f and ceb3970.

📒 Files selected for processing (4)

backend/app/crud/evaluations/batch.py
backend/app/crud/evaluations/core.py
backend/app/services/audio/speech_to_text.py
backend/app/services/llm/providers/registry.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement business logic in services located in backend/app/services/

Files:

backend/app/services/llm/providers/registry.py
backend/app/services/audio/speech_to_text.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/services/llm/providers/registry.py
backend/app/crud/evaluations/batch.py
backend/app/crud/evaluations/core.py
backend/app/services/audio/speech_to_text.py

🧠 Learnings (1)

📚 Learning: 2025-12-17T10:16:25.880Z

Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.

Applied to files:

backend/app/services/llm/providers/registry.py

🧬 Code graph analysis (4)

backend/app/services/llm/providers/registry.py (2)

backend/app/services/llm/providers/base.py (1)

BaseProvider (13-63)

backend/app/services/llm/providers/openai.py (1)

OpenAIProvider (21-110)

backend/app/crud/evaluations/batch.py (1)

backend/app/models/llm/request.py (1)

KaapiLLMParams (8-41)

backend/app/crud/evaluations/core.py (5)

backend/app/crud/config/version.py (1)

ConfigVersionCrud (15-142)

backend/app/crud/evaluations/langfuse.py (1)

fetch_trace_scores_from_langfuse (318-557)

backend/app/models/evaluation.py (1)

EvaluationRun (171-322)

backend/app/models/llm/request.py (1)

LLMCallConfig (132-188)

backend/app/services/llm/jobs.py (1)

resolve_config_blob (84-116)

backend/app/services/audio/speech_to_text.py (1)

backend/app/core/providers.py (1)

ProviderConfig (18-21)

🪛 Ruff (0.14.10)

backend/app/services/audio/speech_to_text.py

13-13: typing.List is deprecated, use list instead

(UP035)

13-13: typing.Dict is deprecated, use dict instead

(UP035)

13-13: Redefinition of unused Literal from line 9

Remove definition: Literal

(F811)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (2)

backend/app/crud/evaluations/batch.py (1)

63-65: LGTM on type change.

The signature change from dict[str, Any] to KaapiLLMParams provides better type safety and aligns with the PR's goal of using stored configurations.

backend/app/crud/evaluations/core.py (1)

18-27: LGTM on signature change.

The refactor from config: dict to config_id: UUID, config_version: int properly implements the PR objective of using stored configurations instead of raw config dictionaries.

backend/app/crud/evaluations/batch.py

backend/app/crud/evaluations/core.py

coderabbitai · 2026-01-13T14:29:19Z

backend/app/services/audio/speech_to_text.py

+from typing import BinaryIO, Literal
+from pydantic import BaseModel
+from dotenv import load_dotenv
+from uuid import uuid4
+from typing import List, Literal, Dict


⚠️ Potential issue | 🟡 Minor

Fix duplicate import and use modern type hints.

Literal is imported twice (lines 9 and 13), and List/Dict from typing are deprecated in Python 3.9+.

Proposed fix

-from typing import BinaryIO, Literal +from typing import BinaryIO from pydantic import BaseModel from dotenv import load_dotenv from uuid import uuid4 -from typing import List, Literal, Dict +from typing import Literal

Then replace List[...] with list[...] and Dict[...] with dict[...] throughout the file.

🧰 Tools

🪛 Ruff (0.14.10)

13-13: typing.List is deprecated, use list instead

(UP035)

13-13: typing.Dict is deprecated, use dict instead

(UP035)

13-13: Redefinition of unused Literal from line 9

Remove definition: Literal

(F811)

🤖 Prompt for AI Agents

In @backend/app/services/audio/speech_to_text.py around lines 9 - 13, Remove the duplicate Literal import and modernize typing: keep only the needed typing imports (e.g., BinaryIO) and remove the second "Literal" import from the top of speech_to_text.py, then replace all usages of List[...] and Dict[...] in function signatures, class BaseModel fields, and return types (e.g., any occurrences in functions like transcribe_* or classes that reference lists/dicts) with the Python 3.9+ built-in generics list[...] and dict[...]; ensure annotations still import BinaryIO and, if Literal is required, import it only once (or use typing.Literal) and run a quick lint/type-check to confirm no remaining deprecated typing usages.

coderabbitai · 2026-01-13T14:29:19Z

backend/app/services/audio/speech_to_text.py

+    def __init__(self, provider, api_key: str | None = None):
+        if api_key is None:
+            raise ValueError(
+                "Missing OpenAI API Key for Client STT Client initialization"
+            )
+        self.provide = provider
+        self.openai_client = None
+        self.gemini_client = None
+        if provider == "openai":
+            self.openai_client = OpenAI(api_key=api_key)
+        elif provider == "gemini":
+            self.gemini_client = genai.Client(api_key=api_key)
+        else:
+            raise ValueError("This provider is not supported yet.")


⚠️ Potential issue | 🔴 Critical

Typo and misleading error message in __init__.

Line 58: self.provide = provider should be self.provider = provider

Lines 54-57: Error message mentions "OpenAI" but this class supports multiple providers

Proposed fix

def __init__(self, provider, api_key: str | None = None): if api_key is None: raise ValueError( - "Missing OpenAI API Key for Client STT Client initialization" + f"Missing API Key for {provider} STT Client initialization" ) - self.provide = provider + self.provider = provider

🤖 Prompt for AI Agents

In @backend/app/services/audio/speech_to_text.py around lines 53 - 66, In the __init__ of the speech-to-text class fix the typo and clarify the error: replace the incorrect attribute assignment self.provide = provider with self.provider = provider, and change the API-key missing ValueError message ("Missing OpenAI API Key for Client STT Client initialization") to a provider-agnostic message like "Missing API key for speech-to-text client initialization" so it no longer references only OpenAI; keep the rest of the provider branching (provider == "openai" / provider == "gemini") and the unsupported-provider ValueError as-is.

coderabbitai · 2026-01-13T14:29:19Z

backend/app/services/audio/speech_to_text.py

+        try:
+            # Handle file path vs file object
+            if isinstance(audio_file, str):
+                audio_file = open(audio_file, "rb")
+
+            transcription = self.openai_client.audio.transcriptions.create(
+                model=model,
+                file=audio_file,
+                # language="hi",
+                response_format="text",
+                prompt=prompt or DEFAULT_PROMPT,
+            )
+            logger.info(f"Successfully transcribed audio using OpenAI model: {model}")
+            return transcription
+        except Exception as e:
+            logger.error(f"OpenAI transcription failed: {str(e)}", exc_info=True)
+            raise


⚠️ Potential issue | 🟠 Major

Resource leak: file handle not closed when path is provided.

When audio_file is a string path, the file is opened at line 79 but never closed, causing a resource leak. Also, log messages should be prefixed with the function name per coding guidelines.

Proposed fix

def transcribe_with_openai( self, audio_file: BinaryIO | str, model: str = "gpt-4o-transcribe", prompt: str | None = None, ): if self.openai_client is None: raise ValueError("OpenAI client is not initialized.") + should_close = False try: # Handle file path vs file object if isinstance(audio_file, str): audio_file = open(audio_file, "rb") + should_close = True transcription = self.openai_client.audio.transcriptions.create( model=model, file=audio_file, # language="hi", response_format="text", prompt=prompt or DEFAULT_PROMPT, ) - logger.info(f"Successfully transcribed audio using OpenAI model: {model}") + logger.info(f"[transcribe_with_openai] Successfully transcribed audio using OpenAI model: {model}") return transcription except Exception as e: - logger.error(f"OpenAI transcription failed: {str(e)}", exc_info=True) + logger.error(f"[transcribe_with_openai] OpenAI transcription failed: {str(e)}", exc_info=True) raise + finally: + if should_close and hasattr(audio_file, 'close'): + audio_file.close()

🤖 Prompt for AI Agents

In @backend/app/services/audio/speech_to_text.py around lines 76 - 92, The code opens audio_file when it's a string but never closes it; change the handling around isinstance(audio_file, str) to open the file with a context manager or track the opened file in a local variable and ensure it's closed in a finally block so that the file handle is always released even on exceptions; keep using self.openai_client.audio.transcriptions.create for transcription and return the result, and update the logger calls (logger.info and logger.error) to prefix messages with the function/method name (e.g., the enclosing method name) per guidelines so logs include that context.

coderabbitai · 2026-01-13T14:29:19Z

backend/app/services/audio/speech_to_text.py

+def transcribe_audio_with_indic_conformer(audio_file_path: str):
+    indic_conformer_api_url = str(os.getenv("AI4B_STT_URL"))
+    with open(audio_file_path, "rb") as file:
+        audio_content = file.read()
+
+    response = requests.post(
+        url=indic_conformer_api_url,
+        data={"language_code": "hi", "decoding_strategy": "ctc"},
+        files={"audio_file": audio_content},
+    )
+    logger.info(response.json())
+    transcription = response.json()["transcription"]
+    return transcription


⚠️ Potential issue | 🟠 Major

Missing timeout and error handling for external API call.

The requests.post call has no timeout, which can cause the application to hang indefinitely if the AI4B service is unresponsive. Additionally, there's no error handling for HTTP errors or JSON parsing failures.

Proposed fix

def transcribe_audio_with_indic_conformer(audio_file_path: str): indic_conformer_api_url = str(os.getenv("AI4B_STT_URL")) + if not indic_conformer_api_url: + raise ValueError("AI4B_STT_URL environment variable not set") + with open(audio_file_path, "rb") as file: audio_content = file.read() response = requests.post( url=indic_conformer_api_url, data={"language_code": "hi", "decoding_strategy": "ctc"}, files={"audio_file": audio_content}, + timeout=60, ) - logger.info(response.json()) - transcription = response.json()["transcription"] + response.raise_for_status() + result = response.json() + logger.info(f"[transcribe_audio_with_indic_conformer] Response received: {result}") + transcription = result["transcription"] return transcription

backend/app/services/audio/speech_to_text.py

backend/app/services/llm/providers/registry.py

backend/app/crud/evaluations/batch.py

…tion_jsonl

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

backend/app/crud/evaluations/batch.py (2)

63-91: Docstring is inconsistent with the new parameter type.

The config parameter is now KaapiLLMParams, but the docstring at line 77 still describes it as "Evaluation configuration dict with OpenAI Responses API parameters." Update the docstring to reflect the typed model and its actual fields.

📝 Suggested docstring update

     Args:
         dataset_items: List of dataset items from Langfuse
-        config: Evaluation configuration dict with OpenAI Responses API parameters.
-            This config is used as-is in the body, with only "input" being added
-            from the dataset. Config can include any fields like:
-            - model (required)
-            - instructions
-            - tools
-            - reasoning
-            - text
-            - temperature
-            - include
-            etc.
+        config: KaapiLLMParams containing LLM configuration parameters including
+            model, instructions, knowledge_base_ids, reasoning, temperature,
+            and max_num_results.

     Returns:
         List of dictionaries (JSONL data)

142-148: Docstring is inconsistent with the new parameter type.

The config parameter docstring at line 147 still describes it as "Evaluation configuration dict" but the type is now KaapiLLMParams.

📝 Suggested docstring update

     Args:
         langfuse: Configured Langfuse client
         openai_client: Configured OpenAI client
         session: Database session
         eval_run: EvaluationRun database object (with run_name, dataset_name, config)
-        config: Evaluation configuration dict with llm, instructions, vector_store_ids
+        config: KaapiLLMParams containing LLM configuration parameters

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceb3970 and ebdda81.

📒 Files selected for processing (1)

backend/app/crud/evaluations/batch.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/crud/evaluations/batch.py

🧬 Code graph analysis (1)

backend/app/crud/evaluations/batch.py (1)

backend/app/models/llm/request.py (1)

KaapiLLMParams (8-41)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (2)

backend/app/crud/evaluations/batch.py (2)

20-20: LGTM!

The import of KaapiLLMParams is appropriate and aligns with the refactoring to use typed configuration objects instead of raw dictionaries.

176-182: LGTM!

Good use of model_dump(exclude_none=True) to serialize the config without None values. This ensures clean storage of the evaluation configuration.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

backend/app/crud/evaluations/batch.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

backend/app/tests/api/routes/test_evaluation.py (4)
510-540: Critical: Corrupted method definition causing syntax error.

This method has a duplicated/malformed signature that's causing the CI pipeline to fail with a Black parsing error. Lines 510-517 start defining the method, but then lines 518-522 appear to be remnants of an old or conflicting definition with type hints.

The method body also has inconsistencies - it references sample_evaluation_config fixture in the old signature but never uses it.
🐛 Proposed fix
     def test_start_batch_evaluation_invalid_dataset_id(
-        self, client, user_api_key_header, db, user_api_key
-    ):
-        """Test batch evaluation fails with invalid dataset_id."""
-        # Create a valid config to use
-        config = create_test_config(db, project_id=user_api_key.project.id)
-
-        # Try to start evaluation with non-existent dataset_id
         self,
         client: TestClient,
         user_api_key_header: dict[str, str],
-        sample_evaluation_config: dict[str, Any],
+        db: Session,
+        user_api_key: TestAuthContext,
     ) -> None:
         """Test batch evaluation fails with invalid/non-existent dataset_id."""
+        # Create a valid config to use
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         response = client.post(
             "/api/v1/evaluations",
             json={
                 "experiment_name": "test_evaluation_run",
                 "dataset_id": 99999,  # Non-existent
                 "config_id": str(config.id),
                 "config_version": 1,
             },
             headers=user_api_key_header,
         )
845-854: Inconsistent: Test still uses deprecated config field.

This test (test_get_evaluation_run_without_trace_info) creates an EvaluationRun with the old config dict field instead of the new config_id/config_version pattern used in other updated tests. Based on the AI summary and related model changes, the schema now expects config_id and config_version fields.
🔧 Suggested fix for consistency
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_simple_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )
880-888: Inconsistent: Another test using deprecated config field.

Same issue as above - test_get_evaluation_run_resync_without_trace_info_fails still uses the old config dict pattern. Should be updated to use config_id/config_version for consistency with the refactored schema.
🔧 Suggested fix for consistency
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )
577-654: Type mismatch: Tests pass dict but build_evaluation_jsonl expects KaapiLLMParams.

The function at backend/app/crud/evaluations/batch.py (lines 63-65) declares config: KaapiLLMParams, and its implementation (lines 110-122) accesses attributes directly: config.model, config.instructions, config.temperature, config.reasoning, config.knowledge_base_ids, config.max_num_results.

Plain dicts do not support attribute access, so these tests would fail at runtime with AttributeError: 'dict' object has no attribute 'model'. Construct KaapiLLMParams instances instead:
from app.models.llm import KaapiLLMParams

config = KaapiLLMParams(
    model="gpt-4o",
    temperature=0.2,
    instructions="You are a helpful assistant",
)

🧹 Nitpick comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)

542-572: Missing type hints on method parameters.

Per coding guidelines, all function parameters and return values should have type hints. The client and user_api_key_header parameters are missing type annotations.

Also, the test docstring says "model is missing from config" but the test now validates config-not-found since it uses config_id/config_version. Consider updating the docstring to reflect the actual test behavior.

♻️ Suggested improvement

     def test_start_batch_evaluation_missing_model(
-        self, client: TestClient, user_api_key_header: dict[str, str]
+        self,
+        client: TestClient,
+        user_api_key_header: dict[str, str],
     ) -> None:
-        """Test batch evaluation fails when model is missing from config."""
-        # We don't need a real dataset for this test - the validation should happen
-        # before dataset lookup. Use any dataset_id and expect config validation error
-        invalid_config = {
-            "instructions": "You are a helpful assistant",
-            "temperature": 0.5,
-        }
-
+        """Test batch evaluation fails when config_id references non-existent config."""
         response = client.post(
             "/api/v1/evaluations",
             json={
                 "experiment_name": "test_no_config",
                 "dataset_id": 1,  # Dummy ID, config validation happens first
                 "config_id": str(uuid4()),  # Non-existent config
                 "config_version": 1,
             },
             headers=user_api_key_header,
         )

Note: The invalid_config variable on lines 548-551 is now dead code and should be removed.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebdda81 and f9e77ff.

📒 Files selected for processing (1)

backend/app/tests/api/routes/test_evaluation.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/tests/api/routes/test_evaluation.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py

🧬 Code graph analysis (1)

backend/app/tests/api/routes/test_evaluation.py (5)

backend/app/tests/crud/documents/test_doc_transformation_job.py (1)

crud (24-25)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (63-126)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-168)

EvaluationRun (171-322)

backend/app/tests/utils/test_data.py (1)

create_test_config (238-301)

backend/app/tests/utils/document.py (2)

project (65-66)

get (118-122)

🪛 GitHub Actions: Kaapi CI

backend/app/tests/api/routes/test_evaluation.py

[error] 519-519: Black formatting failed. Cannot parse: 519:26: client: TestClient,

🔇 Additional comments (3)

backend/app/tests/api/routes/test_evaluation.py (3)

4-4: LGTM: New imports support config-based testing.

The additions of uuid4 and create_test_config properly support the refactored test cases that now use stored configs with config_id/config_version instead of inline config dicts.

Also applies to: 12-12

742-758: LGTM: Correctly updated to use config_id/config_version.

The test now properly creates a test config and uses config_id and config_version fields when constructing the EvaluationRun, aligning with the new schema.

782-804: LGTM: Test updated to use config references.

Correctly creates a config and uses config_id/config_version fields in the EvaluationRun construction.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

backend/app/tests/api/routes/test_evaluation.py (3)
839-863: Test uses deprecated config dict field instead of config_id/config_version.

The EvaluationRun model no longer has a config dict field (per the model definition in relevant snippets showing config_id: UUID and config_version: int). This test creates an EvaluationRun with config={"model": "gpt-4o"} which is the old pattern and will likely fail.
     def test_get_evaluation_run_without_trace_info(
         self,
         client: TestClient,
         user_api_key_header: dict[str, str],
         db: Session,
         user_api_key: TestAuthContext,
         create_test_dataset: EvaluationDataset,
     ) -> None:
         """Test getting evaluation run without requesting trace info."""
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_simple_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )
865-902: Same issue: Test uses deprecated config dict field.

This test also creates EvaluationRun with config={"model": "gpt-4o"} which needs to be updated to use config_id and config_version.
     def test_get_evaluation_run_resync_without_trace_info_fails(
         self,
         client: TestClient,
         user_api_key_header: dict[str, str],
         db: Session,
         user_api_key: TestAuthContext,
         create_test_dataset: EvaluationDataset,
     ) -> None:
         """Test that resync_score=true requires get_trace_info=true."""
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )
571-600: Replace dict config with KaapiLLMParams instantiation—test will fail at runtime.

The test passes a plain dictionary to build_evaluation_jsonl, which expects a KaapiLLMParams object. Since the function accesses config using attribute notation (e.g., config.model, config.temperature), the test will fail with an AttributeError at runtime when trying to access attributes on a dict. Use KaapiLLMParams(model="gpt-4o", temperature=0.2, instructions="You are a helpful assistant") instead, consistent with how other tests in the codebase instantiate this model. This also aligns with the factory pattern requirement for test fixtures.

🤖 Fix all issues with AI agents

In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 510-517: The test method
test_start_batch_evaluation_invalid_dataset_id contains a duplicated docstring —
remove the redundant docstring so the method has a single descriptive docstring
(keep one of the two identical strings and delete the other) to eliminate the
editing artifact.

🧹 Nitpick comments (4)

backend/app/crud/evaluations/core.py (2)
308-358: Return type should be str not str | None.

The function raises ValueError in all cases where the model would be missing (lines 353-357), so it can never return None. The return type annotation should be str to accurately reflect the contract.
 def resolve_model_from_config(
     session: Session,
     eval_run: EvaluationRun,
-) -> str | None:
+) -> str:
63-69: Log messages should include function name prefix.

Per coding guidelines, log messages should be prefixed with the function name in square brackets. The log statements here and in other CRUD functions (list_evaluation_runs, get_evaluation_run_by_id, update_evaluation_run) don't follow this pattern.
Example fix for this function
     except Exception as e:
         session.rollback()
-        logger.error(f"Failed to create EvaluationRun: {e}", exc_info=True)
+        logger.error(f"[create_evaluation_run] Failed to create EvaluationRun: {e}", exc_info=True)
         raise

-    logger.info(
-        f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
-        f"config_id={config_id}, config_version={config_version}"
-    )
+    logger.info(
+        f"[create_evaluation_run] Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
+        f"config_id={config_id}, config_version={config_version}"
+    )
As per coding guidelines, prefix all log messages with the function name in square brackets.
backend/app/api/routes/evaluation.py (1)
501-510: Add defensive None check for config before accessing .completion.

The elif on line 506 accesses config.completion.provider but only checks if error:. If resolve_config_blob returns (None, None) unexpectedly, this would raise AttributeError.
Suggested fix
     if error:
         raise HTTPException(
             status_code=400,
             detail=f"Failed to resolve config from stored config: {error}",
         )
-    elif config.completion.provider != LLMProvider.OPENAI:
+    if config is None:
+        raise HTTPException(
+            status_code=400,
+            detail="Config resolution returned empty config",
+        )
+    if config.completion.provider != LLMProvider.OPENAI:
         raise HTTPException(
             status_code=422,
             detail="Only 'openai' provider is supported for evaluation configs",
         )
backend/app/tests/api/routes/test_evaluation.py (1)
536-565: Test name and docstring are now misleading.

The test test_start_batch_evaluation_missing_model with docstring "Test batch evaluation fails when model is missing from config" no longer tests that scenario. It now tests that a non-existent config_id returns an error. Consider renaming to test_start_batch_evaluation_invalid_config and updating the docstring.
-    def test_start_batch_evaluation_missing_model(
+    def test_start_batch_evaluation_invalid_config(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
-        """Test batch evaluation fails when model is missing from config."""
+        """Test batch evaluation fails when config_id does not exist."""

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9e77ff and dddee70.

📒 Files selected for processing (5)

backend/app/api/routes/evaluation.py
backend/app/crud/evaluations/batch.py
backend/app/crud/evaluations/core.py
backend/app/services/llm/providers/registry.py
backend/app/tests/api/routes/test_evaluation.py

🚧 Files skipped from review as they are similar to previous changes (2)

backend/app/services/llm/providers/registry.py
backend/app/crud/evaluations/batch.py

🧰 Additional context used

📓 Path-based instructions (3)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluation.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/crud/evaluations/core.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py

🧠 Learnings (2)

📚 Learning: 2025-12-17T15:39:30.469Z

Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Organize backend code in `backend/app/` following the layered architecture: Models, CRUD, Routes, Core, Services, and Celery directories

Applied to files:

backend/app/api/routes/evaluation.py

📚 Learning: 2025-12-17T10:16:25.880Z

Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.

Applied to files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (2)

backend/app/tests/api/routes/test_evaluation.py (3)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (63-133)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-168)

EvaluationRun (171-322)

backend/app/tests/utils/test_data.py (1)

create_test_config (238-301)

backend/app/crud/evaluations/core.py (3)

backend/app/crud/config/version.py (1)

ConfigVersionCrud (15-142)

backend/app/models/llm/request.py (1)

LLMCallConfig (132-188)

backend/app/services/llm/jobs.py (1)

resolve_config_blob (84-116)

🔇 Additional comments (7)

backend/app/crud/evaluations/core.py (2)

348-357: Good defensive validation for nested config access.

The defensive checks for config.completion, config.completion.params, and model address the concerns raised in previous reviews. This ensures proper error messages when the config structure is incomplete.

18-71: LGTM: Function signature update for config management integration.

The create_evaluation_run function has been properly updated to accept config_id: UUID and config_version: int instead of a config dict. Type hints are complete and the docstring reflects the new parameters.

backend/app/api/routes/evaluation.py (3)

446-534: LGTM: Evaluation endpoint updated for config management.

The evaluate function properly:

Accepts config_id: UUID and config_version: int as required body parameters

Resolves the config via ConfigVersionCrud and resolve_config_blob

Validates the provider is OpenAI

Passes config.completion.params to start_evaluation_batch, which aligns with the KaapiLLMParams type expected by build_evaluation_jsonl

512-512: Log message follows the coding guidelines.

Good use of function name prefix in the log message: [evaluate] Successfully resolved config from config management.

493-510: LLMProvider.OPENAI constant exists and is properly implemented.

The constant LLMProvider.OPENAI = "openai" is defined at backend/app/services/llm/providers/registry.py:16 and is already integrated into the provider registry. It is also used in the get_llm_provider function at line 66.

Additionally, the None-checking for config at line 506 is safe. The resolve_config_blob function returns tuple[ConfigBlob | None, str | None] where the contract guarantees either (config_blob, None) on success or (None, error_message) on failure. Since line 506 is only reached after the if error: check at line 501 (which raises on any error), config cannot be None at that point.

Likely an incorrect or invalid review comment.

backend/app/tests/api/routes/test_evaluation.py (2)

510-534: LGTM: Test updated to use config management pattern.

The test correctly creates a test config using create_test_config and passes config_id and config_version in the API payload, aligning with the new endpoint contract.

736-752: LGTM: Test properly updated for config management.

The test correctly creates a test config and uses config_id and config_version when constructing the EvaluationRun.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-15T09:53:26Z

backend/app/tests/api/routes/test_evaluation.py

    def test_start_batch_evaluation_invalid_dataset_id(
-        self,
-        client: TestClient,
-        user_api_key_header: dict[str, str],
-        sample_evaluation_config: dict[str, Any],
-    ) -> None:
+        self, client, user_api_key_header, db, user_api_key
+    ):
+        """Test batch evaluation fails with invalid dataset_id."""
+        # Create a valid config to use
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
        """Test batch evaluation fails with invalid/non-existent dataset_id."""


⚠️ Potential issue | 🟡 Minor

Duplicate docstring in test method.

Lines 513 and 517 both contain docstrings for the same test method. This appears to be an editing artifact - remove one of them.

def test_start_batch_evaluation_invalid_dataset_id( self, client, user_api_key_header, db, user_api_key ): - """Test batch evaluation fails with invalid dataset_id.""" # Create a valid config to use config = create_test_config(db, project_id=user_api_key.project.id) """Test batch evaluation fails with invalid/non-existent dataset_id."""

Should be:

def test_start_batch_evaluation_invalid_dataset_id( self, client, user_api_key_header, db, user_api_key ): """Test batch evaluation fails with invalid/non-existent dataset_id.""" # Create a valid config to use config = create_test_config(db, project_id=user_api_key.project.id)

🤖 Prompt for AI Agents

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 510 - 517, The test method test_start_batch_evaluation_invalid_dataset_id contains a duplicated docstring — remove the redundant docstring so the method has a single descriptive docstring (keep one of the two identical strings and delete the other) to eliminate the editing artifact.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

backend/app/tests/api/routes/test_evaluation.py (6)

551-580: Remove unused invalid_config variable.

The invalid_config dict at lines 557-560 is defined but never used. The test now uses uuid4() for a non-existent config_id instead.

🧹 Suggested fix

     def test_start_batch_evaluation_missing_model(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
         """Test batch evaluation fails when model is missing from config."""
         # We don't need a real dataset for this test - the validation should happen
         # before dataset lookup. Use any dataset_id and expect config validation error
-        invalid_config = {
-            "instructions": "You are a helpful assistant",
-            "temperature": 0.5,
-        }

         response = client.post(

601-631: Tests will fail: build_evaluation_jsonl now expects KaapiLLMParams, not dict.

All tests in TestBatchEvaluationJSONLBuilding pass dict configs to build_evaluation_jsonl, but the function signature changed to accept KaapiLLMParams. These tests will raise AttributeError when accessing config.model, config.instructions, etc.

🔧 Suggested fix for test_build_batch_jsonl_basic

+from app.models.llm.request import KaapiLLMParams
+
 class TestBatchEvaluationJSONLBuilding:
     """Test JSONL building logic for batch evaluation."""

     def test_build_batch_jsonl_basic(self) -> None:
         """Test basic JSONL building with minimal config."""
         dataset_items = [
             {
                 "id": "item1",
                 "input": {"question": "What is 2+2?"},
                 "expected_output": {"answer": "4"},
                 "metadata": {},
             }
         ]

-        config = {
-            "model": "gpt-4o",
-            "temperature": 0.2,
-            "instructions": "You are a helpful assistant",
-        }
+        config = KaapiLLMParams(
+            model="gpt-4o",
+            temperature=0.2,
+            instructions="You are a helpful assistant",
+        )

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

Similar changes are needed for all other tests in this class.

632-659: Test uses tools key which doesn't exist in KaapiLLMParams.

This test assumes a tools field in the config, but KaapiLLMParams uses knowledge_base_ids instead. The function internally converts knowledge_base_ids to the tools format for OpenAI. Update to use KaapiLLMParams with knowledge_base_ids.

🔧 Suggested fix

     def test_build_batch_jsonl_with_tools(self) -> None:
         """Test JSONL building with tools configuration."""
         dataset_items = [
             {
                 "id": "item1",
                 "input": {"question": "Search the docs"},
                 "expected_output": {"answer": "Answer from docs"},
                 "metadata": {},
             }
         ]

-        config = {
-            "model": "gpt-4o-mini",
-            "instructions": "Search documents",
-            "tools": [
-                {
-                    "type": "file_search",
-                    "vector_store_ids": ["vs_abc123"],
-                }
-            ],
-        }
+        config = KaapiLLMParams(
+            model="gpt-4o-mini",
+            instructions="Search documents",
+            knowledge_base_ids=["vs_abc123"],
+        )

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

         assert len(jsonl_data) == 1
         request = jsonl_data[0]
         assert request["body"]["tools"][0]["type"] == "file_search"
         assert "vs_abc123" in request["body"]["tools"][0]["vector_store_ids"]

661-710: Update remaining tests to use KaapiLLMParams.

test_build_batch_jsonl_minimal_config and test_build_batch_jsonl_skips_empty_questions also pass dict configs and need similar updates.

🔧 Suggested fixes

     def test_build_batch_jsonl_minimal_config(self) -> None:
         """Test JSONL building with minimal config (only model required)."""
         dataset_items = [...]

-        config = {"model": "gpt-4o"}  # Only model provided
+        config = KaapiLLMParams(model="gpt-4o")

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

     def test_build_batch_jsonl_skips_empty_questions(self) -> None:
         """Test that items with empty questions are skipped."""
         dataset_items = [...]

-        config = {"model": "gpt-4o", "instructions": "Test"}
+        config = KaapiLLMParams(model="gpt-4o", instructions="Test")

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

712-736: Update test_build_batch_jsonl_multiple_items to use KaapiLLMParams.

🔧 Suggested fix

     def test_build_batch_jsonl_multiple_items(self) -> None:
         """Test JSONL building with multiple items."""
         dataset_items = [...]

-        config = {
-            "model": "gpt-4o",
-            "instructions": "Answer questions",
-        }
+        config = KaapiLLMParams(
+            model="gpt-4o",
+            instructions="Answer questions",
+        )

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

869-877: Update tests to use config_id and config_version instead of config dict.

The EvaluationRun model does not have a config field—it has config_id (UUID) and config_version (int). The test initializes EvaluationRun with config={"model": "gpt-4o"}, which will be rejected by the model. Update lines 869-877 and the similar test at lines 904-912 to use config_id and config_version to match the pattern in lines 766-778 and 806-827.

♻️ Duplicate comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)

525-549: Duplicate docstring in test method.

Lines 528 and 532 both contain docstrings for the same test method.

🔧 Suggested fix

     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
         """Test batch evaluation fails with invalid dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)

-        """Test batch evaluation fails with invalid/non-existent dataset_id."""
         response = client.post(

🧹 Nitpick comments (4)

backend/app/crud/evaluations/__init__.py (1)
40-70: Inconsistency: save_score and fetch_trace_scores_from_langfuse imported but not in __all__.

Two functions are imported at the module level (lines 9 and 29) but omitted from __all__:

save_score (from core)

fetch_trace_scores_from_langfuse (from langfuse)

If these are intended as internal-only, consider removing them from the top-level imports to avoid confusion. Otherwise, add them to __all__ for consistency.
Option A: Add missing exports to `__all__`
 __all__ = [
     # Core
     "create_evaluation_run",
     "get_evaluation_run_by_id",
     "list_evaluation_runs",
     "resolve_model_from_config",
+    "save_score",
     # Cron
     "process_all_pending_evaluations",
     "process_all_pending_evaluations_sync",
     ...
     # Langfuse
     "create_langfuse_dataset_run",
+    "fetch_trace_scores_from_langfuse",
     "update_traces_with_cosine_scores",
     "upload_dataset_to_langfuse",
 ]
Option B: Remove unused top-level imports
 from app.crud.evaluations.core import (
     create_evaluation_run,
     get_evaluation_run_by_id,
     list_evaluation_runs,
     resolve_model_from_config,
-    save_score,
 )
 ...
 from app.crud.evaluations.langfuse import (
     create_langfuse_dataset_run,
-    fetch_trace_scores_from_langfuse,
     update_traces_with_cosine_scores,
     upload_dataset_to_langfuse,
 )
backend/app/crud/evaluations/embeddings.py (1)
365-366: Stale comment: model is hard-coded, not read from config.

The comment on line 365 states "Get embedding model from config" but the code now hard-codes the value directly. Consider updating the comment to reflect the intentional hard-coding, or remove the comment entirely since the assignment is self-explanatory.
📝 Suggested comment fix
-        # Get embedding model from config (default: text-embedding-3-large)
+        # Use fixed embedding model for similarity scoring
         embedding_model = "text-embedding-3-large"
backend/app/crud/evaluations/batch.py (2)
74-86: Docstring references fields not present in KaapiLLMParams.

The docstring mentions config fields like tools, text, and include, but KaapiLLMParams (per the relevant snippet) only has: model, instructions, knowledge_base_ids, reasoning, temperature, and max_num_results. Update the docstring to reflect the actual KaapiLLMParams structure.
📝 Suggested docstring update
     Args:
         dataset_items: List of dataset items from Langfuse
-        config: Evaluation configuration dict with OpenAI Responses API parameters.
-            This config is used as-is in the body, with only "input" being added
-            from the dataset. Config can include any fields like:
-            - model (required)
-            - instructions
-            - tools
-            - reasoning
-            - text
-            - temperature
-            - include
-            etc.
+        config: KaapiLLMParams containing evaluation configuration:
+            - model (required): Model identifier
+            - instructions: System instructions
+            - knowledge_base_ids: Vector store IDs for file_search tools
+            - reasoning: Reasoning effort level
+            - temperature: Sampling temperature
+            - max_num_results: Max results for file search

     Returns:
         List of dictionaries (JSONL data)
148-153: Update docstring to reflect KaapiLLMParams type.

The docstring still describes config as "Evaluation configuration dict with llm, instructions, vector_store_ids" but the parameter is now KaapiLLMParams.
📝 Suggested fix
         session: Database session
         eval_run: EvaluationRun database object (with run_name, dataset_name, config)
-        config: Evaluation configuration dict with llm, instructions, vector_store_ids
+        config: KaapiLLMParams containing model, instructions, and other LLM parameters

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dddee70 and 355d281.

📒 Files selected for processing (5)

backend/app/crud/evaluations/__init__.py
backend/app/crud/evaluations/batch.py
backend/app/crud/evaluations/embeddings.py
backend/app/crud/evaluations/processing.py
backend/app/tests/api/routes/test_evaluation.py

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/crud/evaluations/processing.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/crud/evaluations/embeddings.py
backend/app/crud/evaluations/batch.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/crud/evaluations/__init__.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py

🧬 Code graph analysis (2)

backend/app/crud/evaluations/batch.py (1)

backend/app/models/llm/request.py (1)

KaapiLLMParams (8-41)

backend/app/tests/api/routes/test_evaluation.py (2)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-168)

EvaluationRun (171-322)

backend/app/tests/utils/test_data.py (1)

create_test_config (238-301)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (4)

backend/app/crud/evaluations/__init__.py (1)

4-10: LGTM on the new resolve_model_from_config import.

The new function is properly imported from core and correctly added to __all__, making it part of the module's public API.

backend/app/crud/evaluations/batch.py (1)

103-128: LGTM! Body building logic correctly handles None values.

The conditional field inclusion properly addresses the previous review feedback about not sending None values to the OpenAI API. Fields are only added when they have valid values.

backend/app/tests/api/routes/test_evaluation.py (2)

766-778: LGTM! Test correctly uses config_id and config_version.

The test properly creates a config using create_test_config and references it via config_id and config_version on the EvaluationRun, aligning with the new config management pattern.

806-827: LGTM! Consistent config reference pattern.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

…aapiLLMParams for jsonl bu ild func

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

backend/app/tests/api/routes/test_evaluation.py (4)
644-660: Type mismatch: build_evaluation_jsonl expects KaapiLLMParams, not dict.

The function signature is build_evaluation_jsonl(dataset_items: list[dict[str, Any]], config: KaapiLLMParams). This test passes a plain dict, which will cause a runtime error when the function accesses config.model, config.instructions, etc. as attributes.
Proposed fix using KaapiLLMParams
-        config = {
-            "model": "gpt-4o-mini",
-            "instructions": "Search documents",
-            "tools": [
-                {
-                    "type": "file_search",
-                    "vector_store_ids": ["vs_abc123"],
-                }
-            ],
-        }
+        config = KaapiLLMParams(
+            model="gpt-4o-mini",
+            instructions="Search documents",
+            knowledge_base_ids=["vs_abc123"],
+        )
Note: Based on the build_evaluation_jsonl implementation, tools are constructed from knowledge_base_ids, not passed directly. You may need to adjust the test assertions accordingly.
870-882: EvaluationRun no longer has a config field - use config_id and config_version.

Based on the EvaluationRun model, the config dict field has been replaced with config_id (UUID) and config_version (int). This test will fail at runtime because it's setting a non-existent field.
Proposed fix
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_simple_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )
905-917: Same issue: config field doesn't exist on EvaluationRun.

This test also uses the deprecated config dict field instead of config_id/config_version.
Proposed fix
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )
656-660: Test setup and assertions need updating for KaapiLLMParams structure.

The test setup passes a plain dict with pre-formed tools, but build_evaluation_jsonl expects a KaapiLLMParams object with knowledge_base_ids field. Update the config to use knowledge_base_ids instead of pre-formed tools, and add assertions for the max_num_results field that is automatically included in the generated tools array (defaults to 20 if not specified).
Example correction for test setup:
config = KaapiLLMParams(
    model="gpt-4o-mini",
    instructions="Search documents",
    knowledge_base_ids=["vs_abc123"],
)

jsonl_data = build_evaluation_jsonl(dataset_items, config)

assert len(jsonl_data) == 1
request = jsonl_data[0]
assert request["body"]["tools"][0]["type"] == "file_search"
assert "vs_abc123" in request["body"]["tools"][0]["vector_store_ids"]
assert request["body"]["tools"][0]["max_num_results"] == 20

♻️ Duplicate comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)

526-533: Duplicate docstring in test method.

Lines 529 and 533 both contain docstrings for the same test method. This appears to be an editing artifact from the refactor - remove one of them.

Proposed fix

     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
-        """Test batch evaluation fails with invalid dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)

         """Test batch evaluation fails with invalid/non-existent dataset_id."""

Should become:

     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
         """Test batch evaluation fails with invalid/non-existent dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)

🧹 Nitpick comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)
12-15: Duplicate import of create_test_config.

create_test_config is imported twice - once on line 13 and implicitly available from line 15's import of test_data module. Remove the redundant import.
Proposed fix
 from app.models import EvaluationDataset, EvaluationRun
 from app.models.llm.request import KaapiLLMParams
-from app.tests.utils.test_data import create_test_config
 from app.tests.utils.auth import TestAuthContext
-from app.tests.utils.test_data import create_test_evaluation_dataset
+from app.tests.utils.test_data import create_test_config, create_test_evaluation_dataset

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 355d281 and 9c71896.

📒 Files selected for processing (2)

backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/crud/evaluations/test_processing.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/api/routes/test_evaluation.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/api/routes/test_evaluation.py

🧬 Code graph analysis (1)

backend/app/tests/api/routes/test_evaluation.py (3)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (62-132)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-168)

EvaluationRun (171-322)

backend/app/models/llm/request.py (1)

KaapiLLMParams (8-41)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (11)

backend/app/tests/crud/evaluations/test_processing.py (6)

279-295: LGTM! Test fixture correctly updated to use config_id and config_version.

The eval_run_with_batch fixture properly creates an evaluation run with the new config reference fields. The hardcoded UUID is acceptable for test purposes.

401-410: LGTM! Consistent config reference usage in test.

The test for processing evaluation without batch_job_id correctly uses the same config_id/config_version pattern.

469-485: LGTM! Embedding batch test fixture updated correctly.

The eval_run_with_embedding_batch fixture follows the same pattern as other fixtures.

618-632: LGTM! Check and process evaluation test updated.

The test correctly creates an evaluation run with config references for testing completed batch processing.

681-695: LGTM! Failed batch test fixture updated.

Consistent with other test fixtures in the file.

781-795: LGTM! Poll pending evaluations test updated.

The final test fixture correctly uses config_id/config_version for testing poll functionality.

backend/app/tests/api/routes/test_evaluation.py (5)

602-631: LGTM! Basic JSONL building test correctly uses KaapiLLMParams.

The test properly constructs a KaapiLLMParams instance and validates the output structure.

662-680: LGTM! Minimal config test uses KaapiLLMParams correctly.

Good coverage of the minimal required configuration.

682-711: LGTM! Empty questions skip test uses KaapiLLMParams correctly.

Proper testing of edge case handling.

713-738: LGTM! Multiple items test uses KaapiLLMParams correctly.

Good coverage of batch processing multiple items.

758-796: LGTM! Trace info tests properly updated to use config references.

Both test_get_evaluation_run_trace_info_not_completed and test_get_evaluation_run_trace_info_completed correctly create a test config and use config_id/config_version.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/tests/crud/evaluations/test_processing.py`:
- Around line 356-359: Remove the debug print statements that output Result and
mock call counts (the three print(...) lines referencing
mock_download.call_count, mock_fetch_dataset.call_count, and
mock_create_langfuse.call_count) from the test; the existing assertions already
verify behavior so delete those prints and keep the assertions unchanged to
avoid noisy test output.

♻️ Duplicate comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)

526-533: Duplicate docstring still present.

Lines 529 and 533 both contain docstrings for the same test method. This was flagged in a previous review and remains unresolved.

Proposed fix

     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
-        """Test batch evaluation fails with invalid dataset_id."""
+        """Test batch evaluation fails with invalid/non-existent dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)
 
-        """Test batch evaluation fails with invalid/non-existent dataset_id."""
         response = client.post(

🧹 Nitpick comments (3)

backend/app/tests/api/routes/test_evaluation.py (3)
558-561: Remove unused variable invalid_config.

This dictionary is defined but never used. It appears to be a leftover from when the test passed an inline config instead of referencing a stored config via config_id.
Proposed fix
     def test_start_batch_evaluation_missing_model(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
         """Test batch evaluation fails when model is missing from config."""
-        # We don't need a real dataset for this test - the validation should happen
-        # before dataset lookup. Use any dataset_id and expect config validation error
-        invalid_config = {
-            "instructions": "You are a helpful assistant",
-            "temperature": 0.5,
-        }
-
         response = client.post(
633-661: Commented test contains syntax errors.

When this test is uncommented, it will fail because:

Lines 645-654 use dict literal syntax inside KaapiLLMParams() constructor instead of keyword arguments

KaapiLLMParams doesn't have a tools parameter—it uses knowledge_base_ids which maps to file_search tools internally (per build_evaluation_jsonl)

If you plan to re-enable this test, consider:
config = KaapiLLMParams(
    model="gpt-4o-mini",
    instructions="Search documents",
    knowledge_base_ids=["vs_abc123"],  # This becomes file_search tool
)
871-879: Consider migrating remaining tests to use stored config.

This test still uses the embedded config={"model": "gpt-4o"} pattern while other tests in this PR have been updated to use config_id/config_version. Per the AI summary, this is pending migration work.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4d1896 and 3620fee.

📒 Files selected for processing (2)

backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/crud/evaluations/test_processing.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/api/routes/test_evaluation.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/api/routes/test_evaluation.py

🧬 Code graph analysis (2)

backend/app/tests/crud/evaluations/test_processing.py (2)

backend/app/tests/utils/test_data.py (1)

create_test_config (238-301)

backend/app/crud/evaluations/core.py (1)

create_evaluation_run (18-71)

backend/app/tests/api/routes/test_evaluation.py (4)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (62-132)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-168)

EvaluationRun (171-322)

backend/app/models/llm/request.py (1)

KaapiLLMParams (8-41)

backend/app/tests/utils/test_data.py (1)

create_test_config (238-301)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (14)

backend/app/tests/crud/evaluations/test_processing.py (7)

16-16: LGTM!

The import of create_test_config follows the factory pattern for test fixtures as per coding guidelines.

259-296: LGTM!

The fixture correctly uses the new config_id/config_version pattern and follows the factory approach with proper type hints.

401-430: LGTM!

Test correctly updated to use the new config management pattern.

454-492: LGTM!

Fixture correctly updated to use config management pattern consistent with other test fixtures.

594-657: LGTM!

Test correctly updated to use the new config management pattern with proper fixture setup.

659-721: LGTM!

Test correctly updated for the failed batch scenario with new config management pattern.

759-821: LGTM!

Test correctly updated to use the new config management pattern for polling pending evaluations.

backend/app/tests/api/routes/test_evaluation.py (7)

4-15: LGTM!

Imports are correctly added and all are utilized in the test file.

613-631: LGTM!

Test correctly migrated from dict to KaapiLLMParams. Assertions properly verify the JSONL structure.

674-681: LGTM!

Minimal config test correctly uses KaapiLLMParams with only the required model parameter.

706-712: LGTM!

Test correctly validates that items with empty or missing questions are filtered out.

726-738: LGTM!

Test properly validates batch processing of multiple items with KaapiLLMParams.

768-784: LGTM!

Test correctly migrated to use stored config reference via config_id and config_version. The factory pattern with create_test_config aligns with coding guidelines.

808-827: LGTM!

Test correctly uses stored config reference and properly validates the completed evaluation response structure.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-15T12:20:03Z

backend/app/tests/crud/evaluations/test_processing.py

+        print(f"Result: {result}")
+        print(f"Download called: {mock_download.call_count}")
+        print(f"Fetch dataset called: {mock_fetch_dataset.call_count}")
+        print(f"Create langfuse called: {mock_create_langfuse.call_count}")


⚠️ Potential issue | 🟡 Minor

Remove debug print statements.

These debug print statements should be removed before merging. The assertions below already validate the expected behavior, making these statements unnecessary test artifacts that pollute output.

🧹 Suggested fix

- print(f"Result: {result}") - print(f"Download called: {mock_download.call_count}") - print(f"Fetch dataset called: {mock_fetch_dataset.call_count}") - print(f"Create langfuse called: {mock_create_langfuse.call_count}") - assert result is not None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"Result: {result}")

print(f"Download called: {mock_download.call_count}")

print(f"Fetch dataset called: {mock_fetch_dataset.call_count}")

print(f"Create langfuse called: {mock_create_langfuse.call_count}")

assert result is not None

🤖 Prompt for AI Agents

In `@backend/app/tests/crud/evaluations/test_processing.py` around lines 356 - 359, Remove the debug print statements that output Result and mock call counts (the three print(...) lines referencing mock_download.call_count, mock_fetch_dataset.call_count, and mock_create_langfuse.call_count) from the test; the existing assertions already verify behavior so delete those prints and keep the assertions unchanged to avoid noisy test output.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)
552-581: Remove unused invalid_config variable.

The invalid_config dict defined at lines 558-561 is never used - it appears to be leftover from the previous implementation that passed config as a dict. The test now correctly uses config_id with a non-existent UUID.
     def test_start_batch_evaluation_missing_model(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
         """Test batch evaluation fails when model is missing from config."""
         # We don't need a real dataset for this test - the validation should happen
         # before dataset lookup. Use any dataset_id and expect config validation error
-        invalid_config = {
-            "instructions": "You are a helpful assistant",
-            "temperature": 0.5,
-        }

         response = client.post(

♻️ Duplicate comments (2)

backend/app/tests/api/routes/test_evaluation.py (1)
526-541: Duplicate docstring in test method.

Lines 529 and 533 both contain docstrings. Line 533 appears to be misplaced - it should be removed as the docstring at line 529 already describes the test purpose.
     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
         """Test batch evaluation fails with invalid dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)

-        """Test batch evaluation fails with invalid/non-existent dataset_id."""
         response = client.post(
backend/app/tests/crud/evaluations/test_processing.py (1)
356-359: Remove debug print statements.

These debug print statements should be removed before merging. The assertions below already validate the expected behavior, making these statements unnecessary test artifacts that pollute output.
         result = await process_completed_evaluation(
             eval_run=eval_run_with_batch,
             session=db,
             openai_client=mock_openai,
             langfuse=mock_langfuse,
         )
-        print(f"Result: {result}")
-        print(f"Download called: {mock_download.call_count}")
-        print(f"Fetch dataset called: {mock_fetch_dataset.call_count}")
-        print(f"Create langfuse called: {mock_create_langfuse.call_count}")

         assert result is not None

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3620fee and 8d7db45.

📒 Files selected for processing (2)

backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/crud/evaluations/test_processing.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/crud/evaluations/test_processing.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/crud/evaluations/test_processing.py

🧬 Code graph analysis (1)

backend/app/tests/crud/evaluations/test_processing.py (2)

backend/app/tests/utils/test_data.py (1)

create_test_config (238-301)

backend/app/crud/evaluations/core.py (1)

create_evaluation_run (18-71)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (17)

backend/app/tests/api/routes/test_evaluation.py (9)

4-4: LGTM!

Import of uuid4 is correctly added for generating non-existent config IDs in tests.

12-13: LGTM!

Imports for KaapiLLMParams and create_test_config are properly added to support the config-based evaluation flow.

613-617: LGTM!

Correctly migrated to use KaapiLLMParams for typed configuration in JSONL building tests.

644-648: LGTM!

Properly uses KaapiLLMParams with knowledge_base_ids for testing tools configuration.

668-668: LGTM!

Minimal config test correctly uses KaapiLLMParams with only the required model parameter.

700-700: LGTM!

Test correctly uses KaapiLLMParams for the empty questions skipping test.

720-723: LGTM!

Multiple items test properly migrated to use KaapiLLMParams.

762-770: LGTM!

Test correctly creates a config via create_test_config and uses config_id/config_version when constructing the EvaluationRun instance.

802-810: LGTM!

Properly updated to use config-based approach consistent with the model changes.

backend/app/tests/crud/evaluations/test_processing.py (8)

16-16: LGTM!

Import of create_test_config is correctly added alongside existing create_test_evaluation_dataset.

263-263: LGTM!

Correctly creates a test config with use_kaapi_schema=True to match the evaluation run requirements.

280-289: LGTM!

The create_evaluation_run call correctly uses config_id and config_version parameters matching the updated function signature from core.py.

406-416: LGTM!

Test correctly creates config and uses config_id/config_version for the no batch job ID test case.

454-485: LGTM!

Fixture properly creates a config and uses the config-based approach for the embedding batch evaluation run.

607-635: LGTM!

Test correctly migrated to use config creation and config_id/config_version parameters.

670-699: LGTM!

Failed evaluation test properly uses the config-based pattern.

772-800: LGTM!

Poll pending evaluations test correctly creates config and uses config_id/config_version.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

…_completed_evaluation_success

avirajsingh7 linked an issue Dec 10, 2025 that may be closed by this pull request

Add config management in Evals #440

Open

avirajsingh7 self-assigned this Dec 10, 2025

avirajsingh7 added enhancement New feature or request ready-for-review labels Dec 10, 2025

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

backend/app/alembic/versions/040_add_config_in_evals_run_table.py Outdated Show resolved Hide resolved

backend/app/alembic/versions/040_add_config_in_evals_run_table.py Show resolved Hide resolved

backend/app/models/evaluation.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

avirajsingh7 requested review from AkhileshNegi and Prajna1999 December 10, 2025 05:06

AkhileshNegi requested changes Dec 10, 2025

View reviewed changes

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py Outdated Show resolved Hide resolved

backend/app/api/routes/evaluation.py Outdated Show resolved Hide resolved

backend/app/api/routes/evaluation.py Outdated Show resolved Hide resolved

Prajna1999 reviewed Dec 12, 2025

View reviewed changes

backend/app/api/routes/evaluation.py Outdated Show resolved Hide resolved

avirajsingh7 force-pushed the evals/config_addition branch from f5b94b0 to cdb0b2e Compare December 15, 2025 07:13

avirajsingh7 added 9 commits December 15, 2025 14:01

Refactor evaluation endpoint to use stored configuration and remove a…

3f8ddcf

…ssistant_id handling

Refactor evaluation run to use config ID and version instead of confi…

5280622

…g dict

Add config_id, config_version, and model fields to evaluation run table

13eb778

Refactor batch evaluation tests to use config_id and config_version i…

7bdd322

…nstead of config dict

Update EvaluationRunPublic model to allow nullable config_id and conf…

8f9561c

…ig_version fields

Refactor evaluation run model handling: remove model field, add resol…

f612da4

…ve_model_from_config function, and update processing logic to use config references

fix migration number

4f89f43

fix test

82bee43

fix status code

a2c8a95

avirajsingh7 force-pushed the evals/config_addition branch from c9cc51a to a2c8a95 Compare December 15, 2025 08:34

remove old mirgation

b9fd664

avirajsingh7 requested review from AkhileshNegi and Prajna1999 December 16, 2025 04:46

Prajna1999 approved these changes Dec 16, 2025

View reviewed changes

AkhileshNegi approved these changes Dec 16, 2025

View reviewed changes

Merge branch 'main' into evals/config_addition

31d9523

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

added depends as import

6b00e0f

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

backend/app/api/routes/evaluation.py Show resolved Hide resolved

backend/app/api/routes/evaluation.py Show resolved Hide resolved

Prajna1999 reviewed Jan 12, 2026

View reviewed changes

backend/app/api/routes/evaluation.py Show resolved Hide resolved

fix: spread config object while building batch eval jsonl

ceb3970

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

chore: remove audio poc code

82c7b70

Prajna1999 reviewed Jan 14, 2026

View reviewed changes

backend/app/crud/evaluations/batch.py Outdated Show resolved Hide resolved

fix: add comprehensive expansion of 'tools' key while building evalua…

ebdda81

…tion_jsonl

coderabbitai bot reviewed Jan 14, 2026

View reviewed changes

backend/app/crud/evaluations/batch.py Outdated Show resolved Hide resolved

Merge branch 'main' into evals/config_addition

f9e77ff

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

AkhileshNegi changed the title ~~Evaluation to Use Config Management~~ Evaluation: Use Config Management Jan 15, 2026

Prajna1999 approved these changes Jan 15, 2026

View reviewed changes

Prajna1999 added 2 commits January 15, 2026 15:17

fix: type safety at evaluation run

dddee70

fix: merge conflict

355d281

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

fix: replace config with config_id and version and wrapper function K…

9c71896

…aapiLLMParams for jsonl bu ild func

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

Prajna1999 added 2 commits January 15, 2026 17:17

fix: create test config in eval runs

a4d1896

fix: check print logs

3620fee

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

Prajna1999 added 2 commits January 16, 2026 09:30

fix: tool config jsnol build

779e71c

hotfix: kaapi schema true in create_test_config

8d7db45

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

fix: refresh config from db pre downstream processing in test_process…

afec624

…_completed_evaluation_success

Evaluation: Use Config Management #477

Are you sure you want to change the base?

Evaluation: Use Config Management #477

Uh oh!

Conversation

avirajsingh7 commented Dec 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Prajna1999 left a comment

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 commented Dec 18, 2025

Uh oh!

Prajna1999 commented Dec 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 commented Dec 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 10, 2025 •

edited

Loading

codecov bot commented Dec 10, 2025 •

edited

Loading