Skip to content

Conversation

@avirajsingh7
Copy link
Collaborator

@avirajsingh7 avirajsingh7 commented Dec 10, 2025

Summary

This change refactors the evaluation run process to utilize a stored configuration instead of a configuration dictionary. It introduces fields for config_id, config_version, and model in the evaluation run table, streamlining the evaluation process and improving data integrity.

Checklist

Before submitting a pull request, please ensure that you mark these tasks.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

  • New Features

    • Evaluation API now accepts stored configuration references (config ID + version) and records the resolved model for each run.
    • Added "openai" as a recognized provider; embedding batches now use text-embedding-3-large.
  • Bug Fixes

    • Improved config resolution and validation with clearer error responses for missing/invalid configs.
    • Provider validation tightened to allowed OpenAI providers.
  • Chores

    • Database migration to store config_id and config_version on evaluation runs.
  • Tests

    • Tests updated to use config references and cover config-not-found/access scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

📝 Walkthrough

Walkthrough

Evaluation endpoints, CRUD, models, and tests now use stored configs referenced by config_id + config_version; configs are resolved at runtime to extract the LLM model. EvaluationRun stores config_id, config_version, and model. DB migration replaces embedded JSON config with a FK to config. Batch/embedding/provider flows updated.

Changes

Cohort / File(s) Summary
Evaluation Route & Public API
backend/app/api/routes/evaluation.py
Endpoint signature changed to accept config_id: UUID and config_version: int; resolves stored config via ConfigVersionCrud/resolve_config_blob; enforces provider == OPENAI; returns validation/resolve errors; logging updated.
Evaluation CRUD & Model Resolution
backend/app/crud/evaluations/core.py, backend/app/crud/evaluations/__init__.py
create_evaluation_run now records config_id/config_version (removed embedded config). Added and exported resolve_model_from_config(session, eval_run) to fetch/validate the model from the stored config.
Evaluation Data Model & Migration
backend/app/models/evaluation.py, backend/app/alembic/versions/041_add_config_in_evals_run_table.py
Replaced config JSONB with config_id: UUID and config_version: int, added model: str to EvaluationRun; migration adds FK to config.id and drops legacy config column (downgrade restores JSONB).
Batch Construction & LLM Config Types
backend/app/crud/evaluations/batch.py
config parameter now typed as KaapiLLMParams; JSONL bodies explicitly map model, input, and optional instructions, temperature, reasoning, tools; batch start serializes config via model_dump(exclude_none=True).
Processing & Embeddings
backend/app/crud/evaluations/processing.py, backend/app/crud/evaluations/embeddings.py
Processing resolves model via resolve_model_from_config and passes it downstream; embeddings now hard-code "text-embedding-3-large" (removed dynamic model selection).
LLM Provider Registry
backend/app/services/llm/providers/registry.py
Added LLMProvider.OPENAI = "openai" and mapped it to OpenAIProvider; get_llm_provider accepts both OPENAI and OPENAI_NATIVE.
Tests & Test Helpers
backend/app/tests/api/routes/test_evaluation.py, backend/app/tests/crud/evaluations/test_processing.py, backend/app/tests/...
Tests updated to create/use test configs (create_test_config) and pass config_id/config_version; fixtures and assertions adjusted for config-not-found/config-not-accessible semantics; many calls updated to new create_evaluation_run signature.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant API as Evaluation API
  participant Config as ConfigVersionCrud
  participant Eval as Evaluation CRUD
  participant Batch as Batch Service
  participant LLM as OpenAI Provider

  Client->>API: POST /evaluate (config_id, config_version, dataset_id...)
  API->>Config: resolve_config_blob(config_id, config_version)
  Config-->>API: config blob (KaapiLLMParams) or error
  API->>API: validate provider == "openai"
  API->>Eval: create_evaluation_run(config_id, config_version, metadata)
  Eval-->>API: EvaluationRun (id, config_id, config_version)
  API->>Batch: start_evaluation_batch(eval_run, config.completion.params)
  Batch->>LLM: enqueue/call provider
  Batch-->>Eval: update run status / attach batch ids
  API-->>Client: 200 Created (EvaluationRun) or 4xx on error
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • #498: Overlaps on LLM config types and provider handling (Kaapi params, transform/resolve functions, provider registry changes).
  • #405: Broad evaluation refactor touching evaluation models and workflows; likely overlaps with config/model resolution edits.
  • #488: Related test-suite enhancements and helpers that intersect with evaluation testing and fixtures.

Suggested reviewers

  • kartpop
  • AkhileshNegi

Poem

I hopped through configs, one by one, 🐇
Replaced raw dicts with ids and versioned fun,
I sniffed the model, resolved its name,
Queued batches, saved runs — migration game,
A rabbit celebrates — tests green, deploy begun! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.42% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Evaluation: Use Config Management' accurately describes the main change: refactoring evaluation runs to use stored configuration management instead of inline config dictionaries.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings


📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d7db45 and afec624.

📒 Files selected for processing (1)
  • backend/app/tests/crud/evaluations/test_processing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/crud/evaluations/test_processing.py
🧠 Learnings (1)
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/tests/**/*.py : Use factory pattern for test fixtures in `backend/app/tests/`

Applied to files:

  • backend/app/tests/crud/evaluations/test_processing.py
🧬 Code graph analysis (1)
backend/app/tests/crud/evaluations/test_processing.py (2)
backend/app/tests/utils/test_data.py (2)
  • create_test_evaluation_dataset (341-376)
  • create_test_config (238-301)
backend/app/crud/evaluations/core.py (1)
  • create_evaluation_run (18-71)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)
🔇 Additional comments (7)
backend/app/tests/crud/evaluations/test_processing.py (7)

16-16: Factory import aligns with test data pattern.

Nice update to pull in create_test_config for config setup.


263-288: Config-backed eval run setup looks good.

Using create_test_config and passing config_id/config_version keeps the fixture aligned with the new stored-config flow.


408-415: Config creation wired correctly.

Creating a test config and passing config_id/config_version keeps the test consistent with the new API.


459-484: Embedding batch fixture aligned with config storage.

The new config setup and config_id/config_version use are consistent with the refactor.


609-634: Completed-batch test updated appropriately.

Config creation and config_id/config_version usage look correct here.


672-698: Failed-batch test updated appropriately.

Config creation and config_id/config_version usage look correct here.


774-799: Pending-evaluations test updated appropriately.

Config creation and config_id/config_version usage look correct here.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@avirajsingh7 avirajsingh7 linked an issue Dec 10, 2025 that may be closed by this pull request
@avirajsingh7 avirajsingh7 self-assigned this Dec 10, 2025
@avirajsingh7 avirajsingh7 added enhancement New feature or request ready-for-review labels Dec 10, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
backend/app/crud/evaluations/embeddings.py (1)

366-367: Misleading comment - update to reflect actual behavior.

The comment says "Get embedding model from config" but the code hardcodes the value. Update the comment to accurately describe the implementation.

-        # Get embedding model from config (default: text-embedding-3-large)
-        embedding_model = "text-embedding-3-large"
+        # Use fixed embedding model (text-embedding-3-large)
+        embedding_model = "text-embedding-3-large"
backend/app/tests/api/routes/test_evaluation.py (1)

524-545: Consider renaming function to match its new purpose.

The function test_start_batch_evaluation_missing_model was repurposed to test invalid config_id scenarios. The docstring was updated but the function name still references "missing_model". Consider renaming for clarity.

-    def test_start_batch_evaluation_missing_model(self, client, user_api_key_header):
-        """Test batch evaluation fails with invalid config_id."""
+    def test_start_batch_evaluation_invalid_config_id(self, client, user_api_key_header):
+        """Test batch evaluation fails with invalid config_id."""
backend/app/api/routes/evaluation.py (1)

499-510: Consider validating that model is present in config params.

The model is extracted with .get("model") which returns None if not present. Since model is critical for cost tracking (used in create_langfuse_dataset_run), consider validating its presence and returning an error if missing.

     # Extract model from config for storage
     model = config.completion.params.get("model")
+    if not model:
+        raise HTTPException(
+            status_code=400,
+            detail="Config must specify a 'model' in completion params for evaluation",
+        )

     # Create EvaluationRun record with config references
backend/app/crud/evaluations/core.py (1)

15-69: Config-based create_evaluation_run refactor is correctly implemented; consider logging model for improved traceability.

The refactor from inline config dict to config_id: UUID and config_version: int is properly implemented throughout:

  • The sole call site in backend/app/api/routes/evaluation.py:503 correctly passes all new parameters with the right types (config_id as UUID, config_version as int, model extracted from config).
  • The EvaluationRun model in backend/app/models/evaluation.py correctly defines all three fields with appropriate types and descriptions.
  • All type hints align with Python 3.11+ guidelines.

One suggested improvement for debugging:

Include model in the creation log for better traceability when correlating evaluation runs with model versions:

logger.info(
    f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
-   f"config_id={config_id}, config_version={config_version}"
+   f"config_id={config_id}, config_version={config_version}, model={model}"
)

Since the model is already extracted at the call site and passed to the function, including it in the log will provide fuller context for operational debugging without any additional cost.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30ef268 and d5f9d4d.

📒 Files selected for processing (7)
  • backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py (1 hunks)
  • backend/app/api/routes/evaluation.py (5 hunks)
  • backend/app/crud/evaluations/core.py (5 hunks)
  • backend/app/crud/evaluations/embeddings.py (1 hunks)
  • backend/app/crud/evaluations/processing.py (1 hunks)
  • backend/app/models/evaluation.py (3 hunks)
  • backend/app/tests/api/routes/test_evaluation.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

  • backend/app/api/routes/evaluation.py
  • backend/app/models/evaluation.py
  • backend/app/crud/evaluations/embeddings.py
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/crud/evaluations/core.py
  • backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py
backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

  • backend/app/api/routes/evaluation.py
backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

  • backend/app/models/evaluation.py
backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

  • backend/app/crud/evaluations/embeddings.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/crud/evaluations/core.py
🧬 Code graph analysis (2)
backend/app/tests/api/routes/test_evaluation.py (2)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (62-115)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-130)
  • EvaluationRun (133-248)
backend/app/crud/evaluations/processing.py (1)
backend/app/crud/evaluations/langfuse.py (1)
  • create_langfuse_dataset_run (20-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (3)
backend/app/crud/evaluations/processing.py (1)

257-264: LGTM! Clean refactor to use stored model field.

The change correctly retrieves the model from eval_run.model instead of extracting it from config. This aligns with the new data model where the model is snapshotted at evaluation creation time.

backend/app/models/evaluation.py (1)

148-157: LGTM! Well-structured config reference fields.

The new config_id and config_version fields properly establish the relationship to stored configs with appropriate constraints (ge=1 for version). The nullable design allows backward compatibility with existing data.

backend/app/api/routes/evaluation.py (1)

478-495: LGTM! Robust config resolution with provider validation.

The config resolution flow properly validates that the stored config exists and uses the OPENAI provider. Error handling returns appropriate HTTP 400 responses with descriptive messages.

@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 55.00000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/crud/evaluations/core.py 35.71% 9 Missing ⚠️
backend/app/api/routes/evaluation.py 41.66% 7 Missing ⚠️
backend/app/crud/evaluations/embeddings.py 0.00% 1 Missing ⚠️
backend/app/crud/evaluations/processing.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
backend/app/models/evaluation.py (1)

148-158: Align EvaluationRun type hints with nullable DB columns for config fields

config_id and config_version are nullable in the schema but annotated as non-optional types. This can mislead callers and type checkers into assuming they’re always present, even for legacy runs or transitional data.

Consider updating the annotations to reflect nullability:

-    config_id: UUID = SQLField(
+    config_id: UUID | None = SQLField(
         foreign_key="config.id",
         nullable=True,
         description="Reference to the stored config used for this evaluation",
     )
-    config_version: int = SQLField(
+    config_version: int | None = SQLField(
         nullable=True,
         ge=1,
         description="Version of the config used for this evaluation",
     )

This keeps the schema the same while making runtime and type expectations clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5f9d4d and eda7762.

📒 Files selected for processing (1)
  • backend/app/models/evaluation.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

  • backend/app/models/evaluation.py
backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

  • backend/app/models/evaluation.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (1)
backend/app/models/evaluation.py (1)

271-273: Public model nullability now matches the schema

Making config_id, config_version, and model nullable in EvaluationRunPublic correctly reflects the DB fields and avoids validation issues for existing rows. This resolves the earlier mismatch between the table and the public model.

Copy link
Collaborator

@Prajna1999 Prajna1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@avirajsingh7
Copy link
Collaborator Author

hold merge- untill frontend is ready.

@Prajna1999
Copy link
Collaborator

good to go. Can be merged

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/api/routes/evaluation.py (1)

8-16: Critical: Missing Depends import.

The pipeline failure indicates NameError: name 'Depends' is not defined at line 130. The import statement for Depends from fastapi is missing.

🐛 Add missing import
 from fastapi import (
     APIRouter,
     Body,
+    Depends,
     File,
     Form,
     HTTPException,
     Query,
     UploadFile,
 )
🤖 Fix all issues with AI agents
In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py:
- Line 20: The migration functions upgrade() and downgrade() lack return type
hints; update both function definitions (upgrade and downgrade) to include
explicit return types (e.g., change "def upgrade():" and "def downgrade():" to
"def upgrade() -> None:" and "def downgrade() -> None:") so they conform to the
project's typing guidelines.
- Around line 47-56: The downgrade currently re-adds the "config" column on the
"evaluation_run" table using op.add_column with sa.Column(..., nullable=False)
which will fail if rows exist; update that op.add_column call in the downgrade
to use nullable=True (or alternatively add a server_default or a prior data
migration to populate values before setting non-nullable), ensuring the column
is created nullable during downgrade to avoid PostgreSQL errors.

In @backend/app/api/routes/evaluation.py:
- Around line 505-509: The code references a non-existent constant
LLMProvider.OPENAI in the evaluation config validation, causing AttributeError;
update the check in evaluation.py (the block that raises HTTPException) to
compare against the actual provider string "openai" (i.e., use
config.completion.provider != "openai") or alternatively add a new constant
OPENAI = "openai" to the LLMProvider class in
backend/app/services/llm/providers/registry.py so the symbol exists and matches
tests; pick one approach and ensure the error message and tests remain
consistent with the chosen value.

In @backend/app/crud/evaluations/core.py:
- Around line 308-349: resolve_model_from_config currently declares returning
str but assigns model = config.completion.params.get("model") which may be None;
update resolve_model_from_config to validate that model is present and a str
(e.g., if not model: raise ValueError(...) with context including eval_run.id,
config_id, config_version) before returning, or coerce/choose a safe default
only if intended; reference the resolve_model_from_config function and the model
variable from config.completion.params.get("model") when implementing the check.

In @backend/app/crud/evaluations/processing.py:
- Around line 257-263: resolve_model_from_config currently uses
config.get("model") which can return None despite its str return annotation and
docstring promise; modify resolve_model_from_config to validate the retrieved
value and raise ValueError if missing (or alternatively change the function
signature to return str | None and update callers), e.g., after fetching model =
config.get("model") check if model is truthy and raise ValueError("missing model
in config") to enforce the contract so callers like
resolve_model_from_config(session=session, eval_run=eval_run) always receive a
str or an explicit None-aware type is used consistently.
🧹 Nitpick comments (1)
backend/app/alembic/versions/041_add_config_in_evals_run_table.py (1)

1-60: Consider a multi-step migration strategy for safer deployment.

Given the destructive nature of this schema change (dropping the config column) and the PR status ("hold merge - until frontend is ready"), consider deploying this as a multi-phase migration:

Phase 1: Add new columns without dropping old ones

  • Add config_id and config_version (nullable)
  • Add foreign key constraint
  • Deploy application code that writes to both old and new columns

Phase 2: Backfill existing data

  • Create a data migration script to populate config_id/config_version from existing config JSONB
  • Validate data integrity

Phase 3: Cut over

  • Deploy application code that only uses new columns
  • Monitor for issues

Phase 4: Cleanup

  • Drop the old config column in a subsequent migration

This approach provides:

  • Zero downtime deployment
  • Easy rollback at each phase
  • Data preservation and validation
  • Safer production deployment
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eda7762 and 31d9523.

📒 Files selected for processing (8)
  • backend/app/alembic/versions/041_add_config_in_evals_run_table.py
  • backend/app/api/routes/evaluation.py
  • backend/app/crud/evaluations/__init__.py
  • backend/app/crud/evaluations/core.py
  • backend/app/crud/evaluations/embeddings.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/models/evaluation.py
  • backend/app/tests/api/routes/test_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/app/crud/evaluations/embeddings.py
  • backend/app/models/evaluation.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/alembic/versions/041_add_config_in_evals_run_table.py
  • backend/app/api/routes/evaluation.py
  • backend/app/crud/evaluations/__init__.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/crud/evaluations/core.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/api/routes/test_evaluation.py
backend/app/alembic/versions/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Generate database migrations using alembic revision --autogenerate -m "Description" --rev-id <number> where rev-id is the latest existing revision ID + 1

Files:

  • backend/app/alembic/versions/041_add_config_in_evals_run_table.py
backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

  • backend/app/api/routes/evaluation.py
🧠 Learnings (2)
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/alembic/versions/*.py : Generate database migrations using `alembic revision --autogenerate -m "Description" --rev-id <number>` where rev-id is the latest existing revision ID + 1

Applied to files:

  • backend/app/alembic/versions/041_add_config_in_evals_run_table.py
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Organize backend code in `backend/app/` following the layered architecture: Models, CRUD, Routes, Core, Services, and Celery directories

Applied to files:

  • backend/app/api/routes/evaluation.py
🧬 Code graph analysis (5)
backend/app/tests/api/routes/test_evaluation.py (3)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (62-115)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-168)
  • EvaluationRun (171-322)
backend/app/tests/utils/test_data.py (1)
  • create_test_config (239-302)
backend/app/api/routes/evaluation.py (6)
backend/app/crud/config/version.py (1)
  • ConfigVersionCrud (15-142)
backend/app/models/llm/request.py (1)
  • LLMCallConfig (132-188)
backend/app/services/llm/jobs.py (1)
  • resolve_config_blob (84-116)
backend/app/services/llm/providers/registry.py (1)
  • LLMProvider (14-41)
backend/app/utils.py (4)
  • APIResponse (33-57)
  • get_langfuse_client (212-248)
  • get_openai_client (179-209)
  • load_description (393-398)
backend/app/crud/evaluations/core.py (1)
  • create_evaluation_run (18-71)
backend/app/crud/evaluations/__init__.py (1)
backend/app/crud/evaluations/core.py (1)
  • resolve_model_from_config (308-349)
backend/app/crud/evaluations/processing.py (2)
backend/app/crud/evaluations/core.py (2)
  • update_evaluation_run (154-206)
  • resolve_model_from_config (308-349)
backend/app/crud/evaluations/langfuse.py (1)
  • create_langfuse_dataset_run (21-164)
backend/app/crud/evaluations/core.py (3)
backend/app/crud/config/version.py (1)
  • ConfigVersionCrud (15-142)
backend/app/models/llm/request.py (1)
  • LLMCallConfig (132-188)
backend/app/services/llm/jobs.py (1)
  • resolve_config_blob (84-116)
🪛 GitHub Actions: Kaapi CI
backend/app/api/routes/evaluation.py

[error] 130-130: NameError: name 'Depends' is not defined.

🔇 Additional comments (4)
backend/app/crud/evaluations/__init__.py (1)

8-8: LGTM!

The new resolve_model_from_config function is correctly imported and exported for public use.

Also applies to: 43-43

backend/app/tests/api/routes/test_evaluation.py (1)

3-3: LGTM!

The test updates correctly reflect the shift from inline config dictionaries to stored config references. The use of create_test_config factory function aligns with the coding guidelines for test fixtures, and the error scenarios properly test config-not-found cases.

Also applies to: 10-10, 499-545, 728-803

backend/app/api/routes/evaluation.py (1)

492-509: Verify config resolution error handling covers all failure modes.

The config resolution logic handles errors from resolve_config_blob and validates the provider, but ensure that:

  1. Config version not found scenarios are properly handled
  2. Invalid/corrupted config blobs are caught
  3. The provider validation matches actual config schemas used in production
backend/app/crud/evaluations/core.py (1)

66-69: LGTM!

The logging statement correctly follows the coding guideline format with function context and includes the new config_id and config_version fields.

depends_on = None


def upgrade():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add return type hints to migration functions.

Both upgrade() and downgrade() functions are missing return type hints.

As per coding guidelines, all functions should have type hints.

📝 Proposed fix
-def upgrade():
+def upgrade() -> None:
-def downgrade():
+def downgrade() -> None:

Also applies to: 45-45

🤖 Prompt for AI Agents
In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py at line
20, The migration functions upgrade() and downgrade() lack return type hints;
update both function definitions (upgrade and downgrade) to include explicit
return types (e.g., change "def upgrade():" and "def downgrade():" to "def
upgrade() -> None:" and "def downgrade() -> None:") so they conform to the
project's typing guidelines.

Comment on lines +22 to +41
op.add_column(
"evaluation_run",
sa.Column(
"config_id",
sa.Uuid(),
nullable=True,
comment="Reference to the stored config used",
),
)
op.add_column(
"evaluation_run",
sa.Column(
"config_version",
sa.Integer(),
nullable=True,
comment="Version of the config used",
),
)
op.create_foreign_key(None, "evaluation_run", "config", ["config_id"], ["id"])
op.drop_column("evaluation_run", "config")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Data loss and foreign key constraint naming issues.

This migration has several critical problems:

  1. Data loss: Line 41 drops the config column without migrating existing data to the new config_id/config_version columns. Any existing evaluation runs will lose their configuration data permanently.

  2. Foreign key constraint naming: Line 40 creates a foreign key with None as the constraint name, causing Alembic to auto-generate a name. However, the downgrade function (Line 57) also uses None to drop the constraint, which won't match the auto-generated name and will fail.

Required actions:

  1. Add a data migration step before dropping the config column. You'll need to:

    • Parse each existing config JSONB object
    • Look up or create corresponding config records with appropriate versions
    • Update config_id and config_version for each evaluation_run
    • Or, if data migration isn't feasible, add a comment explaining why data loss is acceptable
  2. Specify an explicit constraint name instead of None:

🔧 Proposed fix for FK constraint naming
-    op.create_foreign_key(None, "evaluation_run", "config", ["config_id"], ["id"])
+    op.create_foreign_key(
+        "fk_evaluation_run_config_id", 
+        "evaluation_run", 
+        "config", 
+        ["config_id"], 
+        ["id"]
+    )

And update the downgrade:

-    op.drop_constraint(None, "evaluation_run", type_="foreignkey")
+    op.drop_constraint("fk_evaluation_run_config_id", "evaluation_run", type_="foreignkey")

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +47 to +56
op.add_column(
"evaluation_run",
sa.Column(
"config",
postgresql.JSONB(astext_type=sa.Text()),
autoincrement=False,
nullable=False,
comment="Evaluation configuration (model, instructions, etc.)",
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Downgrade will fail with existing data.

The downgrade re-adds the config column with nullable=False (Line 53). If the evaluation_run table contains any records when downgrading, this operation will fail because PostgreSQL cannot add a non-nullable column to a table with existing rows without specifying a default value.

Either:

  1. Make the column nullable during downgrade: nullable=True
  2. Provide a server default value
  3. Add a data migration to populate the column before setting it non-nullable
🔧 Proposed fix (Option 1: Make nullable)
     op.add_column(
         "evaluation_run",
         sa.Column(
             "config",
             postgresql.JSONB(astext_type=sa.Text()),
             autoincrement=False,
-            nullable=False,
+            nullable=True,
             comment="Evaluation configuration (model, instructions, etc.)",
         ),
     )
🤖 Prompt for AI Agents
In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py around
lines 47 - 56, The downgrade currently re-adds the "config" column on the
"evaluation_run" table using op.add_column with sa.Column(..., nullable=False)
which will fail if rows exist; update that op.add_column call in the downgrade
to use nullable=True (or alternatively add a server_default or a prior data
migration to populate values before setting non-nullable), ensuring the column
is created nullable during downgrade to avoid PostgreSQL errors.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/api/routes/evaluation.py (1)

527-534: Fix type incompatibility in start_evaluation_batch call.

The function expects config: dict[str, Any], but config.completion.params could be either dict[str, Any] (NativeCompletionConfig) or KaapiLLMParams (KaapiCompletionConfig). Update the function signature to accept both types:

config: dict[str, Any] | KaapiLLMParams

Alternatively, convert config.completion.params to a dict before passing it if the function strictly requires a dict.

🤖 Fix all issues with AI agents
In @backend/app/api/routes/evaluation.py:
- Around line 520-524: The attribute access is inconsistent: replace
auth_context.organization.id and auth_context.project.id with the existing
variables organization_.id and project_.id to match the rest of the file; update
the call that currently passes config_id=config_id,
config_version=config_version, organization_id=auth_context.organization.id,
project_id=auth_context.project.id to use organization_.id and project_.id
(keeping the same parameter names) so all references are consistent with the
file’s use of organization_ and project_.
- Around line 493-495: Replace direct nullable attribute access to AuthContext
with the non-null asserting properties: update usages of auth_context.project.id
to auth_context.project_.id and auth_context.organization.id to
auth_context.organization_.id (e.g., where ConfigVersionCrud is instantiated as
config_version_crud = ConfigVersionCrud(session=_session, config_id=config_id,
project_id=auth_context.project.id) change to use project_); scan the rest of
this module for any other auth_context.project or auth_context.organization
usages and switch them to project_ and organization_ to avoid potential
AttributeError when those attributes are None.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31d9523 and 6b00e0f.

📒 Files selected for processing (1)
  • backend/app/api/routes/evaluation.py
🧰 Additional context used
📓 Path-based instructions (2)
backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

  • backend/app/api/routes/evaluation.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/api/routes/evaluation.py
🧠 Learnings (2)
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Organize backend code in `backend/app/` following the layered architecture: Models, CRUD, Routes, Core, Services, and Celery directories

Applied to files:

  • backend/app/api/routes/evaluation.py
📚 Learning: 2025-12-17T10:16:25.880Z
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.

Applied to files:

  • backend/app/api/routes/evaluation.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/crud/evaluations/batch.py (1)

104-119: Filter out None values from request body to avoid API errors.

The body construction includes fields that may be None (e.g., instructions, temperature, reasoning). Passing explicit None values to the OpenAI API may cause errors, as some endpoints don't accept null for optional fields.

Proposed fix
         batch_request = {
             "custom_id": item["id"],
             "method": "POST",
             "url": "/v1/responses",
-            "body": {
-                # Use config as-is
-                "model": config.model,
-                "instructions": config.instructions,
-                "temperature": config.temperature,
-                "reasoning": {"effort": config.reasoning} if config.reasoning else None,
-                "tools": config.knowledge_base_ids if config.knowledge_base_ids else [],
-                "input": question,  # Add input from dataset
-            },
+            "body": {
+                k: v for k, v in {
+                    "model": config.model,
+                    "instructions": config.instructions,
+                    "temperature": config.temperature,
+                    "reasoning": {"effort": config.reasoning} if config.reasoning else None,
+                    "tools": config.knowledge_base_ids if config.knowledge_base_ids else None,
+                    "input": question,
+                }.items() if v is not None
+            },
         }
🤖 Fix all issues with AI agents
In @backend/app/crud/evaluations/batch.py:
- Around line 108-116: The body currently sets "tools" to
config.knowledge_base_ids (a list of IDs) but OpenAI expects tool objects;
update the body construction in batch.py to use the mapper or construct the
required tool object: call map_kaapi_to_openai_params(config) or transform
config.knowledge_base_ids into [{"type":"file_search","vector_store_ids":
config.knowledge_base_ids, "max_num_results": config.max_num_results or 20}] and
assign that to "tools" so the payload matches the OpenAI Responses API format
(refer to map_kaapi_to_openai_params in backend/app/services/llm/mappers.py and
the body assembly where "tools" is set).

In @backend/app/crud/evaluations/core.py:
- Around line 342-348: The code directly accesses config.completion.params.model
which can raise AttributeError if completion or params are missing; update the
block that follows the config resolution (around the use of eval_run and config)
to defensively validate these nested attributes: verify config has a non-None
completion and that completion has non-None params and a model (e.g., using
explicit checks or getattr with defaults), and if any are missing raise a
ValueError that includes eval_run.id, eval_run.config_id,
eval_run.config_version and a clear message about the missing field(s); ensure
subsequent uses of model assume it is valid.

In @backend/app/services/audio/speech_to_text.py:
- Around line 610-625: The __main__ test block contains developer-specific
hardcoded file paths and should be removed or guarded; delete or replace the if
__name__ == "__main__": section (which calls
transcribe_audio_with_indic_conformer with ai4b_file_path) or wrap it behind a
non-committed debug flag/CLI using argparse or environment variables, or move
the scenario into a proper test/integration test that uses fixtures or temp
files; ensure no absolute local paths remain and any manual-run examples are
documented separately rather than committed in the module.
- Around line 53-66: In the __init__ of the speech-to-text class fix the typo
and clarify the error: replace the incorrect attribute assignment self.provide =
provider with self.provider = provider, and change the API-key missing
ValueError message ("Missing OpenAI API Key for Client STT Client
initialization") to a provider-agnostic message like "Missing API key for
speech-to-text client initialization" so it no longer references only OpenAI;
keep the rest of the provider branching (provider == "openai" / provider ==
"gemini") and the unsupported-provider ValueError as-is.
- Around line 9-13: Remove the duplicate Literal import and modernize typing:
keep only the needed typing imports (e.g., BinaryIO) and remove the second
"Literal" import from the top of speech_to_text.py, then replace all usages of
List[...] and Dict[...] in function signatures, class BaseModel fields, and
return types (e.g., any occurrences in functions like transcribe_* or classes
that reference lists/dicts) with the Python 3.9+ built-in generics list[...] and
dict[...]; ensure annotations still import BinaryIO and, if Literal is required,
import it only once (or use typing.Literal) and run a quick lint/type-check to
confirm no remaining deprecated typing usages.
- Around line 76-92: The code opens audio_file when it's a string but never
closes it; change the handling around isinstance(audio_file, str) to open the
file with a context manager or track the opened file in a local variable and
ensure it's closed in a finally block so that the file handle is always released
even on exceptions; keep using self.openai_client.audio.transcriptions.create
for transcription and return the result, and update the logger calls
(logger.info and logger.error) to prefix messages with the function/method name
(e.g., the enclosing method name) per guidelines so logs include that context.

In @backend/app/services/llm/providers/registry.py:
- Around line 16-23: The registry currently includes the OPENAI constant but
get_llm_provider only recognizes OPENAI_NATIVE, causing a runtime ValueError;
fix by either removing the OPENAI entry from the _registry dict so only
OPENAI_NATIVE maps to OpenAIProvider, or update get_llm_provider to accept the
OPENAI symbol as an alias (handle provider_type == OPENAI by mapping it to
OpenAIProvider) and ensure the registry check and the ValueError logic align;
refer to OPENAI, OPENAI_NATIVE, _registry, and get_llm_provider when making the
change.
🧹 Nitpick comments (4)
backend/app/services/audio/speech_to_text.py (2)

316-323: Use tempfile module instead of hardcoded /tmp path.

The hardcoded /tmp path is not portable (e.g., Windows). Consider using tempfile.NamedTemporaryFile for cross-platform compatibility.

Proposed fix
+import tempfile
 ...
-    media_extention = (
+    media_extension = (
         file_data.media_type.split("/")[-1] if file_data.media_type else ".ogg"
     )
-    temp_file = f"/tmp/{uuid4()}.{media_extention}"
+    temp_file = tempfile.NamedTemporaryFile(
+        suffix=f".{media_extension}", delete=False
+    ).name

52-67: Add type hints to all function parameters and return values.

Per coding guidelines, all functions should have type hints. Several functions in this file are missing return type annotations. For example:

Example fixes
-def __init__(self, provider, api_key: str | None = None):
+def __init__(self, provider: str, api_key: str | None = None) -> None:

-def transcribe_with_openai(
+def transcribe_with_openai(
     self,
     audio_file: BinaryIO | str,
     model: str = "gpt-4o-transcribe",
     prompt: str | None = None,
-):
+) -> str:

-def transcribe_audio_with_chirp_v3(audio_file_path: str):
+def transcribe_audio_with_chirp_v3(audio_file_path: str) -> str | None:

-def transcribe_audio_with_indic_conformer(audio_file_path: str):
+def transcribe_audio_with_indic_conformer(audio_file_path: str) -> str:

Apply similar changes to other functions missing return type hints.

backend/app/crud/evaluations/core.py (2)

61-69: Add function name prefix to log messages.

Per coding guidelines, log messages should be prefixed with the function name in square brackets.

Proposed fix
     except Exception as e:
         session.rollback()
-        logger.error(f"Failed to create EvaluationRun: {e}", exc_info=True)
+        logger.error(f"[create_evaluation_run] Failed to create EvaluationRun: {e}", exc_info=True)
         raise

-    logger.info(
-        f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
+    logger.info(
+        f"[create_evaluation_run] Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
         f"config_id={config_id}, config_version={config_version}"
     )

308-311: Add return type hint.

The function is missing a return type annotation in the signature on line 311.

The signature already shows -> str: so this appears correct. The function implementation looks good overall.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b00e0f and ceb3970.

📒 Files selected for processing (4)
  • backend/app/crud/evaluations/batch.py
  • backend/app/crud/evaluations/core.py
  • backend/app/services/audio/speech_to_text.py
  • backend/app/services/llm/providers/registry.py
🧰 Additional context used
📓 Path-based instructions (2)
backend/app/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement business logic in services located in backend/app/services/

Files:

  • backend/app/services/llm/providers/registry.py
  • backend/app/services/audio/speech_to_text.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/services/llm/providers/registry.py
  • backend/app/crud/evaluations/batch.py
  • backend/app/crud/evaluations/core.py
  • backend/app/services/audio/speech_to_text.py
🧠 Learnings (1)
📚 Learning: 2025-12-17T10:16:25.880Z
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.

Applied to files:

  • backend/app/services/llm/providers/registry.py
🧬 Code graph analysis (4)
backend/app/services/llm/providers/registry.py (2)
backend/app/services/llm/providers/base.py (1)
  • BaseProvider (13-63)
backend/app/services/llm/providers/openai.py (1)
  • OpenAIProvider (21-110)
backend/app/crud/evaluations/batch.py (1)
backend/app/models/llm/request.py (1)
  • KaapiLLMParams (8-41)
backend/app/crud/evaluations/core.py (5)
backend/app/crud/config/version.py (1)
  • ConfigVersionCrud (15-142)
backend/app/crud/evaluations/langfuse.py (1)
  • fetch_trace_scores_from_langfuse (318-557)
backend/app/models/evaluation.py (1)
  • EvaluationRun (171-322)
backend/app/models/llm/request.py (1)
  • LLMCallConfig (132-188)
backend/app/services/llm/jobs.py (1)
  • resolve_config_blob (84-116)
backend/app/services/audio/speech_to_text.py (1)
backend/app/core/providers.py (1)
  • ProviderConfig (18-21)
🪛 Ruff (0.14.10)
backend/app/services/audio/speech_to_text.py

13-13: typing.List is deprecated, use list instead

(UP035)


13-13: typing.Dict is deprecated, use dict instead

(UP035)


13-13: Redefinition of unused Literal from line 9

Remove definition: Literal

(F811)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)
🔇 Additional comments (2)
backend/app/crud/evaluations/batch.py (1)

63-65: LGTM on type change.

The signature change from dict[str, Any] to KaapiLLMParams provides better type safety and aligns with the PR's goal of using stored configurations.

backend/app/crud/evaluations/core.py (1)

18-27: LGTM on signature change.

The refactor from config: dict to config_id: UUID, config_version: int properly implements the PR objective of using stored configurations instead of raw config dictionaries.

Comment on lines 9 to 13
from typing import BinaryIO, Literal
from pydantic import BaseModel
from dotenv import load_dotenv
from uuid import uuid4
from typing import List, Literal, Dict
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix duplicate import and use modern type hints.

Literal is imported twice (lines 9 and 13), and List/Dict from typing are deprecated in Python 3.9+.

Proposed fix
-from typing import BinaryIO, Literal
+from typing import BinaryIO
 from pydantic import BaseModel
 from dotenv import load_dotenv
 from uuid import uuid4
-from typing import List, Literal, Dict
+from typing import Literal

Then replace List[...] with list[...] and Dict[...] with dict[...] throughout the file.

🧰 Tools
🪛 Ruff (0.14.10)

13-13: typing.List is deprecated, use list instead

(UP035)


13-13: typing.Dict is deprecated, use dict instead

(UP035)


13-13: Redefinition of unused Literal from line 9

Remove definition: Literal

(F811)

🤖 Prompt for AI Agents
In @backend/app/services/audio/speech_to_text.py around lines 9 - 13, Remove the
duplicate Literal import and modernize typing: keep only the needed typing
imports (e.g., BinaryIO) and remove the second "Literal" import from the top of
speech_to_text.py, then replace all usages of List[...] and Dict[...] in
function signatures, class BaseModel fields, and return types (e.g., any
occurrences in functions like transcribe_* or classes that reference
lists/dicts) with the Python 3.9+ built-in generics list[...] and dict[...];
ensure annotations still import BinaryIO and, if Literal is required, import it
only once (or use typing.Literal) and run a quick lint/type-check to confirm no
remaining deprecated typing usages.

Comment on lines 53 to 66
def __init__(self, provider, api_key: str | None = None):
if api_key is None:
raise ValueError(
"Missing OpenAI API Key for Client STT Client initialization"
)
self.provide = provider
self.openai_client = None
self.gemini_client = None
if provider == "openai":
self.openai_client = OpenAI(api_key=api_key)
elif provider == "gemini":
self.gemini_client = genai.Client(api_key=api_key)
else:
raise ValueError("This provider is not supported yet.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Typo and misleading error message in __init__.

  1. Line 58: self.provide = provider should be self.provider = provider
  2. Lines 54-57: Error message mentions "OpenAI" but this class supports multiple providers
Proposed fix
 def __init__(self, provider, api_key: str | None = None):
     if api_key is None:
         raise ValueError(
-            "Missing OpenAI API Key for Client STT Client initialization"
+            f"Missing API Key for {provider} STT Client initialization"
         )
-    self.provide = provider
+    self.provider = provider
🤖 Prompt for AI Agents
In @backend/app/services/audio/speech_to_text.py around lines 53 - 66, In the
__init__ of the speech-to-text class fix the typo and clarify the error: replace
the incorrect attribute assignment self.provide = provider with self.provider =
provider, and change the API-key missing ValueError message ("Missing OpenAI API
Key for Client STT Client initialization") to a provider-agnostic message like
"Missing API key for speech-to-text client initialization" so it no longer
references only OpenAI; keep the rest of the provider branching (provider ==
"openai" / provider == "gemini") and the unsupported-provider ValueError as-is.

Comment on lines 76 to 92
try:
# Handle file path vs file object
if isinstance(audio_file, str):
audio_file = open(audio_file, "rb")

transcription = self.openai_client.audio.transcriptions.create(
model=model,
file=audio_file,
# language="hi",
response_format="text",
prompt=prompt or DEFAULT_PROMPT,
)
logger.info(f"Successfully transcribed audio using OpenAI model: {model}")
return transcription
except Exception as e:
logger.error(f"OpenAI transcription failed: {str(e)}", exc_info=True)
raise
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resource leak: file handle not closed when path is provided.

When audio_file is a string path, the file is opened at line 79 but never closed, causing a resource leak. Also, log messages should be prefixed with the function name per coding guidelines.

Proposed fix
 def transcribe_with_openai(
     self,
     audio_file: BinaryIO | str,
     model: str = "gpt-4o-transcribe",
     prompt: str | None = None,
 ):
     if self.openai_client is None:
         raise ValueError("OpenAI client is not initialized.")
+    should_close = False
     try:
         # Handle file path vs file object
         if isinstance(audio_file, str):
             audio_file = open(audio_file, "rb")
+            should_close = True

         transcription = self.openai_client.audio.transcriptions.create(
             model=model,
             file=audio_file,
             # language="hi",
             response_format="text",
             prompt=prompt or DEFAULT_PROMPT,
         )
-        logger.info(f"Successfully transcribed audio using OpenAI model: {model}")
+        logger.info(f"[transcribe_with_openai] Successfully transcribed audio using OpenAI model: {model}")
         return transcription
     except Exception as e:
-        logger.error(f"OpenAI transcription failed: {str(e)}", exc_info=True)
+        logger.error(f"[transcribe_with_openai] OpenAI transcription failed: {str(e)}", exc_info=True)
         raise
+    finally:
+        if should_close and hasattr(audio_file, 'close'):
+            audio_file.close()
🤖 Prompt for AI Agents
In @backend/app/services/audio/speech_to_text.py around lines 76 - 92, The code
opens audio_file when it's a string but never closes it; change the handling
around isinstance(audio_file, str) to open the file with a context manager or
track the opened file in a local variable and ensure it's closed in a finally
block so that the file handle is always released even on exceptions; keep using
self.openai_client.audio.transcriptions.create for transcription and return the
result, and update the logger calls (logger.info and logger.error) to prefix
messages with the function/method name (e.g., the enclosing method name) per
guidelines so logs include that context.

Comment on lines 178 to 190
def transcribe_audio_with_indic_conformer(audio_file_path: str):
indic_conformer_api_url = str(os.getenv("AI4B_STT_URL"))
with open(audio_file_path, "rb") as file:
audio_content = file.read()

response = requests.post(
url=indic_conformer_api_url,
data={"language_code": "hi", "decoding_strategy": "ctc"},
files={"audio_file": audio_content},
)
logger.info(response.json())
transcription = response.json()["transcription"]
return transcription
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing timeout and error handling for external API call.

The requests.post call has no timeout, which can cause the application to hang indefinitely if the AI4B service is unresponsive. Additionally, there's no error handling for HTTP errors or JSON parsing failures.

Proposed fix
 def transcribe_audio_with_indic_conformer(audio_file_path: str):
     indic_conformer_api_url = str(os.getenv("AI4B_STT_URL"))
+    if not indic_conformer_api_url:
+        raise ValueError("AI4B_STT_URL environment variable not set")
+    
     with open(audio_file_path, "rb") as file:
         audio_content = file.read()

     response = requests.post(
         url=indic_conformer_api_url,
         data={"language_code": "hi", "decoding_strategy": "ctc"},
         files={"audio_file": audio_content},
+        timeout=60,
     )
-    logger.info(response.json())
-    transcription = response.json()["transcription"]
+    response.raise_for_status()
+    result = response.json()
+    logger.info(f"[transcribe_audio_with_indic_conformer] Response received: {result}")
+    transcription = result["transcription"]
     return transcription

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
backend/app/crud/evaluations/batch.py (2)

63-91: Docstring is inconsistent with the new parameter type.

The config parameter is now KaapiLLMParams, but the docstring at line 77 still describes it as "Evaluation configuration dict with OpenAI Responses API parameters." Update the docstring to reflect the typed model and its actual fields.

📝 Suggested docstring update
     Args:
         dataset_items: List of dataset items from Langfuse
-        config: Evaluation configuration dict with OpenAI Responses API parameters.
-            This config is used as-is in the body, with only "input" being added
-            from the dataset. Config can include any fields like:
-            - model (required)
-            - instructions
-            - tools
-            - reasoning
-            - text
-            - temperature
-            - include
-            etc.
+        config: KaapiLLMParams containing LLM configuration parameters including
+            model, instructions, knowledge_base_ids, reasoning, temperature,
+            and max_num_results.

     Returns:
         List of dictionaries (JSONL data)

142-148: Docstring is inconsistent with the new parameter type.

The config parameter docstring at line 147 still describes it as "Evaluation configuration dict" but the type is now KaapiLLMParams.

📝 Suggested docstring update
     Args:
         langfuse: Configured Langfuse client
         openai_client: Configured OpenAI client
         session: Database session
         eval_run: EvaluationRun database object (with run_name, dataset_name, config)
-        config: Evaluation configuration dict with llm, instructions, vector_store_ids
+        config: KaapiLLMParams containing LLM configuration parameters
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceb3970 and ebdda81.

📒 Files selected for processing (1)
  • backend/app/crud/evaluations/batch.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/crud/evaluations/batch.py
🧬 Code graph analysis (1)
backend/app/crud/evaluations/batch.py (1)
backend/app/models/llm/request.py (1)
  • KaapiLLMParams (8-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)
🔇 Additional comments (2)
backend/app/crud/evaluations/batch.py (2)

20-20: LGTM!

The import of KaapiLLMParams is appropriate and aligns with the refactoring to use typed configuration objects instead of raw dictionaries.


176-182: LGTM!

Good use of model_dump(exclude_none=True) to serialize the config without None values. This ensures clean storage of the evaluation configuration.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
backend/app/tests/api/routes/test_evaluation.py (4)

510-540: Critical: Corrupted method definition causing syntax error.

This method has a duplicated/malformed signature that's causing the CI pipeline to fail with a Black parsing error. Lines 510-517 start defining the method, but then lines 518-522 appear to be remnants of an old or conflicting definition with type hints.

The method body also has inconsistencies - it references sample_evaluation_config fixture in the old signature but never uses it.

🐛 Proposed fix
     def test_start_batch_evaluation_invalid_dataset_id(
-        self, client, user_api_key_header, db, user_api_key
-    ):
-        """Test batch evaluation fails with invalid dataset_id."""
-        # Create a valid config to use
-        config = create_test_config(db, project_id=user_api_key.project.id)
-
-        # Try to start evaluation with non-existent dataset_id
         self,
         client: TestClient,
         user_api_key_header: dict[str, str],
-        sample_evaluation_config: dict[str, Any],
+        db: Session,
+        user_api_key: TestAuthContext,
     ) -> None:
         """Test batch evaluation fails with invalid/non-existent dataset_id."""
+        # Create a valid config to use
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         response = client.post(
             "/api/v1/evaluations",
             json={
                 "experiment_name": "test_evaluation_run",
                 "dataset_id": 99999,  # Non-existent
                 "config_id": str(config.id),
                 "config_version": 1,
             },
             headers=user_api_key_header,
         )

845-854: Inconsistent: Test still uses deprecated config field.

This test (test_get_evaluation_run_without_trace_info) creates an EvaluationRun with the old config dict field instead of the new config_id/config_version pattern used in other updated tests. Based on the AI summary and related model changes, the schema now expects config_id and config_version fields.

🔧 Suggested fix for consistency
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_simple_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )

880-888: Inconsistent: Another test using deprecated config field.

Same issue as above - test_get_evaluation_run_resync_without_trace_info_fails still uses the old config dict pattern. Should be updated to use config_id/config_version for consistency with the refactored schema.

🔧 Suggested fix for consistency
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )

577-654: Type mismatch: Tests pass dict but build_evaluation_jsonl expects KaapiLLMParams.

The function at backend/app/crud/evaluations/batch.py (lines 63-65) declares config: KaapiLLMParams, and its implementation (lines 110-122) accesses attributes directly: config.model, config.instructions, config.temperature, config.reasoning, config.knowledge_base_ids, config.max_num_results.

Plain dicts do not support attribute access, so these tests would fail at runtime with AttributeError: 'dict' object has no attribute 'model'. Construct KaapiLLMParams instances instead:

from app.models.llm import KaapiLLMParams

config = KaapiLLMParams(
    model="gpt-4o",
    temperature=0.2,
    instructions="You are a helpful assistant",
)
🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)

542-572: Missing type hints on method parameters.

Per coding guidelines, all function parameters and return values should have type hints. The client and user_api_key_header parameters are missing type annotations.

Also, the test docstring says "model is missing from config" but the test now validates config-not-found since it uses config_id/config_version. Consider updating the docstring to reflect the actual test behavior.

♻️ Suggested improvement
     def test_start_batch_evaluation_missing_model(
-        self, client: TestClient, user_api_key_header: dict[str, str]
+        self,
+        client: TestClient,
+        user_api_key_header: dict[str, str],
     ) -> None:
-        """Test batch evaluation fails when model is missing from config."""
-        # We don't need a real dataset for this test - the validation should happen
-        # before dataset lookup. Use any dataset_id and expect config validation error
-        invalid_config = {
-            "instructions": "You are a helpful assistant",
-            "temperature": 0.5,
-        }
-
+        """Test batch evaluation fails when config_id references non-existent config."""
         response = client.post(
             "/api/v1/evaluations",
             json={
                 "experiment_name": "test_no_config",
                 "dataset_id": 1,  # Dummy ID, config validation happens first
                 "config_id": str(uuid4()),  # Non-existent config
                 "config_version": 1,
             },
             headers=user_api_key_header,
         )

Note: The invalid_config variable on lines 548-551 is now dead code and should be removed.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebdda81 and f9e77ff.

📒 Files selected for processing (1)
  • backend/app/tests/api/routes/test_evaluation.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/api/routes/test_evaluation.py
🧬 Code graph analysis (1)
backend/app/tests/api/routes/test_evaluation.py (5)
backend/app/tests/crud/documents/test_doc_transformation_job.py (1)
  • crud (24-25)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (63-126)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-168)
  • EvaluationRun (171-322)
backend/app/tests/utils/test_data.py (1)
  • create_test_config (238-301)
backend/app/tests/utils/document.py (2)
  • project (65-66)
  • get (118-122)
🪛 GitHub Actions: Kaapi CI
backend/app/tests/api/routes/test_evaluation.py

[error] 519-519: Black formatting failed. Cannot parse: 519:26: client: TestClient,

🔇 Additional comments (3)
backend/app/tests/api/routes/test_evaluation.py (3)

4-4: LGTM: New imports support config-based testing.

The additions of uuid4 and create_test_config properly support the refactored test cases that now use stored configs with config_id/config_version instead of inline config dicts.

Also applies to: 12-12


742-758: LGTM: Correctly updated to use config_id/config_version.

The test now properly creates a test config and uses config_id and config_version fields when constructing the EvaluationRun, aligning with the new schema.


782-804: LGTM: Test updated to use config references.

Correctly creates a config and uses config_id/config_version fields in the EvaluationRun construction.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@AkhileshNegi AkhileshNegi changed the title Evaluation to Use Config Management Evaluation: Use Config Management Jan 15, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/app/tests/api/routes/test_evaluation.py (3)

839-863: Test uses deprecated config dict field instead of config_id/config_version.

The EvaluationRun model no longer has a config dict field (per the model definition in relevant snippets showing config_id: UUID and config_version: int). This test creates an EvaluationRun with config={"model": "gpt-4o"} which is the old pattern and will likely fail.

     def test_get_evaluation_run_without_trace_info(
         self,
         client: TestClient,
         user_api_key_header: dict[str, str],
         db: Session,
         user_api_key: TestAuthContext,
         create_test_dataset: EvaluationDataset,
     ) -> None:
         """Test getting evaluation run without requesting trace info."""
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_simple_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )

865-902: Same issue: Test uses deprecated config dict field.

This test also creates EvaluationRun with config={"model": "gpt-4o"} which needs to be updated to use config_id and config_version.

     def test_get_evaluation_run_resync_without_trace_info_fails(
         self,
         client: TestClient,
         user_api_key_header: dict[str, str],
         db: Session,
         user_api_key: TestAuthContext,
         create_test_dataset: EvaluationDataset,
     ) -> None:
         """Test that resync_score=true requires get_trace_info=true."""
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )

571-600: Replace dict config with KaapiLLMParams instantiation—test will fail at runtime.

The test passes a plain dictionary to build_evaluation_jsonl, which expects a KaapiLLMParams object. Since the function accesses config using attribute notation (e.g., config.model, config.temperature), the test will fail with an AttributeError at runtime when trying to access attributes on a dict. Use KaapiLLMParams(model="gpt-4o", temperature=0.2, instructions="You are a helpful assistant") instead, consistent with how other tests in the codebase instantiate this model. This also aligns with the factory pattern requirement for test fixtures.

🤖 Fix all issues with AI agents
In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 510-517: The test method
test_start_batch_evaluation_invalid_dataset_id contains a duplicated docstring —
remove the redundant docstring so the method has a single descriptive docstring
(keep one of the two identical strings and delete the other) to eliminate the
editing artifact.
🧹 Nitpick comments (4)
backend/app/crud/evaluations/core.py (2)

308-358: Return type should be str not str | None.

The function raises ValueError in all cases where the model would be missing (lines 353-357), so it can never return None. The return type annotation should be str to accurately reflect the contract.

 def resolve_model_from_config(
     session: Session,
     eval_run: EvaluationRun,
-) -> str | None:
+) -> str:

63-69: Log messages should include function name prefix.

Per coding guidelines, log messages should be prefixed with the function name in square brackets. The log statements here and in other CRUD functions (list_evaluation_runs, get_evaluation_run_by_id, update_evaluation_run) don't follow this pattern.

Example fix for this function
     except Exception as e:
         session.rollback()
-        logger.error(f"Failed to create EvaluationRun: {e}", exc_info=True)
+        logger.error(f"[create_evaluation_run] Failed to create EvaluationRun: {e}", exc_info=True)
         raise

-    logger.info(
-        f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
-        f"config_id={config_id}, config_version={config_version}"
-    )
+    logger.info(
+        f"[create_evaluation_run] Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
+        f"config_id={config_id}, config_version={config_version}"
+    )

As per coding guidelines, prefix all log messages with the function name in square brackets.

backend/app/api/routes/evaluation.py (1)

501-510: Add defensive None check for config before accessing .completion.

The elif on line 506 accesses config.completion.provider but only checks if error:. If resolve_config_blob returns (None, None) unexpectedly, this would raise AttributeError.

Suggested fix
     if error:
         raise HTTPException(
             status_code=400,
             detail=f"Failed to resolve config from stored config: {error}",
         )
-    elif config.completion.provider != LLMProvider.OPENAI:
+    if config is None:
+        raise HTTPException(
+            status_code=400,
+            detail="Config resolution returned empty config",
+        )
+    if config.completion.provider != LLMProvider.OPENAI:
         raise HTTPException(
             status_code=422,
             detail="Only 'openai' provider is supported for evaluation configs",
         )
backend/app/tests/api/routes/test_evaluation.py (1)

536-565: Test name and docstring are now misleading.

The test test_start_batch_evaluation_missing_model with docstring "Test batch evaluation fails when model is missing from config" no longer tests that scenario. It now tests that a non-existent config_id returns an error. Consider renaming to test_start_batch_evaluation_invalid_config and updating the docstring.

-    def test_start_batch_evaluation_missing_model(
+    def test_start_batch_evaluation_invalid_config(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
-        """Test batch evaluation fails when model is missing from config."""
+        """Test batch evaluation fails when config_id does not exist."""
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9e77ff and dddee70.

📒 Files selected for processing (5)
  • backend/app/api/routes/evaluation.py
  • backend/app/crud/evaluations/batch.py
  • backend/app/crud/evaluations/core.py
  • backend/app/services/llm/providers/registry.py
  • backend/app/tests/api/routes/test_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/app/services/llm/providers/registry.py
  • backend/app/crud/evaluations/batch.py
🧰 Additional context used
📓 Path-based instructions (3)
backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

  • backend/app/api/routes/evaluation.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/api/routes/evaluation.py
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/crud/evaluations/core.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/api/routes/test_evaluation.py
🧠 Learnings (2)
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Organize backend code in `backend/app/` following the layered architecture: Models, CRUD, Routes, Core, Services, and Celery directories

Applied to files:

  • backend/app/api/routes/evaluation.py
📚 Learning: 2025-12-17T10:16:25.880Z
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.

Applied to files:

  • backend/app/api/routes/evaluation.py
🧬 Code graph analysis (2)
backend/app/tests/api/routes/test_evaluation.py (3)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (63-133)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-168)
  • EvaluationRun (171-322)
backend/app/tests/utils/test_data.py (1)
  • create_test_config (238-301)
backend/app/crud/evaluations/core.py (3)
backend/app/crud/config/version.py (1)
  • ConfigVersionCrud (15-142)
backend/app/models/llm/request.py (1)
  • LLMCallConfig (132-188)
backend/app/services/llm/jobs.py (1)
  • resolve_config_blob (84-116)
🔇 Additional comments (7)
backend/app/crud/evaluations/core.py (2)

348-357: Good defensive validation for nested config access.

The defensive checks for config.completion, config.completion.params, and model address the concerns raised in previous reviews. This ensures proper error messages when the config structure is incomplete.


18-71: LGTM: Function signature update for config management integration.

The create_evaluation_run function has been properly updated to accept config_id: UUID and config_version: int instead of a config dict. Type hints are complete and the docstring reflects the new parameters.

backend/app/api/routes/evaluation.py (3)

446-534: LGTM: Evaluation endpoint updated for config management.

The evaluate function properly:

  • Accepts config_id: UUID and config_version: int as required body parameters
  • Resolves the config via ConfigVersionCrud and resolve_config_blob
  • Validates the provider is OpenAI
  • Passes config.completion.params to start_evaluation_batch, which aligns with the KaapiLLMParams type expected by build_evaluation_jsonl

512-512: Log message follows the coding guidelines.

Good use of function name prefix in the log message: [evaluate] Successfully resolved config from config management.


493-510: LLMProvider.OPENAI constant exists and is properly implemented.

The constant LLMProvider.OPENAI = "openai" is defined at backend/app/services/llm/providers/registry.py:16 and is already integrated into the provider registry. It is also used in the get_llm_provider function at line 66.

Additionally, the None-checking for config at line 506 is safe. The resolve_config_blob function returns tuple[ConfigBlob | None, str | None] where the contract guarantees either (config_blob, None) on success or (None, error_message) on failure. Since line 506 is only reached after the if error: check at line 501 (which raises on any error), config cannot be None at that point.

Likely an incorrect or invalid review comment.

backend/app/tests/api/routes/test_evaluation.py (2)

510-534: LGTM: Test updated to use config management pattern.

The test correctly creates a test config using create_test_config and passes config_id and config_version in the API payload, aligning with the new endpoint contract.


736-752: LGTM: Test properly updated for config management.

The test correctly creates a test config and uses config_id and config_version when constructing the EvaluationRun.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 510 to 517
def test_start_batch_evaluation_invalid_dataset_id(
self,
client: TestClient,
user_api_key_header: dict[str, str],
sample_evaluation_config: dict[str, Any],
) -> None:
self, client, user_api_key_header, db, user_api_key
):
"""Test batch evaluation fails with invalid dataset_id."""
# Create a valid config to use
config = create_test_config(db, project_id=user_api_key.project.id)

"""Test batch evaluation fails with invalid/non-existent dataset_id."""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate docstring in test method.

Lines 513 and 517 both contain docstrings for the same test method. This appears to be an editing artifact - remove one of them.

     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
-        """Test batch evaluation fails with invalid dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)

         """Test batch evaluation fails with invalid/non-existent dataset_id."""

Should be:

     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
         """Test batch evaluation fails with invalid/non-existent dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)
🤖 Prompt for AI Agents
In `@backend/app/tests/api/routes/test_evaluation.py` around lines 510 - 517, The
test method test_start_batch_evaluation_invalid_dataset_id contains a duplicated
docstring — remove the redundant docstring so the method has a single
descriptive docstring (keep one of the two identical strings and delete the
other) to eliminate the editing artifact.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
backend/app/tests/api/routes/test_evaluation.py (6)

551-580: Remove unused invalid_config variable.

The invalid_config dict at lines 557-560 is defined but never used. The test now uses uuid4() for a non-existent config_id instead.

🧹 Suggested fix
     def test_start_batch_evaluation_missing_model(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
         """Test batch evaluation fails when model is missing from config."""
         # We don't need a real dataset for this test - the validation should happen
         # before dataset lookup. Use any dataset_id and expect config validation error
-        invalid_config = {
-            "instructions": "You are a helpful assistant",
-            "temperature": 0.5,
-        }

         response = client.post(

601-631: Tests will fail: build_evaluation_jsonl now expects KaapiLLMParams, not dict.

All tests in TestBatchEvaluationJSONLBuilding pass dict configs to build_evaluation_jsonl, but the function signature changed to accept KaapiLLMParams. These tests will raise AttributeError when accessing config.model, config.instructions, etc.

🔧 Suggested fix for test_build_batch_jsonl_basic
+from app.models.llm.request import KaapiLLMParams
+
 class TestBatchEvaluationJSONLBuilding:
     """Test JSONL building logic for batch evaluation."""

     def test_build_batch_jsonl_basic(self) -> None:
         """Test basic JSONL building with minimal config."""
         dataset_items = [
             {
                 "id": "item1",
                 "input": {"question": "What is 2+2?"},
                 "expected_output": {"answer": "4"},
                 "metadata": {},
             }
         ]

-        config = {
-            "model": "gpt-4o",
-            "temperature": 0.2,
-            "instructions": "You are a helpful assistant",
-        }
+        config = KaapiLLMParams(
+            model="gpt-4o",
+            temperature=0.2,
+            instructions="You are a helpful assistant",
+        )

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

Similar changes are needed for all other tests in this class.


632-659: Test uses tools key which doesn't exist in KaapiLLMParams.

This test assumes a tools field in the config, but KaapiLLMParams uses knowledge_base_ids instead. The function internally converts knowledge_base_ids to the tools format for OpenAI. Update to use KaapiLLMParams with knowledge_base_ids.

🔧 Suggested fix
     def test_build_batch_jsonl_with_tools(self) -> None:
         """Test JSONL building with tools configuration."""
         dataset_items = [
             {
                 "id": "item1",
                 "input": {"question": "Search the docs"},
                 "expected_output": {"answer": "Answer from docs"},
                 "metadata": {},
             }
         ]

-        config = {
-            "model": "gpt-4o-mini",
-            "instructions": "Search documents",
-            "tools": [
-                {
-                    "type": "file_search",
-                    "vector_store_ids": ["vs_abc123"],
-                }
-            ],
-        }
+        config = KaapiLLMParams(
+            model="gpt-4o-mini",
+            instructions="Search documents",
+            knowledge_base_ids=["vs_abc123"],
+        )

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

         assert len(jsonl_data) == 1
         request = jsonl_data[0]
         assert request["body"]["tools"][0]["type"] == "file_search"
         assert "vs_abc123" in request["body"]["tools"][0]["vector_store_ids"]

661-710: Update remaining tests to use KaapiLLMParams.

test_build_batch_jsonl_minimal_config and test_build_batch_jsonl_skips_empty_questions also pass dict configs and need similar updates.

🔧 Suggested fixes
     def test_build_batch_jsonl_minimal_config(self) -> None:
         """Test JSONL building with minimal config (only model required)."""
         dataset_items = [...]

-        config = {"model": "gpt-4o"}  # Only model provided
+        config = KaapiLLMParams(model="gpt-4o")

         jsonl_data = build_evaluation_jsonl(dataset_items, config)
     def test_build_batch_jsonl_skips_empty_questions(self) -> None:
         """Test that items with empty questions are skipped."""
         dataset_items = [...]

-        config = {"model": "gpt-4o", "instructions": "Test"}
+        config = KaapiLLMParams(model="gpt-4o", instructions="Test")

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

712-736: Update test_build_batch_jsonl_multiple_items to use KaapiLLMParams.

🔧 Suggested fix
     def test_build_batch_jsonl_multiple_items(self) -> None:
         """Test JSONL building with multiple items."""
         dataset_items = [...]

-        config = {
-            "model": "gpt-4o",
-            "instructions": "Answer questions",
-        }
+        config = KaapiLLMParams(
+            model="gpt-4o",
+            instructions="Answer questions",
+        )

         jsonl_data = build_evaluation_jsonl(dataset_items, config)

869-877: Update tests to use config_id and config_version instead of config dict.

The EvaluationRun model does not have a config field—it has config_id (UUID) and config_version (int). The test initializes EvaluationRun with config={"model": "gpt-4o"}, which will be rejected by the model. Update lines 869-877 and the similar test at lines 904-912 to use config_id and config_version to match the pattern in lines 766-778 and 806-827.

♻️ Duplicate comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)

525-549: Duplicate docstring in test method.

Lines 528 and 532 both contain docstrings for the same test method.

🔧 Suggested fix
     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
         """Test batch evaluation fails with invalid dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)

-        """Test batch evaluation fails with invalid/non-existent dataset_id."""
         response = client.post(
🧹 Nitpick comments (4)
backend/app/crud/evaluations/__init__.py (1)

40-70: Inconsistency: save_score and fetch_trace_scores_from_langfuse imported but not in __all__.

Two functions are imported at the module level (lines 9 and 29) but omitted from __all__:

  • save_score (from core)
  • fetch_trace_scores_from_langfuse (from langfuse)

If these are intended as internal-only, consider removing them from the top-level imports to avoid confusion. Otherwise, add them to __all__ for consistency.

Option A: Add missing exports to `__all__`
 __all__ = [
     # Core
     "create_evaluation_run",
     "get_evaluation_run_by_id",
     "list_evaluation_runs",
     "resolve_model_from_config",
+    "save_score",
     # Cron
     "process_all_pending_evaluations",
     "process_all_pending_evaluations_sync",
     ...
     # Langfuse
     "create_langfuse_dataset_run",
+    "fetch_trace_scores_from_langfuse",
     "update_traces_with_cosine_scores",
     "upload_dataset_to_langfuse",
 ]
Option B: Remove unused top-level imports
 from app.crud.evaluations.core import (
     create_evaluation_run,
     get_evaluation_run_by_id,
     list_evaluation_runs,
     resolve_model_from_config,
-    save_score,
 )
 ...
 from app.crud.evaluations.langfuse import (
     create_langfuse_dataset_run,
-    fetch_trace_scores_from_langfuse,
     update_traces_with_cosine_scores,
     upload_dataset_to_langfuse,
 )
backend/app/crud/evaluations/embeddings.py (1)

365-366: Stale comment: model is hard-coded, not read from config.

The comment on line 365 states "Get embedding model from config" but the code now hard-codes the value directly. Consider updating the comment to reflect the intentional hard-coding, or remove the comment entirely since the assignment is self-explanatory.

📝 Suggested comment fix
-        # Get embedding model from config (default: text-embedding-3-large)
+        # Use fixed embedding model for similarity scoring
         embedding_model = "text-embedding-3-large"
backend/app/crud/evaluations/batch.py (2)

74-86: Docstring references fields not present in KaapiLLMParams.

The docstring mentions config fields like tools, text, and include, but KaapiLLMParams (per the relevant snippet) only has: model, instructions, knowledge_base_ids, reasoning, temperature, and max_num_results. Update the docstring to reflect the actual KaapiLLMParams structure.

📝 Suggested docstring update
     Args:
         dataset_items: List of dataset items from Langfuse
-        config: Evaluation configuration dict with OpenAI Responses API parameters.
-            This config is used as-is in the body, with only "input" being added
-            from the dataset. Config can include any fields like:
-            - model (required)
-            - instructions
-            - tools
-            - reasoning
-            - text
-            - temperature
-            - include
-            etc.
+        config: KaapiLLMParams containing evaluation configuration:
+            - model (required): Model identifier
+            - instructions: System instructions
+            - knowledge_base_ids: Vector store IDs for file_search tools
+            - reasoning: Reasoning effort level
+            - temperature: Sampling temperature
+            - max_num_results: Max results for file search

     Returns:
         List of dictionaries (JSONL data)

148-153: Update docstring to reflect KaapiLLMParams type.

The docstring still describes config as "Evaluation configuration dict with llm, instructions, vector_store_ids" but the parameter is now KaapiLLMParams.

📝 Suggested fix
         session: Database session
         eval_run: EvaluationRun database object (with run_name, dataset_name, config)
-        config: Evaluation configuration dict with llm, instructions, vector_store_ids
+        config: KaapiLLMParams containing model, instructions, and other LLM parameters
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dddee70 and 355d281.

📒 Files selected for processing (5)
  • backend/app/crud/evaluations/__init__.py
  • backend/app/crud/evaluations/batch.py
  • backend/app/crud/evaluations/embeddings.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/tests/api/routes/test_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/crud/evaluations/processing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/crud/evaluations/embeddings.py
  • backend/app/crud/evaluations/batch.py
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/crud/evaluations/__init__.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/api/routes/test_evaluation.py
🧬 Code graph analysis (2)
backend/app/crud/evaluations/batch.py (1)
backend/app/models/llm/request.py (1)
  • KaapiLLMParams (8-41)
backend/app/tests/api/routes/test_evaluation.py (2)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-168)
  • EvaluationRun (171-322)
backend/app/tests/utils/test_data.py (1)
  • create_test_config (238-301)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)
🔇 Additional comments (4)
backend/app/crud/evaluations/__init__.py (1)

4-10: LGTM on the new resolve_model_from_config import.

The new function is properly imported from core and correctly added to __all__, making it part of the module's public API.

backend/app/crud/evaluations/batch.py (1)

103-128: LGTM! Body building logic correctly handles None values.

The conditional field inclusion properly addresses the previous review feedback about not sending None values to the OpenAI API. Fields are only added when they have valid values.

backend/app/tests/api/routes/test_evaluation.py (2)

766-778: LGTM! Test correctly uses config_id and config_version.

The test properly creates a config using create_test_config and references it via config_id and config_version on the EvaluationRun, aligning with the new config management pattern.


806-827: LGTM! Consistent config reference pattern.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
backend/app/tests/api/routes/test_evaluation.py (4)

644-660: Type mismatch: build_evaluation_jsonl expects KaapiLLMParams, not dict.

The function signature is build_evaluation_jsonl(dataset_items: list[dict[str, Any]], config: KaapiLLMParams). This test passes a plain dict, which will cause a runtime error when the function accesses config.model, config.instructions, etc. as attributes.

Proposed fix using KaapiLLMParams
-        config = {
-            "model": "gpt-4o-mini",
-            "instructions": "Search documents",
-            "tools": [
-                {
-                    "type": "file_search",
-                    "vector_store_ids": ["vs_abc123"],
-                }
-            ],
-        }
+        config = KaapiLLMParams(
+            model="gpt-4o-mini",
+            instructions="Search documents",
+            knowledge_base_ids=["vs_abc123"],
+        )

Note: Based on the build_evaluation_jsonl implementation, tools are constructed from knowledge_base_ids, not passed directly. You may need to adjust the test assertions accordingly.


870-882: EvaluationRun no longer has a config field - use config_id and config_version.

Based on the EvaluationRun model, the config dict field has been replaced with config_id (UUID) and config_version (int). This test will fail at runtime because it's setting a non-existent field.

Proposed fix
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_simple_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )

905-917: Same issue: config field doesn't exist on EvaluationRun.

This test also uses the deprecated config dict field instead of config_id/config_version.

Proposed fix
+        # Create a config for the evaluation run
+        config = create_test_config(db, project_id=user_api_key.project.id)
+
         eval_run = EvaluationRun(
             run_name="test_run",
             dataset_name=create_test_dataset.name,
             dataset_id=create_test_dataset.id,
-            config={"model": "gpt-4o"},
+            config_id=config.id,
+            config_version=1,
             status="completed",
             total_items=3,
             organization_id=user_api_key.organization_id,
             project_id=user_api_key.project_id,
         )

656-660: Test setup and assertions need updating for KaapiLLMParams structure.

The test setup passes a plain dict with pre-formed tools, but build_evaluation_jsonl expects a KaapiLLMParams object with knowledge_base_ids field. Update the config to use knowledge_base_ids instead of pre-formed tools, and add assertions for the max_num_results field that is automatically included in the generated tools array (defaults to 20 if not specified).

Example correction for test setup:
config = KaapiLLMParams(
    model="gpt-4o-mini",
    instructions="Search documents",
    knowledge_base_ids=["vs_abc123"],
)

jsonl_data = build_evaluation_jsonl(dataset_items, config)

assert len(jsonl_data) == 1
request = jsonl_data[0]
assert request["body"]["tools"][0]["type"] == "file_search"
assert "vs_abc123" in request["body"]["tools"][0]["vector_store_ids"]
assert request["body"]["tools"][0]["max_num_results"] == 20
♻️ Duplicate comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)

526-533: Duplicate docstring in test method.

Lines 529 and 533 both contain docstrings for the same test method. This appears to be an editing artifact from the refactor - remove one of them.

Proposed fix
     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
-        """Test batch evaluation fails with invalid dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)

         """Test batch evaluation fails with invalid/non-existent dataset_id."""

Should become:

     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
         """Test batch evaluation fails with invalid/non-existent dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)
🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)

12-15: Duplicate import of create_test_config.

create_test_config is imported twice - once on line 13 and implicitly available from line 15's import of test_data module. Remove the redundant import.

Proposed fix
 from app.models import EvaluationDataset, EvaluationRun
 from app.models.llm.request import KaapiLLMParams
-from app.tests.utils.test_data import create_test_config
 from app.tests.utils.auth import TestAuthContext
-from app.tests.utils.test_data import create_test_evaluation_dataset
+from app.tests.utils.test_data import create_test_config, create_test_evaluation_dataset
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 355d281 and 9c71896.

📒 Files selected for processing (2)
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/tests/crud/evaluations/test_processing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/tests/crud/evaluations/test_processing.py
  • backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/crud/evaluations/test_processing.py
  • backend/app/tests/api/routes/test_evaluation.py
🧬 Code graph analysis (1)
backend/app/tests/api/routes/test_evaluation.py (3)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (62-132)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-168)
  • EvaluationRun (171-322)
backend/app/models/llm/request.py (1)
  • KaapiLLMParams (8-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)
🔇 Additional comments (11)
backend/app/tests/crud/evaluations/test_processing.py (6)

279-295: LGTM! Test fixture correctly updated to use config_id and config_version.

The eval_run_with_batch fixture properly creates an evaluation run with the new config reference fields. The hardcoded UUID is acceptable for test purposes.


401-410: LGTM! Consistent config reference usage in test.

The test for processing evaluation without batch_job_id correctly uses the same config_id/config_version pattern.


469-485: LGTM! Embedding batch test fixture updated correctly.

The eval_run_with_embedding_batch fixture follows the same pattern as other fixtures.


618-632: LGTM! Check and process evaluation test updated.

The test correctly creates an evaluation run with config references for testing completed batch processing.


681-695: LGTM! Failed batch test fixture updated.

Consistent with other test fixtures in the file.


781-795: LGTM! Poll pending evaluations test updated.

The final test fixture correctly uses config_id/config_version for testing poll functionality.

backend/app/tests/api/routes/test_evaluation.py (5)

602-631: LGTM! Basic JSONL building test correctly uses KaapiLLMParams.

The test properly constructs a KaapiLLMParams instance and validates the output structure.


662-680: LGTM! Minimal config test uses KaapiLLMParams correctly.

Good coverage of the minimal required configuration.


682-711: LGTM! Empty questions skip test uses KaapiLLMParams correctly.

Proper testing of edge case handling.


713-738: LGTM! Multiple items test uses KaapiLLMParams correctly.

Good coverage of batch processing multiple items.


758-796: LGTM! Trace info tests properly updated to use config references.

Both test_get_evaluation_run_trace_info_not_completed and test_get_evaluation_run_trace_info_completed correctly create a test config and use config_id/config_version.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@backend/app/tests/crud/evaluations/test_processing.py`:
- Around line 356-359: Remove the debug print statements that output Result and
mock call counts (the three print(...) lines referencing
mock_download.call_count, mock_fetch_dataset.call_count, and
mock_create_langfuse.call_count) from the test; the existing assertions already
verify behavior so delete those prints and keep the assertions unchanged to
avoid noisy test output.
♻️ Duplicate comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)

526-533: Duplicate docstring still present.

Lines 529 and 533 both contain docstrings for the same test method. This was flagged in a previous review and remains unresolved.

Proposed fix
     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
-        """Test batch evaluation fails with invalid dataset_id."""
+        """Test batch evaluation fails with invalid/non-existent dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)
 
-        """Test batch evaluation fails with invalid/non-existent dataset_id."""
         response = client.post(
🧹 Nitpick comments (3)
backend/app/tests/api/routes/test_evaluation.py (3)

558-561: Remove unused variable invalid_config.

This dictionary is defined but never used. It appears to be a leftover from when the test passed an inline config instead of referencing a stored config via config_id.

Proposed fix
     def test_start_batch_evaluation_missing_model(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
         """Test batch evaluation fails when model is missing from config."""
-        # We don't need a real dataset for this test - the validation should happen
-        # before dataset lookup. Use any dataset_id and expect config validation error
-        invalid_config = {
-            "instructions": "You are a helpful assistant",
-            "temperature": 0.5,
-        }
-
         response = client.post(

633-661: Commented test contains syntax errors.

When this test is uncommented, it will fail because:

  1. Lines 645-654 use dict literal syntax inside KaapiLLMParams() constructor instead of keyword arguments
  2. KaapiLLMParams doesn't have a tools parameter—it uses knowledge_base_ids which maps to file_search tools internally (per build_evaluation_jsonl)

If you plan to re-enable this test, consider:

config = KaapiLLMParams(
    model="gpt-4o-mini",
    instructions="Search documents",
    knowledge_base_ids=["vs_abc123"],  # This becomes file_search tool
)

871-879: Consider migrating remaining tests to use stored config.

This test still uses the embedded config={"model": "gpt-4o"} pattern while other tests in this PR have been updated to use config_id/config_version. Per the AI summary, this is pending migration work.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4d1896 and 3620fee.

📒 Files selected for processing (2)
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/tests/crud/evaluations/test_processing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/tests/crud/evaluations/test_processing.py
  • backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/crud/evaluations/test_processing.py
  • backend/app/tests/api/routes/test_evaluation.py
🧬 Code graph analysis (2)
backend/app/tests/crud/evaluations/test_processing.py (2)
backend/app/tests/utils/test_data.py (1)
  • create_test_config (238-301)
backend/app/crud/evaluations/core.py (1)
  • create_evaluation_run (18-71)
backend/app/tests/api/routes/test_evaluation.py (4)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (62-132)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-168)
  • EvaluationRun (171-322)
backend/app/models/llm/request.py (1)
  • KaapiLLMParams (8-41)
backend/app/tests/utils/test_data.py (1)
  • create_test_config (238-301)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)
🔇 Additional comments (14)
backend/app/tests/crud/evaluations/test_processing.py (7)

16-16: LGTM!

The import of create_test_config follows the factory pattern for test fixtures as per coding guidelines.


259-296: LGTM!

The fixture correctly uses the new config_id/config_version pattern and follows the factory approach with proper type hints.


401-430: LGTM!

Test correctly updated to use the new config management pattern.


454-492: LGTM!

Fixture correctly updated to use config management pattern consistent with other test fixtures.


594-657: LGTM!

Test correctly updated to use the new config management pattern with proper fixture setup.


659-721: LGTM!

Test correctly updated for the failed batch scenario with new config management pattern.


759-821: LGTM!

Test correctly updated to use the new config management pattern for polling pending evaluations.

backend/app/tests/api/routes/test_evaluation.py (7)

4-15: LGTM!

Imports are correctly added and all are utilized in the test file.


613-631: LGTM!

Test correctly migrated from dict to KaapiLLMParams. Assertions properly verify the JSONL structure.


674-681: LGTM!

Minimal config test correctly uses KaapiLLMParams with only the required model parameter.


706-712: LGTM!

Test correctly validates that items with empty or missing questions are filtered out.


726-738: LGTM!

Test properly validates batch processing of multiple items with KaapiLLMParams.


768-784: LGTM!

Test correctly migrated to use stored config reference via config_id and config_version. The factory pattern with create_test_config aligns with coding guidelines.


808-827: LGTM!

Test correctly uses stored config reference and properly validates the completed evaluation response structure.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +356 to +359
print(f"Result: {result}")
print(f"Download called: {mock_download.call_count}")
print(f"Fetch dataset called: {mock_fetch_dataset.call_count}")
print(f"Create langfuse called: {mock_create_langfuse.call_count}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove debug print statements.

These debug print statements should be removed before merging. The assertions below already validate the expected behavior, making these statements unnecessary test artifacts that pollute output.

🧹 Suggested fix
-        print(f"Result: {result}")
-        print(f"Download called: {mock_download.call_count}")
-        print(f"Fetch dataset called: {mock_fetch_dataset.call_count}")
-        print(f"Create langfuse called: {mock_create_langfuse.call_count}")
-
         assert result is not None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"Result: {result}")
print(f"Download called: {mock_download.call_count}")
print(f"Fetch dataset called: {mock_fetch_dataset.call_count}")
print(f"Create langfuse called: {mock_create_langfuse.call_count}")
assert result is not None
🤖 Prompt for AI Agents
In `@backend/app/tests/crud/evaluations/test_processing.py` around lines 356 -
359, Remove the debug print statements that output Result and mock call counts
(the three print(...) lines referencing mock_download.call_count,
mock_fetch_dataset.call_count, and mock_create_langfuse.call_count) from the
test; the existing assertions already verify behavior so delete those prints and
keep the assertions unchanged to avoid noisy test output.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)

552-581: Remove unused invalid_config variable.

The invalid_config dict defined at lines 558-561 is never used - it appears to be leftover from the previous implementation that passed config as a dict. The test now correctly uses config_id with a non-existent UUID.

     def test_start_batch_evaluation_missing_model(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
         """Test batch evaluation fails when model is missing from config."""
         # We don't need a real dataset for this test - the validation should happen
         # before dataset lookup. Use any dataset_id and expect config validation error
-        invalid_config = {
-            "instructions": "You are a helpful assistant",
-            "temperature": 0.5,
-        }

         response = client.post(
♻️ Duplicate comments (2)
backend/app/tests/api/routes/test_evaluation.py (1)

526-541: Duplicate docstring in test method.

Lines 529 and 533 both contain docstrings. Line 533 appears to be misplaced - it should be removed as the docstring at line 529 already describes the test purpose.

     def test_start_batch_evaluation_invalid_dataset_id(
         self, client, user_api_key_header, db, user_api_key
     ):
         """Test batch evaluation fails with invalid dataset_id."""
         # Create a valid config to use
         config = create_test_config(db, project_id=user_api_key.project.id)

-        """Test batch evaluation fails with invalid/non-existent dataset_id."""
         response = client.post(
backend/app/tests/crud/evaluations/test_processing.py (1)

356-359: Remove debug print statements.

These debug print statements should be removed before merging. The assertions below already validate the expected behavior, making these statements unnecessary test artifacts that pollute output.

         result = await process_completed_evaluation(
             eval_run=eval_run_with_batch,
             session=db,
             openai_client=mock_openai,
             langfuse=mock_langfuse,
         )
-        print(f"Result: {result}")
-        print(f"Download called: {mock_download.call_count}")
-        print(f"Fetch dataset called: {mock_fetch_dataset.call_count}")
-        print(f"Create langfuse called: {mock_create_langfuse.call_count}")

         assert result is not None
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3620fee and 8d7db45.

📒 Files selected for processing (2)
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/tests/crud/evaluations/test_processing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/tests/crud/evaluations/test_processing.py
🧬 Code graph analysis (1)
backend/app/tests/crud/evaluations/test_processing.py (2)
backend/app/tests/utils/test_data.py (1)
  • create_test_config (238-301)
backend/app/crud/evaluations/core.py (1)
  • create_evaluation_run (18-71)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)
🔇 Additional comments (17)
backend/app/tests/api/routes/test_evaluation.py (9)

4-4: LGTM!

Import of uuid4 is correctly added for generating non-existent config IDs in tests.


12-13: LGTM!

Imports for KaapiLLMParams and create_test_config are properly added to support the config-based evaluation flow.


613-617: LGTM!

Correctly migrated to use KaapiLLMParams for typed configuration in JSONL building tests.


644-648: LGTM!

Properly uses KaapiLLMParams with knowledge_base_ids for testing tools configuration.


668-668: LGTM!

Minimal config test correctly uses KaapiLLMParams with only the required model parameter.


700-700: LGTM!

Test correctly uses KaapiLLMParams for the empty questions skipping test.


720-723: LGTM!

Multiple items test properly migrated to use KaapiLLMParams.


762-770: LGTM!

Test correctly creates a config via create_test_config and uses config_id/config_version when constructing the EvaluationRun instance.


802-810: LGTM!

Properly updated to use config-based approach consistent with the model changes.

backend/app/tests/crud/evaluations/test_processing.py (8)

16-16: LGTM!

Import of create_test_config is correctly added alongside existing create_test_evaluation_dataset.


263-263: LGTM!

Correctly creates a test config with use_kaapi_schema=True to match the evaluation run requirements.


280-289: LGTM!

The create_evaluation_run call correctly uses config_id and config_version parameters matching the updated function signature from core.py.


406-416: LGTM!

Test correctly creates config and uses config_id/config_version for the no batch job ID test case.


454-485: LGTM!

Fixture properly creates a config and uses the config-based approach for the embedding batch evaluation run.


607-635: LGTM!

Test correctly migrated to use config creation and config_id/config_version parameters.


670-699: LGTM!

Failed evaluation test properly uses the config-based pattern.


772-800: LGTM!

Poll pending evaluations test correctly creates config and uses config_id/config_version.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add config management in Evals

5 participants