Evaluation: Refactor #503

AkhileshNegi · 2025-12-17T09:53:56Z

Summary

Target issue is #492

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Summary by CodeRabbit

New Features
- Dataset management APIs: upload CSV with duplication, list, retrieve, delete.
- Evaluation run APIs: start runs, list runs, view run status with optional trace/score sync.
- Batch status polling to track background job progress.
Improvements
- Server-side CSV validation and dataset-name sanitization.
- Service-driven evaluation flow with centralized score retrieval and caching.
- Router/package consolidation for a clearer API surface.
Tests
- Updated tests for new routes (trailing-slash) and added auth checks.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-17T09:54:02Z

📝 Walkthrough

Walkthrough

Split evaluation functionality into an evaluations package: moved dataset upload, CSV validation, evaluation orchestration, and score retrieval into service modules; added batch polling and job re-exports; updated routers, CRUD signatures, and tests to use the new service-driven flow.

Changes

Cohort / File(s)	Summary
API package & routers `backend/app/api/routes/evaluations/__init__.py`, `backend/app/api/routes/evaluations/dataset.py`, `backend/app/api/routes/evaluations/evaluation.py`, `backend/app/api/main.py`	New `/evaluations` router and sub-routers; dataset/evaluation endpoints moved into package, now delegate to services, include docstrings and response mappers; main router wiring updated.
Evaluation services & validators `backend/app/services/evaluations/__init__.py`, `backend/app/services/evaluations/dataset.py`, `backend/app/services/evaluations/evaluation.py`, `backend/app/services/evaluations/validators.py`	New service layer: `upload_dataset`, `start_evaluation`, `get_evaluation_with_scores`, `build_evaluation_config`; CSV validators, sanitization and parsing utilities; package re-exports.
Dataset CRUD changes `backend/app/crud/evaluations/dataset.py`	`delete_dataset` now accepts a pre-fetched `EvaluationDataset` and returns `str
Batch core & polling `backend/app/core/batch/__init__.py`, `backend/app/core/batch/operations.py`, `backend/app/core/batch/polling.py`	Expanded core batch public API; extracted polling logic to new `polling.py` and removed polling function from `operations.py`; `poll_batch_status` implemented to compare provider status and persist DB updates.
CRUD / re-exports & import consolidation `backend/app/crud/job/__init__.py`, `backend/app/crud/job/batch.py`, `backend/app/crud/evaluations/__init__.py`, `backend/app/crud/evaluations/*`	Added re-export layers for job/batch CRUD, consolidated imports to use `app.core.batch` and `app.crud.job` re-exports; added `save_score` and `fetch_trace_scores_from_langfuse` re-exports; removed explicit `__all__` in one evaluations init.
Evaluation-related CRUD/processing changes `backend/app/crud/evaluations/processing.py`, `backend/app/crud/evaluations/embeddings.py`, `backend/app/crud/evaluations/batch.py`	Adjusted imports to centralized `app.core.batch` and `app.crud.job`; minor logging/message tweaks and signature punctuation change.
Services -> API tests & auth adjustments `backend/app/tests/api/routes/test_evaluation.py`, `backend/app/tests/api/test_auth_failures.py`, other evaluation tests	Tests updated to patch new service paths, use trailing-slash endpoint URLs, and added auth-required assertions for dataset upload and batch evaluation endpoints.
Tests for CRUD changes `backend/app/tests/crud/evaluations/test_dataset.py`, `backend/app/tests/crud/evaluations/test_processing.py`	Tests adapted to new `delete_dataset` contract (object param + None/error string) and updated poll_batch_status patch targets.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as Evaluations API
    participant Validator as Validators
    participant Service as Dataset Service
    participant ObjectStore as Object Store
    participant Langfuse as Langfuse
    participant DB as Database

    Client->>API: POST /evaluations/datasets/ (CSV)
    API->>Validator: validate_csv_file(file)
    Validator-->>API: csv_content (bytes)
    API->>Service: upload_dataset(csv_content, name, ...)
    Service->>Service: sanitize_dataset_name / parse_csv_items
    Service->>ObjectStore: upload CSV (optional)
    ObjectStore-->>Service: object_store_url / error
    Service->>Langfuse: upload dataset
    Langfuse-->>Service: langfuse_dataset_id
    Service->>DB: create EvaluationDataset record
    DB-->>Service: EvaluationDataset
    Service-->>API: EvaluationDataset
    API-->>Client: DatasetUploadResponse

sequenceDiagram
    participant Client
    participant API as Evaluations API
    participant Service as Evaluation Service
    participant DB as Database
    participant Langfuse as Langfuse
    participant Provider as Batch Provider

    Client->>API: POST /evaluations/ (dataset_id, config, ...)
    API->>Service: start_evaluation(dataset_id, config, ...)
    Service->>DB: fetch EvaluationDataset
    DB-->>Service: EvaluationDataset
    Service->>Service: build_evaluation_config (merge assistant defaults)
    Service->>DB: create EvaluationRun record
    DB-->>Service: EvaluationRun
    Service->>Langfuse: initialize evaluation run
    Langfuse-->>Service: langfuse_run_id
    Service->>Provider: start batch job
    Provider-->>Service: provider_batch_id
    Service->>DB: update EvaluationRun with provider ids
    Service-->>API: EvaluationRunPublic
    API-->>Client: EvaluationRunPublic

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Evaluation #405: Overlapping refactor that splits evaluation endpoints and moves dataset/evaluation logic into services and Langfuse/batch integrations.
Enhancing swagger and redocs #484: Related changes touching API router/tags and evaluation route documentation/metadata.
Kaapi v1.0: Enhancing the test suite #488: Test and fixture updates that exercise the new evaluation endpoints and service paths.

Suggested labels

ready-for-review

Suggested reviewers

Prajna1999
kartpop
avirajsingh7

Poem

🐇 I hopped through routes and services bright,
CSV rows counted by validator light;
Datasets uploaded, runs queued to zoom,
Polls tick until the results resume—
A rabbit cheers as new modules take flight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title "Evaluation: Refactor" is vague and overly generic, using non-descriptive terms that don't convey the actual scope or primary focus of the substantial changes.	Replace with a more specific title that highlights a key aspect of the refactor, such as "Evaluation: Refactor to service-driven architecture" or "Evaluation: Migrate dataset and run endpoints to services".

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/api/routes/evaluation.py (1)
52-82: Fix event loop blocking: async endpoint calling sync I/O.

The endpoint is async def (required for await validate_csv_file), but line 72 calls the synchronous upload_dataset service, which performs blocking I/O operations (database queries, Langfuse API calls, object store uploads). This blocks FastAPI's event loop and degrades concurrency.
🔎 Recommended fixes (choose one):

Option 1 (simplest): Run the sync service in a thread pool
+from fastapi.concurrency import run_in_threadpool
+
 async def upload_dataset_endpoint(
     ...
 ) -> APIResponse[DatasetUploadResponse]:
     """Upload an evaluation dataset."""
     # Validate and read CSV file
     csv_content = await validate_csv_file(file)
 
     # Upload dataset using service
-    dataset = upload_dataset(
+    dataset = await run_in_threadpool(
+        upload_dataset,
         session=_session,
         csv_content=csv_content,
         dataset_name=dataset_name,
         description=description,
         duplication_factor=duplication_factor,
         organization_id=auth_context.organization.id,
         project_id=auth_context.project.id,
     )
 
     return APIResponse.success_response(data=_dataset_to_response(dataset))
Option 2: Make the service async (requires refactoring service layer)

This would require making upload_dataset and its dependencies async, which is a larger refactor but provides better async integration.
Note: The other endpoints (evaluate, get_evaluation_run_status) are correctly defined as sync (def), so FastAPI automatically runs them in a thread pool.

🧹 Nitpick comments (5)

backend/app/services/evaluation/dataset.py (1)
131-138: Consider sanitizing exception details in client-facing error message.

Exposing raw exception messages ({e}) could leak internal implementation details (stack traces, connection strings, etc.) to API clients.
🔎 Apply this diff to provide a generic error message:
     except Exception as e:
         logger.error(
             f"[upload_dataset] Failed to upload dataset to Langfuse | {e}",
             exc_info=True,
         )
         raise HTTPException(
-            status_code=500, detail=f"Failed to upload dataset to Langfuse: {e}"
+            status_code=500, detail="Failed to upload dataset to Langfuse. Please check your Langfuse credentials and try again."
         )
backend/app/services/evaluation/evaluation.py (3)
23-28: Consider more specific type hints for config parameter.

Using dict without type parameters reduces type safety. As per coding guidelines requiring type hints, consider using dict[str, Any] for clarity.
🔎 Apply this diff:
+from typing import Any
+
 def build_evaluation_config(
     session: Session,
-    config: dict,
+    config: dict[str, Any],
     assistant_id: str | None,
     project_id: int,
-) -> dict:
+) -> dict[str, Any]:
101-109: Consider more specific type hints for config parameter here as well.

Same suggestion as build_evaluation_config for consistency.

313-319: Consider sanitizing exception details in error message.

Similar to the dataset service, exposing raw exception messages could leak internal details.
🔎 Apply this diff:
     except Exception as e:
         logger.error(
             f"[get_evaluation_with_scores] Failed to fetch trace info | "
             f"evaluation_id={evaluation_id} | error={e}",
             exc_info=True,
         )
-        return eval_run, f"Failed to fetch trace info from Langfuse: {str(e)}"
+        return eval_run, "Failed to fetch trace info from Langfuse. Please try again later."
backend/app/api/routes/evaluation.py (1)
67-67: Consider enriching docstrings with parameter and behavior details.

The added docstrings are present but minimal, mostly restating the function names. For route handlers that delegate to services, consider documenting:

Key parameter constraints or validation rules

Important response scenarios (e.g., 404 vs 422)

Side effects or async behavior notes

Example for upload_dataset_endpoint:
"""
Upload an evaluation dataset from a CSV file.

Validates CSV format and content, uploads to Langfuse and optional object store,
and persists metadata. Runs synchronously in a thread pool to avoid blocking.
"""
This is optional since the Swagger descriptions are loaded from external markdown files.

Also applies to: 96-96, 124-124, 156-156, 200-200, 225-225, 269-269

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d26401b and d990add.

📒 Files selected for processing (7)

backend/app/api/routes/evaluation.py (10 hunks)
backend/app/crud/evaluations/__init__.py (4 hunks)
backend/app/services/evaluation/__init__.py (1 hunks)
backend/app/services/evaluation/dataset.py (1 hunks)
backend/app/services/evaluation/evaluation.py (1 hunks)
backend/app/services/evaluation/validators.py (1 hunks)
backend/app/tests/api/routes/test_evaluation.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

backend/app/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement business logic in services located in backend/app/services/

Files:

backend/app/services/evaluation/validators.py
backend/app/services/evaluation/dataset.py
backend/app/services/evaluation/__init__.py
backend/app/services/evaluation/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/services/evaluation/validators.py
backend/app/crud/evaluations/__init__.py
backend/app/services/evaluation/dataset.py
backend/app/services/evaluation/__init__.py
backend/app/services/evaluation/evaluation.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/api/routes/evaluation.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (4)

backend/app/crud/evaluations/__init__.py (2)

backend/app/crud/evaluations/core.py (1)

save_score (259-295)

backend/app/crud/evaluations/langfuse.py (1)

fetch_trace_scores_from_langfuse (304-543)

backend/app/services/evaluation/__init__.py (3)

backend/app/services/evaluation/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluation/evaluation.py (2)

build_evaluation_config (23-98)

get_evaluation_with_scores (229-329)

backend/app/services/evaluation/validators.py (2)

parse_csv_items (118-174)

sanitize_dataset_name (23-68)

backend/app/services/evaluation/evaluation.py (7)

backend/app/crud/assistants.py (1)

get_assistant_by_id (19-30)

backend/app/crud/evaluations/core.py (3)

create_evaluation_run (14-61)

get_evaluation_run_by_id (103-141)

save_score (259-295)

backend/app/crud/evaluations/langfuse.py (1)

fetch_trace_scores_from_langfuse (304-543)

backend/app/crud/evaluations/dataset.py (1)

get_dataset_by_id (107-140)

backend/app/crud/evaluations/batch.py (1)

start_evaluation_batch (118-212)

backend/app/models/evaluation.py (1)

EvaluationRun (170-319)

backend/app/utils.py (2)

get_langfuse_client (212-248)

get_openai_client (179-209)

backend/app/api/routes/evaluation.py (3)

backend/app/services/evaluation/evaluation.py (1)

get_evaluation_with_scores (229-329)

backend/app/services/evaluation/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluation/validators.py (1)

validate_csv_file (71-115)

🔇 Additional comments (29)

backend/app/tests/api/routes/test_evaluation.py (7)

56-64: LGTM! Mock paths correctly updated to reflect service layer refactoring.

The patch targets are properly updated from app.api.routes.evaluation.* to app.services.evaluation.dataset.*, aligning with the new service-oriented architecture.

142-150: Mock paths consistently updated for empty rows test.

188-196: Mock paths consistently updated for default duplication test.

229-237: Mock paths consistently updated for custom duplication test.

270-278: Mock paths consistently updated for description test.

362-370: Mock paths consistently updated for boundary minimum test.

407-410: Mock paths consistently updated for Langfuse configuration failure test.

backend/app/crud/evaluations/__init__.py (3)

4-9: LGTM! Properly exports save_score from core module.

The new export aligns with the service layer's need to persist evaluation scores.

26-31: LGTM! Properly exports fetch_trace_scores_from_langfuse from langfuse module.

This enables the evaluation service to retrieve trace scores via the CRUD layer's public API.

39-70: LGTM! __all__ correctly updated with new exports.

Both save_score and fetch_trace_scores_from_langfuse are properly listed in the appropriate sections.

backend/app/services/evaluation/validators.py (4)

13-20: LGTM! Well-defined security constants.

Clear file upload limits with appropriate MIME type allowances including text/plain for compatibility.

23-68: LGTM! Well-documented sanitization with comprehensive rules.

The function handles edge cases properly, includes clear examples in the docstring, and follows coding guidelines with type hints.

100-115: LGTM! Thorough file size validation.

Properly seeks to end for size check and resets position before reading.

118-174: LGTM! Robust CSV parsing with case-insensitive headers.

Good approach to normalize headers while preserving original column names for row access. Properly filters empty rows and re-raises HTTPException without wrapping.

backend/app/services/evaluation/dataset.py (4)

24-32: LGTM! Well-typed function signature.

All parameters have proper type hints including the union type for optional description. As per coding guidelines, type hints are correctly applied.

58-74: LGTM! Proper sanitization with informative logging.

Logging follows the [function_name] prefix convention as per coding guidelines.

86-108: LGTM! Graceful degradation for object store failures.

Non-fatal handling allows the workflow to continue when object storage is unavailable, which is appropriate for optional storage backends.

140-163: LGTM! Clean metadata construction and database persistence.

The metadata structure is well-organized and the final logging confirms successful creation.

backend/app/services/evaluation/__init__.py (1)

1-32: LGTM! Clean public API surface definition.

Well-organized exports with clear categorization. The __all__ list provides a clear contract for the evaluation services package.

backend/app/services/evaluation/evaluation.py (8)

47-86: LGTM! Robust assistant config merging logic.

Properly handles the merge precedence where provided config values override assistant values, and conditionally adds tools when vector stores are available.

88-98: LGTM! Proper validation for direct config usage.

Clear error message when model is missing and no assistant_id is provided.

142-170: LGTM! Thorough dataset validation with clear error messages.

Properly validates both dataset existence and Langfuse ID presence before proceeding.

179-200: LGTM! Clean client initialization and evaluation run creation.

Follows the documented workflow steps clearly.

229-236: LGTM! Well-typed return signature.

The tuple[EvaluationRun | None, str | None] return type clearly communicates the possible outcomes.

276-286: LGTM! Proper caching logic for trace scores.

Correctly checks both completion status and existing cached scores before fetching from Langfuse.

321-329: LGTM! Clean score persistence flow.

Uses the separate-session save_score function correctly to persist scores after the primary fetch operation.

219-226: The error handling behavior is correct and intentional. start_evaluation_batch properly sets status = "failed" with an error_message before re-raising the exception. The exception is then caught in start_evaluation, and session.refresh(eval_run) retrieves the updated failure state from the database. The approach is sound—error details are persisted before exception propagation, ensuring the evaluation run maintains failure context when returned to the caller.

backend/app/api/routes/evaluation.py (2)

1-23: LGTM: Clean service-oriented refactoring.

The refactoring to delegate business logic to service modules improves separation of concerns and maintainability. The route handlers are now thin wrappers focused on HTTP concerns.

294-295: Graceful degradation pattern is intentional and well-designed.

Lines 294-295 correctly use APIResponse.failure_response(), which is explicitly designed to support returning data alongside errors. The failure_response() method signature includes an optional data: Optional[T] = None parameter, and the response is constructed with both success=False, error=error_message, and data=data. This pattern is appropriate for cases where the evaluation run exists but additional enrichment (like trace info) fails, allowing clients to receive partial data along with error context.

backend/app/services/evaluations/validators.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/api/routes/evaluation.py (1)
51-81: Add logging to endpoint and address blocking I/O pattern.

Two issues found:

Missing logging: This endpoint lacks logging despite other endpoints in the file logging operations with [function_name] prefix. Per coding guidelines, add a log statement prefixed with [upload_dataset_endpoint].

Blocking I/O in async context: upload_dataset is a synchronous function that performs blocking I/O operations (database writes, object store uploads, Langfuse API calls). Calling it directly in an async endpoint blocks the event loop. Consider wrapping it with asyncio.to_thread() or refactoring the service function to be async.

Suggested fix for logging:
async def upload_dataset_endpoint(
    ...
) -> APIResponse[DatasetUploadResponse]:
    """Upload an evaluation dataset."""
+   logger.info(
+       f"[upload_dataset_endpoint] Uploading dataset | "
+       f"name={dataset_name} | org_id={auth_context.organization.id} | "
+       f"project_id={auth_context.project.id}"
+   )
    # Validate and read CSV file
    csv_content = await validate_csv_file(file)

    # Upload dataset using service
    dataset = upload_dataset(

🧹 Nitpick comments (4)

backend/app/api/routes/evaluation.py (4)
33-43: Add type hint for dataset parameter.

The dataset parameter is missing a type hint. As per coding guidelines, all function parameters should have type hints.
🔎 Suggested fix:
-def _dataset_to_response(dataset) -> DatasetUploadResponse:
+def _dataset_to_response(dataset: "Dataset") -> DatasetUploadResponse:
You'll need to import the Dataset model or use the appropriate type from your models. If you want to avoid a circular import, you can use a string annotation or TYPE_CHECKING block.
89-110: Add logging for list operation.

This endpoint lacks logging. Per coding guidelines, consider adding a log statement prefixed with [list_datasets_endpoint] to track dataset listing operations.
🔎 Suggested addition:
 def list_datasets_endpoint(
     ...
 ) -> APIResponse[list[DatasetUploadResponse]]:
     """List evaluation datasets."""
+    logger.info(
+        f"[list_datasets_endpoint] Listing datasets | "
+        f"org_id={auth_context.organization.id} | "
+        f"project_id={auth_context.project.id} | limit={limit} | offset={offset}"
+    )
     # Enforce maximum limit
     if limit > 100:
         limit = 100
186-210: Add logging for evaluation start.

This endpoint lacks logging for the evaluation start operation. Starting an evaluation is a significant action that should be logged for observability and debugging purposes.
🔎 Suggested addition:
 def evaluate(
     ...
 ) -> APIResponse[EvaluationRunPublic]:
     """Start an evaluation run."""
+    logger.info(
+        f"[evaluate] Starting evaluation | dataset_id={dataset_id} | "
+        f"experiment_name={experiment_name} | assistant_id={assistant_id} | "
+        f"org_id={auth_context.organization.id} | project_id={auth_context.project.id}"
+    )
     eval_run = start_evaluation(
247-296: Add logging for evaluation status fetch.

This endpoint lacks logging. Consider adding a log statement prefixed with [get_evaluation_run_status] for consistency with other endpoints like get_dataset.
🔎 Suggested addition:
 def get_evaluation_run_status(
     ...
 ) -> APIResponse[EvaluationRunPublic]:
     """Get evaluation run status with optional trace info."""
+    logger.info(
+        f"[get_evaluation_run_status] Fetching evaluation | id={evaluation_id} | "
+        f"get_trace_info={get_trace_info} | resync_score={resync_score} | "
+        f"org_id={auth_context.organization.id}"
+    )
     if resync_score and not get_trace_info:

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d990add and 8e5ba77.

📒 Files selected for processing (1)

backend/app/api/routes/evaluation.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluation.py

🔇 Additional comments (2)

backend/app/api/routes/evaluation.py (2)

113-142: LGTM!

Proper logging with [get_dataset] prefix, complete type hints, and appropriate error handling for the 404 case.

145-178: LGTM!

Proper logging with [delete_dataset] prefix at entry and exit. Type hints are complete.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/crud/evaluations/__init__.py (1)

1-37: Add an explicit __all__ list to declare the public API.

Per PEP 8, modules should explicitly declare the names in their public API using the __all__ attribute. This module currently imports 21+ functions from six submodules without defining __all__, which means all imported names become part of the public API by default. Using __all__ allows you to explicitly specify the public interface of your packages and modules and prevents accidental usage of objects that shouldn't be used from outside the module. Define an __all__ list after the module docstring containing the functions intended as part of the public API.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e5ba77 and 7ba9b43.

📒 Files selected for processing (2)

backend/app/crud/evaluations/__init__.py (2 hunks)
backend/app/services/evaluation/__init__.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/services/evaluation/init.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/crud/evaluations/__init__.py

🧬 Code graph analysis (1)

backend/app/crud/evaluations/__init__.py (2)

backend/app/crud/evaluations/core.py (1)

save_score (259-295)

backend/app/crud/evaluations/langfuse.py (1)

fetch_trace_scores_from_langfuse (304-543)

🔇 Additional comments (2)

backend/app/crud/evaluations/__init__.py (2)

8-8: LGTM!

The import of save_score is correctly added and maintains alphabetical ordering with other imports from the core module.

28-28: LGTM!

The import of fetch_trace_scores_from_langfuse is correctly added and maintains alphabetical ordering with other imports from the langfuse module.

codecov · 2025-12-18T07:28:35Z

Codecov Report

❌ Patch coverage is 71.73913% with 91 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/services/evaluations/evaluation.py	37.14%	44 Missing ⚠️
backend/app/core/batch/polling.py	32.00%	17 Missing ⚠️
backend/app/services/evaluations/validators.py	81.96%	11 Missing ⚠️
backend/app/services/evaluations/dataset.py	84.09%	7 Missing ⚠️
backend/app/api/routes/evaluations/evaluation.py	83.33%	5 Missing ⚠️
backend/app/api/routes/evaluations/dataset.py	92.68%	3 Missing ⚠️
backend/app/crud/job/batch.py	0.00%	3 Missing ⚠️
backend/app/crud/evaluations/dataset.py	87.50%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

backend/app/api/routes/evaluation.py (3)
5-36: Critical: Missing imports causing pipeline failure.

The code references require_permission and Permission on lines 60, 99, 129, 162, 199, 232, and 262, but these are not imported. This is causing the pytest collection to fail with NameError: name 'require_permission' is not defined.
🔎 Add missing imports
 from app.api.deps import AuthContextDep, SessionDep
+from app.api.deps import require_permission, Permission
 from app.crud.evaluations import (
43-53: Add type hint for dataset parameter.

The dataset parameter is missing a type hint, which violates the coding guideline requiring type hints on all function parameters.

Based on coding guidelines, all function parameters must have type hints.
🔎 Add type hint
-def _dataset_to_response(dataset) -> DatasetUploadResponse:
+def _dataset_to_response(dataset: EvaluationDataset) -> DatasetUploadResponse:
     """Convert a dataset model to a DatasetUploadResponse."""
Note: You may need to import EvaluationDataset from the appropriate module if not already imported.
101-122: Unify inconsistent auth_context attribute access patterns across all endpoints.

This function uses auth_context.organization_.id and auth_context.project_.id (non-optional property accessors), while other endpoints like upload_dataset_endpoint (lines 88-89) and start_evaluation (lines 221-222) use auth_context.organization.id and auth_context.project.id (optional attribute accessors). All endpoints have REQUIRE_PROJECT permission dependencies, ensuring these values are never None; however, the codebase should consistently use the non-optional property accessors (organization_ and project_) for semantic clarity and fail-fast error handling.

Update lines 114-115 and other inconsistent endpoints to use the underscore versions for uniformity.

♻️ Duplicate comments (3)

backend/app/api/routes/evaluation.py (3)

131-155: Verify inconsistent auth_context attribute access.

Same issue as in list_datasets_endpoint: this function uses auth_context.organization_.id and auth_context.project_.id (with underscores) on lines 139-140 and 146-147, while refactored endpoints use the non-underscore version. This inconsistency needs to be resolved.

164-192: Verify inconsistent auth_context attribute access.

Same issue: this function uses auth_context.organization_.id and auth_context.project_.id (with underscores) on lines 172-173, while refactored endpoints use the non-underscore version.

234-255: Verify inconsistent auth_context attribute access.

Same issue: this function uses auth_context.organization_.id and auth_context.project_.id (with underscores) on lines 243-244 and 250-251, while refactored endpoints use the non-underscore version.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ba9b43 and 09c4f4a.

📒 Files selected for processing (1)

backend/app/api/routes/evaluation.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (1)

backend/app/api/routes/evaluation.py (4)

backend/app/services/evaluation/evaluation.py (1)

get_evaluation_with_scores (229-329)

backend/app/services/evaluation/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluation/validators.py (1)

validate_csv_file (71-115)

backend/app/utils.py (3)

APIResponse (33-57)

success_response (40-43)

failure_response (46-57)

🪛 GitHub Actions: Kaapi CI

backend/app/api/routes/evaluation.py

[error] 60-60: NameError: name 'require_permission' is not defined. This occurred during pytest collection while running: 'coverage run --source=app -m pytest'.

🔇 Additional comments (3)

backend/app/api/routes/evaluation.py (3)

62-92: LGTM: Clean service delegation.

The refactored endpoint properly delegates CSV validation and dataset upload to the service layer, maintaining clear separation of concerns. The error handling is appropriately managed by the service functions.

201-225: LGTM: Simplified evaluation orchestration.

The endpoint correctly delegates evaluation orchestration to start_evaluation, simplifying the route handler and moving business logic to the service layer as intended by this refactor.

264-312: LGTM: Proper error handling for evaluation scores.

The endpoint correctly handles the tuple return from get_evaluation_with_scores and appropriately uses failure_response when an error is present. This maintains a consistent API response structure while delegating the complex Langfuse interaction to the service layer.

…ech4DevAI/kaapi-backend into enhancement/evaluation-refactor

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

backend/app/crud/evaluations/embeddings.py (1)
83-85: Missing function name prefix in log messages.

Per coding guidelines, log messages should be prefixed with the function name in square brackets. The current format "Building embedding JSONL..." should be "[build_embedding_jsonl] Building embedding JSONL...".
♻️ Suggested logging format
     logger.info(
-        f"Building embedding JSONL for {len(results)} items with model {embedding_model}"
+        f"[build_embedding_jsonl] Building JSONL | items={len(results)} | model={embedding_model}"
     )
Similar updates needed for lines 95, 101, 106, and 127.
backend/app/services/evaluations/validators.py (1)
84-90: Consider handling None filename edge case.

UploadFile.filename can potentially be None in some edge cases. Path(None) would raise a TypeError.
♻️ Suggested defensive check
     # Security validation: Check file extension
+    if not file.filename:
+        raise HTTPException(
+            status_code=422,
+            detail="File must have a filename",
+        )
     file_ext = Path(file.filename).suffix.lower()
backend/app/api/routes/evaluations/dataset.py (1)
94-115: Add type hints for query parameters.

Per coding guidelines, all function parameters should have type hints. The limit and offset parameters are missing explicit type annotations in the function signature (they have defaults but no type hints).
Suggested fix
 def list_datasets_endpoint(
     _session: SessionDep,
     auth_context: AuthContextDep,
-    limit: int = 50,
-    offset: int = 0,
+    limit: int = Query(default=50, ge=1, le=100, description="Maximum number of datasets to return"),
+    offset: int = Query(default=0, ge=0, description="Number of datasets to skip"),
 ) -> APIResponse[list[DatasetUploadResponse]]:
This also adds validation constraints, making the explicit if limit > 100 check on line 102-103 unnecessary.
backend/app/api/routes/evaluations/evaluation.py (2)
37-61: Add logging for the evaluate endpoint.

Per coding guidelines, log messages should be prefixed with the function name. The evaluate function has no logging, unlike the other endpoints in this file.
Suggested addition
 def evaluate(
     _session: SessionDep,
     auth_context: AuthContextDep,
     dataset_id: int = Body(..., description="ID of the evaluation dataset"),
     experiment_name: str = Body(
         ..., description="Name for this evaluation experiment/run"
     ),
     config: dict = Body(default_factory=dict, description="Evaluation configuration"),
     assistant_id: str
     | None = Body(
         None, description="Optional assistant ID to fetch configuration from"
     ),
 ) -> APIResponse[EvaluationRunPublic]:
     """Start an evaluation run."""
+    logger.info(
+        f"[evaluate] Starting evaluation | dataset_id={dataset_id} | "
+        f"experiment_name={experiment_name} | "
+        f"org_id={auth_context.organization_.id} | "
+        f"project_id={auth_context.project_.id}"
+    )
     eval_run = start_evaluation(
70-91: Consider adding validation constraints for pagination parameters.

Similar to the dataset listing endpoint, limit and offset lack validation constraints. Consider using Query with ge/le constraints for consistency across endpoints.
Suggested fix
 def list_evaluation_runs(
     _session: SessionDep,
     auth_context: AuthContextDep,
-    limit: int = 50,
-    offset: int = 0,
+    limit: int = Query(default=50, ge=1, le=100, description="Maximum number of runs to return"),
+    offset: int = Query(default=0, ge=0, description="Number of runs to skip"),
 ) -> APIResponse[list[EvaluationRunPublic]]:
backend/app/services/evaluations/evaluation.py (1)

23-98: Consider using TypedDict for the config parameter and return type.

The config: dict and -> dict type hints are loose. Using a TypedDict would provide better type safety and documentation for the expected structure.

Additionally, the logging at line 60-64 truncates instructions to 50 chars which is good, but consider using mask_string if instructions could contain sensitive data (as per coding guidelines for sensitive values).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09c4f4a and d128375.

📒 Files selected for processing (17)

backend/app/api/main.py
backend/app/api/routes/evaluations/__init__.py
backend/app/api/routes/evaluations/dataset.py
backend/app/api/routes/evaluations/evaluation.py
backend/app/core/batch/__init__.py
backend/app/core/batch/operations.py
backend/app/core/batch/polling.py
backend/app/crud/evaluations/batch.py
backend/app/crud/evaluations/embeddings.py
backend/app/crud/evaluations/processing.py
backend/app/crud/job/__init__.py
backend/app/crud/job/batch.py
backend/app/crud/job/job.py
backend/app/services/evaluations/__init__.py
backend/app/services/evaluations/dataset.py
backend/app/services/evaluations/evaluation.py
backend/app/services/evaluations/validators.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/crud/job/batch.py
backend/app/crud/evaluations/batch.py
backend/app/core/batch/operations.py
backend/app/core/batch/polling.py
backend/app/api/routes/evaluations/dataset.py
backend/app/services/evaluations/__init__.py
backend/app/services/evaluations/validators.py
backend/app/crud/evaluations/processing.py
backend/app/api/routes/evaluations/evaluation.py
backend/app/services/evaluations/evaluation.py
backend/app/api/main.py
backend/app/core/batch/__init__.py
backend/app/services/evaluations/dataset.py
backend/app/api/routes/evaluations/__init__.py
backend/app/crud/job/__init__.py
backend/app/crud/evaluations/embeddings.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluations/dataset.py
backend/app/api/routes/evaluations/evaluation.py
backend/app/api/main.py
backend/app/api/routes/evaluations/__init__.py

backend/app/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement business logic in services located in backend/app/services/

Files:

backend/app/services/evaluations/__init__.py
backend/app/services/evaluations/validators.py
backend/app/services/evaluations/evaluation.py
backend/app/services/evaluations/dataset.py

🧠 Learnings (1)

📚 Learning: 2025-12-17T15:39:30.469Z

Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/crud/*.py : Use CRUD pattern for database access operations located in `backend/app/crud/`

Applied to files:

backend/app/crud/job/__init__.py

🧬 Code graph analysis (11)

backend/app/crud/job/batch.py (2)

backend/app/core/batch/operations.py (4)

download_batch_results (86-112)

process_completed_batch (115-158)

start_batch_job (17-83)

upload_batch_results_to_object_store (161-188)

backend/app/core/batch/polling.py (1)

poll_batch_status (15-53)

backend/app/crud/evaluations/batch.py (2)

backend/app/core/batch/openai.py (1)

OpenAIBatchProvider (14-254)

backend/app/core/batch/operations.py (1)

start_batch_job (17-83)

backend/app/core/batch/operations.py (1)

backend/app/crud/job/job.py (2)

create_batch_job (13-56)

update_batch_job (76-120)

backend/app/core/batch/polling.py (3)

backend/app/core/batch/base.py (1)

BatchProvider (7-105)

backend/app/crud/job/job.py (1)

update_batch_job (76-120)

backend/app/models/batch_job.py (2)

BatchJob (15-136)

BatchJobUpdate (156-165)

backend/app/api/routes/evaluations/dataset.py (4)

backend/app/crud/evaluations/dataset.py (2)

get_dataset_by_id (107-140)

list_datasets (175-210)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (25-44)

EvaluationDataset (73-167)

backend/app/api/permissions.py (2)

Permission (10-15)

require_permission (45-70)

backend/app/utils.py (2)

APIResponse (33-57)

success_response (40-43)

backend/app/services/evaluations/__init__.py (3)

backend/app/services/evaluations/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluations/evaluation.py (3)

build_evaluation_config (23-98)

get_evaluation_with_scores (229-329)

start_evaluation (101-226)

backend/app/services/evaluations/validators.py (3)

parse_csv_items (118-174)

sanitize_dataset_name (23-68)

validate_csv_file (71-115)

backend/app/crud/evaluations/processing.py (5)

backend/app/core/batch/openai.py (2)

OpenAIBatchProvider (14-254)

download_batch_results (143-191)

backend/app/core/batch/operations.py (1)

download_batch_results (86-112)

backend/app/core/batch/base.py (1)

download_batch_results (55-72)

backend/app/core/batch/polling.py (1)

poll_batch_status (15-53)

backend/app/crud/job/job.py (1)

get_batch_job (59-73)

backend/app/core/batch/__init__.py (3)

backend/app/core/batch/openai.py (2)

OpenAIBatchProvider (14-254)

download_batch_results (143-191)

backend/app/core/batch/operations.py (4)

download_batch_results (86-112)

process_completed_batch (115-158)

start_batch_job (17-83)

upload_batch_results_to_object_store (161-188)

backend/app/core/batch/polling.py (1)

poll_batch_status (15-53)

backend/app/services/evaluations/dataset.py (6)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (261-278)

backend/app/crud/evaluations/dataset.py (1)

create_evaluation_dataset (31-104)

backend/app/crud/evaluations/langfuse.py (1)

upload_dataset_to_langfuse (224-315)

backend/app/models/evaluation.py (1)

EvaluationDataset (73-167)

backend/app/services/evaluations/validators.py (2)

parse_csv_items (118-174)

sanitize_dataset_name (23-68)

backend/app/utils.py (1)

get_langfuse_client (212-248)

backend/app/crud/job/__init__.py (1)

backend/app/crud/job/job.py (6)

create_batch_job (13-56)

delete_batch_job (197-222)

get_batch_job (59-73)

get_batch_jobs_by_ids (123-149)

get_batches_by_type (152-194)

update_batch_job (76-120)

backend/app/crud/evaluations/embeddings.py (2)

backend/app/core/batch/openai.py (1)

OpenAIBatchProvider (14-254)

backend/app/core/batch/operations.py (1)

start_batch_job (17-83)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (34)

backend/app/core/batch/operations.py (2)

11-11: LGTM: Import path updated correctly.

The import path has been updated to reflect the new CRUD structure where batch job operations are now under app.crud.job.job. This aligns with the broader refactor that consolidates batch-related CRUD operations.

17-188: Type hints and logging look good.

All functions have complete type hints on parameters and return values, and log messages properly use the [function_name] prefix format.

backend/app/crud/job/__init__.py (1)

1-24: Well-structured public API aggregator.

This module correctly re-exports batch job CRUD operations and provides clear guidance to users about where to import batch operations (from app.core.batch). The separation between CRUD operations (here) and batch operations (in app.core.batch) follows good architectural practices.

backend/app/crud/evaluations/batch.py (2)

17-17: LGTM: Import consolidation improves maintainability.

The consolidated import from app.core.batch simplifies the code and aligns with the refactor that centralizes batch operations. This makes it easier to maintain and understand the dependency structure.

23-211: Code follows all coding guidelines.

All functions have complete type hints and proper log message formatting with function names in square brackets. The evaluation batch orchestration logic is well-structured and delegates appropriately to the generic batch infrastructure.

backend/app/api/main.py (1)

23-23: Router refactor is consistent.

The change from evaluation to evaluations aligns with the refactor that reorganizes the evaluation routes from a single file into a package structure. The router wiring correctly includes the new evaluations.router.

Also applies to: 40-40

backend/app/api/routes/evaluations/__init__.py (1)

1-13: Router structure follows best practices.

The router configuration properly organizes evaluation endpoints by domain with nested dataset routes under /evaluations/datasets and direct evaluation routes under /evaluations. This provides a clean, hierarchical API structure.

Verified that endpoint implementations in dataset.py and evaluation.py properly load Swagger descriptions from external markdown files using load_description(). All endpoint functions include complete type hints on parameters and return types, and all logger calls use the correct [function_name] prefix format as specified in the coding guidelines.

backend/app/crud/job/batch.py (1)

1-22: Clean re-export layer for batch operations.

This module correctly consolidates batch-related CRUD operations from the core layer, providing a convenient import surface for consumers. The __all__ list properly documents the public API.

backend/app/crud/evaluations/processing.py (6)

22-39: Import consolidation aligns with the refactor.

The updated imports from app.core.batch and app.crud.job correctly centralize batch utilities, improving code organization and reducing import fragmentation.

46-177: Well-structured parsing with proper error handling.

The function correctly:

Uses type hints for parameters and return value

Follows the [function_name] logging prefix convention

Gracefully handles parsing errors with continue statements

Creates a lookup map for O(1) dataset item access

180-317: Comprehensive evaluation processing with good observability.

The async function properly:

Declares all parameter and return type hints

Uses consistent log prefixes with org/project/eval context

Handles embedding batch failures gracefully without failing the entire evaluation

Updates evaluation status appropriately on errors

320-441: Robust embedding batch processing with trace updates.

The function correctly handles the complete embedding workflow including Langfuse trace updates. The decision to mark as "completed" with an error message (rather than "failed") when embedding processing fails is a reasonable design choice to avoid blocking the overall evaluation.

444-643: Comprehensive batch status checking with dual-batch handling.

The function properly orchestrates both embedding and main response batch processing, with appropriate status transitions and error handling throughout.

646-807: Well-organized batch polling with project-level credential isolation.

The function correctly groups evaluations by project to handle per-project API credentials, with comprehensive error handling that persists failure status to the database for each failed run.

backend/app/crud/evaluations/embeddings.py (6)

18-18: Import consolidation aligns with the batch refactor.

The centralized import from app.core.batch correctly sources both OpenAIBatchProvider and start_batch_job, matching the new package structure.

32-46: Simple validation function with proper type hints.

The function correctly validates embedding models against the allowed list and raises a descriptive ValueError for invalid models.

131-217: Robust embedding result parsing with proper error handling.

The function correctly parses embedding pairs by index and handles various error conditions gracefully. Type hints are complete.

220-251: Correct cosine similarity implementation with edge case handling.

The function properly handles zero vectors to avoid division by zero, and the numpy-based implementation is efficient.

254-256: Type hints and trailing comma added.

The function signature update maintains good formatting practices.

333-433: Well-orchestrated embedding batch creation.

The function correctly:

Validates and falls back to default embedding model

Uses the centralized start_batch_job infrastructure

Links the embedding batch job to the evaluation run

backend/app/core/batch/polling.py (1)

15-53: Clean polling implementation with proper logging.

The function correctly:

Uses type hints for all parameters and return value

Follows the [function_name] logging prefix convention

Only updates the database when status actually changes (optimization)

Propagates optional fields like provider_output_file_id and error_message

backend/app/core/batch/__init__.py (1)

1-21: Well-organized public API surface for batch operations.

The package init correctly consolidates all batch-related utilities into a single import surface, making it easy for consumers to access BatchProvider, OpenAIBatchProvider, and all batch operations from one import.

backend/app/services/evaluations/dataset.py (1)

24-163: Well-structured service layer for dataset upload orchestration.

The function correctly:

Uses complete type hints for all parameters and return value

Follows the [function_name] logging prefix convention throughout

Implements a resilient workflow that continues without object store if upload fails

Properly sequences the steps: sanitize → parse → upload to storage → upload to Langfuse → create DB record

Uses appropriate HTTP status codes (422 for validation, 500 for server errors)

As per coding guidelines, business logic is correctly implemented in the services layer.

backend/app/services/evaluations/validators.py (2)

23-68: Robust dataset name sanitization for Langfuse compatibility.

The function correctly handles:

Case normalization

Space/hyphen replacement

Special character removal

Underscore collapse

Empty name validation

The docstring examples clearly document expected behavior.

118-174: Robust CSV parsing with case-insensitive header matching.

The function correctly:

Uses type hints for parameters and return value

Handles case-insensitive header matching

Filters out rows with empty question or answer

Provides clear error messages with column information

Follows the logging prefix convention

backend/app/api/routes/evaluations/dataset.py (4)

36-46: LGTM! Helper function is well-structured.

The conversion function correctly maps EvaluationDataset fields to DatasetUploadResponse, with sensible defaults for missing metadata keys.

49-85: LGTM! Upload endpoint follows best practices.

The endpoint correctly validates CSV content, delegates to the service layer, and uses project-level permissions. The form parameters with validation constraints (ge=1, le=5 for duplication_factor) are appropriate.

118-148: LGTM! Get endpoint is well-implemented.

The endpoint correctly logs access with the function name prefix, handles the not-found case with a 404, and uses the helper function for response conversion.

151-185: LGTM! Delete endpoint handles error cases appropriately.

The endpoint properly distinguishes between "not found" (404) and other errors (400) based on the message content, and logs both the attempt and success.

backend/app/services/evaluations/__init__.py (1)

1-16: LGTM! Clean package exports.

The __init__.py properly consolidates the public API surface for the evaluations services package, making imports cleaner for consumers (e.g., from app.services.evaluations import upload_dataset).

backend/app/api/routes/evaluations/evaluation.py (1)

100-148: LGTM! Status endpoint is well-implemented.

The endpoint correctly:

Validates the resync_score / get_trace_info dependency

Returns appropriate 404 for missing evaluations

Uses failure_response when there's an error but data is still available

backend/app/services/evaluations/evaluation.py (3)

101-226: LGTM! Evaluation start flow is well-orchestrated.

The function:

Validates dataset existence and Langfuse ID requirement

Builds config via the dedicated function

Creates the run record before starting the batch (good for error tracking)

Handles batch failures gracefully by refreshing and returning the run

The logging follows the [function_name] prefix convention correctly.

283-286: Potential issue: cache check may fail if score is an empty dict.

The check eval_run.score is not None and "traces" in eval_run.score will pass if score is {}, but this edge case seems unlikely given the flow. Just flagging for awareness.

Consider whether an empty score dict (without traces key) could ever be persisted. If so, the condition correctly handles it by fetching from Langfuse.

229-329: LGTM! Score fetching with caching is well-designed.

The function properly:

Returns early for non-completed evaluations with a helpful message

Uses cached scores when available (unless resync_score is true)

Handles both ValueError and general exceptions distinctly

Captures necessary data before external calls to avoid session issues

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

backend/app/crud/evaluations/embeddings.py (4)

49-128: Apply the log prefix guideline to all log messages in this function.

Line 84 correctly follows the coding guideline by prefixing the log message with [build_embedding_jsonl]. However, the log messages at lines 95, 101, 106, and 127 are missing the function name prefix.

As per coding guidelines: "Prefix all log messages with the function name in square brackets"

♻️ Apply consistent log prefixes

-            logger.warning("Skipping result with no item_id")
+            logger.warning("[build_embedding_jsonl] Skipping result with no item_id")

-            logger.warning(f"Skipping item {item_id} - no trace_id found")
+            logger.warning(f"[build_embedding_jsonl] Skipping item {item_id} - no trace_id found")

-            logger.warning(f"Skipping item {item_id} - empty output or ground_truth")
+            logger.warning(f"[build_embedding_jsonl] Skipping item {item_id} - empty output or ground_truth")

-    logger.info(f"Built {len(jsonl_data)} embedding JSONL lines")
+    logger.info(f"[build_embedding_jsonl] Built {len(jsonl_data)} embedding JSONL lines")

Based on coding guidelines.

131-217: Apply the log prefix guideline to all log messages in this function.

All log messages in this function are missing the required function name prefix. As per coding guidelines: "Prefix all log messages with the function name in square brackets"

♻️ Apply consistent log prefixes

-    logger.info(f"Parsing embedding results from {len(raw_results)} lines")
+    logger.info(f"[parse_embedding_results] Parsing embedding results from {len(raw_results)} lines")

-                logger.warning(f"Line {line_num}: No custom_id found, skipping")
+                logger.warning(f"[parse_embedding_results] Line {line_num}: No custom_id found, skipping")

-                logger.error(f"Trace {trace_id} had error: {error_msg}")
+                logger.error(f"[parse_embedding_results] Trace {trace_id} had error: {error_msg}")

-                logger.warning(
-                    f"Trace {trace_id}: Expected 2 embeddings, got {len(embedding_data)}"
-                )
+                logger.warning(
+                    f"[parse_embedding_results] Trace {trace_id}: Expected 2 embeddings, got {len(embedding_data)}"
+                )

-                logger.warning(
-                    f"Trace {trace_id}: Missing embeddings (output={output_embedding is not None}, "
-                    f"ground_truth={ground_truth_embedding is not None})"
-                )
+                logger.warning(
+                    f"[parse_embedding_results] Trace {trace_id}: Missing embeddings (output={output_embedding is not None}, "
+                    f"ground_truth={ground_truth_embedding is not None})"
+                )

-            logger.error(f"Line {line_num}: Unexpected error: {e}", exc_info=True)
+            logger.error(f"[parse_embedding_results] Line {line_num}: Unexpected error: {e}", exc_info=True)

-    logger.info(
-        f"Parsed {len(embedding_pairs)} embedding pairs from {len(raw_results)} lines"
-    )
+    logger.info(
+        f"[parse_embedding_results] Parsed {len(embedding_pairs)} embedding pairs from {len(raw_results)} lines"
+    )

Based on coding guidelines.

254-330: Apply the log prefix guideline to all log messages in this function.

All log messages in this function are missing the required function name prefix. As per coding guidelines: "Prefix all log messages with the function name in square brackets"

♻️ Apply consistent log prefixes

-    logger.info(f"Calculating similarity for {len(embedding_pairs)} pairs")
+    logger.info(f"[calculate_average_similarity] Calculating similarity for {len(embedding_pairs)} pairs")

-            logger.error(
-                f"Error calculating similarity for trace {pair.get('trace_id')}: {e}"
-            )
+            logger.error(
+                f"[calculate_average_similarity] Error calculating similarity for trace {pair.get('trace_id')}: {e}"
+            )

-        logger.warning("No valid similarities calculated")
+        logger.warning("[calculate_average_similarity] No valid similarities calculated")

-    logger.info(
-        f"Calculated similarity stats: avg={stats['cosine_similarity_avg']:.3f}, "
-        f"std={stats['cosine_similarity_std']:.3f}"
-    )
+    logger.info(
+        f"[calculate_average_similarity] Calculated similarity stats: avg={stats['cosine_similarity_avg']:.3f}, "
+        f"std={stats['cosine_similarity_std']:.3f}"
+    )

Based on coding guidelines.

333-433: Apply the log prefix guideline to all log messages in this function.

All log messages in this function are missing the required function name prefix. As per coding guidelines: "Prefix all log messages with the function name in square brackets"

♻️ Apply consistent log prefixes

-        logger.info(f"Starting embedding batch for evaluation run {eval_run.id}")
+        logger.info(f"[start_embedding_batch] Starting embedding batch for evaluation run {eval_run.id}")

-            logger.warning(
-                f"Invalid embedding model '{embedding_model}' in config: {e}. "
-                f"Falling back to text-embedding-3-large"
-            )
+            logger.warning(
+                f"[start_embedding_batch] Invalid embedding model '{embedding_model}' in config: {e}. "
+                f"Falling back to text-embedding-3-large"
+            )

-        logger.info(
-            f"Successfully started embedding batch: batch_job_id={batch_job.id}, "
-            f"provider_batch_id={batch_job.provider_batch_id} "
-            f"for evaluation run {eval_run.id} with {batch_job.total_items} items"
-        )
+        logger.info(
+            f"[start_embedding_batch] Successfully started embedding batch: batch_job_id={batch_job.id}, "
+            f"provider_batch_id={batch_job.provider_batch_id} "
+            f"for evaluation run {eval_run.id} with {batch_job.total_items} items"
+        )

-        logger.error(f"Failed to start embedding batch: {e}", exc_info=True)
+        logger.error(f"[start_embedding_batch] Failed to start embedding batch: {e}", exc_info=True)

Based on coding guidelines.

🧹 Nitpick comments (4)

backend/app/api/routes/evaluations/dataset.py (4)
55-85: Add structured logging for the upload operation.

The get_dataset and delete_dataset endpoints include logging, but this upload endpoint does not. Consider adding logging after the upload completes to maintain consistency and aid in debugging.
📝 Suggested logging addition

Add logging after line 83:
    dataset = upload_dataset(
        session=_session,
        csv_content=csv_content,
        dataset_name=dataset_name,
        description=description,
        duplication_factor=duplication_factor,
        organization_id=auth_context.organization_.id,
        project_id=auth_context.project_.id,
    )
    
    logger.info(
        f"[upload_dataset_endpoint] Successfully uploaded dataset | "
        f"id={dataset.id} | name={dataset_name} | "
        f"org_id={auth_context.organization_.id} | "
        f"project_id={auth_context.project_.id}"
    )

    return APIResponse.success_response(data=_dataset_to_response(dataset))
Based on coding guidelines requiring log message prefixes with function names.

94-117: Add structured logging for consistency.

Like the upload endpoint, this lacks logging while get_dataset and delete_dataset include it. Adding a log statement would improve observability for list operations.
📝 Suggested logging addition

Add logging after line 113:
    datasets = list_datasets(
        session=_session,
        organization_id=auth_context.organization_.id,
        project_id=auth_context.project_.id,
        limit=limit,
        offset=offset,
    )
    
    logger.info(
        f"[list_datasets_endpoint] Retrieved datasets | "
        f"count={len(datasets)} | limit={limit} | offset={offset} | "
        f"org_id={auth_context.organization_.id} | "
        f"project_id={auth_context.project_.id}"
    )

    return APIResponse.success_response(
        data=[_dataset_to_response(dataset) for dataset in datasets]
    )
Based on coding guidelines requiring log message prefixes with function names.

104-105: Remove redundant limit check.

The Query parameter on line 98 already enforces le=100, making this manual check unnecessary.
♻️ Proposed simplification
     offset: int = Query(default=0, ge=0, description="Number of datasets to skip"),
 ) -> APIResponse[list[DatasetUploadResponse]]:
     """List evaluation datasets."""
-    # Enforce maximum limit
-    if limit > 100:
-        limit = 100
-
     datasets = list_datasets(
         session=_session,
178-182: Consider more robust error handling.

The string matching on line 179 (if "not found" in message.lower()) is fragile. If the CRUD layer changes its error message format, this logic could break.

Consider one of these approaches:

Have delete_dataset_crud raise specific exception types (e.g., DatasetNotFoundError) instead of returning success/message tuples

Return an enum or error code alongside the message for more reliable classification

Document the expected message formats in the CRUD layer

For now, the current implementation works, but a future refactor would improve maintainability.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d128375 and 696dba6.

📒 Files selected for processing (3)

backend/app/api/routes/evaluations/dataset.py
backend/app/crud/evaluations/embeddings.py
backend/app/services/evaluations/validators.py

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/services/evaluations/validators.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluations/dataset.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluations/dataset.py
backend/app/crud/evaluations/embeddings.py

🧬 Code graph analysis (2)

backend/app/api/routes/evaluations/dataset.py (6)

backend/app/crud/evaluations/dataset.py (2)

get_dataset_by_id (107-140)

list_datasets (175-210)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (25-44)

EvaluationDataset (73-167)

backend/app/api/permissions.py (2)

Permission (10-15)

require_permission (45-70)

backend/app/services/evaluations/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluations/validators.py (1)

validate_csv_file (71-120)

backend/app/utils.py (2)

APIResponse (33-57)

success_response (40-43)

backend/app/crud/evaluations/embeddings.py (2)

backend/app/core/batch/openai.py (1)

OpenAIBatchProvider (14-254)

backend/app/core/batch/operations.py (1)

start_batch_job (17-83)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (3)

backend/app/api/routes/evaluations/dataset.py (2)

36-46: LGTM! Clean conversion helper.

The function properly converts the database model to the response schema with safe dictionary access using defaults.

120-150: LGTM! Well-structured endpoint.

Proper logging, error handling, and type hints. The 404 response for missing datasets is appropriate.

backend/app/crud/evaluations/embeddings.py (1)

18-18: The centralized batch imports are properly re-exported.

The refactoring correctly imports OpenAIBatchProvider and start_batch_job from app.core.batch, which properly re-exports both symbols from their respective implementation files (openai.py and operations.py) and includes them in the module's __all__ declaration.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @backend/app/api/routes/evaluations/dataset.py:
- Around line 94-113: The endpoint list_datasets_endpoint lacks structured
logging; add a log call prefixed with "[list_datasets_endpoint]" before/after
calling list_datasets that records relevant fields (organization id, project id,
limit, offset, and number of datasets returned) using the same logger used
elsewhere (e.g., processLogger or module logger); place the log near the call to
list_datasets and ensure it executes whether the call succeeds or fails (include
error logging on exception), leaving the existing return via
APIResponse.success_response and conversion via _dataset_to_response unchanged.
- Around line 55-85: Add structured logging to upload_dataset_endpoint: log an
info entry prefixed with "[upload_dataset_endpoint]" before calling
upload_dataset and another on success (including dataset.id/name) and log an
error with context on exceptions. Use the existing logger used by
get_dataset/delete_dataset (match their format), include key fields like
auth_context.organization_.id, auth_context.project_.id, dataset_name,
duplication_factor, and any validation errors from validate_csv_file; ensure
logs reference function name in square brackets and do not leak sensitive data.
- Line 72: Add structured info-level logging to the upload_dataset_endpoint:
insert a logger.info call prefixed with "[upload_dataset_endpoint]" (matching
get_dataset/delete_dataset) to log the upload operation after validating the
file (near the call to validate_csv_file and variable csv_content). Use the same
message pattern as other endpoints (function name in square brackets and a
concise description of the action) so logs are consistent with the existing
logger usage.

🧹 Nitpick comments (1)

backend/app/api/routes/evaluations/dataset.py (1)
174-178: Consider more robust error detection.

The error type detection relies on string matching ("not found" in message.lower()), which is fragile. If the CRUD layer changes its error message format, the HTTP status code mapping will break.
💡 More robust alternatives

Consider one of these approaches:

Option 1: Have delete_dataset_crud raise specific exceptions instead of returning (bool, str):
# In CRUD layer
raise DatasetNotFoundError(f"Dataset {dataset_id} not found")
raise DatasetDeleteError(f"Failed to delete: {reason}")

# In API layer
try:
    delete_dataset_crud(...)
except DatasetNotFoundError as e:
    raise HTTPException(status_code=404, detail=str(e))
except DatasetDeleteError as e:
    raise HTTPException(status_code=400, detail=str(e))
Option 2: Return an error code enum alongside the message:
# CRUD returns: (success: bool, message: str, error_code: ErrorCode | None)
success, message, error_code = delete_dataset_crud(...)
if not success:
    status = 404 if error_code == ErrorCode.NOT_FOUND else 400
    raise HTTPException(status_code=status, detail=message)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 696dba6 and 684a6ef.

📒 Files selected for processing (1)

backend/app/api/routes/evaluations/dataset.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluations/dataset.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluations/dataset.py

🧬 Code graph analysis (1)

backend/app/api/routes/evaluations/dataset.py (7)

backend/app/crud/evaluations/dataset.py (2)

get_dataset_by_id (107-140)

list_datasets (175-210)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (25-44)

EvaluationDataset (73-167)

backend/app/api/permissions.py (2)

Permission (10-15)

require_permission (45-70)

backend/app/services/evaluations/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluations/validators.py (1)

validate_csv_file (71-120)

backend/app/utils.py (2)

APIResponse (33-57)

success_response (40-43)

backend/app/models/auth.py (2)

organization_ (25-29)

project_ (32-36)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (3)

backend/app/api/routes/evaluations/dataset.py (3)

116-146: LGTM! Proper logging and error handling.

The endpoint correctly implements structured logging with the function name prefix, appropriate 404 error handling, and has complete type hints.

161-165: LGTM! Proper structured logging.

The endpoint correctly implements structured logging with the function name prefix as required by the coding guidelines.

Also applies to: 180-180

36-46: Remove the review comment - the concern is unfounded.

The dataset_metadata field in the EvaluationDataset model is guaranteed to never be None. The field is declared as dataset_metadata: dict[str, Any] with default_factory=dict and a database-level constraint of nullable=False. The code in _dataset_to_response is safe and correct.

Likely an incorrect or invalid review comment.

coderabbitai · 2026-01-08T11:25:45Z

backend/app/api/routes/evaluations/dataset.py

+async def upload_dataset_endpoint(
+    _session: SessionDep,
+    auth_context: AuthContextDep,
+    file: UploadFile = File(
+        ..., description="CSV file with 'question' and 'answer' columns"
+    ),
+    dataset_name: str = Form(..., description="Name for the dataset"),
+    description: str | None = Form(None, description="Optional dataset description"),
+    duplication_factor: int = Form(
+        default=1,
+        ge=1,
+        le=5,
+        description="Number of times to duplicate each item (min: 1, max: 5)",
+    ),
+) -> APIResponse[DatasetUploadResponse]:
+    """Upload an evaluation dataset."""
+    # Validate and read CSV file
+    csv_content = await validate_csv_file(file)
+
+    # Upload dataset using service
+    dataset = upload_dataset(
+        session=_session,
+        csv_content=csv_content,
+        dataset_name=dataset_name,
+        description=description,
+        duplication_factor=duplication_factor,
+        organization_id=auth_context.organization_.id,
+        project_id=auth_context.project_.id,
+    )
+
+    return APIResponse.success_response(data=_dataset_to_response(dataset))


⚠️ Potential issue | 🟠 Major

Add structured logging for the upload operation.

According to the coding guidelines, all operations should include structured logging with the function name prefixed in square brackets. The get_dataset and delete_dataset endpoints follow this pattern, but this endpoint is missing logging.

📝 Suggested logging additions

async def upload_dataset_endpoint( _session: SessionDep, auth_context: AuthContextDep, file: UploadFile = File( ..., description="CSV file with 'question' and 'answer' columns" ), dataset_name: str = Form(..., description="Name for the dataset"), description: str | None = Form(None, description="Optional dataset description"), duplication_factor: int = Form( default=1, ge=1, le=5, description="Number of times to duplicate each item (min: 1, max: 5)", ), ) -> APIResponse[DatasetUploadResponse]: """Upload an evaluation dataset.""" + logger.info( + f"[upload_dataset_endpoint] Uploading dataset | " + f"dataset_name={dataset_name} | " + f"duplication_factor={duplication_factor} | " + f"org_id={auth_context.organization_.id} | " + f"project_id={auth_context.project_.id}" + ) + # Validate and read CSV file csv_content = await validate_csv_file(file) # Upload dataset using service dataset = upload_dataset( session=_session, csv_content=csv_content, dataset_name=dataset_name, description=description, duplication_factor=duplication_factor, organization_id=auth_context.organization_.id, project_id=auth_context.project_.id, ) + logger.info( + f"[upload_dataset_endpoint] Successfully uploaded dataset | " + f"dataset_id={dataset.id} | " + f"total_items={dataset.dataset_metadata.get('total_items_count', 0)}" + ) + return APIResponse.success_response(data=_dataset_to_response(dataset))

As per coding guidelines, all log messages should be prefixed with the function name in square brackets.

🤖 Prompt for AI Agents

In @backend/app/api/routes/evaluations/dataset.py around lines 55 - 85, Add structured logging to upload_dataset_endpoint: log an info entry prefixed with "[upload_dataset_endpoint]" before calling upload_dataset and another on success (including dataset.id/name) and log an error with context on exceptions. Use the existing logger used by get_dataset/delete_dataset (match their format), include key fields like auth_context.organization_.id, auth_context.project_.id, dataset_name, duplication_factor, and any validation errors from validate_csv_file; ensure logs reference function name in square brackets and do not leak sensitive data.

coderabbitai · 2026-01-08T11:25:45Z

backend/app/api/routes/evaluations/dataset.py

+) -> APIResponse[DatasetUploadResponse]:
+    """Upload an evaluation dataset."""
+    # Validate and read CSV file
+    csv_content = await validate_csv_file(file)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for validate_csv_file function definition with async keyword ast-grep --pattern $'async def validate_csv_file($$$): $$$'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57

🏁 Script executed:

#!/bin/bash # Search for validate_csv_file function definition without async ast-grep --pattern $'def validate_csv_file($$$): $$$'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57

🏁 Script executed:

#!/bin/bash # Find the validators module and examine validate_csv_file fd -t f "validators.py" backend/app/services/

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 121

🏁 Script executed:

#!/bin/bash # Get the full endpoint function to check for logging and other issues fd -t f "dataset.py" backend/app/api/routes/evaluations/

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 120

🏁 Script executed:

#!/bin/bash # Check the validators file to find validate_csv_file definition wc -l backend/app/services/evaluations/validators.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 125

🏁 Script executed:

#!/bin/bash # Read the validators file cat -n backend/app/services/evaluations/validators.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 6580

🏁 Script executed:

#!/bin/bash # Check the dataset.py file size wc -l backend/app/api/routes/evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 124

🏁 Script executed:

#!/bin/bash # Read the dataset.py file cat -n backend/app/api/routes/evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 7149

Add structured logging to the endpoint.

The upload_dataset_endpoint function is missing structured logging. Other endpoints in the same file (get_dataset, delete_dataset) follow the coding guideline of prefixing log messages with the function name in square brackets. Add logger.info(f"[upload_dataset_endpoint] ...") to log the upload operation, matching the pattern used elsewhere in this file.

🤖 Prompt for AI Agents

In @backend/app/api/routes/evaluations/dataset.py at line 72, Add structured info-level logging to the upload_dataset_endpoint: insert a logger.info call prefixed with "[upload_dataset_endpoint]" (matching get_dataset/delete_dataset) to log the upload operation after validating the file (near the call to validate_csv_file and variable csv_content). Use the same message pattern as other endpoints (function name in square brackets and a concise description of the action) so logs are consistent with the existing logger usage.

coderabbitai · 2026-01-08T11:25:45Z

backend/app/api/routes/evaluations/dataset.py

+def list_datasets_endpoint(
+    _session: SessionDep,
+    auth_context: AuthContextDep,
+    limit: int = Query(
+        default=50, ge=1, le=100, description="Maximum number of datasets to return"
+    ),
+    offset: int = Query(default=0, ge=0, description="Number of datasets to skip"),
+) -> APIResponse[list[DatasetUploadResponse]]:
+    """List evaluation datasets."""
+    datasets = list_datasets(
+        session=_session,
+        organization_id=auth_context.organization_.id,
+        project_id=auth_context.project_.id,
+        limit=limit,
+        offset=offset,
+    )
+
+    return APIResponse.success_response(
+        data=[_dataset_to_response(dataset) for dataset in datasets]
+    )


⚠️ Potential issue | 🟠 Major

Add structured logging for the list operation.

According to the coding guidelines, all operations should include structured logging with the function name prefixed in square brackets. This endpoint is missing logging, while get_dataset and delete_dataset follow the correct pattern.

📝 Suggested logging additions

def list_datasets_endpoint( _session: SessionDep, auth_context: AuthContextDep, limit: int = Query( default=50, ge=1, le=100, description="Maximum number of datasets to return" ), offset: int = Query(default=0, ge=0, description="Number of datasets to skip"), ) -> APIResponse[list[DatasetUploadResponse]]: """List evaluation datasets.""" + logger.info( + f"[list_datasets_endpoint] Listing datasets | " + f"limit={limit} | offset={offset} | " + f"org_id={auth_context.organization_.id} | " + f"project_id={auth_context.project_.id}" + ) + datasets = list_datasets( session=_session, organization_id=auth_context.organization_.id, project_id=auth_context.project_.id, limit=limit, offset=offset, ) + logger.info( + f"[list_datasets_endpoint] Retrieved {len(datasets)} dataset(s)" + ) + return APIResponse.success_response( data=[_dataset_to_response(dataset) for dataset in datasets] )

As per coding guidelines, all log messages should be prefixed with the function name in square brackets.

🤖 Prompt for AI Agents

In @backend/app/api/routes/evaluations/dataset.py around lines 94 - 113, The endpoint list_datasets_endpoint lacks structured logging; add a log call prefixed with "[list_datasets_endpoint]" before/after calling list_datasets that records relevant fields (organization id, project id, limit, offset, and number of datasets returned) using the same logger used elsewhere (e.g., processLogger or module logger); place the log near the call to list_datasets and ensure it executes whether the call succeeds or fails (include error logging on exception), leaving the existing return via APIResponse.success_response and conversion via _dataset_to_response unchanged.

nishika26 · 2026-01-12T04:02:32Z

backend/app/api/routes/evaluations/__init__.py

+router = APIRouter(prefix="/evaluations", tags=["evaluation"])
+
+# Include dataset routes under /evaluations/datasets
+router.include_router(dataset.router, prefix="/datasets")


what was the need of a specific init file for evaluation router since we have not followed this pattern till now

nishika26 · 2026-01-12T04:16:52Z

backend/app/crud/evaluations/processing.py

-
        poll_batch_status(session=session, provider=provider, batch_job=batch_job)

        # Refresh batch_job to get the updated provider_status


remove comments that are not needed

we need this comment

nishika26 · 2026-01-12T04:17:26Z

backend/app/crud/job/__init__.py

+)
+
+__all__ = [
+    # CRUD operations


this comment is not needed here either

nishika26 · 2026-01-12T04:21:36Z

backend/app/services/evaluations/dataset.py

+        organization_id: Organization ID
+        project_id: Project ID
+
+    Returns:


returns and raises not needed in the function doc

didn't get this, can you elaborate

backend/app/services/evaluations/dataset.py

backend/app/api/routes/evaluations/dataset.py

backend/app/services/evaluations/evaluation.py

backend/app/services/evaluations/validators.py

nishika26 · 2026-01-12T04:40:36Z

backend/app/services/evaluations/validators.py

+    Returns:
+        CSV content as bytes if valid
+
+    Raises:


this raises here part not needed in the function doc

nishika26 · 2026-01-12T05:09:55Z

backend/app/tests/api/routes/test_evaluation.py

+            "/api/v1/evaluations/",
            json={
                "experiment_name": "test_evaluation_run",
                "dataset_id": 1,


the test coverage of this PR is really low

backend/app/api/routes/evaluations/dataset.py

Prajna1999 · 2026-01-12T05:03:22Z

backend/app/api/routes/evaluations/dataset.py

+    response_model=APIResponse[DatasetUploadResponse],
+    dependencies=[Depends(require_permission(Permission.REQUIRE_PROJECT))],
+)
+def get_dataset(


either this router function could be list_one_dataset_by_id to maintain consistency with getAllDatasets endpoint above or vice versa. IMO get_dataset instead of list_dataset convention makes a bit more semantic sense to me.

backend/app/api/routes/evaluations/dataset.py

Prajna1999 · 2026-01-12T05:14:50Z

backend/app/api/routes/evaluations/evaluation.py

+        project_id=auth_context.project_.id,
+    )
+
+    return APIResponse.success_response(data=eval_run)


What happens if the evaluation run creation fails (something like error code 422 or rate limiting)? Is proper error handling taken care of inside start_evaluation function?

Prajna1999 · 2026-01-12T05:17:52Z

backend/app/api/routes/evaluations/evaluation.py

+    response_model=APIResponse[EvaluationRunPublic],
+    dependencies=[Depends(require_permission(Permission.REQUIRE_PROJECT))],
+)
+def get_evaluation_run_status(


likewise semantic consistency with getall endpoint

backend/app/api/routes/evaluations/evaluation.py

backend/app/core/batch/polling.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

backend/app/api/routes/evaluations/dataset.py (2)

56-86: Consider adding logging for observability.

Unlike get_dataset and delete_dataset, this endpoint lacks logging. For consistency and to aid debugging/monitoring, consider adding a log statement.

♻️ Suggested logging addition

 async def upload_dataset_endpoint(
     _session: SessionDep,
     auth_context: AuthContextDep,
     file: UploadFile = File(
         ..., description="CSV file with 'question' and 'answer' columns"
     ),
     dataset_name: str = Form(..., description="Name for the dataset"),
     description: str | None = Form(None, description="Optional dataset description"),
     duplication_factor: int = Form(
         default=1,
         ge=1,
         le=5,
         description="Number of times to duplicate each item (min: 1, max: 5)",
     ),
 ) -> APIResponse[DatasetUploadResponse]:
     """Upload an evaluation dataset."""
+    logger.info(
+        f"[upload_dataset_endpoint] Uploading dataset | name={dataset_name} | "
+        f"org_id={auth_context.organization_.id} | "
+        f"project_id={auth_context.project_.id}"
+    )
     # Validate and read CSV file
     csv_content = await validate_csv_file(file)

95-114: Consider adding logging for consistency with other endpoints.

This endpoint lacks logging while get_dataset and delete_dataset have it. Adding a log statement would improve observability and maintain consistency. As per coding guidelines, prefix log messages with the function name.

♻️ Suggested logging addition

 def list_datasets_endpoint(
     _session: SessionDep,
     auth_context: AuthContextDep,
     limit: int = Query(
         default=50, ge=1, le=100, description="Maximum number of datasets to return"
     ),
     offset: int = Query(default=0, ge=0, description="Number of datasets to skip"),
 ) -> APIResponse[list[DatasetUploadResponse]]:
     """List evaluation datasets."""
+    logger.info(
+        f"[list_datasets_endpoint] Listing datasets | "
+        f"org_id={auth_context.organization_.id} | "
+        f"project_id={auth_context.project_.id} | "
+        f"limit={limit} | offset={offset}"
+    )
     datasets = list_datasets(

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b197a60 and 2d1d8c4.

📒 Files selected for processing (1)

backend/app/api/routes/evaluations/dataset.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluations/dataset.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluations/dataset.py

🧠 Learnings (1)

📚 Learning: 2025-12-23T03:57:06.891Z

Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 488
File: backend/app/tests/seed_data/seed_data.py:344-354
Timestamp: 2025-12-23T03:57:06.891Z
Learning: In backend/app/tests/seed_data/seed_data.py, log messages should use the file name as the prefix (e.g., `[tests.seed_data]`) rather than the function name prefix. This file-based logging prefix convention is preferred for test seed data modules.

Applied to files:

backend/app/api/routes/evaluations/dataset.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (3)

backend/app/api/routes/evaluations/dataset.py (3)

117-147: LGTM!

Proper logging with function name prefix, appropriate 404 handling, and type hints are all in place.

150-184: LGTM with a minor observation.

Logging and error handling are well implemented. The string-matching approach for distinguishing 404 vs 400 errors (line 176) is a bit fragile—consider having the CRUD layer return a status code or enum instead if this becomes problematic.

The APIResponse[dict] return type is loose; a dedicated response model could improve type safety, but this is optional.

37-47: No changes required. The dataset_metadata field cannot be None because it is typed as dict[str, Any] (not Optional) with default_factory=dict and the database column is marked nullable=False. Calling .get() on it is safe and will not raise an AttributeError. The function has appropriate type hints.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

backend/app/services/evaluations/evaluation.py (3)
59-63: Consider masking potentially sensitive instructions content.

Per coding guidelines, sensitive values in logs should use mask_string(). Assistant instructions might contain sensitive information. Consider either removing the instructions preview from the log or masking it.
♻️ Suggested change
         logger.info(
             f"[build_evaluation_config] Found assistant in DB | id={assistant.id} | "
-            f"model={assistant.model} | instructions="
-            f"{assistant.instructions[:50] if assistant.instructions else 'None'}..."
+            f"model={assistant.model} | has_instructions={assistant.instructions is not None}"
         )
76-82: Simplify the truthiness check.

The len(vector_store_ids) > 0 check is redundant when vector_store_ids is already being checked for truthiness. A non-empty list is truthy in Python.
♻️ Suggested simplification
-        if vector_store_ids and len(vector_store_ids) > 0:
+        if vector_store_ids:
             merged_config["tools"] = [
                 {
                     "type": "file_search",
                     "vector_store_ids": vector_store_ids,
                 }
             ]
263-264: Consider logging when evaluation run is not found.

When eval_run is None, the function silently returns (None, None). Adding a log statement would help with debugging and monitoring when callers request non-existent or inaccessible evaluation runs.
♻️ Suggested change
     if not eval_run:
+        logger.warning(
+            f"[get_evaluation_with_scores] Evaluation run not found | "
+            f"evaluation_id={evaluation_id} | org_id={organization_id} | project_id={project_id}"
+        )
         return None, None

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d1d8c4 and f1213dc.

📒 Files selected for processing (1)

backend/app/services/evaluations/evaluation.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement business logic in services located in backend/app/services/

Files:

backend/app/services/evaluations/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/services/evaluations/evaluation.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (3)

backend/app/services/evaluations/evaluation.py (3)

1-20: LGTM!

Imports are well-organized with clear separation between FastAPI/SQLModel framework imports, CRUD operations, models, and utilities. Logger setup follows standard Python conventions.

312-317: No action needed – save_score intentionally creates its own session.

The function is correctly designed to manage its own database session, as documented in the docstring: "This function creates its own database session to persist the score, allowing it to be called after releasing the request's main session." This is an intentional architectural pattern, not a bug. The function properly implements type hints and logging guidelines.

Likely an incorrect or invalid review comment.

196-219: Exception handling in start_evaluation_batch is correctly implemented.

The CRUD function properly updates eval_run.status to "failed" and persists the change to the database before re-raising the exception. The service catches this exception, refreshes the run object from the database, and returns it with the failure status intact. This pattern correctly allows callers to detect failures by inspecting the returned run's status.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @backend/app/services/evaluations/validators.py:
- Around line 142-146: The HTTPException raised in validators.py builds the
detail string by concatenating two f-strings but omits punctuation/spacing
between the clauses; update the detail message in the raise HTTPException block
(the f-string that references csv_reader.fieldnames) to include proper
punctuation and spacing (e.g., add a period or colon and a space before "Found
columns:") so the error reads correctly.
- Around line 92-97: The validator uses file.content_type directly and may
report "got: None" when the header is missing; update the check around
content_type (from file.content_type) to explicitly handle a None value before
comparing to ALLOWED_MIME_TYPES and raise a 422 with a clearer message (e.g.,
"Missing Content-Type header" or similar) instead of including None, ensuring
the HTTPException raised by this validator communicates the missing header case
separately from an invalid MIME type.

🧹 Nitpick comments (1)

backend/app/api/routes/evaluations/dataset.py (1)

175-179: String-based error detection is fragile.

The logic if "not found" in message.lower() couples the HTTP status code to the exact wording of the error message from the CRUD layer. If the message changes, this could return incorrect status codes.

Consider having the CRUD function return a more structured error indicator (e.g., an enum or specific exception type) rather than relying on string matching.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1213dc and cc20f0b.

📒 Files selected for processing (2)

backend/app/api/routes/evaluations/dataset.py
backend/app/services/evaluations/validators.py

🧰 Additional context used

📓 Path-based instructions (3)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluations/dataset.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluations/dataset.py
backend/app/services/evaluations/validators.py

backend/app/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement business logic in services located in backend/app/services/

Files:

backend/app/services/evaluations/validators.py

🧠 Learnings (1)

📚 Learning: 2025-12-23T03:57:06.891Z

Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 488
File: backend/app/tests/seed_data/seed_data.py:344-354
Timestamp: 2025-12-23T03:57:06.891Z
Learning: In backend/app/tests/seed_data/seed_data.py, log messages should use the file name as the prefix (e.g., `[tests.seed_data]`) rather than the function name prefix. This file-based logging prefix convention is preferred for test seed data modules.

Applied to files:

backend/app/api/routes/evaluations/dataset.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (9)

backend/app/services/evaluations/validators.py (3)

1-20: LGTM!

Well-structured imports and clearly defined constants for file validation constraints.

22-64: LGTM!

Well-documented sanitization function with clear rules and examples. The validation logic correctly handles edge cases and raises an appropriate ValueError for empty results.

115-169: LGTM!

Solid CSV parsing implementation with proper case-insensitive header matching, empty row filtering, and appropriate error handling. The logging follows the coding guidelines with the [parse_csv_items] prefix.

backend/app/api/routes/evaluations/dataset.py (6)

1-34: LGTM!

Clean imports and router setup. Good separation between CRUD operations and service layer imports.

37-47: LGTM!

Clean helper function with proper type hints and safe dictionary access with defaults.

50-86: LGTM!

Well-structured upload endpoint with proper validation, form parameters with constraints, and permission checks. Uses load_description as per coding guidelines.

89-114: LGTM!

Clean list endpoint with proper pagination parameters and constraints.

117-147: LGTM!

Well-implemented get endpoint with proper logging following the [function_name] format guideline, and appropriate 404 handling.

150-184: LGTM overall!

Delete endpoint with proper logging following the coding guidelines and appropriate permission checks.

coderabbitai · 2026-01-12T16:46:55Z

backend/app/services/evaluations/validators.py

+    content_type = file.content_type
+    if content_type not in ALLOWED_MIME_TYPES:
+        raise HTTPException(
+            status_code=422,
+            detail=f"Invalid content type. Expected CSV, got: {content_type}",
+        )


⚠️ Potential issue | 🟡 Minor

Handle potential None content type.

file.content_type can be None if the client doesn't send a Content-Type header. The current check content_type not in ALLOWED_MIME_TYPES would raise an HTTPException with "got: None", but it's cleaner to handle this explicitly.

Suggested improvement

content_type = file.content_type + if not content_type: + raise HTTPException( + status_code=422, + detail="Content-Type header is missing", + ) if content_type not in ALLOWED_MIME_TYPES: raise HTTPException( status_code=422, detail=f"Invalid content type. Expected CSV, got: {content_type}", )

🤖 Prompt for AI Agents

In @backend/app/services/evaluations/validators.py around lines 92 - 97, The validator uses file.content_type directly and may report "got: None" when the header is missing; update the check around content_type (from file.content_type) to explicitly handle a None value before comparing to ALLOWED_MIME_TYPES and raise a 422 with a clearer message (e.g., "Missing Content-Type header" or similar) instead of including None, ensuring the HTTPException raised by this validator communicates the missing header case separately from an invalid MIME type.

coderabbitai · 2026-01-12T16:46:56Z

backend/app/services/evaluations/validators.py

+            raise HTTPException(
+                status_code=422,
+                detail=f"CSV must contain 'question' and 'answer' columns "
+                f"Found columns: {csv_reader.fieldnames}",
+            )


⚠️ Potential issue | 🟡 Minor

Minor: Missing separator in error message.

The error detail message is missing punctuation between the two parts.

Suggested fix

raise HTTPException( status_code=422, - detail=f"CSV must contain 'question' and 'answer' columns " - f"Found columns: {csv_reader.fieldnames}", + detail=f"CSV must contain 'question' and 'answer' columns. " + f"Found columns: {csv_reader.fieldnames}", )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

raise HTTPException(

status_code=422,

detail=f"CSV must contain 'question' and 'answer' columns "

f"Found columns: {csv_reader.fieldnames}",

)

raise HTTPException(

status_code=422,

detail=f"CSV must contain 'question' and 'answer' columns. "

f"Found columns: {csv_reader.fieldnames}",

)

🤖 Prompt for AI Agents

In @backend/app/services/evaluations/validators.py around lines 142 - 146, The HTTPException raised in validators.py builds the detail string by concatenating two f-strings but omits punctuation/spacing between the clauses; update the detail message in the raise HTTPException block (the f-string that references csv_reader.fieldnames) to include proper punctuation and spacing (e.g., add a period or colon and a space before "Found columns:") so the error reads correctly.

…ech4DevAI/kaapi-backend into enhancement/evaluation-refactor

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 495-508: The test file has Black formatting issues; run the
formatter (e.g., `black .` or `pre-commit run --all-files`) and commit the
changes so tests like test_upload_without_authentication in
backend/app/tests/api/routes/test_evaluation.py conform to Black style; ensure
imports, line breaks, and indentation in that file are corrected and re-run the
tests/CI to verify the formatting error is resolved.
- Around line 495-508: The test function test_upload_without_authentication is
missing type hints on its parameters and return value; update its signature to
add appropriate annotations (e.g., annotate client with the test client type
used in your suite and valid_csv_content with its type such as str or IO) and
add a -> None return type so the function and parameters conform to the project
typing guidelines (adjust concrete types to match your test fixtures' types).
- Around line 576-589: The test function
test_start_batch_evaluation_without_authentication is missing type hints for its
parameters and return type; add appropriate annotations (e.g., client:
FlaskClient or TestClient and sample_evaluation_config: dict) and a return type
of None to the function signature to match the style used in
test_upload_without_authentication so linters and type checkers are satisfied.

🧹 Nitpick comments (2)

backend/app/api/routes/evaluations/evaluation.py (2)
37-67: Consider adding logging for observability.

Unlike list_evaluation_runs, this endpoint doesn't log incoming requests. Adding a log statement would improve traceability and align with the pattern established elsewhere in this file.
♻️ Suggested logging
 def evaluate(
     _session: SessionDep,
     auth_context: AuthContextDep,
     dataset_id: int = Body(..., description="ID of the evaluation dataset"),
     experiment_name: str = Body(
         ..., description="Name for this evaluation experiment/run"
     ),
     config: dict = Body(default_factory=dict, description="Evaluation configuration"),
     assistant_id: str
     | None = Body(
         None, description="Optional assistant ID to fetch configuration from"
     ),
 ) -> APIResponse[EvaluationRunPublic]:
     """Start an evaluation run."""
+    logger.info(
+        f"[evaluate] Starting evaluation | "
+        f"org_id={auth_context.organization_.id} | "
+        f"project_id={auth_context.project_.id} | "
+        f"dataset_id={dataset_id} | experiment_name={experiment_name}"
+    )
     eval_run = start_evaluation(
As per coding guidelines, log messages should be prefixed with the function name in square brackets.

76-81: Add Query() wrapper with validation bounds for pagination parameters.

The limit and offset parameters lack explicit Query() declarations, which means they won't have OpenAPI documentation. Additionally, an unbounded limit could allow expensive queries.
♻️ Suggested improvement
 def list_evaluation_runs(
     _session: SessionDep,
     auth_context: AuthContextDep,
-    limit: int = 50,
-    offset: int = 0,
+    limit: int = Query(default=50, ge=1, le=100, description="Maximum number of runs to return"),
+    offset: int = Query(default=0, ge=0, description="Number of runs to skip"),
 ) -> APIResponse[list[EvaluationRunPublic]]:

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc20f0b and a94e559.

📒 Files selected for processing (2)

backend/app/api/routes/evaluations/evaluation.py
backend/app/tests/api/routes/test_evaluation.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/tests/api/routes/test_evaluation.py
backend/app/api/routes/evaluations/evaluation.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluations/evaluation.py

🧠 Learnings (1)

📚 Learning: 2025-12-17T15:39:30.469Z

Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Implement automatic coverage reporting for tests

Applied to files:

backend/app/tests/api/routes/test_evaluation.py

🧬 Code graph analysis (2)

backend/app/tests/api/routes/test_evaluation.py (1)

backend/app/tests/conftest.py (1)

client (65-68)

backend/app/api/routes/evaluations/evaluation.py (5)

backend/app/models/evaluation.py (1)

EvaluationRunPublic (336-354)

backend/app/api/permissions.py (2)

Permission (10-15)

require_permission (45-70)

backend/app/services/evaluations/evaluation.py (2)

get_evaluation_with_scores (222-319)

start_evaluation (100-219)

backend/app/utils.py (4)

APIResponse (33-57)

load_description (393-398)

failure_response (46-57)

success_response (40-43)

backend/app/models/auth.py (2)

organization_ (25-29)

project_ (32-36)

🪛 GitHub Actions: Kaapi CI

backend/app/tests/api/routes/test_evaluation.py

[error] Black formatting failed. The hook reformatted 1 file (backend/app/tests/api/routes/test_evaluation.py). Please run 'pre-commit run --all-files' or 'black .' to fix formatting.

🔇 Additional comments (4)

backend/app/api/routes/evaluations/evaluation.py (2)
1-28: LGTM!

Imports are well-organized and the logger is properly initialized. The router setup follows standard FastAPI patterns.

106-154: LGTM with optional logging suggestion.

The endpoint correctly validates that resync_score=true requires get_trace_info=true, properly handles the not-found case with a 404, and returns partial data with an error message when score fetching fails. The separation of concerns between "not found" (404) and "found but with error" (failure_response with data) is appropriate.

Consider adding a log statement for consistency with list_evaluation_runs:
logger.info(
    f"[get_evaluation_run_status] Fetching evaluation | "
    f"evaluation_id={evaluation_id} | get_trace_info={get_trace_info}"
)
backend/app/tests/api/routes/test_evaluation.py (2)

62-109: LGTM!

The patch targets correctly reflect the service layer migration (from app.api.routes.evaluation.* to app.services.evaluations.dataset.*), and the trailing slash on the endpoint URL aligns with the router restructuring.

732-918: LGTM!

The TestGetEvaluationRunStatus class provides good coverage for the various scenarios:

Incomplete evaluation with trace info request

Completed evaluation with cached scores

Not found case

Without trace info

Invalid resync_score without get_trace_info

The test fixtures correctly use the factory pattern via create_test_evaluation_dataset.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-15T08:11:03Z

backend/app/tests/api/routes/test_evaluation.py

+    def test_upload_without_authentication(self, client, valid_csv_content):
+        """Test uploading without authentication."""
+        filename, file_obj = create_csv_file(valid_csv_content)
+
+        response = client.post(
+            "/api/v1/evaluations/datasets/",
+            files={"file": (filename, file_obj, "text/csv")},
+            data={
+                "dataset_name": "test_dataset",
+                "duplication_factor": 5,
+            },
+        )
+
+        assert response.status_code == 401  # Unauthorized


⚠️ Potential issue | 🔴 Critical

Fix Black formatting to unblock CI.

The pipeline failed due to Black formatting issues in this file. Run pre-commit run --all-files or black . to fix the formatting before merging.

🧰 Tools

🪛 GitHub Actions: Kaapi CI

[error] Black formatting failed. The hook reformatted 1 file (backend/app/tests/api/routes/test_evaluation.py). Please run 'pre-commit run --all-files' or 'black .' to fix formatting.

🤖 Prompt for AI Agents

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 495 - 508, The test file has Black formatting issues; run the formatter (e.g., `black .` or `pre-commit run --all-files`) and commit the changes so tests like test_upload_without_authentication in backend/app/tests/api/routes/test_evaluation.py conform to Black style; ensure imports, line breaks, and indentation in that file are corrected and re-run the tests/CI to verify the formatting error is resolved.

⚠️ Potential issue | 🟡 Minor

Add missing type hints.

The test function is missing type hints for parameters and return type, which violates the coding guidelines and may be contributing to the Black formatting failure.

🔧 Proposed fix

- def test_upload_without_authentication(self, client, valid_csv_content): + def test_upload_without_authentication( + self, client: TestClient, valid_csv_content: str + ) -> None: """Test uploading without authentication."""

As per coding guidelines, always add type hints to all function parameters and return values in Python code.

🧰 Tools

🪛 GitHub Actions: Kaapi CI

[error] Black formatting failed. The hook reformatted 1 file (backend/app/tests/api/routes/test_evaluation.py). Please run 'pre-commit run --all-files' or 'black .' to fix formatting.

🤖 Prompt for AI Agents

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 495 - 508, The test function test_upload_without_authentication is missing type hints on its parameters and return value; update its signature to add appropriate annotations (e.g., annotate client with the test client type used in your suite and valid_csv_content with its type such as str or IO) and add a -> None return type so the function and parameters conform to the project typing guidelines (adjust concrete types to match your test fixtures' types).

coderabbitai · 2026-01-15T08:11:03Z

backend/app/tests/api/routes/test_evaluation.py

+    def test_start_batch_evaluation_without_authentication(
+        self, client, sample_evaluation_config
+    ):
+        """Test batch evaluation requires authentication."""
+        response = client.post(
+            "/api/v1/evaluations/",
+            json={
+                "experiment_name": "test_evaluation_run",
+                "dataset_id": 1,
+                "config": sample_evaluation_config,
+            },
+        )
+
+        assert response.status_code == 401  # Unauthorized


⚠️ Potential issue | 🟡 Minor

Add missing type hints.

Same issue as test_upload_without_authentication - missing type hints for parameters and return type.

🔧 Proposed fix

def test_start_batch_evaluation_without_authentication( - self, client, sample_evaluation_config - ): + self, client: TestClient, sample_evaluation_config: dict[str, Any] + ) -> None: """Test batch evaluation requires authentication."""

🧰 Tools

🪛 GitHub Actions: Kaapi CI

[error] Black formatting failed. The hook reformatted 1 file (backend/app/tests/api/routes/test_evaluation.py). Please run 'pre-commit run --all-files' or 'black .' to fix formatting.

🤖 Prompt for AI Agents

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 576 - 589, The test function test_start_batch_evaluation_without_authentication is missing type hints for its parameters and return type; add appropriate annotations (e.g., client: FlaskClient or TestClient and sample_evaluation_config: dict) and a return type of None to the function signature to match the style used in test_upload_without_authentication so linters and type checkers are satisfied.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/api/routes/evaluations/dataset.py`:
- Around line 180-191: The endpoint accesses dataset.name after the CRUD
delete/commit which can hit a detached SQLAlchemy instance; extract dataset.name
into a local variable before calling delete_dataset_crud and use that variable
in the success message and log. Specifically, read dataset_name = dataset.name
prior to calling delete_dataset_crud(session=_session, dataset=dataset) and then
use dataset_name in logger.info and the APIResponse.success_response message
instead of dataset.name.

♻️ Duplicate comments (2)

backend/app/api/routes/evaluations/dataset.py (2)

56-86: Add structured logging to upload_dataset endpoint.

This endpoint is missing structured logging, which is inconsistent with get_dataset and delete_dataset in the same file. As per coding guidelines, log messages should be prefixed with the function name in square brackets.

📝 Suggested logging additions

 async def upload_dataset(
     _session: SessionDep,
     auth_context: AuthContextDep,
     file: UploadFile = File(
         ..., description="CSV file with 'question' and 'answer' columns"
     ),
     dataset_name: str = Form(..., description="Name for the dataset"),
     description: str | None = Form(None, description="Optional dataset description"),
     duplication_factor: int = Form(
         default=1,
         ge=1,
         le=5,
         description="Number of times to duplicate each item (min: 1, max: 5)",
     ),
 ) -> APIResponse[DatasetUploadResponse]:
     """Upload an evaluation dataset."""
+    logger.info(
+        f"[upload_dataset] Uploading dataset | "
+        f"dataset_name={dataset_name} | "
+        f"duplication_factor={duplication_factor} | "
+        f"org_id={auth_context.organization_.id} | "
+        f"project_id={auth_context.project_.id}"
+    )
+
     # Validate and read CSV file
     csv_content = await validate_csv_file(file)

     # Upload dataset using service
     dataset = upload_evaluation_dataset(
         session=_session,
         csv_content=csv_content,
         dataset_name=dataset_name,
         description=description,
         duplication_factor=duplication_factor,
         organization_id=auth_context.organization_.id,
         project_id=auth_context.project_.id,
     )

+    logger.info(
+        f"[upload_dataset] Successfully uploaded dataset | "
+        f"dataset_id={dataset.id} | dataset_name={dataset.name}"
+    )
+
     return APIResponse.success_response(data=_dataset_to_response(dataset))

95-114: Add structured logging to list_datasets endpoint.

This endpoint is missing structured logging for consistency with other endpoints in this file.

📝 Suggested logging additions

 def list_datasets(
     _session: SessionDep,
     auth_context: AuthContextDep,
     limit: int = Query(
         default=50, ge=1, le=100, description="Maximum number of datasets to return"
     ),
     offset: int = Query(default=0, ge=0, description="Number of datasets to skip"),
 ) -> APIResponse[list[DatasetUploadResponse]]:
     """List evaluation datasets."""
+    logger.info(
+        f"[list_datasets] Listing datasets | "
+        f"limit={limit} | offset={offset} | "
+        f"org_id={auth_context.organization_.id} | "
+        f"project_id={auth_context.project_.id}"
+    )
+
     datasets = list_evaluation_datasets(
         session=_session,
         organization_id=auth_context.organization_.id,
         project_id=auth_context.project_.id,
         limit=limit,
         offset=offset,
     )

+    logger.info(f"[list_datasets] Retrieved {len(datasets)} dataset(s)")
+
     return APIResponse.success_response(
         data=[_dataset_to_response(dataset) for dataset in datasets]
     )

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a94e559 and cf88d15.

📒 Files selected for processing (2)

backend/app/api/routes/evaluations/dataset.py
backend/app/crud/evaluations/dataset.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/crud/evaluations/dataset.py
backend/app/api/routes/evaluations/dataset.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluations/dataset.py

🧠 Learnings (1)

📚 Learning: 2025-12-23T03:57:06.891Z

Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 488
File: backend/app/tests/seed_data/seed_data.py:344-354
Timestamp: 2025-12-23T03:57:06.891Z
Learning: In backend/app/tests/seed_data/seed_data.py, log messages should use the file name as the prefix (e.g., `[tests.seed_data]`) rather than the function name prefix. This file-based logging prefix convention is preferred for test seed data modules.

Applied to files:

backend/app/api/routes/evaluations/dataset.py

🧬 Code graph analysis (2)

backend/app/crud/evaluations/dataset.py (2)

backend/app/api/routes/evaluations/dataset.py (1)

delete_dataset (156-191)

backend/app/models/evaluation.py (2)

EvaluationDataset (73-167)

EvaluationRun (170-319)

backend/app/api/routes/evaluations/dataset.py (6)

backend/app/api/permissions.py (2)

Permission (10-15)

require_permission (45-70)

backend/app/crud/evaluations/dataset.py (3)

get_dataset_by_id (107-140)

list_datasets (175-210)

delete_dataset (323-370)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (25-44)

EvaluationDataset (73-167)

backend/app/services/evaluations/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluations/validators.py (1)

validate_csv_file (67-112)

backend/app/utils.py (2)

APIResponse (33-57)

success_response (40-43)

🔇 Additional comments (4)

backend/app/crud/evaluations/dataset.py (1)

323-370: LGTM! Clean refactor of delete_dataset signature.

The refactored signature accepting a pre-fetched EvaluationDataset object is cleaner and follows the separation of concerns principle—lookup/validation is now handled by the caller. Good practice extracting dataset_id, dataset_name, etc. before session.delete() to avoid accessing the deleted object's attributes in the success log.

backend/app/api/routes/evaluations/dataset.py (3)

1-34: LGTM! Clean module setup.

Imports are properly organized, logger is initialized correctly, and required dependencies for permission checks and API responses are all present.

37-47: LGTM! Clean mapper function.

Type hints are present, and using .get() with defaults provides defensive handling for potentially missing metadata keys.

117-147: LGTM! Well-structured endpoint.

Proper type hints, structured logging with function name prefix, permission checks, and appropriate 404 handling when dataset is not found.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

backend/app/api/routes/evaluations/dataset.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

backend/app/tests/crud/evaluations/test_processing.py (1)

584-645: Configure or remove unused mock_poll parameter.

The mock_poll parameter from the poll_batch_status patch is neither configured with a return value nor verified. Since check_and_process_evaluation calls poll_batch_status on the batch job (even when already completed), the mock should either be configured with an expected return value or verified with assert_called_once() to ensure it's called. Alternatively, if intentionally not testing this dependency, remove the patch decorator to keep the test focused.

backend/app/tests/api/test_auth_failures.py (1)

34-36: Fix trailing slash in evaluation list endpoints to match route definitions.

Lines 34-35 should use /evaluations/ (with trailing slash) instead of /evaluations. The evaluation router is mounted at /evaluations with routes defined as / for list operations, resulting in the final path /evaluations/. This matches the pattern used in the evaluation route tests (test_evaluation.py) and is consistent with the dataset endpoint format on lines 28-29. Without the trailing slash, these requests will return 404 errors.

♻️ Duplicate comments (4)

backend/app/api/routes/evaluations/dataset.py (2)
56-86: Add structured logging for the upload operation.

The upload_dataset endpoint is missing structured logging. Other endpoints in the same file (get_dataset, delete_dataset) follow the coding guideline of prefixing log messages with the function name in square brackets.
📝 Suggested logging additions
 async def upload_dataset(
     ...
 ) -> APIResponse[DatasetUploadResponse]:
     """Upload an evaluation dataset."""
+    logger.info(
+        f"[upload_dataset] Uploading dataset | "
+        f"dataset_name={dataset_name} | "
+        f"duplication_factor={duplication_factor} | "
+        f"org_id={auth_context.organization_.id} | "
+        f"project_id={auth_context.project_.id}"
+    )
+
     # Validate and read CSV file
     csv_content = await validate_csv_file(file)
     ...
+    logger.info(
+        f"[upload_dataset] Successfully uploaded dataset | "
+        f"dataset_id={dataset.id}"
+    )
+
     return APIResponse.success_response(data=_dataset_to_response(dataset))
As per coding guidelines, all log messages should be prefixed with the function name in square brackets.

95-114: Add structured logging for the list operation.

The list_datasets endpoint is missing structured logging, while get_dataset and delete_dataset follow the correct pattern.
📝 Suggested logging additions
 def list_datasets(
     ...
 ) -> APIResponse[list[DatasetUploadResponse]]:
     """List evaluation datasets."""
+    logger.info(
+        f"[list_datasets] Listing datasets | "
+        f"limit={limit} | offset={offset} | "
+        f"org_id={auth_context.organization_.id} | "
+        f"project_id={auth_context.project_.id}"
+    )
+
     datasets = list_evaluation_datasets(...)
 
+    logger.info(f"[list_datasets] Retrieved {len(datasets)} dataset(s)")
+
     return APIResponse.success_response(...)
As per coding guidelines, all log messages should be prefixed with the function name in square brackets.
backend/app/tests/api/routes/test_evaluation.py (2)
495-508: Add missing type hints.

The test function is missing type hints for parameters and return type, which violates the coding guidelines.
🔧 Proposed fix
-    def test_upload_without_authentication(self, client, valid_csv_content):
+    def test_upload_without_authentication(
+        self, client: TestClient, valid_csv_content: str
+    ) -> None:
         """Test uploading without authentication."""
As per coding guidelines, always add type hints to all function parameters and return values in Python code.

577-590: Add missing type hints.

Same issue as test_upload_without_authentication - missing type hints for parameters and return type.
🔧 Proposed fix
     def test_start_batch_evaluation_without_authentication(
-        self, client, sample_evaluation_config
-    ):
+        self, client: TestClient, sample_evaluation_config: dict[str, Any]
+    ) -> None:
         """Test batch evaluation requires authentication."""

🧹 Nitpick comments (2)

backend/app/tests/crud/evaluations/test_processing.py (2)
647-707: Same issue: mock_poll is unconfigured and unverified.

Consistent with the previous test, mock_poll is injected but not used. Either configure it to simulate poll_batch_status behavior and add assertions, or remove the patch if it's not needed for the "failed batch" scenario.

24-26: Consider adding return type hints to test methods.

Per coding guidelines, all functions should have type hints on return values. Test methods should include -> None:
def test_parse_evaluation_output_basic(self) -> None:
This applies to all test methods in the file. This is a minor consistency improvement.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf88d15 and c6b09aa.

📒 Files selected for processing (5)

backend/app/api/routes/evaluations/dataset.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/api/test_auth_failures.py
backend/app/tests/crud/evaluations/test_dataset.py
backend/app/tests/crud/evaluations/test_processing.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/crud/evaluations/test_dataset.py
backend/app/api/routes/evaluations/dataset.py
backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/api/test_auth_failures.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/crud/evaluations/test_dataset.py
backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/api/test_auth_failures.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluations/dataset.py

🧠 Learnings (2)

📚 Learning: 2025-12-17T15:39:30.469Z

Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Implement automatic coverage reporting for tests

Applied to files:

backend/app/tests/api/routes/test_evaluation.py

📚 Learning: 2025-12-23T03:57:06.891Z

Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 488
File: backend/app/tests/seed_data/seed_data.py:344-354
Timestamp: 2025-12-23T03:57:06.891Z
Learning: In backend/app/tests/seed_data/seed_data.py, log messages should use the file name as the prefix (e.g., `[tests.seed_data]`) rather than the function name prefix. This file-based logging prefix convention is preferred for test seed data modules.

Applied to files:

backend/app/api/routes/evaluations/dataset.py

🧬 Code graph analysis (3)

backend/app/tests/api/routes/test_evaluation.py (1)

backend/app/tests/conftest.py (1)

client (65-68)

backend/app/tests/crud/evaluations/test_dataset.py (2)

backend/app/api/routes/evaluations/dataset.py (1)

delete_dataset (156-191)

backend/app/crud/evaluations/dataset.py (2)

delete_dataset (323-370)

get_dataset_by_id (107-140)

backend/app/api/routes/evaluations/dataset.py (4)

backend/app/crud/evaluations/dataset.py (3)

get_dataset_by_id (107-140)

list_datasets (175-210)

delete_dataset (323-370)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (25-44)

EvaluationDataset (73-167)

backend/app/services/evaluations/dataset.py (1)

upload_dataset (24-163)

backend/app/utils.py (3)

APIResponse (33-57)

load_description (393-398)

success_response (40-43)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.12, 6)

🔇 Additional comments (13)

backend/app/tests/crud/evaluations/test_processing.py (1)

8-14: LGTM on the import updates.

The imports correctly reference the new app.crud.evaluations.processing module path, aligning with the refactored structure.

backend/app/tests/api/test_auth_failures.py (1)

28-29: LGTM! Trailing slashes added for datasets endpoints.

The changes correctly align the protected endpoints with the new route structure for /evaluations/datasets/.

backend/app/tests/api/routes/test_evaluation.py (3)

62-72: LGTM! Patch targets correctly updated to the new service layer paths.

The mock patches now target app.services.evaluations.dataset.* which aligns with the refactored module structure.

82-84: LGTM!

Endpoint path correctly updated with trailing slash.

530-538: LGTM!

Endpoint path correctly updated with trailing slash for the batch evaluation POST request.

backend/app/tests/crud/evaluations/test_dataset.py (4)

402-416: LGTM!

The test correctly implements the new delete_dataset API contract: captures dataset_id before deletion, passes the dataset object to delete_dataset, and verifies the dataset is removed by checking get_dataset_by_id returns None.

418-434: LGTM!

The test correctly validates the new pattern where the caller must fetch the dataset first. For non-existent datasets, get_dataset_by_id returns None, and the caller handles the "not found" case.

451-467: LGTM!

The test correctly validates that datasets cannot be fetched (and thus deleted) using the wrong organization ID. The access control check happens at the fetch level.

499-513: LGTM!

The test correctly validates that datasets with associated evaluation runs cannot be deleted. The error message check appropriately uses flexible matching for the expected phrases.

backend/app/api/routes/evaluations/dataset.py (4)

1-34: LGTM!

Imports are well-organized following PEP8 guidelines, and the logger is properly configured.

37-47: LGTM!

Clean helper function with proper type hints and defensive defaults for metadata fields.

123-147: LGTM!

Well-implemented endpoint with proper structured logging, error handling, and type hints.

156-191: LGTM!

Well-implemented endpoint with proper structured logging and error handling. Good practice extracting dataset_name before deletion (line 179) to avoid accessing attributes on a detached instance.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

first stab at refactoring

d990add

AkhileshNegi self-assigned this Dec 17, 2025

AkhileshNegi added the enhancement New feature or request label Dec 17, 2025

AkhileshNegi marked this pull request as ready for review December 18, 2025 06:53

Merge branch 'main' into enhancement/evaluation-refactor

8e5ba77

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

backend/app/services/evaluations/validators.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

cleanups

7ba9b43

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

run pre commit

49a2063

Merge branch 'main' into enhancement/evaluation-refactor

09c4f4a

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

AkhileshNegi added 3 commits December 25, 2025 10:50

added missing permission imports

698ccf8

refactoring batch code

54c6754

Merge branch 'main' into enhancement/evaluation-refactor

0cf524c

AkhileshNegi requested review from Prajna1999 and nishika26 January 8, 2026 05:23

AkhileshNegi added 2 commits January 8, 2026 10:57

Merge branch 'enhancement/evaluation-refactor' of github.com:ProjectT…

8f2cf22

…ech4DevAI/kaapi-backend into enhancement/evaluation-refactor

refactoring

d128375

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

coderabbit suggestion changes

696dba6

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

coderabbit suggestion changes

684a6ef

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

updated testcases

b197a60

nishika26 requested changes Jan 12, 2026

View reviewed changes

Prajna1999 reviewed Jan 12, 2026

View reviewed changes

following PEP 8 standards

2d1d8c4

cleanup

f1213dc

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

cleanup

d951be0

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

renaming endpoints with better semantics

cc20f0b

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

AkhileshNegi added 3 commits January 15, 2026 13:35

Merge branch 'main' into enhancement/evaluation-refactor

f54308f

updated fail status

ab6cc5b

Merge branch 'enhancement/evaluation-refactor' of github.com:ProjectT…

a94e559

…ech4DevAI/kaapi-backend into enhancement/evaluation-refactor

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

cleanup delete dataset

cf88d15

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

backend/app/api/routes/evaluations/dataset.py Show resolved Hide resolved

AkhileshNegi added 2 commits January 15, 2026 13:52

running pre commit

d978618

updating testcases

c6b09aa

AkhileshNegi merged commit ef56025 into main Jan 15, 2026
1 of 3 checks passed

AkhileshNegi deleted the enhancement/evaluation-refactor branch January 15, 2026 09:06

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

AkhileshNegi linked an issue Jan 16, 2026 that may be closed by this pull request

Evaluation: Restructuring code #492

Closed


		poll_batch_status(session=session, provider=provider, batch_job=batch_job)

		# Refresh batch_job to get the updated provider_status

Evaluation: Refactor #503

Evaluation: Refactor #503

Uh oh!

Conversation

AkhileshNegi commented Dec 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

AkhileshNegi commented Dec 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2025 •

edited

Loading

codecov bot commented Dec 18, 2025 •

edited

Loading