-
Notifications
You must be signed in to change notification settings - Fork 70
LCORE-1179: Update metadata extraction from chunks #1073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughRemoved TextContentItem metadata parsing from Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 🧪 Unit Test Generation v2 is now available!We have significantly improved our unit test generation capabilities. To enable: Add this to your reviews:
finishing_touches:
unit_tests:
enabled: trueTry it out by using the Have feedback? Share your thoughts on our Discord thread! Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/app/endpoints/query_v2.py (1)
547-588: Include doc_id in de-duplication to avoid dropping distinct documents.
If two documents share title/URL (or URL is missing) but have different doc_id values, the current(final_url, doc_title)key will collapse them, which undermines the new doc_id linkage.✅ Suggested fix
- # Use a set to track unique documents by (doc_url, doc_title) tuple - seen_docs: set[tuple[Optional[str], Optional[str]]] = set() + # Use a set to track unique documents by (doc_id, doc_url, doc_title) tuple + seen_docs: set[tuple[Optional[str], Optional[str], Optional[str]]] = set() @@ - if (final_url, doc_title) not in seen_docs: + unique_key = (doc_id, final_url, doc_title) + if unique_key not in seen_docs: documents.append( ReferencedDocument( doc_url=final_url, doc_title=doc_title, doc_id=doc_id ) ) - seen_docs.add((final_url, doc_title)) + seen_docs.add(unique_key)
| ) | ||
| seen_docs.add((final_url, filename)) | ||
|
|
||
| # 2. Parse from message content annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for dropping this section?
Check also OpenAIResponseOutputMessageContent type that can be part of OpenAIResponseMessage content. Isn't this also relevant? More specifically, OpenAIResponseAnnotations object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already get the references to the chunks used for the response.
Annotations is useful if we want to determine exactly which files the model has used to give the answer (determined by the model) and at which position in the text.
Personally I think that we can use this data, but right now it is not needed. Because the scope of referenced_docs is to keep data of all chunks retrieved and used as input for reference. The data that we get from citation is just redundant since it is a small subset of these docs.
I think we can make use of citations in a separate feature, after agreement across our customers since it adds a new functionality.
asimurka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2dd8162 to
88ed424
Compare
- now extracting doc_title, doc_url, doc_id from the "attributes" in response - added doc_id to keep a link between rag_chunks and referenced_docs - removed chunk metadata extraction from results which are not "type": "file_search_call" - citations could be added in future, needs discussion with other teams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/openapi.md`:
- Around line 5324-5328: The ReferencedDocument markdown table pipes are
misaligned with the header row causing MD060; fix by aligning the pipe
separators vertically and ensuring the header separator row matches the number
of columns and header styles—update the table containing the fields doc_url,
doc_title, and doc_id so the header row "| Field | Type | Description |" and the
separator row (the row of dashes) have matching columns and properly placed
pipes, and then ensure each data row (the lines with doc_url, doc_title, doc_id)
uses the same pipe alignment and column count.
In `@src/app/endpoints/query_v2.py`:
- Around line 561-566: The code sets attributes = result.get("attributes", {})
or uses getattr(result, "attributes", {}), but if the key/attribute exists with
value None the default isn't applied and attributes becomes None causing
attributes.get(...) to raise AttributeError; change the assignment so attributes
is always a dict by applying the "or {}" pattern after both retrievals (i.e.,
ensure attributes = result.get("attributes", {}) or {} and attributes =
getattr(result, "attributes", {}) or {}), referencing the variables result and
attributes and the subsequent attributes.get(...) usage to locate where to apply
the fix.
- Around line 575-586: The current filter only appends ReferencedDocument when
doc_title or doc_url exists, which drops entries that only have doc_id; change
the if-condition to check any of doc_title, doc_url, or doc_id so documents with
only a document_id are preserved. Keep the existing final_url normalization
(treat empty string as None), use the same dedupe tuple (final_url, doc_title,
doc_id) against seen_docs, and append ReferencedDocument(doc_url=final_url,
doc_title=doc_title, doc_id=doc_id) when that tuple is not in seen_docs.
🧹 Nitpick comments (1)
src/app/endpoints/query_v2.py (1)
503-507: Consider using doc_id asRAGChunk.source(with filename fallback).If clients are expected to join
rag_chunkswithreferenced_documentsviadoc_id,sourceshould carry that ID.🔄 Suggested linkage-friendly source
- rag_chunk = RAGChunk( - content=result.text, source=result.filename, score=result.score - ) + attributes = getattr(result, "attributes", None) or {} + doc_id = None + if isinstance(attributes, dict): + doc_id = attributes.get("document_id") or attributes.get("doc_id") + source = doc_id or result.filename + rag_chunk = RAGChunk(content=result.text, source=source, score=result.score)
| doc_title = attributes.get("title") | ||
| doc_id = attributes.get("document_id") or attributes.get("doc_id") | ||
|
|
||
| # If we have at least a filename or url | ||
| if filename or doc_url: | ||
| if doc_title or doc_url: | ||
| # Treat empty string as None for URL to satisfy Optional[AnyUrl] | ||
| final_url = doc_url if doc_url else None | ||
| if (final_url, filename) not in seen_docs: | ||
| if (final_url, doc_title, doc_id) not in seen_docs: | ||
| documents.append( | ||
| ReferencedDocument(doc_url=final_url, doc_title=filename) | ||
| ) | ||
| seen_docs.add((final_url, filename)) | ||
|
|
||
| # 2. Parse from message content annotations | ||
| elif item_type == "message": | ||
| content = getattr(output_item, "content", None) | ||
| if isinstance(content, list): | ||
| for part in content: | ||
| # Skip if part is a string or doesn't have annotations | ||
| if isinstance(part, str): | ||
| continue | ||
|
|
||
| annotations = getattr(part, "annotations", []) or [] | ||
| for annotation in annotations: | ||
| # Handle both object and dict access for annotations | ||
| if isinstance(annotation, dict): | ||
| anno_type = annotation.get("type") | ||
| anno_url = annotation.get("url") | ||
| anno_title = annotation.get("title") or annotation.get( | ||
| "filename" | ||
| ) | ||
| else: | ||
| anno_type = getattr(annotation, "type", None) | ||
| anno_url = getattr(annotation, "url", None) | ||
| anno_title = getattr(annotation, "title", None) or getattr( | ||
| annotation, "filename", None | ||
| ReferencedDocument( | ||
| doc_url=final_url, doc_title=doc_title, doc_id=doc_id | ||
| ) | ||
|
|
||
| if anno_type == "url_citation": | ||
| # Treat empty string as None | ||
| final_url = anno_url if anno_url else None | ||
| if (final_url, anno_title) not in seen_docs: | ||
| documents.append( | ||
| ReferencedDocument( | ||
| doc_url=final_url, doc_title=anno_title | ||
| ) | ||
| ) | ||
| seen_docs.add((final_url, anno_title)) | ||
|
|
||
| elif anno_type == "file_citation": | ||
| if (None, anno_title) not in seen_docs: | ||
| documents.append( | ||
| ReferencedDocument( | ||
| doc_url=None, doc_title=anno_title | ||
| ) | ||
| ) | ||
| seen_docs.add((None, anno_title)) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual code around lines 575-586 in query_v2.py
cat -n src/app/endpoints/query_v2.py | sed -n '570,590p'Repository: lightspeed-core/lightspeed-stack
Length of output: 1147
🏁 Script executed:
# Let's also check the ReferencedDocument class definition to understand its structure
rg -A 10 'class ReferencedDocument' src/Repository: lightspeed-core/lightspeed-stack
Length of output: 708
🏁 Script executed:
# Let's see more context around this code section to understand the full logic
cat -n src/app/endpoints/query_v2.py | sed -n '550,600p'Repository: lightspeed-core/lightspeed-stack
Length of output: 2444
🏁 Script executed:
# Get full ReferencedDocument class definition
cat -n src/models/responses.py | grep -A 20 'class ReferencedDocument'Repository: lightspeed-core/lightspeed-stack
Length of output: 840
🏁 Script executed:
# Search for where ReferencedDocument is used to understand the intent
rg 'ReferencedDocument' src/ --type py -B 2 -A 2Repository: lightspeed-core/lightspeed-stack
Length of output: 10260
🏁 Script executed:
# Check if there are any tests that show expected behavior with doc_id-only documents
fd 'test.*query' src/ --type py | head -5Repository: lightspeed-core/lightspeed-stack
Length of output: 245
Include doc_id-only documents to prevent data loss.
Documents with only document_id but no title or url are currently filtered out. Since doc_id is explicitly extracted and used in deduplication, it should be preserved. Update the condition to include doc_id:
Suggested fix
- if doc_title or doc_url:
+ if doc_title or doc_url or doc_id:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| doc_title = attributes.get("title") | |
| doc_id = attributes.get("document_id") or attributes.get("doc_id") | |
| # If we have at least a filename or url | |
| if filename or doc_url: | |
| if doc_title or doc_url: | |
| # Treat empty string as None for URL to satisfy Optional[AnyUrl] | |
| final_url = doc_url if doc_url else None | |
| if (final_url, filename) not in seen_docs: | |
| if (final_url, doc_title, doc_id) not in seen_docs: | |
| documents.append( | |
| ReferencedDocument(doc_url=final_url, doc_title=filename) | |
| ) | |
| seen_docs.add((final_url, filename)) | |
| # 2. Parse from message content annotations | |
| elif item_type == "message": | |
| content = getattr(output_item, "content", None) | |
| if isinstance(content, list): | |
| for part in content: | |
| # Skip if part is a string or doesn't have annotations | |
| if isinstance(part, str): | |
| continue | |
| annotations = getattr(part, "annotations", []) or [] | |
| for annotation in annotations: | |
| # Handle both object and dict access for annotations | |
| if isinstance(annotation, dict): | |
| anno_type = annotation.get("type") | |
| anno_url = annotation.get("url") | |
| anno_title = annotation.get("title") or annotation.get( | |
| "filename" | |
| ) | |
| else: | |
| anno_type = getattr(annotation, "type", None) | |
| anno_url = getattr(annotation, "url", None) | |
| anno_title = getattr(annotation, "title", None) or getattr( | |
| annotation, "filename", None | |
| ReferencedDocument( | |
| doc_url=final_url, doc_title=doc_title, doc_id=doc_id | |
| ) | |
| if anno_type == "url_citation": | |
| # Treat empty string as None | |
| final_url = anno_url if anno_url else None | |
| if (final_url, anno_title) not in seen_docs: | |
| documents.append( | |
| ReferencedDocument( | |
| doc_url=final_url, doc_title=anno_title | |
| ) | |
| ) | |
| seen_docs.add((final_url, anno_title)) | |
| elif anno_type == "file_citation": | |
| if (None, anno_title) not in seen_docs: | |
| documents.append( | |
| ReferencedDocument( | |
| doc_url=None, doc_title=anno_title | |
| ) | |
| ) | |
| seen_docs.add((None, anno_title)) | |
| ) | |
| doc_title = attributes.get("title") | |
| doc_id = attributes.get("document_id") or attributes.get("doc_id") | |
| if doc_title or doc_url or doc_id: | |
| # Treat empty string as None for URL to satisfy Optional[AnyUrl] | |
| final_url = doc_url if doc_url else None | |
| if (final_url, doc_title, doc_id) not in seen_docs: | |
| documents.append( | |
| ReferencedDocument( | |
| doc_url=final_url, doc_title=doc_title, doc_id=doc_id | |
| ) | |
| ) |
🤖 Prompt for AI Agents
In `@src/app/endpoints/query_v2.py` around lines 575 - 586, The current filter
only appends ReferencedDocument when doc_title or doc_url exists, which drops
entries that only have doc_id; change the if-condition to check any of
doc_title, doc_url, or doc_id so documents with only a document_id are
preserved. Keep the existing final_url normalization (treat empty string as
None), use the same dedupe tuple (final_url, doc_title, doc_id) against
seen_docs, and append ReferencedDocument(doc_url=final_url, doc_title=doc_title,
doc_id=doc_id) when that tuple is not in seen_docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
referenced_docs are used because we need to show in the frontend what document have been referenced, IMO it does not make sense to return empty documents without TItle / Url
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@are-ces, that makes perfect sense! You're right that a document reference with only an ID wouldn't provide meaningful information for the frontend to display. The current condition ensures that only documents with at least a title or URL—something actionable and human-readable—are returned to the UI. Thank you for the clarification!
✏️ Learnings added
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1073
File: src/app/endpoints/query_v2.py:575-586
Timestamp: 2026-01-30T13:33:34.749Z
Learning: In `src/app/endpoints/query_v2.py`, the `parse_referenced_documents_from_responses_api` function filters referenced documents to include only those with `doc_title` or `doc_url` because these documents are displayed in the frontend. Documents with only `doc_id` are intentionally excluded as they wouldn't provide useful information to end users.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/openapi.json (1)
4287-4312:⚠️ Potential issue | 🟠 MajorEnsure unique operationId values for GET vs POST.
Line 4293 and Line 4311 both usehandle_a2a_jsonrpc_a2a_post, which can break client codegen (operationId must be unique). Consider distinct IDs per method.🔧 Suggested fix
- "operationId": "handle_a2a_jsonrpc_a2a_post", + "operationId": "handle_a2a_jsonrpc_a2a_get",
🧹 Nitpick comments (2)
src/llama_stack_configuration.py (1)
310-347: Avoid in-place mutation ofls_configinenrich_byok_rag.
Returning a new mapping reduces side‑effects and aligns with the project guideline.♻️ Proposed refactor to avoid in-place mutation
-def enrich_byok_rag(ls_config: dict[str, Any], byok_rag: list[dict[str, Any]]) -> None: +def enrich_byok_rag(ls_config: dict[str, Any], byok_rag: list[dict[str, Any]]) -> dict[str, Any]: """Enrich Llama Stack config with BYOK RAG settings. Args: - ls_config: Llama Stack configuration dict (modified in place) + ls_config: Llama Stack configuration dict byok_rag: List of BYOK RAG configurations + Returns: + dict[str, Any]: Enriched configuration mapping. """ - if len(byok_rag) == 0: - logger.info("BYOK RAG is not configured: skipping") - return + if len(byok_rag) == 0: + logger.info("BYOK RAG is not configured: skipping") + return ls_config + + output = {**ls_config} logger.info("Enriching Llama Stack config with BYOK RAG") # Add storage backends - if "storage" not in ls_config: - ls_config["storage"] = {} - ls_config["storage"]["backends"] = construct_storage_backends_section( - ls_config, byok_rag - ) + storage = {**output.get("storage", {})} + storage["backends"] = construct_storage_backends_section(output, byok_rag) + output["storage"] = storage # Add vector_io providers - if "providers" not in ls_config: - ls_config["providers"] = {} - ls_config["providers"]["vector_io"] = construct_vector_io_providers_section( - ls_config, byok_rag - ) + providers = {**output.get("providers", {})} + providers["vector_io"] = construct_vector_io_providers_section(output, byok_rag) + output["providers"] = providers # Add registered vector stores - if "registered_resources" not in ls_config: - ls_config["registered_resources"] = {} - ls_config["registered_resources"]["vector_stores"] = ( - construct_vector_stores_section(ls_config, byok_rag) - ) - - # Add embedding models - ls_config["registered_resources"]["models"] = construct_models_section( - ls_config, byok_rag - ) + registered_resources = {**output.get("registered_resources", {})} + registered_resources["vector_stores"] = construct_vector_stores_section( + output, byok_rag + ) + registered_resources["models"] = construct_models_section(output, byok_rag) + output["registered_resources"] = registered_resources + + return output🔁 Caller update
- enrich_byok_rag(ls_config, config.get("byok_rag", [])) + ls_config = enrich_byok_rag(ls_config, config.get("byok_rag", []))As per coding guidelines: Avoid in-place parameter modification anti-patterns; instead return new data structures.
docs/openapi.json (1)
8434-8451: Clarify whendoc_idwill actually appear.
The schema allowsdoc_idalone, but the API filters out entries lackingdoc_title/doc_url. A short note will prevent consumer confusion.📝 Suggested doc clarification
- "description": "Model representing a document referenced in generating a response.\n\nAttributes:\n doc_url: Url to the referenced doc.\n doc_title: Title of the referenced doc.\n doc_id: ID of the referenced doc." + "description": "Model representing a document referenced in generating a response.\n\nAttributes:\n doc_url: Url to the referenced doc.\n doc_title: Title of the referenced doc.\n doc_id: ID of the referenced doc.\n\nNote: referenced documents are only emitted when doc_title or doc_url is present; doc_id-only entries are filtered."Based on learnings: In
src/app/endpoints/query_v2.py, theparse_referenced_documents_from_responses_apifunction filters referenced documents to include only those withdoc_titleordoc_urlbecause these documents are displayed in the frontend. Documents with onlydoc_idare intentionally excluded as they wouldn't provide useful information to end users.
asimurka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update LGTM
Description
new:
Important change
There is an additional field in the
referenced_documentsfield:doc_id- which points to theOpenAI Filesobject created by llama-stack has been added to the attributes ofreferenced_documents.E.g.
The reason is to keep a link between
referenced_documentsandrag_chunksin the query endpoint.Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
I have registered a vector store, called
/v1/queryand/v1/streaming_queryendpoints, making sure that thereferenced_docsfield is well structured.E.g.
Summary by CodeRabbit
Improvements
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.