-
Notifications
You must be signed in to change notification settings - Fork 351
feat: Add OCR text selection with refine functionality #674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
Caution Review failedThe pull request is closed. WalkthroughThis PR adds OCR (Optical Character Recognition) functionality to the image viewer. It introduces a new TextOverlay component for displaying OCR results, an OCRService using Tesseract.js for text extraction, integrates a Ctrl+T keyboard shortcut to toggle OCR mode, and adds tesseract.js as a dependency to enable OCR capabilities. Changes
Sequence DiagramsequenceDiagram
actor User
participant ImageViewer
participant OCRService
participant Tesseract as Tesseract.js Worker
participant TextOverlay
participant Clipboard
User->>ImageViewer: Press Ctrl+T
ImageViewer->>ImageViewer: Toggle isOCRActive state
ImageViewer->>OCRService: recognize(imagePath)
ImageViewer->>ImageViewer: Set isOCRLoading = true
OCRService->>OCRService: Check/initialize worker
OCRService->>Tesseract: Initialize worker (eng, PSM.AUTO)
Tesseract-->>OCRService: Worker ready
OCRService->>Tesseract: Process image
Tesseract-->>OCRService: OCR result (Page with lines/bbox)
OCRService-->>ImageViewer: Return ocrData
ImageViewer->>ImageViewer: Set isOCRLoading = false
ImageViewer->>ImageViewer: Store ocrData in state
ImageViewer->>TextOverlay: Render with ocrData + scale
TextOverlay->>TextOverlay: Position text overlays by bbox
User->>TextOverlay: Select text + Press Ctrl+C
TextOverlay->>Clipboard: Write selected text
Clipboard-->>TextOverlay: Success
TextOverlay->>TextOverlay: Show copy feedback
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (7)
backend/app/utils/ocr.py (2)
27-34: Lazy import of pytesseract is appropriate for optional dependency handling.The pattern of importing inside the function enables graceful degradation when pytesseract isn't installed. However, consider catching
ImportErrorspecifically rather than bareExceptionfor clarity.try: import pytesseract - except Exception: + except ImportError: pytesseract = None
55-62: Silent failure on coordinate parsing may hide data corruption issues.When
int()conversion fails for bounding box coordinates, the word is silently skipped. Consider logging a warning to help diagnose OCR data issues.try: left = int(data.get("left", [])[i]) top = int(data.get("top", [])[i]) w = int(data.get("width", [])[i]) h = int(data.get("height", [])[i]) - except Exception: + except (ValueError, TypeError, IndexError) as e: + logger.debug(f"Skipping word at index {i} due to invalid bbox data: {e}") continuebackend/app/routes/images.py (2)
136-167: Consider adding a response model for consistency and documentation.Other endpoints in this file use Pydantic response models (e.g.,
GetAllImagesResponse). Adding one for the OCR endpoint would improve API documentation and type safety.class OCRWordData(BaseModel): text: str left: int top: int width: int height: int class OCRResponse(BaseModel): success: bool image_id: str full_text: str words: List[OCRWordData] image_width: int image_height: int @router.get("/{image_id}/ocr", response_model=OCRResponse) def get_image_ocr(image_id: str): ...
165-167: Avoid exposing raw exception details in production responses.Including
{e}directly in the error detail could leak internal implementation details. Consider using a generic message while logging the full error.except Exception as e: logger.error(f"Error in OCR route for {image_id}: {e}") - raise HTTPException(status_code=500, detail=f"OCR failed: {e}") + raise HTTPException(status_code=500, detail="OCR processing failed")backend/app/database/images.py (1)
123-160: Differentiate DB errors from “not found” indb_get_image_by_idThe implementation is correct, but by catching all exceptions and returning
None, callers cannot distinguish “record not found” from an underlying DB error. If the route mapsNoneto HTTP 404, genuine DB failures would also surface as 404 instead of 500.Consider either:
- Letting unexpected exceptions propagate so the route can return a 500, or
- Returning a structured result (e.g.
{ record: ..., error: ... }) so the caller can handle “not found” vs “error” separately.docs/backend/backend_python/openapi.json (1)
929-969: Align OCR endpoint OpenAPI responses and schema with backend behaviorThe new
/images/{image_id}/ocrpath only declares 200/422 with an empty{}schema, while the backend route can return 404 (not found), 503 (OCR unavailable), and 500 (unexpected error), and has a well-defined JSON shape (success,image_id,full_text,words,image_width,image_height).To keep docs and generated clients accurate, consider:
- Adding explicit 404/503/500 responses mirroring the route behavior.
- Defining a concrete response schema for the successful payload instead of
{}(similar toGetAllImagesResponse/other image schemas).frontend/src/components/Media/ImageViewer.tsx (1)
145-239: Optional: Treat simple click as clearing or re‑computing selectionWith the current pointer logic, a plain click (no movement) in selection mode sets a zero‑size
selRectbut never schedules therequestAnimationFrameupdate, soselectedTextandselectedCountremain from the previous selection while the selection rectangle jumps to the new location.If you’d prefer clicks to clear or recompute selection, you could:
- Reset
selectedText/selectedCountinhandlePointerDown, or- Trigger the same rect + word‑intersection update path once on pointer up when
width/heightare below a small threshold.This is a UX polish only; the core functionality works as‑is.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
backend/app/database/images.py(1 hunks)backend/app/routes/images.py(2 hunks)backend/app/utils/ocr.py(1 hunks)backend/requirements.txt(1 hunks)docs/backend/backend_python/openapi.json(1 hunks)frontend/src/api/apiEndpoints.ts(1 hunks)frontend/src/components/Media/ImageViewer.tsx(4 hunks)frontend/src/components/Media/MediaView.tsx(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
frontend/src/components/Media/ImageViewer.tsx (2)
frontend/src/config/Backend.ts (1)
BACKEND_URL(1-1)frontend/src/api/apiEndpoints.ts (1)
imagesEndpoints(1-5)
backend/app/database/images.py (1)
backend/app/utils/images.py (1)
image_util_parse_metadata(496-513)
backend/app/routes/images.py (2)
backend/app/database/images.py (1)
db_get_image_by_id(123-159)backend/app/utils/ocr.py (2)
image_ocr(12-73)OCRUnavailableError(8-9)
🔇 Additional comments (5)
backend/app/utils/ocr.py (1)
64-66: Simple space-joined reconstruction loses original text structure.The
full_textconcatenation with single spaces ignores line breaks and paragraph structure from the OCR output. If preserving formatting matters for downstream use, consider using pytesseract's block/line-level data to reconstruct text more accurately. For basic copy-paste scenarios, this is acceptable.frontend/src/api/apiEndpoints.ts (1)
4-4: LGTM!The new endpoint follows the established pattern for dynamic endpoints in this file and correctly constructs the OCR API path.
backend/app/routes/images.py (1)
9-10: LGTM!Imports are correctly added for the new OCR functionality.
frontend/src/components/Media/MediaView.tsx (1)
172-180: LGTM!The
imageIdprop is correctly passed toImageViewerto enable OCR functionality. The optional chaining oncurrentImage?.idis defensive given the early return guard at line 142, but it's a safe pattern.backend/requirements.txt (1)
74-74: Dependency addition looks correct.The pytesseract>=0.3.10 constraint is appropriate and allows current versions up to the latest release (0.3.13). This requires Tesseract OCR to be installed at the system level. Ensure installation documentation is updated to reflect this system dependency requirement.
| useEffect(() => { | ||
| let cancelled = false; | ||
| const loadOcr = async (imgId?: string | null) => { | ||
| if (!imgId) return; | ||
| try { | ||
| const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`); | ||
| if (!resp.ok) { | ||
| return; | ||
| } | ||
| const data = await resp.json(); | ||
| if (cancelled) return; | ||
| if (data && data.words && data.image_width && data.image_height) { | ||
| setOcrWords(data.words); | ||
| setOcrImageSize({ w: data.image_width, h: data.image_height }); | ||
| // prepare normalized cache | ||
| normalizedWordsRef.current = data.words.map((w: any) => ({ | ||
| text: w.text, | ||
| l: (w.left / data.image_width) * 100, | ||
| t: (w.top / data.image_height) * 100, | ||
| w: (w.width / data.image_width) * 100, | ||
| h: (w.height / data.image_height) * 100, | ||
| })); | ||
| } | ||
| } catch (e) { | ||
| // ignore | ||
| } | ||
| }; | ||
|
|
||
| loadOcr(imageId ?? null); | ||
|
|
||
| return () => { | ||
| cancelled = true; | ||
| }; | ||
| }, [imageId]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clear OCR + selection state when imageId changes or OCR fetch fails
Right now, if imageId changes or the OCR request returns a non‑OK status / unexpected payload, ocrWords, ocrImageSize, and normalizedWordsRef.current are left as‑is. That means the new image can still show and operate on OCR boxes from the previous image, which is misleading for selection and copy.
Consider resetting OCR and selection state whenever imageId changes and when the response is not usable, e.g.:
useEffect(() => {
let cancelled = false;
- const loadOcr = async (imgId?: string | null) => {
- if (!imgId) return;
+ // reset previous OCR/selection
+ setOcrWords(null);
+ setOcrImageSize(null);
+ normalizedWordsRef.current = null;
+ setSelRect(null);
+ setSelectedText(null);
+ setSelectedCount(0);
+ setCopied(false);
+
+ const loadOcr = async (imgId?: string | null) => {
+ if (!imgId) return;
try {
const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`);
if (!resp.ok) {
- return;
+ return; // leave OCR cleared
}
const data = await resp.json();
if (cancelled) return;
if (data && data.words && data.image_width && data.image_height) {
setOcrWords(data.words);
setOcrImageSize({ w: data.image_width, h: data.image_height });
normalizedWordsRef.current = data.words.map((w: any) => ({
text: w.text,
l: (w.left / data.image_width) * 100,
t: (w.top / data.image_height) * 100,
w: (w.width / data.image_width) * 100,
h: (w.height / data.image_height) * 100,
}));
}
} catch (e) {
// ignore
}
};
loadOcr(imageId ?? null);This keeps the overlay and selection in sync with the currently displayed image and avoids stale OCR data.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| useEffect(() => { | |
| let cancelled = false; | |
| const loadOcr = async (imgId?: string | null) => { | |
| if (!imgId) return; | |
| try { | |
| const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`); | |
| if (!resp.ok) { | |
| return; | |
| } | |
| const data = await resp.json(); | |
| if (cancelled) return; | |
| if (data && data.words && data.image_width && data.image_height) { | |
| setOcrWords(data.words); | |
| setOcrImageSize({ w: data.image_width, h: data.image_height }); | |
| // prepare normalized cache | |
| normalizedWordsRef.current = data.words.map((w: any) => ({ | |
| text: w.text, | |
| l: (w.left / data.image_width) * 100, | |
| t: (w.top / data.image_height) * 100, | |
| w: (w.width / data.image_width) * 100, | |
| h: (w.height / data.image_height) * 100, | |
| })); | |
| } | |
| } catch (e) { | |
| // ignore | |
| } | |
| }; | |
| loadOcr(imageId ?? null); | |
| return () => { | |
| cancelled = true; | |
| }; | |
| }, [imageId]); | |
| useEffect(() => { | |
| let cancelled = false; | |
| // reset previous OCR/selection | |
| setOcrWords(null); | |
| setOcrImageSize(null); | |
| normalizedWordsRef.current = null; | |
| setSelRect(null); | |
| setSelectedText(null); | |
| setSelectedCount(0); | |
| setCopied(false); | |
| const loadOcr = async (imgId?: string | null) => { | |
| if (!imgId) return; | |
| try { | |
| const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`); | |
| if (!resp.ok) { | |
| return; // leave OCR cleared | |
| } | |
| const data = await resp.json(); | |
| if (cancelled) return; | |
| if (data && data.words && data.image_width && data.image_height) { | |
| setOcrWords(data.words); | |
| setOcrImageSize({ w: data.image_width, h: data.image_height }); | |
| // prepare normalized cache | |
| normalizedWordsRef.current = data.words.map((w: any) => ({ | |
| text: w.text, | |
| l: (w.left / data.image_width) * 100, | |
| t: (w.top / data.image_height) * 100, | |
| w: (w.width / data.image_width) * 100, | |
| h: (w.height / data.image_height) * 100, | |
| })); | |
| } | |
| } catch (e) { | |
| // ignore | |
| } | |
| }; | |
| loadOcr(imageId ?? null); | |
| return () => { | |
| cancelled = true; | |
| }; | |
| }, [imageId]); |
🤖 Prompt for AI Agents
frontend/src/components/Media/ImageViewer.tsx around lines 74-107: the OCR state
and normalized cache are left from a previous image when imageId changes or when
the fetch fails/returns an unusable payload; to fix, immediately clear
OCR-related state whenever imageId changes (call setOcrWords([]),
setOcrImageSize(undefined/null) and set normalizedWordsRef.current = []) before
starting loadOcr, and also clear that same OCR state plus any selection/copy
state (e.g. selected word(s) / selection bounds — whatever state you use for
selection) inside the branches where resp.ok is false, the payload is invalid,
and inside the catch handler so stale OCR boxes and selections are never shown
for the new image.
e24dca2 to
24bbb6b
Compare
Fixes #675
Description
This PR adds on-demand OCR text selection functionality to PictoPy, allowing users to select, copy, and refine text from images.
Features Added
Technical Changes
Backend
db_get_image_by_idinimages.pyocr.py) with pytesseract integrationGET /images/{image_id}/ocrFrontend
ImageViewer.tsxwith OCR overlay and selection logicapiEndpoints.tsMediaView.tsxto pass image IDsTesting Instructions
Ctrl+Tto enable text selectionDependencies
pytesseractto backend requirementsNotes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.