feat: Add OCR text selection with refine functionality #674

Aryan-Shan · 2025-11-30T08:20:41Z

Fixes #675

Description

This PR adds on-demand OCR text selection functionality to PictoPy, allowing users to select, copy, and refine text from images.

Features Added

OCR Integration: Backend OCR using pytesseract to extract text and word bounding boxes
Text Overlay: Frontend overlay with selectable text that scales with image zoom/pan
Right-click Context Menu: Copy selected text and refine selection options
Refine Panel: Inline controls to nudge/expand selection boundaries
Toggleable Selection: Ctrl+T to enable/disable text selection mode
Performance Optimizations: Cached OCR data and rAF-throttled pointer events

Technical Changes

Backend

Added db_get_image_by_id in images.py
New OCR utility (ocr.py) with pytesseract integration
New endpoint GET /images/{image_id}/ocr

Frontend

Updated ImageViewer.tsx with OCR overlay and selection logic
Added API endpoints in apiEndpoints.ts
Enhanced MediaView.tsx to pass image IDs

Testing Instructions

Start backend and frontend servers
Open an image with text content
Press Ctrl+T to enable text selection
Click and drag to select text from the overlay
Use right-click menu or toolbar to copy/refine selection
Test refine panel controls to adjust selection boundaries

Dependencies

Requires Tesseract OCR installed on system
Added pytesseract to backend requirements

Notes

Selection overlay uses light purple theme for better visibility
Performance improvements for large OCR datasets
Handles pointer events and interruptions gracefully

Summary by CodeRabbit

New Features
- Introduced optical character recognition (OCR) capabilities to easily extract and analyze text directly from images in the image viewer.
- Press Ctrl+T to toggle OCR and display all recognized text as an interactive, selectable overlay.
- Copy selected text using the Ctrl+C keyboard shortcut with on-screen visual feedback confirming each successful copy.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2025-11-30T08:21:00Z

⚠️ No issue was linked in the PR description.
Please make sure to link an issue (e.g., 'Fixes #issue_number')

coderabbitai · 2025-11-30T08:21:02Z

Caution

Review failed

The pull request is closed.

Walkthrough

This PR adds OCR (Optical Character Recognition) functionality to the image viewer. It introduces a new TextOverlay component for displaying OCR results, an OCRService using Tesseract.js for text extraction, integrates a Ctrl+T keyboard shortcut to toggle OCR mode, and adds tesseract.js as a dependency to enable OCR capabilities.

Changes

Cohort / File(s)	Summary
OCR Service Integration `frontend/src/services/OCRService.ts`	New service module providing lazy-initialized Tesseract.js worker for OCR operations. Exposes singleton `ocrService` with `recognize(imagePath)` method and worker cleanup via `terminate()`. Configured for English language with automatic PSM. Includes error handling for worker initialization and OCR processing.
ImageViewer Enhancement `frontend/src/components/Media/ImageViewer.tsx`	Extended with OCR state management (isOCRActive, ocrData, isOCRLoading, imageScale). Added Ctrl+T keyboard shortcut to toggle OCR. Integrated OCRService calls with result storage and error handling. Added image scaling computation effect. UI now renders TextOverlay when OCR data is present and shows processing indicators. Reset behavior extended to clear OCR state on imagePath or resetSignal changes.
Text Overlay Component `frontend/src/components/Media/TextOverlay.tsx`	New React component that renders selectable text overlays from OCR results. Displays lines positioned by bounding boxes and scaled accordingly. Implements Ctrl+C keyboard handler for copying selected text to clipboard with user feedback. Includes fade-in animation, hover highlighting, and selection styling. Early returns when no OCR data present.
Dependency Addition `frontend/package.json`	Added `tesseract.js` version `^5.1.0` as new dependency to support OCR functionality.

Sequence Diagram

sequenceDiagram
    actor User
    participant ImageViewer
    participant OCRService
    participant Tesseract as Tesseract.js Worker
    participant TextOverlay
    participant Clipboard

    User->>ImageViewer: Press Ctrl+T
    ImageViewer->>ImageViewer: Toggle isOCRActive state
    ImageViewer->>OCRService: recognize(imagePath)
    ImageViewer->>ImageViewer: Set isOCRLoading = true
    
    OCRService->>OCRService: Check/initialize worker
    OCRService->>Tesseract: Initialize worker (eng, PSM.AUTO)
    Tesseract-->>OCRService: Worker ready
    OCRService->>Tesseract: Process image
    Tesseract-->>OCRService: OCR result (Page with lines/bbox)
    OCRService-->>ImageViewer: Return ocrData
    
    ImageViewer->>ImageViewer: Set isOCRLoading = false
    ImageViewer->>ImageViewer: Store ocrData in state
    ImageViewer->>TextOverlay: Render with ocrData + scale
    TextOverlay->>TextOverlay: Position text overlays by bbox
    
    User->>TextOverlay: Select text + Press Ctrl+C
    TextOverlay->>Clipboard: Write selected text
    Clipboard-->>TextOverlay: Success
    TextOverlay->>TextOverlay: Show copy feedback

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

OCRService initialization complexity: Lazy-loaded worker with concurrent promise handling and state cleanup requires careful verification of edge cases (initialization races, cleanup timing)
State management in ImageViewer: Multiple interacting state variables (isOCRActive, ocrData, isOCRLoading, imageScale) with reset behavior across effects—verify effect dependencies and state consistency
TextOverlay positioning and rendering: Bounding-box calculations, scaling factors, and absolutely-positioned overlay elements need verification for correctness across image dimensions
Keyboard event handling: Two separate keyboard shortcuts (Ctrl+T in ImageViewer, Ctrl+C in TextOverlay) with event propagation and global handlers—ensure no conflicts or unintended side effects
Integration with external library: Tesseract.js is a new dependency with async worker operations; verify proper error handling and resource cleanup

Possibly related PRs

PR #530: Established the TransformWrapper-based rendering pattern in ImageViewer.tsx that this PR extends with OCR state, lifecycle management, and TextOverlay integration.

Suggested labels

enhancement, UI, frontend

Suggested reviewers

rahulharpal1603

Poem

🐰 With Ctrl+T, the text takes flight,
OCR brings words to gleaming light,
Select and copy, clear as day,
Tesseract whispers: "What shall we say?"
Images speak now, in every way!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title accurately describes the main feature: OCR text selection with refine functionality, which aligns with the core changes across backend (OCR utility, endpoint) and frontend (selection overlay, refine panel).
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e24dca2 and 24bbb6b.

⛔ Files ignored due to path filters (1)

frontend/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (4)

frontend/package.json (1 hunks)
frontend/src/components/Media/ImageViewer.tsx (3 hunks)
frontend/src/components/Media/TextOverlay.tsx (1 hunks)
frontend/src/services/OCRService.ts (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-11-30T08:22:50Z

⚠️ No issue was linked in the PR description.
Please make sure to link an issue (e.g., 'Fixes #issue_number')

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (7)

backend/app/utils/ocr.py (2)
27-34: Lazy import of pytesseract is appropriate for optional dependency handling.

The pattern of importing inside the function enables graceful degradation when pytesseract isn't installed. However, consider catching ImportError specifically rather than bare Exception for clarity.
     try:
         import pytesseract
-    except Exception:
+    except ImportError:
         pytesseract = None
55-62: Silent failure on coordinate parsing may hide data corruption issues.

When int() conversion fails for bounding box coordinates, the word is silently skipped. Consider logging a warning to help diagnose OCR data issues.
                 try:
                     left = int(data.get("left", [])[i])
                     top = int(data.get("top", [])[i])
                     w = int(data.get("width", [])[i])
                     h = int(data.get("height", [])[i])
-                except Exception:
+                except (ValueError, TypeError, IndexError) as e:
+                    logger.debug(f"Skipping word at index {i} due to invalid bbox data: {e}")
                     continue
backend/app/routes/images.py (2)
136-167: Consider adding a response model for consistency and documentation.

Other endpoints in this file use Pydantic response models (e.g., GetAllImagesResponse). Adding one for the OCR endpoint would improve API documentation and type safety.
class OCRWordData(BaseModel):
    text: str
    left: int
    top: int
    width: int
    height: int

class OCRResponse(BaseModel):
    success: bool
    image_id: str
    full_text: str
    words: List[OCRWordData]
    image_width: int
    image_height: int

@router.get("/{image_id}/ocr", response_model=OCRResponse)
def get_image_ocr(image_id: str):
    ...
165-167: Avoid exposing raw exception details in production responses.

Including {e} directly in the error detail could leak internal implementation details. Consider using a generic message while logging the full error.
     except Exception as e:
         logger.error(f"Error in OCR route for {image_id}: {e}")
-        raise HTTPException(status_code=500, detail=f"OCR failed: {e}")
+        raise HTTPException(status_code=500, detail="OCR processing failed")
backend/app/database/images.py (1)

123-160: Differentiate DB errors from “not found” in db_get_image_by_id

The implementation is correct, but by catching all exceptions and returning None, callers cannot distinguish “record not found” from an underlying DB error. If the route maps None to HTTP 404, genuine DB failures would also surface as 404 instead of 500.

Consider either:

Letting unexpected exceptions propagate so the route can return a 500, or

Returning a structured result (e.g. { record: ..., error: ... }) so the caller can handle “not found” vs “error” separately.

docs/backend/backend_python/openapi.json (1)

929-969: Align OCR endpoint OpenAPI responses and schema with backend behavior

The new /images/{image_id}/ocr path only declares 200/422 with an empty {} schema, while the backend route can return 404 (not found), 503 (OCR unavailable), and 500 (unexpected error), and has a well-defined JSON shape (success, image_id, full_text, words, image_width, image_height).

To keep docs and generated clients accurate, consider:

Adding explicit 404/503/500 responses mirroring the route behavior.

Defining a concrete response schema for the successful payload instead of {} (similar to GetAllImagesResponse/other image schemas).

frontend/src/components/Media/ImageViewer.tsx (1)

145-239: Optional: Treat simple click as clearing or re‑computing selection

With the current pointer logic, a plain click (no movement) in selection mode sets a zero‑size selRect but never schedules the requestAnimationFrame update, so selectedText and selectedCount remain from the previous selection while the selection rectangle jumps to the new location.

If you’d prefer clicks to clear or recompute selection, you could:

Reset selectedText/selectedCount in handlePointerDown, or

Trigger the same rect + word‑intersection update path once on pointer up when width/height are below a small threshold.

This is a UX polish only; the core functionality works as‑is.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c37d8df and e24dca2.

📒 Files selected for processing (8)

backend/app/database/images.py (1 hunks)
backend/app/routes/images.py (2 hunks)
backend/app/utils/ocr.py (1 hunks)
backend/requirements.txt (1 hunks)
docs/backend/backend_python/openapi.json (1 hunks)
frontend/src/api/apiEndpoints.ts (1 hunks)
frontend/src/components/Media/ImageViewer.tsx (4 hunks)
frontend/src/components/Media/MediaView.tsx (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

frontend/src/components/Media/ImageViewer.tsx (2)

frontend/src/config/Backend.ts (1)

BACKEND_URL (1-1)

frontend/src/api/apiEndpoints.ts (1)

imagesEndpoints (1-5)

backend/app/database/images.py (1)

backend/app/utils/images.py (1)

image_util_parse_metadata (496-513)

backend/app/routes/images.py (2)

backend/app/database/images.py (1)

db_get_image_by_id (123-159)

backend/app/utils/ocr.py (2)

image_ocr (12-73)

OCRUnavailableError (8-9)

🔇 Additional comments (5)

backend/app/utils/ocr.py (1)

64-66: Simple space-joined reconstruction loses original text structure.

The full_text concatenation with single spaces ignores line breaks and paragraph structure from the OCR output. If preserving formatting matters for downstream use, consider using pytesseract's block/line-level data to reconstruct text more accurately. For basic copy-paste scenarios, this is acceptable.

frontend/src/api/apiEndpoints.ts (1)

4-4: LGTM!

The new endpoint follows the established pattern for dynamic endpoints in this file and correctly constructs the OCR API path.

backend/app/routes/images.py (1)

9-10: LGTM!

Imports are correctly added for the new OCR functionality.

frontend/src/components/Media/MediaView.tsx (1)

172-180: LGTM!

The imageId prop is correctly passed to ImageViewer to enable OCR functionality. The optional chaining on currentImage?.id is defensive given the early return guard at line 142, but it's a safe pattern.

backend/requirements.txt (1)

74-74: Dependency addition looks correct.

The pytesseract>=0.3.10 constraint is appropriate and allows current versions up to the latest release (0.3.13). This requires Tesseract OCR to be installed at the system level. Ensure installation documentation is updated to reflect this system dependency requirement.

backend/app/routes/images.py

coderabbitai · 2025-11-30T08:26:51Z

frontend/src/components/Media/ImageViewer.tsx

+    useEffect(() => {
+      let cancelled = false;
+      const loadOcr = async (imgId?: string | null) => {
+        if (!imgId) return;
+        try {
+          const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`);
+          if (!resp.ok) {
+            return;
+          }
+          const data = await resp.json();
+          if (cancelled) return;
+          if (data && data.words && data.image_width && data.image_height) {
+            setOcrWords(data.words);
+            setOcrImageSize({ w: data.image_width, h: data.image_height });
+            // prepare normalized cache
+            normalizedWordsRef.current = data.words.map((w: any) => ({
+              text: w.text,
+              l: (w.left / data.image_width) * 100,
+              t: (w.top / data.image_height) * 100,
+              w: (w.width / data.image_width) * 100,
+              h: (w.height / data.image_height) * 100,
+            }));
+          }
+        } catch (e) {
+          // ignore
+        }
+      };
+
+      loadOcr(imageId ?? null);
+
+      return () => {
+        cancelled = true;
+      };
+    }, [imageId]);


⚠️ Potential issue | 🟠 Major

Clear OCR + selection state when imageId changes or OCR fetch fails

Right now, if imageId changes or the OCR request returns a non‑OK status / unexpected payload, ocrWords, ocrImageSize, and normalizedWordsRef.current are left as‑is. That means the new image can still show and operate on OCR boxes from the previous image, which is misleading for selection and copy.

Consider resetting OCR and selection state whenever imageId changes and when the response is not usable, e.g.:

useEffect(() => { let cancelled = false; - const loadOcr = async (imgId?: string | null) => { - if (!imgId) return; + // reset previous OCR/selection + setOcrWords(null); + setOcrImageSize(null); + normalizedWordsRef.current = null; + setSelRect(null); + setSelectedText(null); + setSelectedCount(0); + setCopied(false); + + const loadOcr = async (imgId?: string | null) => { + if (!imgId) return; try { const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`); if (!resp.ok) { - return; + return; // leave OCR cleared } const data = await resp.json(); if (cancelled) return; if (data && data.words && data.image_width && data.image_height) { setOcrWords(data.words); setOcrImageSize({ w: data.image_width, h: data.image_height }); normalizedWordsRef.current = data.words.map((w: any) => ({ text: w.text, l: (w.left / data.image_width) * 100, t: (w.top / data.image_height) * 100, w: (w.width / data.image_width) * 100, h: (w.height / data.image_height) * 100, })); } } catch (e) { // ignore } }; loadOcr(imageId ?? null);

This keeps the overlay and selection in sync with the currently displayed image and avoids stale OCR data.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

useEffect(() => {

let cancelled = false;

const loadOcr = async (imgId?: string | null) => {

if (!imgId) return;

try {

const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`);

if (!resp.ok) {

return;

}

const data = await resp.json();

if (cancelled) return;

if (data && data.words && data.image_width && data.image_height) {

setOcrWords(data.words);

setOcrImageSize({ w: data.image_width, h: data.image_height });

// prepare normalized cache

normalizedWordsRef.current = data.words.map((w: any) => ({

text: w.text,

l: (w.left / data.image_width) * 100,

t: (w.top / data.image_height) * 100,

w: (w.width / data.image_width) * 100,

h: (w.height / data.image_height) * 100,

}));

}

} catch (e) {

// ignore

}

};

loadOcr(imageId ?? null);

return () => {

cancelled = true;

};

}, [imageId]);

useEffect(() => {

let cancelled = false;

// reset previous OCR/selection

setOcrWords(null);

setOcrImageSize(null);

normalizedWordsRef.current = null;

setSelRect(null);

setSelectedText(null);

setSelectedCount(0);

setCopied(false);

const loadOcr = async (imgId?: string | null) => {

if (!imgId) return;

try {

const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`);

if (!resp.ok) {

return; // leave OCR cleared

}

const data = await resp.json();

if (cancelled) return;

if (data && data.words && data.image_width && data.image_height) {

setOcrWords(data.words);

setOcrImageSize({ w: data.image_width, h: data.image_height });

// prepare normalized cache

normalizedWordsRef.current = data.words.map((w: any) => ({

text: w.text,

l: (w.left / data.image_width) * 100,

t: (w.top / data.image_height) * 100,

w: (w.width / data.image_width) * 100,

h: (w.height / data.image_height) * 100,

}));

}

} catch (e) {

// ignore

}

};

loadOcr(imageId ?? null);

return () => {

cancelled = true;

};

}, [imageId]);

🤖 Prompt for AI Agents

frontend/src/components/Media/ImageViewer.tsx around lines 74-107: the OCR state and normalized cache are left from a previous image when imageId changes or when the fetch fails/returns an unusable payload; to fix, immediately clear OCR-related state whenever imageId changes (call setOcrWords([]), setOcrImageSize(undefined/null) and set normalizedWordsRef.current = []) before starting loadOcr, and also clear that same OCR state plus any selection/copy state (e.g. selected word(s) / selection bounds — whatever state you use for selection) inside the branches where resp.ok is false, the payload is invalid, and inside the catch handler so stale OCR boxes and selections are never shown for the new image.

frontend/src/components/Media/ImageViewer.tsx

coderabbitai bot reviewed Nov 30, 2025

View reviewed changes

Add OCR text selection feature

24bbb6b

Aryan-Shan force-pushed the feature/ocr-text-selection branch from e24dca2 to 24bbb6b Compare December 5, 2025 07:45

Aryan-Shan closed this Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add OCR text selection with refine functionality #674

feat: Add OCR text selection with refine functionality #674

Uh oh!

Aryan-Shan commented Nov 30, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Nov 30, 2025

Uh oh!

coderabbitai bot commented Nov 30, 2025 •

edited

Loading

Review failed

Uh oh!

github-actions bot commented Nov 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Nov 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add OCR text selection with refine functionality #674

feat: Add OCR text selection with refine functionality #674

Uh oh!

Conversation

Aryan-Shan commented Nov 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Features Added

Technical Changes

Backend

Frontend

Testing Instructions

Dependencies

Notes

Summary by CodeRabbit

Uh oh!

github-actions bot commented Nov 30, 2025

Uh oh!

coderabbitai bot commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Nov 30, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aryan-Shan commented Nov 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 30, 2025 •

edited

Loading