Skip to content

Conversation

@Aryan-Shan
Copy link

@Aryan-Shan Aryan-Shan commented Nov 30, 2025

Fixes #675

Description

This PR adds on-demand OCR text selection functionality to PictoPy, allowing users to select, copy, and refine text from images.

Features Added

  • OCR Integration: Backend OCR using pytesseract to extract text and word bounding boxes
  • Text Overlay: Frontend overlay with selectable text that scales with image zoom/pan
  • Right-click Context Menu: Copy selected text and refine selection options
  • Refine Panel: Inline controls to nudge/expand selection boundaries
  • Toggleable Selection: Ctrl+T to enable/disable text selection mode
  • Performance Optimizations: Cached OCR data and rAF-throttled pointer events

Technical Changes

Backend

  • Added db_get_image_by_id in images.py
  • New OCR utility (ocr.py) with pytesseract integration
  • New endpoint GET /images/{image_id}/ocr

Frontend

  • Updated ImageViewer.tsx with OCR overlay and selection logic
  • Added API endpoints in apiEndpoints.ts
  • Enhanced MediaView.tsx to pass image IDs

Testing Instructions

  1. Start backend and frontend servers
  2. Open an image with text content
  3. Press Ctrl+T to enable text selection
  4. Click and drag to select text from the overlay
  5. Use right-click menu or toolbar to copy/refine selection
  6. Test refine panel controls to adjust selection boundaries

Dependencies

  • Requires Tesseract OCR installed on system
  • Added pytesseract to backend requirements

Notes

  • Selection overlay uses light purple theme for better visibility
  • Performance improvements for large OCR datasets
  • Handles pointer events and interruptions gracefully

Summary by CodeRabbit

  • New Features
    • Introduced optical character recognition (OCR) capabilities to easily extract and analyze text directly from images in the image viewer.
    • Press Ctrl+T to toggle OCR and display all recognized text as an interactive, selectable overlay.
    • Copy selected text using the Ctrl+C keyboard shortcut with on-screen visual feedback confirming each successful copy.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link
Contributor

⚠️ No issue was linked in the PR description.
Please make sure to link an issue (e.g., 'Fixes #issue_number')

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 30, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This PR adds OCR (Optical Character Recognition) functionality to the image viewer. It introduces a new TextOverlay component for displaying OCR results, an OCRService using Tesseract.js for text extraction, integrates a Ctrl+T keyboard shortcut to toggle OCR mode, and adds tesseract.js as a dependency to enable OCR capabilities.

Changes

Cohort / File(s) Summary
OCR Service Integration
frontend/src/services/OCRService.ts
New service module providing lazy-initialized Tesseract.js worker for OCR operations. Exposes singleton ocrService with recognize(imagePath) method and worker cleanup via terminate(). Configured for English language with automatic PSM. Includes error handling for worker initialization and OCR processing.
ImageViewer Enhancement
frontend/src/components/Media/ImageViewer.tsx
Extended with OCR state management (isOCRActive, ocrData, isOCRLoading, imageScale). Added Ctrl+T keyboard shortcut to toggle OCR. Integrated OCRService calls with result storage and error handling. Added image scaling computation effect. UI now renders TextOverlay when OCR data is present and shows processing indicators. Reset behavior extended to clear OCR state on imagePath or resetSignal changes.
Text Overlay Component
frontend/src/components/Media/TextOverlay.tsx
New React component that renders selectable text overlays from OCR results. Displays lines positioned by bounding boxes and scaled accordingly. Implements Ctrl+C keyboard handler for copying selected text to clipboard with user feedback. Includes fade-in animation, hover highlighting, and selection styling. Early returns when no OCR data present.
Dependency Addition
frontend/package.json
Added tesseract.js version ^5.1.0 as new dependency to support OCR functionality.

Sequence Diagram

sequenceDiagram
    actor User
    participant ImageViewer
    participant OCRService
    participant Tesseract as Tesseract.js Worker
    participant TextOverlay
    participant Clipboard

    User->>ImageViewer: Press Ctrl+T
    ImageViewer->>ImageViewer: Toggle isOCRActive state
    ImageViewer->>OCRService: recognize(imagePath)
    ImageViewer->>ImageViewer: Set isOCRLoading = true
    
    OCRService->>OCRService: Check/initialize worker
    OCRService->>Tesseract: Initialize worker (eng, PSM.AUTO)
    Tesseract-->>OCRService: Worker ready
    OCRService->>Tesseract: Process image
    Tesseract-->>OCRService: OCR result (Page with lines/bbox)
    OCRService-->>ImageViewer: Return ocrData
    
    ImageViewer->>ImageViewer: Set isOCRLoading = false
    ImageViewer->>ImageViewer: Store ocrData in state
    ImageViewer->>TextOverlay: Render with ocrData + scale
    TextOverlay->>TextOverlay: Position text overlays by bbox
    
    User->>TextOverlay: Select text + Press Ctrl+C
    TextOverlay->>Clipboard: Write selected text
    Clipboard-->>TextOverlay: Success
    TextOverlay->>TextOverlay: Show copy feedback
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

  • OCRService initialization complexity: Lazy-loaded worker with concurrent promise handling and state cleanup requires careful verification of edge cases (initialization races, cleanup timing)
  • State management in ImageViewer: Multiple interacting state variables (isOCRActive, ocrData, isOCRLoading, imageScale) with reset behavior across effects—verify effect dependencies and state consistency
  • TextOverlay positioning and rendering: Bounding-box calculations, scaling factors, and absolutely-positioned overlay elements need verification for correctness across image dimensions
  • Keyboard event handling: Two separate keyboard shortcuts (Ctrl+T in ImageViewer, Ctrl+C in TextOverlay) with event propagation and global handlers—ensure no conflicts or unintended side effects
  • Integration with external library: Tesseract.js is a new dependency with async worker operations; verify proper error handling and resource cleanup

Possibly related PRs

  • PR #530: Established the TransformWrapper-based rendering pattern in ImageViewer.tsx that this PR extends with OCR state, lifecycle management, and TextOverlay integration.

Suggested labels

enhancement, UI, frontend

Suggested reviewers

  • rahulharpal1603

Poem

🐰 With Ctrl+T, the text takes flight,
OCR brings words to gleaming light,
Select and copy, clear as day,
Tesseract whispers: "What shall we say?"
Images speak now, in every way!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately describes the main feature: OCR text selection with refine functionality, which aligns with the core changes across backend (OCR utility, endpoint) and frontend (selection overlay, refine panel).
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e24dca2 and 24bbb6b.

⛔ Files ignored due to path filters (1)
  • frontend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (4)
  • frontend/package.json (1 hunks)
  • frontend/src/components/Media/ImageViewer.tsx (3 hunks)
  • frontend/src/components/Media/TextOverlay.tsx (1 hunks)
  • frontend/src/services/OCRService.ts (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

⚠️ No issue was linked in the PR description.
Please make sure to link an issue (e.g., 'Fixes #issue_number')

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
backend/app/utils/ocr.py (2)

27-34: Lazy import of pytesseract is appropriate for optional dependency handling.

The pattern of importing inside the function enables graceful degradation when pytesseract isn't installed. However, consider catching ImportError specifically rather than bare Exception for clarity.

     try:
         import pytesseract
-    except Exception:
+    except ImportError:
         pytesseract = None

55-62: Silent failure on coordinate parsing may hide data corruption issues.

When int() conversion fails for bounding box coordinates, the word is silently skipped. Consider logging a warning to help diagnose OCR data issues.

                 try:
                     left = int(data.get("left", [])[i])
                     top = int(data.get("top", [])[i])
                     w = int(data.get("width", [])[i])
                     h = int(data.get("height", [])[i])
-                except Exception:
+                except (ValueError, TypeError, IndexError) as e:
+                    logger.debug(f"Skipping word at index {i} due to invalid bbox data: {e}")
                     continue
backend/app/routes/images.py (2)

136-167: Consider adding a response model for consistency and documentation.

Other endpoints in this file use Pydantic response models (e.g., GetAllImagesResponse). Adding one for the OCR endpoint would improve API documentation and type safety.

class OCRWordData(BaseModel):
    text: str
    left: int
    top: int
    width: int
    height: int

class OCRResponse(BaseModel):
    success: bool
    image_id: str
    full_text: str
    words: List[OCRWordData]
    image_width: int
    image_height: int

@router.get("/{image_id}/ocr", response_model=OCRResponse)
def get_image_ocr(image_id: str):
    ...

165-167: Avoid exposing raw exception details in production responses.

Including {e} directly in the error detail could leak internal implementation details. Consider using a generic message while logging the full error.

     except Exception as e:
         logger.error(f"Error in OCR route for {image_id}: {e}")
-        raise HTTPException(status_code=500, detail=f"OCR failed: {e}")
+        raise HTTPException(status_code=500, detail="OCR processing failed")
backend/app/database/images.py (1)

123-160: Differentiate DB errors from “not found” in db_get_image_by_id

The implementation is correct, but by catching all exceptions and returning None, callers cannot distinguish “record not found” from an underlying DB error. If the route maps None to HTTP 404, genuine DB failures would also surface as 404 instead of 500.

Consider either:

  • Letting unexpected exceptions propagate so the route can return a 500, or
  • Returning a structured result (e.g. { record: ..., error: ... }) so the caller can handle “not found” vs “error” separately.
docs/backend/backend_python/openapi.json (1)

929-969: Align OCR endpoint OpenAPI responses and schema with backend behavior

The new /images/{image_id}/ocr path only declares 200/422 with an empty {} schema, while the backend route can return 404 (not found), 503 (OCR unavailable), and 500 (unexpected error), and has a well-defined JSON shape (success, image_id, full_text, words, image_width, image_height).

To keep docs and generated clients accurate, consider:

  • Adding explicit 404/503/500 responses mirroring the route behavior.
  • Defining a concrete response schema for the successful payload instead of {} (similar to GetAllImagesResponse/other image schemas).
frontend/src/components/Media/ImageViewer.tsx (1)

145-239: Optional: Treat simple click as clearing or re‑computing selection

With the current pointer logic, a plain click (no movement) in selection mode sets a zero‑size selRect but never schedules the requestAnimationFrame update, so selectedText and selectedCount remain from the previous selection while the selection rectangle jumps to the new location.

If you’d prefer clicks to clear or recompute selection, you could:

  • Reset selectedText/selectedCount in handlePointerDown, or
  • Trigger the same rect + word‑intersection update path once on pointer up when width/height are below a small threshold.

This is a UX polish only; the core functionality works as‑is.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c37d8df and e24dca2.

📒 Files selected for processing (8)
  • backend/app/database/images.py (1 hunks)
  • backend/app/routes/images.py (2 hunks)
  • backend/app/utils/ocr.py (1 hunks)
  • backend/requirements.txt (1 hunks)
  • docs/backend/backend_python/openapi.json (1 hunks)
  • frontend/src/api/apiEndpoints.ts (1 hunks)
  • frontend/src/components/Media/ImageViewer.tsx (4 hunks)
  • frontend/src/components/Media/MediaView.tsx (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
frontend/src/components/Media/ImageViewer.tsx (2)
frontend/src/config/Backend.ts (1)
  • BACKEND_URL (1-1)
frontend/src/api/apiEndpoints.ts (1)
  • imagesEndpoints (1-5)
backend/app/database/images.py (1)
backend/app/utils/images.py (1)
  • image_util_parse_metadata (496-513)
backend/app/routes/images.py (2)
backend/app/database/images.py (1)
  • db_get_image_by_id (123-159)
backend/app/utils/ocr.py (2)
  • image_ocr (12-73)
  • OCRUnavailableError (8-9)
🔇 Additional comments (5)
backend/app/utils/ocr.py (1)

64-66: Simple space-joined reconstruction loses original text structure.

The full_text concatenation with single spaces ignores line breaks and paragraph structure from the OCR output. If preserving formatting matters for downstream use, consider using pytesseract's block/line-level data to reconstruct text more accurately. For basic copy-paste scenarios, this is acceptable.

frontend/src/api/apiEndpoints.ts (1)

4-4: LGTM!

The new endpoint follows the established pattern for dynamic endpoints in this file and correctly constructs the OCR API path.

backend/app/routes/images.py (1)

9-10: LGTM!

Imports are correctly added for the new OCR functionality.

frontend/src/components/Media/MediaView.tsx (1)

172-180: LGTM!

The imageId prop is correctly passed to ImageViewer to enable OCR functionality. The optional chaining on currentImage?.id is defensive given the early return guard at line 142, but it's a safe pattern.

backend/requirements.txt (1)

74-74: Dependency addition looks correct.

The pytesseract>=0.3.10 constraint is appropriate and allows current versions up to the latest release (0.3.13). This requires Tesseract OCR to be installed at the system level. Ensure installation documentation is updated to reflect this system dependency requirement.

Comment on lines 74 to 107
useEffect(() => {
let cancelled = false;
const loadOcr = async (imgId?: string | null) => {
if (!imgId) return;
try {
const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`);
if (!resp.ok) {
return;
}
const data = await resp.json();
if (cancelled) return;
if (data && data.words && data.image_width && data.image_height) {
setOcrWords(data.words);
setOcrImageSize({ w: data.image_width, h: data.image_height });
// prepare normalized cache
normalizedWordsRef.current = data.words.map((w: any) => ({
text: w.text,
l: (w.left / data.image_width) * 100,
t: (w.top / data.image_height) * 100,
w: (w.width / data.image_width) * 100,
h: (w.height / data.image_height) * 100,
}));
}
} catch (e) {
// ignore
}
};

loadOcr(imageId ?? null);

return () => {
cancelled = true;
};
}, [imageId]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clear OCR + selection state when imageId changes or OCR fetch fails

Right now, if imageId changes or the OCR request returns a non‑OK status / unexpected payload, ocrWords, ocrImageSize, and normalizedWordsRef.current are left as‑is. That means the new image can still show and operate on OCR boxes from the previous image, which is misleading for selection and copy.

Consider resetting OCR and selection state whenever imageId changes and when the response is not usable, e.g.:

 useEffect(() => {
   let cancelled = false;
-  const loadOcr = async (imgId?: string | null) => {
-    if (!imgId) return;
+  // reset previous OCR/selection
+  setOcrWords(null);
+  setOcrImageSize(null);
+  normalizedWordsRef.current = null;
+  setSelRect(null);
+  setSelectedText(null);
+  setSelectedCount(0);
+  setCopied(false);
+
+  const loadOcr = async (imgId?: string | null) => {
+    if (!imgId) return;
     try {
       const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`);
       if (!resp.ok) {
-        return;
+        return; // leave OCR cleared
       }
       const data = await resp.json();
       if (cancelled) return;
       if (data && data.words && data.image_width && data.image_height) {
         setOcrWords(data.words);
         setOcrImageSize({ w: data.image_width, h: data.image_height });
         normalizedWordsRef.current = data.words.map((w: any) => ({
           text: w.text,
           l: (w.left / data.image_width) * 100,
           t: (w.top / data.image_height) * 100,
           w: (w.width / data.image_width) * 100,
           h: (w.height / data.image_height) * 100,
         }));
       }
     } catch (e) {
       // ignore
     }
   };
 
   loadOcr(imageId ?? null);

This keeps the overlay and selection in sync with the currently displayed image and avoids stale OCR data.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
useEffect(() => {
let cancelled = false;
const loadOcr = async (imgId?: string | null) => {
if (!imgId) return;
try {
const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`);
if (!resp.ok) {
return;
}
const data = await resp.json();
if (cancelled) return;
if (data && data.words && data.image_width && data.image_height) {
setOcrWords(data.words);
setOcrImageSize({ w: data.image_width, h: data.image_height });
// prepare normalized cache
normalizedWordsRef.current = data.words.map((w: any) => ({
text: w.text,
l: (w.left / data.image_width) * 100,
t: (w.top / data.image_height) * 100,
w: (w.width / data.image_width) * 100,
h: (w.height / data.image_height) * 100,
}));
}
} catch (e) {
// ignore
}
};
loadOcr(imageId ?? null);
return () => {
cancelled = true;
};
}, [imageId]);
useEffect(() => {
let cancelled = false;
// reset previous OCR/selection
setOcrWords(null);
setOcrImageSize(null);
normalizedWordsRef.current = null;
setSelRect(null);
setSelectedText(null);
setSelectedCount(0);
setCopied(false);
const loadOcr = async (imgId?: string | null) => {
if (!imgId) return;
try {
const resp = await fetch(`${BACKEND_URL}${imagesEndpoints.getImageOcr(imgId)}`);
if (!resp.ok) {
return; // leave OCR cleared
}
const data = await resp.json();
if (cancelled) return;
if (data && data.words && data.image_width && data.image_height) {
setOcrWords(data.words);
setOcrImageSize({ w: data.image_width, h: data.image_height });
// prepare normalized cache
normalizedWordsRef.current = data.words.map((w: any) => ({
text: w.text,
l: (w.left / data.image_width) * 100,
t: (w.top / data.image_height) * 100,
w: (w.width / data.image_width) * 100,
h: (w.height / data.image_height) * 100,
}));
}
} catch (e) {
// ignore
}
};
loadOcr(imageId ?? null);
return () => {
cancelled = true;
};
}, [imageId]);
🤖 Prompt for AI Agents
frontend/src/components/Media/ImageViewer.tsx around lines 74-107: the OCR state
and normalized cache are left from a previous image when imageId changes or when
the fetch fails/returns an unusable payload; to fix, immediately clear
OCR-related state whenever imageId changes (call setOcrWords([]),
setOcrImageSize(undefined/null) and set normalizedWordsRef.current = []) before
starting loadOcr, and also clear that same OCR state plus any selection/copy
state (e.g. selected word(s) / selection bounds — whatever state you use for
selection) inside the branches where resp.ok is false, the payload is invalid,
and inside the catch handler so stale OCR boxes and selections are never shown
for the new image.

@Aryan-Shan Aryan-Shan force-pushed the feature/ocr-text-selection branch from e24dca2 to 24bbb6b Compare December 5, 2025 07:45
@Aryan-Shan Aryan-Shan closed this Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: OCR Text Selection

1 participant