Optical Character Recognition (OCR) Tool

A command-line tool for detecting and extracting text from images using computer vision and OCR technology.

Features

- ✅ Text Detection - Detect whether images contain text

✅ Character Bounds - Find bounding boxes around characters and text regions

✅ Text Extraction - Extract text from images using Tesseract OCR

✅ Full OCR Pipeline - Complete processing with detailed results

✅ Multiple Output Formats - Text and JSON output

✅ Multi-language Support - Support for 100+ languages via Tesseract

✅ Detailed Analysis - Word-level and line-level confidence scores

✅ Bounding Box Visualization - Annotate images with detected regions

✅ Batch Processing Ready - Process multiple images programmatically

Installation

Prerequisites

This tool requires Tesseract OCR to be installed on your system:

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr

Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

Python Dependencies

Install required Python packages:

pip install -r requirements.txt

Or manually:

pip install opencv-python pytesseract Pillow numpy

Verify Installation

Check Tesseract is installed
tesseract --version
Make the script executable (Linux/macOS)
chmod +x ocr.py
Test the tool
python3 ocr.py --help

Usage

The tool provides four main commands: detect, bounds, extract, and process.

1. Text Detection

Detect whether an image contains text:

python3 ocr.py detect image.png

Output:

Text detected: Yes
Confidence: 85.0%

Use Case: Quickly filter images to find those containing text.

2. Character Boundaries

Find and visualize bounding boxes around text regions:

Find boundaries and show count
python3 ocr.py bounds image.png
Save annotated image with boxes drawn
python3 ocr.py bounds image.png --output annotated.png
List all bounding box coordinates
python3 ocr.py bounds image.png --list

Output:

Detected 45 character/text regions
Bounding boxes saved to annotated.pngBounding boxes (x, y, width, height):
  1. (23, 45, 12, 18)
  2. (38, 45, 15, 18)
  ...

Use Case: Debug text detection, analyze character positions, verify detection quality.

3. Text Extraction

Extract text from an image:

Extract and print to console
python3 ocr.py extract image.png
Save to file
python3 ocr.py extract image.png --output extracted.txt
Extract text in different language (French)
python3 ocr.py extract image.png --lang fra

Output:

Extracted text:
----------------------------------------
Hello World
This is a test image
with multiple lines of text
----------------------------------------

Supported Languages:

eng - English (default)

fra - French

deu - German

spa - Spanish

chi_sim - Chinese Simplified

And 100+ more...

To see available languages: tesseract --list-langs

4. Full OCR Processing

Complete OCR pipeline with detailed results:

Human-readable output


python3 ocr.py process image.png
Detailed text output
python3 ocr.py process image.png --detailed
JSON output
python3 ocr.py process image.png --format json
Detailed JSON with word/line breakdown
python3 ocr.py process image.png --format json --detailed
Save results to file
python3 ocr.py process image.png --format json -o results.json

Human-Readable Output:

Text: Hello World This is a test image...
Confidence: 89.5%
Language: eng
Lines: 3
Words: 12

JSON Output (--detailed):

{
  "image": "image.png",
  "text": "Hello World\nThis is a test image\nwith multiple lines of text",
  "confidence": 0.895,
  "language": "eng",
  "lines": 3,
  "words": 12,
  "line_details": [
    {
      "text": "Hello World",
      "confidence": 0.95,
      "words": [
        {
          "text": "Hello",
          "confidence": 0.96,
          "bbox": [23, 45, 52, 18]
        },
        {
          "text": "World",
          "confidence": 0.94,
          "bbox": [78, 45, 58, 18]
        }
      ]
    }
  ],
  "word_details": [...]
}

Use Cases:

Extract text for further processing

Analyze OCR quality with confidence scores

Integrate with other tools using JSON output

Archive text content from images

Command Reference

Global Options

-v, --verbose    Enable verbose output for debugging


-h, --help       Show help message

detect - Text Detection

usage: ocr.py detect [-h] image positional arguments: image Path to image file

options: -h, --help show this help message and exit

Returns: Exit code 0 if text detected, 1 otherwise.

bounds - Character Boundaries

usage: ocr.py bounds [-h] [-o OUTPUT] [-l] image positional arguments: image Path to image file

options: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Save annotated image to file -l, --list List all bounding box coordinates

extract - Text Extraction

usage: ocr.py extract [-h] [-o OUTPUT] [--lang LANG] image positional arguments: image Path to image file

options: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Save extracted text to file --lang LANG Language for OCR (default: eng)

process - Full OCR Pipeline

usage: ocr.py process [-h] [-f {text,json}] [-o OUTPUT] [-d] [--lang LANG] image
positional arguments:
  image                 Path to image fileoptions:
  -h, --help            show this help message and exit
  -f {text,json}, --format {text,json}
                        Output format (default: text)
  -o OUTPUT, --output OUTPUT
                        Save results to file
  -d, --detailed        Include detailed word/line information
  --lang LANG          Language for OCR (default: eng)

How It Works

Architecture

The OCR tool consists of several components:

1. Image Loading: Uses OpenCV to load and process images 2. Text Detection: Uses edge detection and morphological operations to find text regions 3. Character Boundary Detection: Uses adaptive thresholding and contour detection 4. Text Recognition: Uses Tesseract OCR engine for character recognition 5. Post-Processing: Formats results and calculates confidence scores

Detection Pipeline

Input Image
    ↓
Grayscale Conversion
    ↓
Edge Detection (Canny)
    ↓
Morphological Operations (Dilation)
    ↓
Contour Detection
    ↓
Filter by Size & Aspect Ratio
    ↓
Text Regions Identified

Recognition Pipeline

Input Image
    ↓
RGB Conversion
    ↓
Tesseract OCR Engine
    ↓
Character Recognition
    ↓
Confidence Scoring
    ↓
Text Output with Metadata

Implementation Details

Text Detection Algorithm

1. Grayscale Conversion: Simplifies image to single channel 2. Edge Detection: Canny algorithm detects character edges 3. Dilation: Connects nearby text components 4. Contour Analysis: Identifies potential text regions 5. Filtering: Removes noise based on size and aspect ratio

Character Boundary Detection

1. Adaptive Thresholding: Handles varying lighting conditions 2. Morphological Operations: Connects broken characters 3. Contour Detection: Finds individual character boundaries 4. Size Filtering: Removes noise and non-text regions 5. Sorting: Orders regions top-to-bottom, left-to-right

OCR Recognition

Uses Tesseract OCR, an open-source engine developed by Google:

LSTM neural network-based recognition

Support for 100+ languages

Confidence scoring per word

Handles various fonts and sizes

Can be fine-tuned with custom training data

Examples

Example 1: Document Scanning

Extract text from a scanned document:

python3 ocr.py process document.png --format json -o document.json

Use the JSON output to index documents or build a search system.

Example 2: Screenshot Text Extraction

Extract text from a screenshot:

python3 ocr.py extract screenshot.png --output screenshot.txt

Example 3: Quality Analysis

Check OCR quality before processing:

python3 ocr.py bounds image.png --output check.png

Review check.png to see if text regions are detected correctly.

Example 4: Multi-Language Document

Extract text from a French document:

python3 ocr.py extract french_doc.png --lang fra

Example 5: Batch Processing

Process multiple images using a shell script:

#!/bin/bash
for img in images/*.png; do
    python3 ocr.py process "$img" --format json -o "${img%.png}.json"
done

Testing

Manual Testing

Test with sample images containing:

- Printed Text: Books, documents, signs

Screenshots: Terminal output, code, UI text

Natural Scenes: Street signs, storefronts

Mixed Content: Text with images, logos

Challenging Cases: Low resolution, rotated, blurry

Expected Results

Good Quality (Confidence > 90%):

Clean printed text

High resolution scans

Good contrast

Horizontal text

Medium Quality (Confidence 70-90%):

Screenshots with anti-aliasing

Slightly blurry images

Moderate background noise

Mixed fonts and sizes

Poor Quality (Confidence < 70%):

Very low resolution

Severe blur or distortion

Complex backgrounds

Heavily stylized fonts

Handwritten text

Creating Test Images

You can create test images programmatically:

from PIL import Image, ImageDraw, ImageFont
Create image with text


img = Image.new('RGB', (400, 100), color='white')
d = ImageDraw.Draw(img)
d.text((10, 10), "Hello World", fill='black')
img.save('test.png')

Performance Considerations

Speed

- Detection: Fast (~10-50ms for 1MP image)

Boundary Finding: Fast (~20-100ms)

OCR Recognition: Slower (~100-1000ms depending on image complexity)

Optimization Tips

1. Resize Large Images: Scale down to 2-4MP for faster processing

2. Preprocess: Convert to grayscale, increase contrast 3. Batch Processing: Process multiple images in parallel 4. Use detect First: Skip OCR if no text detected 5. Limit Search Areas: Crop to text regions before OCR

Memory Usage

- Typical: 50-200MB for single image

Large images: Up to 500MB-1GB

Batch processing: Memory usage is sequential (one image at a time)

Limitations

Current Limitations

1. Handwriting: Poor accuracy with handwritten text (use specialized models)

2. Rotation: Works best with upright text (< 15° rotation) 3. Distortion: Struggles with perspective distortion or curved text 4. Artistic Fonts: Limited support for highly stylized fonts 5. Background: Complex backgrounds reduce accuracy 6. Resolution: Requires minimum ~300 DPI for best results

Known Issues

- May detect non-text patterns as text (false positives)

Confidence scores are estimates, not guarantees

Very small text (< 10pt) may not be detected

Mixed languages in same image may reduce accuracy

Troubleshooting

"Tesseract not found" Error

Problem: pytesseract can't find Tesseract installation

Solution:

Linux/macOS - Tesseract usually in PATH
Windows - Set path manually in ocr.py:
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Poor OCR Accuracy

Problem: Text extraction has many errors

Solutions: 1. Preprocess image: increase contrast, sharpen 2. Ensure minimum 300 DPI resolution 3. Try different Tesseract PSM modes (Page Segmentation Modes) 4. Check if correct language is specified 5. Fine-tune Tesseract with custom training data

No Text Detected

Problem: Tool reports no text when text is clearly visible

Solutions: 1. Adjust detection thresholds in code 2. Try different preprocessing (blur, threshold) 3. Check image format and color space 4. Increase image resolution 5. Use --verbose flag to see what's happening

Slow Performance

Problem: Processing takes too long

Solutions: 1. Resize large images before processing 2. Use detect command first to skip empty images 3. Crop to regions of interest 4. Disable detailed mode if not needed 5. Consider GPU-accelerated OCR alternatives

Future Enhancements

Possible improvements:

1. GPU Acceleration: Use CUDA for faster processing 2. Deep Learning OCR: Integrate EasyOCR or PaddleOCR 3. Rotation Correction: Auto-detect and correct text rotation 4. Layout Analysis: Preserve document structure, tables, columns 5. Handwriting Support: Add handwriting recognition models 6. PDF Support: Extract text from PDF documents 7. Video OCR: Process video frames for real-time text extraction 8. Spell Checking: Post-process with spell correction 9. Web Interface: Add REST API and web UI 10. Confidence Thresholds: Filter low-confidence results