Optical Character Recognition (OCR) Tool
A command-line tool for detecting and extracting text from images using computer vision and OCR technology.
Features
- ✅ Text Detection - Detect whether images contain text
Installation
Prerequisites
This tool requires Tesseract OCR to be installed on your system:
macOS:
brew install tesseract
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocr
Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
Python Dependencies
Install required Python packages:
pip install -r requirements.txt
Or manually:
pip install opencv-python pytesseract Pillow numpy
Verify Installation
Check Tesseract is installed
tesseract --versionMake the script executable (Linux/macOS)
chmod +x ocr.pyTest the tool
python3 ocr.py --help
Usage
The tool provides four main commands: detect, bounds, extract, and process.
1. Text Detection
Detect whether an image contains text:
python3 ocr.py detect image.png
Output:
Text detected: Yes
Confidence: 85.0%
Use Case: Quickly filter images to find those containing text.
2. Character Boundaries
Find and visualize bounding boxes around text regions:
Find boundaries and show count
python3 ocr.py bounds image.pngSave annotated image with boxes drawn
python3 ocr.py bounds image.png --output annotated.pngList all bounding box coordinates
python3 ocr.py bounds image.png --list
Output:
Detected 45 character/text regions
Bounding boxes saved to annotated.pngBounding boxes (x, y, width, height):
1. (23, 45, 12, 18)
2. (38, 45, 15, 18)
...
Use Case: Debug text detection, analyze character positions, verify detection quality.
3. Text Extraction
Extract text from an image:
Extract and print to console
python3 ocr.py extract image.pngSave to file
python3 ocr.py extract image.png --output extracted.txtExtract text in different language (French)
python3 ocr.py extract image.png --lang fra
Output:
Extracted text:
----------------------------------------
Hello World
This is a test image
with multiple lines of text
----------------------------------------
Supported Languages:
eng - English (default)fra - Frenchdeu - Germanspa - Spanishchi_sim - Chinese SimplifiedTo see available languages: tesseract --list-langs
4. Full OCR Processing
Complete OCR pipeline with detailed results:
Human-readable output
Detailed text output
python3 ocr.py process image.png --detailedJSON output
python3 ocr.py process image.png --format jsonDetailed JSON with word/line breakdown
python3 ocr.py process image.png --format json --detailedSave results to file
python3 ocr.py process image.png --format json -o results.jsonHuman-Readable Output:
Text: Hello World This is a test image...
Confidence: 89.5%
Language: eng
Lines: 3
Words: 12
JSON Output (--detailed):
{
"image": "image.png",
"text": "Hello World\nThis is a test image\nwith multiple lines of text",
"confidence": 0.895,
"language": "eng",
"lines": 3,
"words": 12,
"line_details": [
{
"text": "Hello World",
"confidence": 0.95,
"words": [
{
"text": "Hello",
"confidence": 0.96,
"bbox": [23, 45, 52, 18]
},
{
"text": "World",
"confidence": 0.94,
"bbox": [78, 45, 58, 18]
}
]
}
],
"word_details": [...]
}
Use Cases:
Command Reference
Global Options
-v, --verbose Enable verbose output for debuggingdetect - Text Detection
usage: ocr.py detect [-h] imagepositional arguments:
image Path to image file
options:
-h, --help show this help message and exit
Returns: Exit code 0 if text detected, 1 otherwise.
bounds - Character Boundaries
usage: ocr.py bounds [-h] [-o OUTPUT] [-l] imagepositional arguments:
image Path to image file
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Save annotated image to file
-l, --list List all bounding box coordinates
extract - Text Extraction
usage: ocr.py extract [-h] [-o OUTPUT] [--lang LANG] imagepositional arguments:
image Path to image file
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Save extracted text to file
--lang LANG Language for OCR (default: eng)
process - Full OCR Pipeline
usage: ocr.py process [-h] [-f {text,json}] [-o OUTPUT] [-d] [--lang LANG] imagepositional arguments:
image Path to image file
options:
-h, --help show this help message and exit
-f {text,json}, --format {text,json}
Output format (default: text)
-o OUTPUT, --output OUTPUT
Save results to file
-d, --detailed Include detailed word/line information
--lang LANG Language for OCR (default: eng)
How It Works
Architecture
The OCR tool consists of several components:
1. Image Loading: Uses OpenCV to load and process images 2. Text Detection: Uses edge detection and morphological operations to find text regions 3. Character Boundary Detection: Uses adaptive thresholding and contour detection 4. Text Recognition: Uses Tesseract OCR engine for character recognition 5. Post-Processing: Formats results and calculates confidence scores
Detection Pipeline
Input Image
↓
Grayscale Conversion
↓
Edge Detection (Canny)
↓
Morphological Operations (Dilation)
↓
Contour Detection
↓
Filter by Size & Aspect Ratio
↓
Text Regions Identified
Recognition Pipeline
Input Image
↓
RGB Conversion
↓
Tesseract OCR Engine
↓
Character Recognition
↓
Confidence Scoring
↓
Text Output with Metadata
Implementation Details
Text Detection Algorithm
1. Grayscale Conversion: Simplifies image to single channel 2. Edge Detection: Canny algorithm detects character edges 3. Dilation: Connects nearby text components 4. Contour Analysis: Identifies potential text regions 5. Filtering: Removes noise based on size and aspect ratio
Character Boundary Detection
1. Adaptive Thresholding: Handles varying lighting conditions 2. Morphological Operations: Connects broken characters 3. Contour Detection: Finds individual character boundaries 4. Size Filtering: Removes noise and non-text regions 5. Sorting: Orders regions top-to-bottom, left-to-right
OCR Recognition
Uses Tesseract OCR, an open-source engine developed by Google:
Examples
Example 1: Document Scanning
Extract text from a scanned document:
python3 ocr.py process document.png --format json -o document.jsonUse the JSON output to index documents or build a search system.
Example 2: Screenshot Text Extraction
Extract text from a screenshot:
python3 ocr.py extract screenshot.png --output screenshot.txt
Example 3: Quality Analysis
Check OCR quality before processing:
python3 ocr.py bounds image.png --output check.png
Review check.png to see if text regions are detected correctly.
Example 4: Multi-Language Document
Extract text from a French document:
python3 ocr.py extract french_doc.png --lang fra
Example 5: Batch Processing
Process multiple images using a shell script:
#!/bin/bash
for img in images/*.png; do
python3 ocr.py process "$img" --format json -o "${img%.png}.json"
done
Testing
Manual Testing
Test with sample images containing:
- Printed Text: Books, documents, signs
Expected Results
Good Quality (Confidence > 90%):
Medium Quality (Confidence 70-90%):
Poor Quality (Confidence < 70%):
Creating Test Images
You can create test images programmatically:
from PIL import Image, ImageDraw, ImageFontCreate image with text
Performance Considerations
Speed
- Detection: Fast (~10-50ms for 1MP image)
Optimization Tips
1. Resize Large Images: Scale down to 2-4MP for faster processing
Memory Usage
- Typical: 50-200MB for single image
Limitations
Current Limitations
1. Handwriting: Poor accuracy with handwritten text (use specialized models)
Known Issues
- May detect non-text patterns as text (false positives)
Troubleshooting
"Tesseract not found" Error
Problem: pytesseract can't find Tesseract installation
Solution:
Linux/macOS - Tesseract usually in PATH
Windows - Set path manually in ocr.py:
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
Poor OCR Accuracy
Problem: Text extraction has many errors
Solutions: 1. Preprocess image: increase contrast, sharpen 2. Ensure minimum 300 DPI resolution 3. Try different Tesseract PSM modes (Page Segmentation Modes) 4. Check if correct language is specified 5. Fine-tune Tesseract with custom training data
No Text Detected
Problem: Tool reports no text when text is clearly visible
Solutions:
1. Adjust detection thresholds in code
2. Try different preprocessing (blur, threshold)
3. Check image format and color space
4. Increase image resolution
5. Use --verbose flag to see what's happening
Slow Performance
Problem: Processing takes too long
Solutions:
1. Resize large images before processing
2. Use detect command first to skip empty images
3. Crop to regions of interest
4. Disable detailed mode if not needed
5. Consider GPU-accelerated OCR alternatives
Future Enhancements
Possible improvements:
1. GPU Acceleration: Use CUDA for faster processing 2. Deep Learning OCR: Integrate EasyOCR or PaddleOCR 3. Rotation Correction: Auto-detect and correct text rotation 4. Layout Analysis: Preserve document structure, tables, columns 5. Handwriting Support: Add handwriting recognition models 6. PDF Support: Extract text from PDF documents 7. Video OCR: Process video frames for real-time text extraction 8. Spell Checking: Post-process with spell correction 9. Web Interface: Add REST API and web UI 10. Confidence Thresholds: Filter low-confidence results
Technical Stack
- Python 3: Core language
Resources
Documentation
Related Projects
Learning Resources
License
MIT License - See LICENSE file for details
Contributing
Contributions welcome! Areas for improvement:
Author
Created as part of the Coding Challenges series.
Acknowledgments
- Google Tesseract OCR team