Metadata-Version: 2.4
Name: abstract_ocr
Version: 0.0.1.67
Summary: A structured OCR pipeline designed for layout-aware text extraction from complex documents, combining preprocessing, column detection, region classification, PaddleOCR, and ordered OCR assembly.
Author: putkoff
Author-email: partners@abstractendeavors.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: abstract_utilities==0.2.2.780
Requires-Dist: abstract_pdfs==0.0.29
Requires-Dist: paddleocr==3.5.0
Requires-Dist: paddlepaddle==3.2.2
Requires-Dist: paddlex==3.5.2
Requires-Dist: pdf2image==1.17.0
Requires-Dist: PyPDF2==3.0.1
Requires-Dist: PyMuPDF==1.27.2
Requires-Dist: pypdfium2==5.6.0
Requires-Dist: pdfplumber==0.11.9
Requires-Dist: pdfminer.six==20251230
Requires-Dist: python-docx==1.2.0
Requires-Dist: openpyxl==3.1.5
Requires-Dist: opencv-python-headless==4.13.0.92
Requires-Dist: pillow==12.1.1
Requires-Dist: numpy==2.3.5
Requires-Dist: scikit-image==0.26.0
Requires-Dist: spacy==3.8.11
Requires-Dist: nltk==3.9.4
Requires-Dist: beautifulsoup4==4.14.3
Requires-Dist: lxml==6.0.2
Requires-Dist: requests==2.32.5
Requires-Dist: tqdm==4.67.3
Requires-Dist: PyYAML==6.0.2
Requires-Dist: packaging
Requires-Dist: typing_extensions==4.15.0
Requires-Dist: moviepy==1.0.3
Provides-Extra: easyocr
Requires-Dist: easyocr==1.7.2; extra == "easyocr"
Provides-Extra: tesseract
Requires-Dist: pytesseract==0.3.13; extra == "tesseract"
Provides-Extra: abstract-hugpy
Requires-Dist: abstract_hugpy; extra == "abstract-hugpy"
Provides-Extra: all
Requires-Dist: easyocr==1.7.2; extra == "all"
Requires-Dist: pytesseract==0.3.13; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

## Part of the Abstract Media Intelligence Platform

This module provides layout-aware OCR as part of a larger media processing system.

abstract_ocr focuses on extraction:
- multi-engine OCR (Tesseract / EasyOCR / PaddleOCR)
- column detection and region segmentation
- structured, position-aware text output

Full system: https://github.com/AbstractEndeavors/abstract-media-intelligence

---

## **abstract_ocr / layout_ocr — Layout-Aware OCR Pipeline**

A structured OCR pipeline designed for **layout-aware text extraction from complex documents**, combining preprocessing, column detection, region classification, and ordered OCR assembly.

Built to handle:

* multi-column PDFs
* mixed-content layouts (text, figures, captions)
* noisy or scanned documents
* large-scale document ingestion pipelines

---

## 🔹 What This System Is

This is not a simple OCR wrapper — it is a **typed, multi-stage processing pipeline**:

* transforms raw images into structured page representations
* detects document layout (columns, headers, regions)
* classifies content blocks (text, figures, captions)
* applies OCR at the region level
* reconstructs output in correct reading order

The system is designed for **deterministic, reproducible extraction** rather than heuristic text scraping.

---
## Pipeline Overview
```text
PDF Input
    ↓
Slice / Decompose (images + text per page)
    ↓
OCR + Text Extraction (layout-aware engines)
    ↓
Metadata Generation
    ├─ summaries
    ├─ keywords
    └─ descriptions
    ↓
Manifest Creation (per-page + per-document)
    ↓
HTML Generation
    ├─ PDF viewer pages
    └─ gallery index pages
    ↓
Static Site Output (SEO-ready)
```
```mermaid
flowchart TD
    A[Input Image / Page Image]
    B[Preprocess\nDenoise + Binarize]
    C[Layout Detection\nColumns + Header Cutoff]
    D[Region Classification\nText / Figure / Caption]
    E[Region OCR\nCrop + Tesseract]
    F[Fallback OCR\nColumn-level OCR]
    G[Reading Order Assembly]
    H[Structured OCRResult\nBlocks + Raw Text + Layout]

    A --> B --> C --> D --> E --> G --> H
    D -->|No usable regions| F --> G
```
---

## 🔹 Core Capabilities

* **Layout Detection**

  * Column detection via vertical projection valleys
  * Header segmentation via density scanning
  * Multi-column classification (single / dual / mixed)

* **Region Classification**

  * Connected-component analysis
  * Density-based classification (text vs figure vs caption)
  * Column-aware region assignment

* **Region-Level OCR**

  * OCR applied per detected block (not full-page)
  * Adaptive Tesseract configuration by region type
  * Automatic fallback to column-level OCR when detection fails

* **Reading Order Reconstruction**

  * Column-aware ordering
  * Top-to-bottom sequencing within columns
  * Header/body/caption prioritization

* **Typed Pipeline Execution**

  * All steps validated via explicit input/output types
  * Registry-driven execution model
  * No implicit coupling between pipeline stages

---

## 🔹 Architecture

The pipeline is built around a **step registry + type-safe execution chain**:

* Each step declares:

  * input type
  * output type
* The pipeline validates compatibility before execution
* Execution is explicit, deterministic, and observable

Example chain:

```python
["preprocess", "detect_layout", "ocr_regions"]
```

Each step is independently replaceable and composable.

---

## 🔹 Key Design Decisions

### **Typed Data Flow**

All intermediate results are structured dataclasses:

* `PageImage`
* `PreprocessedImage`
* `LayoutDetection`
* `OCRResult`

No ad-hoc dictionaries — ensures:

* traceability
* consistency
* debuggability

---

### **Layout-First OCR**

OCR is applied **after structure is understood**, not before.

This prevents:

* column interleaving
* incorrect reading order
* misclassification of content

---

### **Fallback Over Failure**

If region detection fails:

* system falls back to column-level OCR
* ensures output is still usable

---

### **Determinism Over Heuristics**

* explicit thresholds (config-driven)
* no hidden behavior
* reproducible results across runs

---

## 🔹 Why This Exists

Traditional OCR pipelines:

* ignore layout
* operate on full pages
* produce inconsistent reading order
* fail silently on complex documents

This system:

* understands document structure
* isolates regions before OCR
* enforces reading order
* produces structured outputs suitable for downstream systems

---

## 🔹 Example Use Cases

* PDF → structured text extraction
* research document ingestion pipelines
* financial filings parsing
* multi-column article extraction
* preprocessing for NLP / LLM pipelines
* search indexing and document analysis

---

## 🔹 Integration Context

This module is designed to plug into:

* document ingestion systems
* OCR + NLP pipelines (e.g. abstract_hugpy)
* search and indexing systems
* large-scale document processing workflows

---

## 🔹 Design Philosophy

* **Structure before extraction**
* **Determinism over convenience**
* **Typed pipelines over implicit flows**
* **Fallback over failure**

---
