top of page

Build a Scanned PDF to Structured JSON Pipeline in Python (End-to-End)

Build a Scanned PDF to Structured JSON Pipeline in Python (End-to-End)


Converting a scanned PDF into clean, structured JSON is one of the most common — and most underestimated — problems in document AI.


Most tutorials show you how to read a text-based PDF with PyPDF2 in 10 lines of code. That's not what this is. Scanned PDFs are images. The text isn't embedded — it's pixels. Extracting structured data from them requires a real pipeline: preprocessing, OCR, layout analysis, data structuring, validation, and an API layer to serve it all in production.


This guide builds that pipeline from scratch, end-to-end — with full working Python code. By the end you'll have a FastAPI service that accepts a scanned PDF, runs it through a production-grade OCR pipeline, and returns clean structured JSON.


We've deployed this exact architecture for engineering clients processing P&IDs, equipment datasheets, and scanned technical documents.


The live demo runs at 👉 docprocessing360.com


What "Structured JSON" Actually Means

Before writing any code, define what you're building toward.

A raw OCR dump looks like this — flat, unordered, useless for downstream systems:




{
  "raw_text": "FIC-201 Flow Indicating Controller 6\"-P-1042 Centrifugal Pump P-101A..."
}



Structured JSON looks like this — typed, organised, queryable:




{
  "document_id": "ENG-DOC-2024-001",
  "document_type": "equipment_datasheet",
  "extraction_confidence": 0.92,
  "extracted_at": "2025-05-17T10:30:00Z",
  "fields": {
    "equipment_tag": { "value": "P-101A", "confidence": 0.97, "bbox": [120, 340, 180, 360] },
    "equipment_type": { "value": "Centrifugal Pump", "confidence": 0.94, "bbox": [200, 340, 380, 360] },
    "service": { "value": "Crude Feed Pump", "confidence": 0.91, "bbox": [120, 365, 320, 385] },
    "design_pressure": { "value": "150 PSI", "confidence": 0.89, "bbox": [120, 390, 220, 410] },
    "design_temp": { "value": "250°F", "confidence": 0.93, "bbox": [240, 390, 320, 410] }
  },
  "tables": [
    {
      "table_id": "nozzle_schedule",
      "rows": [
        { "nozzle": "N1", "size": "6\"", "rating": "150#", "service": "Suction" },
        { "nozzle": "N2", "size": "4\"", "rating": "150#", "service": "Discharge" }
      ]
    }
  ]
}



Every field has a value, a confidence score, and a bounding box. This is what production systems need.



Pipeline Architecture

The full pipeline has six stages:



Scanned PDF Input
      ↓
[1] PDF → Image Conversion
      ↓
[2] Image Preprocessing (OpenCV)
      ↓
[3] OCR Engine (Tesseract / AWS Textract)
      ↓
[4] Layout Analysis & Region Detection
      ↓
[5] Field Extraction & Table Parsing
      ↓
[6] JSON Assembly & Confidence Scoring
      ↓
FastAPI Endpoint → Structured JSON Output



Each stage has a distinct responsibility. Building them as separate functions makes the pipeline testable, replaceable, and debuggable.



Environment Setup



pip install pymupdf opencv-python pytesseract pillow \
            boto3 pydantic fastapi uvicorn python-multipart \
            numpy pdfplumber


Install Tesseract system dependency:



# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows — download installer from:
# https://github.com/UB-Mannheim/tesseract/wiki




Project structure:



pdf_pipeline/
├── main.py            # FastAPI app
├── pipeline/
│   ├── __init__.py
│   ├── converter.py   # PDF → image
│   ├── preprocessor.py # OpenCV preprocessing
│   ├── ocr.py         # OCR engine
│   ├── extractor.py   # Field + table extraction
│   ├── assembler.py   # JSON assembly
│   └── validator.py   # Output validation
├── models/
│   └── schemas.py     # Pydantic models
└── config.py          # Settings





Stage 1 — PDF to Image Conversion

Scanned PDFs are image containers. The first step is rendering each page as a high-resolution image.




# pipeline/converter.py

import fitz  # PyMuPDF
import numpy as np
from PIL import Image
from pathlib import Path

def pdf_to_images(pdf_bytes: bytes, dpi: int = 300) -> list[np.ndarray]:
    """
    Convert scanned PDF pages to high-resolution numpy images.
    300 DPI is the minimum for reliable OCR on engineering documents.
    Use 400+ DPI for documents with very small text (instrument tags).
    """
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    images = []

    for page_num in range(len(doc)):
        page = doc[page_num]

        # Scale matrix for target DPI (default PDF is 72 DPI)
        zoom = dpi / 72
        matrix = fitz.Matrix(zoom, zoom)

        # Render page to pixmap
        pixmap = page.get_pixmap(matrix=matrix, alpha=False)

        # Convert to numpy array for OpenCV processing
        img_array = np.frombuffer(pixmap.samples, dtype=np.uint8)
        img_array = img_array.reshape(pixmap.height, pixmap.width, pixmap.n)

        images.append(img_array)

    doc.close()
    return images



Why 300 DPI minimum? Engineering documents contain instrument tags as small as 6pt font. At 72 DPI (default PDF rendering), characters become unrecognisable blobs. At 300 DPI, character edges are sharp enough for Tesseract to distinguish FIC-101A from FIC-101B.



Stage 2 — Image Preprocessing

Raw scanned images have noise, skew, low contrast, and uneven lighting. Preprocessing dramatically improves OCR accuracy — often by 15–25 percentage points on poor-quality scans.



# pipeline/preprocessor.py

import cv2
import numpy as np

def preprocess(img: np.ndarray, doc_type: str = "general") -> np.ndarray:
    """Full preprocessing pipeline for scanned document images."""

    img = _convert_to_grayscale(img)
    img = _deskew(img)
    img = _denoise(img)
    img = _binarize(img, doc_type)
    img = _remove_borders(img)

    return img


def _convert_to_grayscale(img: np.ndarray) -> np.ndarray:
    if len(img.shape) == 3:
        return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    return img


def _deskew(img: np.ndarray) -> np.ndarray:
    """Correct document rotation using Hough line detection."""
    edges = cv2.Canny(img, 50, 150, apertureSize=3)
    lines = cv2.HoughLines(edges, 1, np.pi / 180, threshold=200)

    if lines is None:
        return img

    angles = []
    for rho, theta in lines[:, 0]:
        angle = (theta - np.pi / 2) * 180 / np.pi
        if abs(angle) < 10:  # Only correct small skews
            angles.append(angle)

    if not angles:
        return img

    median_angle = np.median(angles)
    (h, w) = img.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, median_angle, 1.0)
    return cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC,
                          borderMode=cv2.BORDER_REPLICATE)


def _denoise(img: np.ndarray) -> np.ndarray:
    """Remove scan noise while preserving text edges."""
    return cv2.fastNlMeansDenoising(img, h=10, templateWindowSize=7,
                                     searchWindowSize=21)


def _binarize(img: np.ndarray, doc_type: str) -> np.ndarray:
    """
    Convert to clean black-and-white.
    Engineering docs use adaptive thresholding for uneven lighting.
    """
    if doc_type == "engineering":
        # Adaptive threshold handles shadows and uneven scan quality
        return cv2.adaptiveThreshold(
            img, 255,
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY,
            blockSize=11,
            C=2
        )
    else:
        # Otsu's method for standard documents with even lighting
        _, binary = cv2.threshold(img, 0, 255,
                                   cv2.THRESH_BINARY + cv2.THRESH_OTSU)
        return binary


def _remove_borders(img: np.ndarray) -> np.ndarray:
    """Remove black border artifacts common in scanned documents."""
    contours, _ = cv2.findContours(img, cv2.RETR_EXTERNAL,
                                    cv2.CHAIN_APPROX_SIMPLE)
    if not contours:
        return img

    largest = max(contours, key=cv2.contourArea)
    x, y, w, h = cv2.boundingRect(largest)

    # Only crop if meaningful content region found
    margin = 10
    if w > img.shape[1] * 0.5 and h > img.shape[0] * 0.5:
        return img[max(0, y-margin):y+h+margin, max(0, x-margin):x+w+margin]

    return img



Stage 3 — OCR Engine

Two options depending on your deployment:



Option A — Tesseract (on-premise, free)



# pipeline/ocr.py — Tesseract implementation

import pytesseract
import numpy as np
from dataclasses import dataclass

@dataclass
class OCRWord:
    text: str
    confidence: float
    bbox: tuple  # (x, y, w, h)
    page: int

def run_tesseract(img: np.ndarray, page_num: int = 0) -> list[OCRWord]:
    """
    Run Tesseract OCR and return word-level results with confidence + bounding boxes.
    PSM 6 = single uniform block of text (best for engineering documents).
    PSM 11 = sparse text, use for P&IDs with scattered labels.
    """
    config = "--psm 6 --oem 3 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_./:() "

    data = pytesseract.image_to_data(
        img,
        config=config,
        output_type=pytesseract.Output.DICT
    )

    words = []
    for i in range(len(data['text'])):
        text = data['text'][i].strip()
        conf = int(data['conf'][i])

        if text and conf > 30:  # Filter noise
            words.append(OCRWord(
                text=text,
                confidence=conf / 100,
                bbox=(data['left'][i], data['top'][i],
                      data['width'][i], data['height'][i]),
                page=page_num
            ))

    return words




Option B — AWS Textract (cloud, production-grade)



# pipeline/ocr.py — AWS Textract implementation

import boto3
from dataclasses import dataclass

textract = boto3.client('textract', region_name='us-east-1')

def run_textract(pdf_bytes: bytes) -> dict:
    """
    Run AWS Textract for production-grade OCR with table detection.
    Returns full Textract response including TABLES and FORMS.
    """
    response = textract.analyze_document(
        Document={'Bytes': pdf_bytes},
        FeatureTypes=['TABLES', 'FORMS']
    )
    return response


def parse_textract_words(response: dict, page_num: int = 0) -> list[OCRWord]:
    """Extract word-level blocks from Textract response."""
    words = []
    for block in response['Blocks']:
        if block['BlockType'] == 'WORD':
            geo = block['Geometry']['BoundingBox']
            words.append(OCRWord(
                text=block['Text'],
                confidence=block['Confidence'] / 100,
                bbox=(geo['Left'], geo['Top'], geo['Width'], geo['Height']),
                page=page_num
            ))
    return words


Which to use: Tesseract for on-premise / budget-sensitive deployments. AWS Textract for production at scale, especially when table extraction is required.



Stage 4 — Layout Analysis & Region Detection

Before extracting fields, identify which region of the document contains what type of content. Engineering documents have distinct zones: title block, main body, notes, revision table.




# pipeline/extractor.py

import re
from dataclasses import dataclass, field

@dataclass
class DocumentRegion:
    region_type: str  # "title_block", "main_body", "notes", "table"
    bbox: tuple
    words: list

def detect_regions(words: list, img_height: int, img_width: int) -> list[DocumentRegion]:
    """
    Heuristic region detection for engineering documents.
    Title block is typically bottom-right on P&IDs, top on datasheets.
    """
    regions = []

    # Title block: bottom 20% of document, right 30%
    title_block_words = [
        w for w in words
        if w.bbox[1] > img_height * 0.80
        and w.bbox[0] > img_width * 0.70
    ]
    if title_block_words:
        regions.append(DocumentRegion(
            region_type="title_block",
            bbox=(int(img_width * 0.70), int(img_height * 0.80),
                  img_width, img_height),
            words=title_block_words
        ))

    # Main body: everything else excluding title block and notes
    main_body_words = [
        w for w in words
        if w not in title_block_words
        and w.bbox[1] < img_height * 0.80
    ]
    if main_body_words:
        regions.append(DocumentRegion(
            region_type="main_body",
            bbox=(0, 0, img_width, int(img_height * 0.80)),
            words=main_body_words
        ))

    return regions




Stage 5 — Field Extraction & Table Parsing

This is where the structured data comes out. Two sub-problems: key-value field extraction and table extraction.



Key-Value Field Extraction



# pipeline/extractor.py — continued

# Define field patterns for engineering documents
ENGINEERING_FIELD_PATTERNS = {
    "equipment_tag": r'\b[A-Z]{1,3}-\d{3}[A-Z]?\b',
    "line_number": r'\b\d{1,2}"-[A-Z]{1,3}-\d{4}-[A-Z0-9]{2,4}\b',
    "instrument_tag": r'\b[A-Z]{2,4}-\d{3}[A-Z]?\b',
    "pressure_value": r'\b\d+\.?\d*\s*(PSI|BAR|kPa|MPa)\b',
    "temperature_value": r'\b\d+\.?\d*\s*(°F|°C|F|C)\b',
    "flow_rate": r'\b\d+\.?\d*\s*(GPM|m3\/hr|MMSCFD|bpd)\b',
}

def extract_fields(words: list, patterns: dict = None) -> dict:
    """
    Extract structured fields from OCR word list using regex patterns.
    Returns dict of field_name -> {value, confidence, bbox}.
    """
    if patterns is None:
        patterns = ENGINEERING_FIELD_PATTERNS

    full_text = " ".join([w.text for w in words])
    extracted = {}

    for field_name, pattern in patterns.items():
        matches = re.findall(pattern, full_text, re.IGNORECASE)
        if matches:
            # Find the word(s) that produced this match
            match_value = matches[0]
            matching_words = [
                w for w in words
                if w.text in match_value or match_value in w.text
            ]

            avg_confidence = (
                sum(w.confidence for w in matching_words) / len(matching_words)
                if matching_words else 0.7
            )

            first_match = matching_words[0] if matching_words else None

            extracted[field_name] = {
                "value": match_value,
                "confidence": round(avg_confidence, 3),
                "bbox": first_match.bbox if first_match else None,
                "all_matches": matches  # Keep all instances found
            }

    return extracted




Table Extraction from Textract Response



def extract_tables_from_textract(response: dict) -> list[dict]:
    """
    Parse Textract TABLE blocks into clean list-of-dicts format.
    Handles merged cells and multi-row headers.
    """
    blocks = response['Blocks']
    block_map = {b['Id']: b for b in blocks}

    tables = []

    for block in blocks:
        if block['BlockType'] != 'TABLE':
            continue

        # Get all cells for this table
        cells = {}
        for rel in block.get('Relationships', []):
            if rel['Type'] == 'CHILD':
                for cell_id in rel['Ids']:
                    cell = block_map.get(cell_id)
                    if cell and cell['BlockType'] == 'CELL':
                        row = cell['RowIndex']
                        col = cell['ColumnIndex']
                        cells[(row, col)] = _get_cell_text(cell, block_map)

        if not cells:
            continue

        max_row = max(r for r, c in cells.keys())
        max_col = max(c for r, c in cells.keys())

        # First row = headers
        headers = [cells.get((1, c), f"col_{c}") for c in range(1, max_col + 1)]

        # Remaining rows = data
        rows = []
        for r in range(2, max_row + 1):
            row_data = {}
            for c, header in enumerate(headers, start=1):
                row_data[header] = cells.get((r, c), "")
            rows.append(row_data)

        tables.append({
            "headers": headers,
            "rows": rows,
            "row_count": len(rows),
            "col_count": max_col
        })

    return tables


def _get_cell_text(cell_block: dict, block_map: dict) -> str:
    """Get concatenated text from a Textract CELL block."""
    texts = []
    for rel in cell_block.get('Relationships', []):
        if rel['Type'] == 'CHILD':
            for word_id in rel['Ids']:
                word_block = block_map.get(word_id)
                if word_block and word_block['BlockType'] == 'WORD':
                    texts.append(word_block['Text'])
    return " ".join(texts)



Stage 6 — JSON Assembly & Confidence Scoring

Assemble all extracted data into a clean, validated JSON output with document-level confidence scoring.




# pipeline/assembler.py

from datetime import datetime, timezone
import uuid

def assemble_output(
    document_id: str,
    document_type: str,
    fields: dict,
    tables: list,
    page_count: int,
    processing_time_ms: float
) -> dict:
    """
    Assemble all extracted data into a structured JSON document.
    Calculates overall document confidence from field-level scores.
    """

    # Calculate overall confidence
    field_confidences = [
        v['confidence']
        for v in fields.values()
        if isinstance(v, dict) and 'confidence' in v
    ]
    overall_confidence = (
        round(sum(field_confidences) / len(field_confidences), 3)
        if field_confidences else 0.0
    )

    # Determine extraction quality tier
    if overall_confidence >= 0.90:
        quality = "high"
        requires_review = False
    elif overall_confidence >= 0.70:
        quality = "medium"
        requires_review = True
    else:
        quality = "low"
        requires_review = True

    return {
        "document_id": document_id or str(uuid.uuid4()),
        "document_type": document_type,
        "extraction_metadata": {
            "extracted_at": datetime.now(timezone.utc).isoformat(),
            "page_count": page_count,
            "processing_time_ms": round(processing_time_ms, 2),
            "overall_confidence": overall_confidence,
            "quality_tier": quality,
            "requires_human_review": requires_review
        },
        "fields": fields,
        "tables": tables,
        "field_count": len(fields),
        "table_count": len(tables)
    }




Pydantic Models for Validation

Validate every output before it leaves the pipeline. This prevents malformed data from reaching downstream systems.




# models/schemas.py

from pydantic import BaseModel, Field
from typing import Optional, Any
from datetime import datetime

class ExtractedField(BaseModel):
    value: str
    confidence: float = Field(ge=0.0, le=1.0)
    bbox: Optional[tuple] = None
    all_matches: Optional[list[str]] = None

class TableData(BaseModel):
    headers: list[str]
    rows: list[dict[str, Any]]
    row_count: int
    col_count: int

class ExtractionMetadata(BaseModel):
    extracted_at: str
    page_count: int
    processing_time_ms: float
    overall_confidence: float = Field(ge=0.0, le=1.0)
    quality_tier: str  # "high", "medium", "low"
    requires_human_review: bool

class ExtractionResult(BaseModel):
    document_id: str
    document_type: str
    extraction_metadata: ExtractionMetadata
    fields: dict[str, ExtractedField]
    tables: list[TableData]
    field_count: int
    table_count: int


FastAPI Production Endpoint

Tie all six stages together into a single API endpoint:




# main.py

from fastapi import FastAPI, UploadFile, File, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
import time
import logging

from pipeline.converter import pdf_to_images
from pipeline.preprocessor import preprocess
from pipeline.ocr import run_tesseract, run_textract, parse_textract_words
from pipeline.extractor import extract_fields, extract_tables_from_textract
from pipeline.assembler import assemble_output
from models.schemas import ExtractionResult
from config import settings

app = FastAPI(
    title="Document Extraction API",
    description="Scanned PDF to Structured JSON pipeline",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"]
)

logger = logging.getLogger(__name__)


@app.post("/extract", response_model=ExtractionResult)
async def extract_document(
    file: UploadFile = File(...),
    document_type: str = "engineering",
    use_textract: bool = False,
    document_id: str = None
):
    """
    Extract structured JSON from a scanned PDF.

    - **file**: Scanned PDF file
    - **document_type**: "engineering", "invoice", "general"
    - **use_textract**: Use AWS Textract (True) or Tesseract (False)
    - **document_id**: Optional document identifier
    """
    if not file.filename.endswith('.pdf'):
        raise HTTPException(status_code=400, detail="Only PDF files accepted")

    if file.size > 50 * 1024 * 1024:  # 50MB limit
        raise HTTPException(status_code=413, detail="File too large (max 50MB)")

    start_time = time.time()

    try:
        pdf_bytes = await file.read()

        # Stage 1: Convert PDF to images
        logger.info(f"Converting PDF: {file.filename}")
        images = pdf_to_images(pdf_bytes, dpi=300)

        all_fields = {}
        all_tables = []

        if use_textract:
            # AWS Textract path — single call handles all pages
            response = run_textract(pdf_bytes)
            words = parse_textract_words(response)
            all_fields = extract_fields(words)
            all_tables = extract_tables_from_textract(response)

        else:
            # Tesseract path — process page by page
            for page_num, img in enumerate(images):
                # Stage 2: Preprocess
                processed = preprocess(img, doc_type=document_type)

                # Stage 3: OCR
                words = run_tesseract(processed, page_num=page_num)

                # Stage 5: Extract fields per page
                page_fields = extract_fields(words)
                all_fields.update(page_fields)

        # Stage 6: Assemble output
        processing_ms = (time.time() - start_time) * 1000
        result = assemble_output(
            document_id=document_id,
            document_type=document_type,
            fields=all_fields,
            tables=all_tables,
            page_count=len(images),
            processing_time_ms=processing_ms
        )

        logger.info(
            f"Extraction complete: {len(all_fields)} fields, "
            f"{len(all_tables)} tables, "
            f"confidence={result['extraction_metadata']['overall_confidence']}, "
            f"time={processing_ms:.0f}ms"
        )

        return result

    except Exception as e:
        logger.error(f"Extraction failed: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Extraction failed: {str(e)}")


@app.get("/health")
def health():
    return {"status": "healthy", "version": "1.0.0"}




Run the API:


uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Test with curl:


curl -X POST "http://localhost:8000/extract" \
  -H "accept: application/json" \
  -F "file=@engineering_doc.pdf" \
  -F "document_type=engineering" \
  -F "use_textract=false"


Sample Output

A real response from the pipeline on an engineering equipment datasheet:




{
  "document_id": "3f7a9b2c-1d4e-4f8a-b2c3-9d7e1f3a5c6b",
  "document_type": "engineering",
  "extraction_metadata": {
    "extracted_at": "2025-05-17T10:30:00Z",
    "page_count": 2,
    "processing_time_ms": 1842.5,
    "overall_confidence": 0.913,
    "quality_tier": "high",
    "requires_human_review": false
  },
  "fields": {
    "equipment_tag": {
      "value": "P-101A",
      "confidence": 0.97,
      "bbox": [120, 340, 60, 20],
      "all_matches": ["P-101A", "P-101B"]
    },
    "line_number": {
      "value": "6\"-P-1042-A1A",
      "confidence": 0.91,
      "bbox": [200, 580, 140, 18],
      "all_matches": ["6\"-P-1042-A1A"]
    },
    "pressure_value": {
      "value": "150 PSI",
      "confidence": 0.94,
      "bbox": [400, 420, 80, 18],
      "all_matches": ["150 PSI", "75 PSI"]
    },
    "temperature_value": {
      "value": "250°F",
      "confidence": 0.92,
      "bbox": [500, 420, 60, 18],
      "all_matches": ["250°F"]
    }
  },
  "tables": [
    {
      "headers": ["Nozzle", "Size", "Rating", "Service"],
      "rows": [
        {"Nozzle": "N1", "Size": "6\"", "Rating": "150#", "Service": "Suction"},
        {"Nozzle": "N2", "Size": "4\"", "Rating": "150#", "Service": "Discharge"},
        {"Nozzle": "N3", "Size": "2\"", "Rating": "150#", "Service": "Drain"}
      ],
      "row_count": 3,
      "col_count": 4
    }
  ],
  "field_count": 4,
  "table_count": 1
}




Production Tips


1. Dockerise the pipeline



FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    tesseract-ocr \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]



2. Add async queue for large files

For PDFs larger than 5 pages, use Celery + Redis to process asynchronously:



from celery import Celery

celery_app = Celery("tasks", broker="redis://localhost:6379/0")

@celery_app.task
def process_pdf_async(pdf_bytes: bytes, document_type: str) -> dict:
    # Full pipeline runs in background
    ...

3. Cache preprocessed images

Preprocessing is expensive. Cache results by document hash:


import hashlib

def get_doc_hash(pdf_bytes: bytes) -> str:
    return hashlib.sha256(pdf_bytes).hexdigest()


4. Confidence-based routing

Route low-confidence extractions to human review automatically:


if result['extraction_metadata']['overall_confidence'] < 0.75:
    send_to_review_queue(result)
else:
    send_to_downstream_system(result)



Accuracy Benchmarks

Based on our production deployments across engineering document types:

Document Type

Tesseract

AWS Textract

Azure Doc Intel

Clean digital PDF

88%

96%

95%

300 DPI scanned

82%

93%

93%

150 DPI legacy scan

68%

84%

85%

Engineering datasheet

79%

91%

92%

P&ID title block

74%

88%

90%


Tesseract is sufficient for prototyping and on-premise deployments where cloud APIs are restricted. For production accuracy requirements above 90%, use Textract or Azure.




Live Demo

This exact pipeline — preprocessing + OCR + field extraction + table parsing + structured JSON output — runs live at:



Upload any scanned engineering PDF and get structured JSON back in seconds, with per-field confidence scores and bounding box coordinates.




What This Pipeline Doesn't Cover

This pipeline handles text and tables from scanned PDFs. For engineering documents it does not:


  • Detect P&ID symbols (valves, instruments, equipment) — that requires a custom YOLOv8 computer vision model

  • Understand line connections in P&IDs — requires graph extraction on top of object detection

  • Handle handwritten annotations reliably — needs a separate handwriting recognition model


If you're building a complete P&ID digitisation system, the pipeline above is the OCR + table layer. The symbol detection layer sits on top of it. We cover that in: P&ID Symbol Detection with YOLOv8 and PyTorch →




Build It With Codersarts

We've deployed this pipeline for 10+ engineering clients — from a standalone FastAPI service to a fully integrated document intelligence platform with active learning and human review workflows.




Tags: scanned PDF to JSON python, PDF data extraction pipeline, OCR pipeline FastAPI, PyMuPDF OCR python, AWS Textract python pipeline, Tesseract python production, structured data extraction PDF, document intelligence pipeline, engineering document extraction python

Comments


bottom of page