OCR Knowledge Graph
Python Neo4j Tesseract FastAPI
The problem
Enterprise documents live in scanned PDFs with no queryable structure. Teams spend hours manually extracting relationships that should be answerable in seconds.
Architecture
PDF → Tesseract OCR → spaCy NER → relationship extractor → Neo4j
↑
FastAPI query layer
Overview
This project builds a pipeline that takes scanned PDFs, runs OCR via Tesseract, extracts named entities, and writes the resulting knowledge graph into Neo4j.
Architecture
PDF → Tesseract OCR → spaCy NER → relationship extractor → Neo4j
↑
FastAPI query layer
Key decisions
Why Neo4j over a relational DB? Entity relationships are naturally sparse
and multi-hop queries (find all documents mentioning entities related to X within 2 hops) are O(log n) in a graph vs O(n²) in SQL joins.
Lessons learned
- → Learned that OCR confidence scores must gate graph ingestion — garbage in, garbage out.
- → Neo4j's APOC library cuts custom graph traversal code by ~60%.
- → FastAPI async endpoints were essential once document volume crossed 10k pages.