OCR Knowledge Graph

PDF → Tesseract OCR → spaCy NER → relationship extractor → Neo4j ↑ FastAPI query layer

Overview

This project builds a pipeline that takes scanned PDFs, runs OCR via Tesseract, extracts named entities, and writes the resulting knowledge graph into Neo4j.

Architecture

PDF → Tesseract OCR → spaCy NER → relationship extractor → Neo4j
                                                          ↑
                                              FastAPI query layer

Key decisions

Why Neo4j over a relational DB? Entity relationships are naturally sparse and multi-hop queries (find all documents mentioning entities related to X within 2 hops) are O(log n) in a graph vs O(n²) in SQL joins.

OCR Knowledge Graph

The problem

Architecture

Overview

Architecture

Key decisions

Lessons learned