Adarsh V.
Projects / OCR Knowledge Graph

OCR Knowledge Graph

Python Neo4j Tesseract FastAPI

The problem

Enterprise documents live in scanned PDFs with no queryable structure. Teams spend hours manually extracting relationships that should be answerable in seconds.

Architecture

PDF → Tesseract OCR → spaCy NER → relationship extractor → Neo4j
                                                          ↑
                                              FastAPI query layer

Overview

This project builds a pipeline that takes scanned PDFs, runs OCR via Tesseract, extracts named entities, and writes the resulting knowledge graph into Neo4j.

Architecture

PDF → Tesseract OCR → spaCy NER → relationship extractor → Neo4j

                                              FastAPI query layer

Key decisions

Why Neo4j over a relational DB? Entity relationships are naturally sparse and multi-hop queries (find all documents mentioning entities related to X within 2 hops) are O(log n) in a graph vs O(n²) in SQL joins.

Lessons learned

  • Learned that OCR confidence scores must gate graph ingestion — garbage in, garbage out.
  • Neo4j's APOC library cuts custom graph traversal code by ~60%.
  • FastAPI async endpoints were essential once document volume crossed 10k pages.