Contract Extractor

AI-powered PDF contract extraction with a human-in-the-loop review workflow. Operators upload hotel contract PDFs, the system extracts structured data via a Gemini-based extraction worker, and a form-based UI lets reviewers edit before the data enters the booking pipeline.

Stack

Backend
FastAPI (Python) + MongoDB + GridFS for PDF storage. JWT auth with bcrypt-hashed passwords. Slack notifications for alerts and daily summaries.
AI
extraction-worker v0.8.5 (Gemini-based, distributed via Bitbucket). Schema-first against the ContractExtraction Pydantic model. Self-consistency monkey-patch for the upstream consensus bug.
Frontend
React + TypeScript + Vite. Inline rate-table editing, audit clauses, inclusive extras, taxes & fees, signature capture, and a review queue with flag / dismiss / delete controls.

Review queue flow

Upload
PDF
Drag-and-drop or API upload
Extract
Gemini Worker
PDF → JSON via schema-targeted extraction
Store
MongoDB
Dedup by content_hash
Review
Operator
Edit form, flag, dismiss, or delete

Key modules

db/mongo.py
CRUD, GridFS PDF storage, deduplication, review-queue queries.
routers/contract_form.py
Upload, get, update, progress-save, review queue, PDF serve endpoints.
routers/auth.py
Login (POST /api/token), JWT issuance, user CRUD (admin-only).
auth/handler.py
JWT creation, password hashing, bootstrap user.
service/self_consistency_patch.py
Monkey-patch for the extraction-worker consensus bug.
utils/contract_utility.py
Date / currency / room-type normalization helpers.

Design principles

01
Conservative extraction
The system prefers null / 0 over guessing. If the AI is uncertain about a field, it leaves it empty rather than risk silent data corruption downstream.
02
Schema-first
The ContractExtraction Pydantic schema is the canonical shape. All extraction targets this schema deterministically.
03
Service-clean boundary
Extraction logic (AI, self-consistency) is isolated from the API / router layer. Routers only orchestrate; they never touch model weights.
04
Alert dedup & cooldown
Slack notifications deduplicate to avoid alert fatigue on repeated re-extractions of the same content.