(03) — DOCUMENT INTELLIGENCE / FINANCIAL STATEMENT EXTRACTION

Ledger

A standalone document intelligence API that turns unstructured bank statements into structured financial data with 14 bank templates and a 4-tier extraction cascade.

PythonFastAPIGemini AITerraformAWS LambdaReact
In production2025
01

Why this exists

Ledger started as a subdirectory inside YieldStream. The underwriting platform needed structured financial data from bank statements before it could score merchants against lenders. I built the extraction pipeline inline, tightly coupled to the monolith.

Two things forced the extraction. First, bank statement parsing is not specific to MCA underwriting. Any fintech product that touches bank data — lending, accounting, bookkeeping, fraud detection — needs the same capability. Second, parsing is bursty and CPU-heavy (OCR, image processing, PDF manipulation), while the rest of YieldStream is steady and database-heavy. Coupling them meant one workload's spike could starve the other.

I extracted it into a standalone FastAPI service with its own test suite, its own Docker image, and its own Terraform-managed infrastructure on AWS Lambda. Ledger is now a product, not a feature.

02

The architecture

The core design principle: try the cheapest, fastest extraction first and escalate only when confidence is too low. Every document passes through a quality gate, then enters a 4-tier extraction cascade. Each tier is independently timed and logged. If a tier extracts fewer than 100 characters, the orchestrator escalates to the next.

Extraction cascade with auto-escalation
PDF INPUTAny formatQUALITYGATEBlur / DPI / skewT1pdfplumber~200ms · 90% of text PDFs<100 charsT2PyMuPDF~300ms · Encrypted / corrupted<100 charsT3Tesseract OCR5-10s · Scanned images<100 charsT4LlamaParse~30s · Cloud fallbackSTRUCTURED OUTPUTText + tablesTier logsConfidence scoresTemplate match

Tier 1: pdfplumber. Handles ~90% of text-based PDFs in under 200ms. Extracts text with layout awareness and pulls structured tables — critical for bank statements where transaction data lives in columns, not paragraphs.

Tier 2: PyMuPDF.Catches encrypted and corrupted PDFs that pdfplumber can't open. Similar speed, different PDF parsing engine.

Tier 3: Tesseract OCR. For scanned documents and phone photos. 5-10 seconds per page. The quality gate pre-screens for blur, skew, and contrast to avoid wasting OCR compute on unsalvageable inputs.

Tier 4: LlamaParse. Cloud fallback with 30,000 free pages per month. Only reached when local extraction fails entirely. Returns markdown output that the downstream parser can still consume.

orchestrator.pyPython
def extract(file_path: str | Path) -> ExtractionResult:
    """Try extractors in order until one produces sufficient text.

    Cascade: pdfplumber -> PyMuPDF -> OCR -> LlamaParse
    Each tier attempt is recorded for fallback analytics.
    """
    threshold = settings.min_text_threshold
    result = ExtractionResult()

    for tier_fn, tier_name, tier_order in TIERS:
        attempt = TierAttempt(tier=tier_name, tier_order=tier_order)
        start = time.time()
        try:
            text, pages = tier_fn(file_path)
            attempt.text_char_count = len(text)
            if len(text) >= threshold:
                attempt.status = "success"
                result.text = text
                result.method = tier_name
                break
        except Exception as e:
            attempt.failure_reason = str(e)
        attempt.processing_time_ms = _elapsed_ms(start)
        result.tier_attempts.append(attempt)
03

The template system

Raw text extraction is step one. Step two is understanding what that text means — and every bank formats statements differently. Chase puts deposits under "Deposits and Additions." BofA calls them "Credits." Wells Fargo uses a transaction table with running balances. PNC splits statements across multiple pages with no clear section headers.

I built an abstract BankTemplate base class with four methods: matches() (confidence score 0-1 that the document belongs to this bank), extract_summary() (account holder, period dates, balances), extract_transactions() (individual line items with categories), and compute_derived_metrics() (ADB, negative balance days, largest transactions).

14 bank templates are registered today: Chase, Bank of America, Wells Fargo, TD, PNC, US Bank, Capital One, Regions, Truist, Citizens, Fifth Third, BMO Harris, Navy Federal, and a generic fallback. Each self-registers at import time via the template registry.

The registry runs all templates against incoming text and picks the highest confidence match above 0.5. Below that threshold, it falls back to the generic parser. Every match is recorded in a 30-day rolling history (thread-safe, in-memory) for monitoring template coverage and drift. The /templates/bank endpoint exposes this data so I can see which banks are hitting the generic fallback most often and prioritize new templates accordingly.

Transaction categorization is deterministic. Credits are classified as revenue by default, unless keyword patterns match transfer signals (Zelle, Venmo, "owner deposit") or MCA payment patterns (ACH debits matching known lender names like Yellowstone, Credibly, OnDeck). This distinction matters — underwriters need to separate operating revenue from transfers and existing debt payments.

04

Enrichment and validation

Structured extraction gives you numbers. Enrichment gives you signal. After parsing, Ledger optionally passes the extracted data through Gemini 2.0 Flash to compute 25+ financial intelligence metrics:

  • Monthly revenue average, trend (growing / stable / declining), volatility, best and worst months
  • Average daily balance, lowest recorded balance, ending balance trend
  • NSF/overdraft counts across 30, 60, and 90-day windows
  • Active MCA positions detected from recurring ACH debits, stacking burden percentage, debt service coverage ratio
  • Lien flags (IRS, tax levy, garnishment keywords), transfer anomalies, and a generated underwriting summary

Gemini is rate-limited to 15 RPM via a token-bucket implementation. If enrichment is disabled or rate-limited, Ledger still returns the full structural extraction — AI is additive, never blocking.

Every extraction includes an arithmetic validation step: beginning balance + deposits - withdrawals should equal ending balance. The check uses Decimal precision with a $0.01 tolerance. A failed balance check is a strong signal that either the OCR misread a number or the template missed a transaction block.

Confidence scoring runs at every stage. Text quality (characters per page), table extraction success, and template match confidence are aggregated into an overall score. If it drops below 0.85, the extraction is flagged needs_human_review and queued for manual inspection via the review endpoint.

05

Infrastructure

Ledger runs on AWS Lambda behind API Gateway v2, deployed as a Docker container image. The entire infrastructure is defined in Terraform — ECR repository, IAM roles, Lambda function, API Gateway, and CloudWatch log groups with 14-day retention. One terraform apply stands up or tears down the whole stack.

Lambda was a deliberate choice over a persistent server. Document parsing is bursty — high concurrency during business hours, near-zero at night. Lambda scales to zero when idle and handles burst concurrency without provisioning. The 512 MB memory allocation and 60-second timeout cover even the slowest OCR extractions.

The service is stateless by design. No database, no file system persistence. The review queue and template match history live in-memory for the Lambda execution context. For persistent storage, the calling service (YieldStream) consumes the ParseResponseand writes to its own database. This keeps Ledger operationally simple — there's nothing to back up, nothing to migrate.

A React frontend (Vite + TypeScript) provides an interactive testing interface: drag-and-drop upload, tabbed result views (parsed data, raw text, tables, confidence breakdown, tier logs), and a review queue for flagged extractions. It exists for internal use and demos, not end-user-facing.

06

Where it stands

Ledger processes bank statements from 14 banks with template-specific accuracy, falls back gracefully for unknown formats, and enriches every extraction with 25+ underwriting metrics. The test suite generates synthetic PDFs via reportlab — no brittle binary fixtures — and covers extraction, classification, bank parsing, enrichment, and quality gate logic.

The architecture is designed to improve passively. Every extraction against an unknown bank accumulates in the template match history. When a bank hits 5+ extractions without a dedicated template, the system surfaces it. Adding a new template means writing one Python class that implements four methods and registering it at import time. The registry handles everything else.

14Bank templatesSelf-registering at import
4Extraction tiersAuto-escalation cascade
25+Financial metricsGemini AI enrichment
11Document typesClassified with confidence scoring