⬡ Hub
Skip to content

12_document_extraction

12. Automated Data Extraction from Documents (OCR + LLM)

Description

This workflow automates the process of extracting structured data from documents like invoices or receipts. It triggers when a new file is uploaded to a specific Google Drive folder, uses an OCR service to extract the text, and then passes that text to a Large Language Model (LLM) to parse and structure the key information before saving it to a database.

Architecture

graph TD
    A[New File in Google Drive] --> B{OCR Service};
    B --> C{AI Agent LLM Parser};
    C --> D[Save to Database];

Setup

  1. Import Workflow: Import the workflow.json file into your n8n instance.
  2. Configure Trigger:
    • In the "New Invoice File" node, create or select your Google Drive credentials.
    • Specify the Folder ID of the Google Drive folder you want to monitor for new files.
  3. Configure OCR:
    • In the "Scan with OCR" node, we use the ocr.space API. It's free and easy to use.
    • Get a free API key from https://ocr.space/ and paste it into the apiKey field.
  4. Configure LLM:
    • In the "Extract Invoice Data" node, configure the URL and authentication for your LLM API.
    • The prompt is specifically designed to parse text from an invoice and extract fields like invoiceNumber, vendor, totalAmount, and dueDate. You can customize this for your specific document type.
  5. Configure Database:
    • The "Save to Database" node is a placeholder. Replace it with the node for your desired database (e.g., PostgreSQL, MySQL, Airtable).
    • Create or select your database credentials.
    • Map the columns in your database table to the structured data extracted by the LLM.

Execution

  • The workflow polls the Google Drive folder on a schedule. You can adjust the polling frequency in the trigger node's settings.
  • To run it, simply drop a new invoice (as a PDF or image) into the specified Google Drive folder. The workflow will process it automatically.

Files and Subdirectories