PDF Reader
PDF Reader skill extracts clean, structured text from PDF documents — the format that defeats most basic text extraction attempts with its complex layout encoding. The skill handles multi-column layouts, headers and footers (detected and stripped), tables (converted to Markdown or JSON), footnotes, page numbers, and embedded hyperlinks.
Input sources are flexible: local file paths, URLs pointing to PDFs (auto-downloaded), and base64-encoded PDF data from other skills. The skill processes documents up to 500 pages in a single call, chunking output for large documents to stay within context limits. OCR mode activates automatically for scanned PDFs or image-based PDFs where text is not machine-readable, using Tesseract under the hood.
Structured extraction modes let you target specific elements: extract all tables as JSON arrays, pull only the executive summary section, list all email addresses and phone numbers found in a contract, or extract financial figures from a balance sheet. Template matching enables applying the same extraction schema to a batch of similar documents — useful for processing dozens of invoices, contracts, or reports with consistent structure.
Common use cases: summarize lengthy legal contracts, extract data from insurance forms, parse financial statements, process academic papers, and analyze government reports. Combine with Summarize for quick digests or with SQL Runner to load extracted tabular data into a database. Accountants, lawyers, and researchers across industries depend on PDF Reader to unlock data locked inside documents.
Installation
clawhub install pdf-reader
Install: clawhub install pdf-reader