Skip to content

extractembeddedpdftext

📊 Project Details

  • Primary Language: Python
  • Languages Used: Python, C, PowerShell, Go Template, Shell, HTML
  • License: MIT License
  • Created: January 21, 2026
  • Last Updated: January 21, 2026

📝 About

extractembeddedpdftext

A simple Python tool to extract embedded text from PDF files. No OCR - extracts only the actual text embedded in the PDF.

Features

  • Fast text extraction using PyMuPDF (fitz)
  • Cross-platform: Windows and Linux binaries included
  • Simple command-line interface
  • Can output to file or stdout

Download Binaries

Grab the pre-compiled binary for your platform from the Releases page.

  • Windows: extract_pdf_text.exe
  • Linux: extract_pdf_text

Usage

# Extract text to <pdf>.txt
extract_pdf_text document.pdf

# Extract to specific output file
extract_pdf_text document.pdf -o output.txt

# Print to stdout
extract_pdf_text document.pdf --stdout

Building from Source

# Install dependencies
pip install -r requirements.txt

# Run directly
python extract_pdf_text.py document.pdf

Compilation

Linux

pip install pyinstaller
pyinstaller --onefile --name extract_pdf_text extract_pdf_text.py

Windows (PowerShell)

pip install pyinstaller
pyinstaller --onefile --name extract_pdf_text extract_pdf_text.py

Requirements

  • Python 3.8+
  • PyMuPDF 1.24.12

License

MIT