PyMuPDF

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

PyMuPDF4LLM is aimed to make it easier to extract PDF content in the format you need for LLM&RAG environments. It supports Markdown extraction as well as LlamaIndex document output. It support multi-column pages, image and vector graphics extraction (and inclusion of references in the MD text), and page chunking.