PyPDF2

PyPDF2 is a free and open-source pure-Python library designed for working with PDF documents. It provides a robust set of tools for reading, manipulating, and writing PDF files. While it cannot create new PDFs from scratch or modify existing content on a granular level (like changing text within a paragraph), it excels at higher-level page-based operations.

Key Capabilities:
- Reading PDFs: Open and read existing PDF files.
- Extracting Text: Extract text content from specific pages or the entire document.
- Splitting PDFs: Divide a multi-page PDF into multiple smaller PDF files.
- Merging PDFs: Combine several PDF files into a single document.
- Rotating Pages: Change the orientation of individual pages.
- Adding Watermarks/Stamps: Overlay content (like another PDF page) onto existing pages.
- Encrypting/Decrypting: Secure PDF files with passwords or remove existing encryption.
- Cropping Pages: Adjust the visible area of pages.
- Adding Blank Pages: Insert empty pages into a PDF.

How it Works:
PyPDF2 operates on a page-by-page basis. You typically create a `PdfReader` object to read an existing PDF, access its pages, and then use a `PdfWriter` object to build a new PDF document by adding pages (potentially modified) from the reader or other sources.

Installation:
You can install PyPDF2 using pip: `pip install PyPDF2`

Use Cases:
- Automating report generation by merging various PDF sections.
- Extracting data (text) from standardized PDF forms.
- Archiving documents by splitting large PDFs.
- Securing sensitive PDF documents.

Limitations:
PyPDF2 is not designed for:
- Creating PDFs from scratch (you can only assemble existing pages or blank ones).
- Low-level content editing (e.g., changing a specific word or image on a page). For such tasks, libraries like ReportLab (for creation) or commercial SDKs might be more suitable.

Example Code

import PyPDF2
import os

 --- Step 0: Create dummy PDF files for demonstration ---
 In a real scenario, you would have existing PDF files.
 For this example, we'll assume 'document1.pdf' and 'document2.pdf' exist.
 You can manually create simple one-page PDFs or use a tool.
 For simplicity, let's just make sure the script won't crash if they don't exist
 and add a note.

 NOTE: For this code to run, you need 'document1.pdf' and 'document2.pdf'
 in the same directory as this script.
 'document1.pdf' should have at least two pages.
 'document2.pdf' can be a simple one-page PDF.

 Example: If you don't have them, you can create very basic dummy files using a PDF printer or text editor.
 For instance, create 'document1.txt' with content "Page 1 of Doc 1\nPage 2 of Doc 1" and
 'document2.txt' with content "Page 1 of Doc 2" and then print them to PDF.

 --- Step 1: Read a PDF and extract text ---
try:
    reader = PyPDF2.PdfReader('document1.pdf')
    print(f"Number of pages in document1.pdf: {len(reader.pages)}")

     Extract text from the first page (page 0)
    if len(reader.pages) > 0:
        first_page = reader.pages[0]
        text = first_page.extract_text()
        print("\n--- Text from page 1 of document1.pdf ---")
        print(text[:200])  Print first 200 characters
        print("------------------------------------------")

     Extract text from the second page (page 1)
    if len(reader.pages) > 1:
        second_page_text = reader.pages[1].extract_text()
        print("\n--- Text from page 2 of document1.pdf ---")
        print(second_page_text[:200])  Print first 200 characters
        print("------------------------------------------")

except FileNotFoundError:
    print("Error: 'document1.pdf' not found. Please create it for the example.")
except Exception as e:
    print(f"An error occurred while reading document1.pdf: {e}")

 --- Step 2: Merge multiple PDF files ---
print("\n--- Merging PDFs ---")
try:
    merger = PyPDF2.PdfMerger()

    if os.path.exists('document1.pdf'):
        merger.append('document1.pdf')
    else:
        print("Warning: 'document1.pdf' not found for merging.")

    if os.path.exists('document2.pdf'):
        merger.append('document2.pdf')
    else:
        print("Warning: 'document2.pdf' not found for merging.")

    if os.path.exists('document1.pdf') or os.path.exists('document2.pdf'):
        with open('merged_document.pdf', 'wb') as output_pdf:
            merger.write(output_pdf)
        print("Successfully merged documents into 'merged_document.pdf'")
    else:
        print("No documents found to merge. Skipping merge operation.")

    merger.close()

except Exception as e:
    print(f"An error occurred during merging: {e}")


 --- Step 3: Rotate a specific page and save as a new PDF ---
print("\n--- Rotating a page ---")
try:
    if os.path.exists('document1.pdf'):
        reader_rotate = PyPDF2.PdfReader('document1.pdf')
        writer_rotate = PyPDF2.PdfWriter()

        if len(reader_rotate.pages) > 0:
             Get the first page
            page = reader_rotate.pages[0]

             Rotate the page by 90 degrees clockwise
            page.rotate(90)

             Add the rotated page to the writer
            writer_rotate.add_page(page)

             Add any other pages from the original document that you want to keep
            for i in range(1, len(reader_rotate.pages)):
                writer_rotate.add_page(reader_rotate.pages[i])

            with open('rotated_document.pdf', 'wb') as output_pdf:
                writer_rotate.write(output_pdf)
            print("Successfully rotated the first page of document1.pdf and saved as 'rotated_document.pdf'")
        else:
            print("document1.pdf has no pages to rotate.")
    else:
        print("Error: 'document1.pdf' not found for rotation example.")
except Exception as e:
    print(f"An error occurred during rotation: {e}")

Example Code

Related Topics