Extract PDF to Text File With Python

Extract PDF to Text File With Python

Today we are going to create a PDF to Text extractor.

As always, first we need to install and import the package. We are going to use the PyPDF package. It can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.

To install this package, type the below command in the terminal.

pip install PyPDF2
# importing required modules 
import PyPDF2 
    
# We are opening the example.pdf and saved the file object as pdfFileObj
pdfFileObj = open('example.pdf', 'rb') 
    
# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
    
# printing number of pages in pdf file with the .numPages property
print(pdfReader.numPages) 
    
# creating a page object of PageObject class of PyPDF2 module
pageObj = pdfReader.getPage(0) 
    
# extracting text from page with the extractText() function
print(pageObj.extractText()) 

# saving the extracted text to .txt file
with open('output_file.txt', 'w') as the_file:
    the_file.write(pageObj.extractText())
    
# closing the pdf file object 
pdfFileObj.close() 

Leave a Reply

Prev
How to Create GUI Application with Python and Tkinter
How to Create GUI Application with Python and Tkinter

How to Create GUI Application with Python and Tkinter

Tkinter is python’s de-facto standard GUI package

Next
Convert Text to Speech with Python in Different Languages
Convert Text to Speech with Python in Different Languages

Convert Text to Speech with Python in Different Languages

In this tutorial, we will learn how to convert text into human-like speech

You May Also Like