Today we are going to create a PDF to Text extractor.
As always, first we need to install and import the package. We are going to use the PyPDF package. It can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.
To install this package, type the below command in the terminal.
pip install PyPDF2
# importing required modules
import PyPDF2
# We are opening the example.pdf and saved the file object as pdfFileObj
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file with the .numPages property
print(pdfReader.numPages)
# creating a page object of PageObject class of PyPDF2 module
pageObj = pdfReader.getPage(0)
# extracting text from page with the extractText() function
print(pageObj.extractText())
# saving the extracted text to .txt file
with open('output_file.txt', 'w') as the_file:
the_file.write(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()