PHP Classes
PHP Classes
elePHPant
Icontem

Extract Text from PDF using PHP - How Can PHP Extract Images from PDF File Data and Search PDF Content using PHP PDFToText Class - PHP PDF to Text package blog

Recommend this page to a friend!
  All package blogs All package blogs   PHP PDF to Text PHP PDF to Text   Blog PHP PDF to Text package blog   RSS 1.0 feed RSS 2.0 feed   Blog Extract Text from PDF...  
  Post a comment Post a comment   See comments See comments (7)   Trackbacks (0)  

Author:

Updated on: 2017-03-18

Posted on:

Package: PHP PDF to Text

Extracting text from individual pages or whole PDF document files in PHP is easy using the PdfToText class.

Read this article that is the first of a series that will teach you about the challenge of processing the PDF file format and how the PdfToText class can be used to extract text and images from it.




Contents

How Can PHP Extract Data from PDF?

Installation

Generating PDF files

Getting Started

How Can PHP Extract Text from PDF?

How Can PHP Extract Images from PDF?

Documentation

How to Contribute to the Development of the PdfToText class?

Known Issues

Download the PdfToText class


How Can PHP Extract Data from PDF?

Extracting text from PDF files can be a tedious task for a developer. If you ever tried to open a PDF file using a text editor such as Notepad++ just to perform a simple search on some text you know for sure to be present in it, chances are great that you will find nothing but binary data!

This is due to the open nature of the PDF file format: the basic elements of a PDF file are objects, usually identified by a unique object number and a revision id.

Objects can contain anything like font definitions, character substitution tables and, of course, text data. Most of these objects are compressed with the gzip format, and eventually encrypted. You can also expect even more complicated things under the hood.

This article explains how the PHP PDF To Text class can help you to extract text from almost any PDF file.

It will be followed by a series of articles explaining various parts of the PDF file format that are of interest during the text extraction process.

Installation

Talking about an installation process would be a little bit pretentious: just extract the PdfToText.phpclass file from the .zip archive to your preferred includes directory.

You may also install it using the composer tool from the PHP Classes composer repository.

A future version may include additional and completely optional satellite data files, but that's another story which will be the subject of another article...

Generating PDF files

Before starting working with the PdfToText class, you will need of course a few PDF sample files. If you do not have any at hand, a few are provided in the PdfToText .zip package, under the examples directory.

If you are using the Windows operating system, the following virtual printer drivers can be of some help to generate PDF files (the following list is not exhaustive) :

  • Microsoft Print to PDF: the native solution from Microsoft. If not installed on your system, you can have a look here. Note that it may sometimes generate weird results.
  • PdfCreator : a free virtual printer. The free version contains some ads.
  • PrimoPdf : another free virtual PDF printer.
  • Pdf Architect 4: Another product from PdfForge, which is not free. However, it includes a free virtual PDF printer driver really similar to Pdf Creator (if not identical, except the name).
  • Pdf Pro 10 : A paid solution for editing PDF files. It includes a free virtual printer driver that has many interesting features, such as an elaborate printer spooler for managing files printed on servers.
  • PdFill Image Writer : A free virtual printer. You can also purchase a PDF editor for less than $20.
  • And, of course, Adobe Acrobat DC.
You can also use Microsoft Word (>= 2007), OpenOffice and LibreOffice to save your documents as PDF files.

Getting Started

Although the PDF file format is really versatile, the PdfToText class has been designed to hide the complexity from you of the underlying data and provide a simple interface.

Basically, the simplest PHP script that would process a PDF file given as a command-line argument and echo its text contents to the standard output would look like this :

<?php
    require( 'path/to/PdfToText.phpcass' );

    $pdf = new  PdfToText( $argv[1] );
    echo $pdf->Text ;
	

Once you have loaded a PDF file, its text contents are accessible through the Text property. The filename supplied to the class constructor is optional, you can omit it, then later use the Load() method to extract its contents.

This allows you to specify additional options or set special properties before loading the actual PDF contents. The following example will extract images from your PDF file by setting the Options property before calling the Load() method:

<?php
    require( 'path/to/PdfToText.phpcass' );

    $pdf = new  PdfToText( );
    $pdf->Options = PdfToText::PDFOPT_DECODE_IMAGE_DATA;
    $pdf->Load( $argv [1] );
    echo $pdf->Text ;

Note that this second approach will allow you to reuse the same object (with the same options) for processing different PDF files.

How Can PHP Extract Text from PDF?

You can retrieve individual page contents by using the Pages array property which is available, like the Text property, once the PDF file contents has been loaded.

The Pages property is an associative array whose keys are page numbers, and values, page contents.

A sample script which would display individual page contents from a PDF file would look like this :

<?php
    require( 'path/to/PdfToText.phpcass' );

    $pdf = new PdfToText( $argv [1] );
	
    foreach( $pdf -> Pages as $page_number => $page_contents)
        echo "Contents of page #$page_number :\n$page_contents\n";

How Can PHP Extract Images from PDF?

The PDF file format supports several types of images contents. In its current version (1.2.46), the PdftoText class is only able to process images encoded in the JPEG format.

Retrieving image contents is a simple as specifying a special option as the second parameter of the class constructor :

<?php
    require( 'path/to/PdfToText.phpcass' );

    $pdf = new PdfToText( $argv [1],
        PdfToText :: PDFOPT_DECODE_IMAGE_DATA ) ;
Or, if you prefer deferred loading :
<?php
    require ('path/to/PdfToText.phpcass' ) ;

    $pdf = new PdfToText( );
    $pdf->Options = PdfToText :: PDFOPT_DECODE_IMAGE_DATA ;
    $pdf->Load( $argv [1] );

Once loaded, image contents will be available through the Images array property, which is an array of image resources that have been created for each JPEG image encountered in your PDF file.

There is another option, PdfToText :: PDFOPT_GET_IMAGE_DATA, which simply loads raw image data into the ImageData array property. This way, you may have more elements in the ImageData property than in Images, since the PdfToText class currently supports only JPEG images.

Note that specifying the PDF_DECODE_IMAGE_DATA flag automatically sets the PDFOPT_GET_IMAGE_DATA one.

Documentation

The complete documentation of the format is available at the Adobe PDF Reference version 1.7 page.

If you are enough enthusiastic to read the 1300 pages of this document, keep in mind that Adobe also provided a generous set of technical notes addressing various specific topics not completely covered by these specifications. Some of these technical notes are more than 200 pages long.

How to contribute to the development of the PdfToText class?

There are so many ways to write the same page contents using the Adobe Postscript-like language that sometimes you may get strange results. Should this be the case, please feel free to contact me on this package support forum.

You can also have a look at my Github repository, and even issue pull requests. I also have a Web site dedicated to this class.

However, if you have any issue while processing one of your PDF files, and really don't want to go through the code to try to understand what's happening, you can reach me directly by email at christian.vigh@wuthering-bytes.com. Just send me the faulty PDF file as an attachment together with a little description about the issue, and I will be happy to try to solve your problem.

Known Issues

The following is a list of known issues. I'm still working on them and they will normally be implemented in future versions :

  • RTL languages, such as Arabic, Hebrew or Syriac, are not correctly processed: they are extracted from left to right
  • Only JPEG images are currently supported
  • There is currently no support for password-protected files (note that I'm not intending to develop a password cracker, just a feature that allows you to extract text contents from a password-encrypted PDF file, if you supply the correct password)
  • Digitally signed files are not currently supported
  • Text contents may sometimes show badly translated characters. The reason why will be explained in the next series of articles
  • The extracted text contents may not exactly reflect text positioning on the page. This is especially true regarding PDF files that contain data in tabular format. Again, this issue will be fixed in a future release and explained in one of the future articles about this class.
  • CID fonts (Adobe internal fonts, mainly used by eastern languages and developed before the Unicode effort took place) are not yet supported. This will be the subject of another article.

Download the PdfToText class

This article explained the basic usage of the PdfToText class. It presented a few features of the class, gave some basic examples on how to use it, and listed its current development state.

More articles will follow, diving into the internals of the PDF file format and explaining how the PdfToText class tries to handle them. The next article will lead you into a general overview of a PDF file layout (at least, the parts of it that are of interest to us when dealing with text extraction).

If you liked this article, please feel free to share it with other developers. If you have questions post a comment here.




You need to be a registered user or login to post a comment

Login Immediately with your account on:

FacebookGmail
HotmailStackOverflow
GitHubYahoo


Comments:

2. Execution time error - Hemanath (2016-11-11 22:27)
Error... - 1 reply
Read the whole comment and replies

1. Problem with Unicode and () - Nashir Uddin (2016-10-19 10:15)
If in pdf exist Unicode and symbol like () then show garbage... - 4 replies
Read the whole comment and replies



  Post a comment Post a comment   See comments See comments (7)   Trackbacks (0)  
  All package blogs All package blogs   PHP PDF to Text PHP PDF to Text   Blog PHP PDF to Text package blog   RSS 1.0 feed RSS 2.0 feed   Blog Extract Text from PDF...