I was trying to extract text from a pdf when I noticed large blocks of it were missing. After fiddling around with your code (very nice by the way, saved me the grand annoyance of learning the pdf format's internals), I realized the issue was that you rely on newlines around the "obj" tokens in the pdf which aren't actually that reliable.
To be more exact, I changed this code:
preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile, $objects);
$objects = @$objects;
preg_match_all("#obj(.*)endobj#ismU", $infile, $objects);
$objects = @$objects;
The latter captures more objects, but still removes the excess spacing. I'm not sure if this was some weirdness due to my particular pdfs, but if this code is useful, feel free to use it.
Chris Li - 2011-06-21 14:27:21 - In reply to message 1 from John Thomas
I checked and tested the PDF2TEXT codes. It works but it did not print new line break. All texts are displayed together. Is there any way to keep new line break for output file. Which code I need to add.
Tony Wilson - 2011-06-26 19:35:10 - In reply to message 5 from Chris Li
At first, this seemed to answer my issue of extracting text (to load into a database for searching purposes), however I can not release it as part of my project as I can not always extract the text reliably (chunks are missing).
This is a real shame as it seems to be almost what I needed. Are there any updates scheduled?
arron wall - 2013-11-22 03:17:42 - In reply to message 1 from John Thomas
I have ever tried to extract text from PDF files with the help of the following code:
YiiGoImaging PDF = new YiiGoImaging();
public void PdfProcessorExtractTextPage();
PDFInputFile = (@"C:/1.pdf");
PDFPageNumberStart = "0";
PDFPageNumberStop = "4";
PDFOutputFile = OutputFormat.txt;
PDFOutputFile = (@"C:/extract.txt");
PDF. PdfProcessorExtractText (@"C:/1.pdf", "0","4", @"C:/extract.txt");
You can check its tutorial page here: yiigo.com/guides/csharp/how-to-extr ...
I hope it helps. Good luck.
lee charles - 2016-02-20 05:20:22 - In reply to message 7 from arron wall
Thanks for sharing these code. But I wonder whether I need some 3rd party pdf text extraction toolkits (like: http://www.pqscan.com/extract-text/ ) to help me extract text from pdf files. If so, it will be better if itt offers free trial package for users to check. I will try it later and send you feedback.