|I was trying to extract text from a pdf when I noticed large blocks of it were missing. After fiddling around with your code (very nice by the way, saved me the grand annoyance of learning the pdf format's internals), I realized the issue was that you rely on newlines around the "obj" tokens in the pdf which aren't actually that reliable.|
To be more exact, I changed this code:
preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile, $objects);
$objects = @$objects;
preg_match_all("#obj(.*)endobj#ismU", $infile, $objects);
$objects = @$objects;
The latter captures more objects, but still removes the excess spacing. I'm not sure if this was some weirdness due to my particular pdfs, but if this code is useful, feel free to use it.
|2010-12-18 07:34:25 - In reply to message 1 from John Thomas|
Thanks. I will test it on some PDF's and if it works, include it with the original.
|2010-12-22 13:37:26 - In reply to message 2 from joeri|
|I still have issues that some of the text in the PDF I want to convert is missing. Any idea what can be done to solve this ?|
|2011-06-21 14:27:21 - In reply to message 1 from John Thomas|
|I checked and tested the PDF2TEXT codes. It works but it did not print new line break. All texts are displayed together. Is there any way to keep new line break for output file. Which code I need to add.|
Thanks in advance.
|2011-06-23 16:14:59 - In reply to message 1 from John Thomas|
|Do you have any new improvements on this project?|
I have similar project and see some text not captured by current
|2011-06-26 19:35:10 - In reply to message 5 from Chris Li|
|At first, this seemed to answer my issue of extracting text (to load into a database for searching purposes), however I can not release it as part of my project as I can not always extract the text reliably (chunks are missing).|
This is a real shame as it seems to be almost what I needed. Are there any updates scheduled?
|2013-11-22 03:17:42 - In reply to message 1 from John Thomas|
|I have ever tried to extract text from PDF files with the help of the following code:|
YiiGoImaging PDF = new YiiGoImaging();
public void PdfProcessorExtractTextPage();
PDFInputFile = (@"C:/1.pdf");
PDFPageNumberStart = "0";
PDFPageNumberStop = "4";
PDFOutputFile = OutputFormat.txt;
PDFOutputFile = (@"C:/extract.txt");
PDF. PdfProcessorExtractText (@"C:/1.pdf", "0","4", @"C:/extract.txt");
You can check its tutorial page here:
I hope it helps. Good luck.