Fix on excluded text

Recommend this page to a friend!

Fix on excluded text

Subject:	Fix on excluded text
Summary:	A fix on a bug I encountered
Messages:	8
Author:	John Thomas
Date:	2010-10-12 16:08:32
Update:	2013-11-22 03:17:42

1. Fix on excluded text

Report abuse

John Thomas - 2010-10-12 16:08:32

I was trying to extract text from a pdf when I noticed large blocks of it were missing. After fiddling around with your code (very nice by the way, saved me the grand annoyance of learning the pdf format's internals), I realized the issue was that you rely on newlines around the "obj" tokens in the pdf which aren't actually that reliable.
To be more exact, I changed this code:
preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile, $objects);
$objects = @$objects[1];

To:
preg_match_all("#obj(.*)endobj#ismU", $infile, $objects);
$objects = @$objects[1];
array_map('ltrim',$objects);

The latter captures more objects, but still removes the excess spacing. I'm not sure if this was some weirdness due to my particular pdfs, but if this code is useful, feel free to use it.

2. Re: Fix on excluded text

Report abuse

joeri - 2010-12-18 07:34:25 - In reply to message 1 from John Thomas

Tomas,

Thanks. I will test it on some PDF's and if it works, include it with the original.

3. Re: Fix on excluded text

Report abuse

Rene Hart - 2010-12-22 13:37:26 - In reply to message 2 from joeri

I still have issues that some of the text in the PDF I want to convert is missing. Any idea what can be done to solve this ?

4. Re: Fix on excluded text

Report abuse

Chris Li - 2011-06-21 14:27:21 - In reply to message 1 from John Thomas

I checked and tested the PDF2TEXT codes. It works but it did not print new line break. All texts are displayed together. Is there any way to keep new line break for output file. Which code I need to add.

Thanks in advance.

5. Re: Fix on excluded text

Report abuse

Chris Li - 2011-06-23 16:14:59 - In reply to message 1 from John Thomas

Do you have any new improvements on this project?
I have similar project and see some text not captured by current
source codes.

Chris

6. Re: Fix on excluded text

Report abuse

Tony Wilson - 2011-06-26 19:35:10 - In reply to message 5 from Chris Li

At first, this seemed to answer my issue of extracting text (to load into a database for searching purposes), however I can not release it as part of my project as I can not always extract the text reliably (chunks are missing).
This is a real shame as it seems to be almost what I needed. Are there any updates scheduled?

7. Re: Fix on excluded text

Report abuse

arron wall - 2013-11-22 03:17:42 - In reply to message 1 from John Thomas

I have ever tried to extract text from PDF files with the help of the following code:
using YiiGo.Imaging.Basic;
using YiiGo.Imaging.Basic.Core;
using YiiGo.Imaging.Basic.Codec;
using YiiGo.Imaging.PDF;

YiiGoImaging PDF = new YiiGoImaging();

public void PdfProcessorExtractTextPage();
{
PDFInputFile = (@"C:/1.pdf");
PDFPageNumberStart = "0";
PDFPageNumberStop = "4";
PDFOutputFile = OutputFormat.txt;
PDFOutputFile = (@"C:/extract.txt");
};
PDF. PdfProcessorExtractText (@"C:/1.pdf", "0","4", @"C:/extract.txt");
You can check its tutorial page here:
yiigo.com/guides/csharp/how-to-extr ...
I hope it helps. Good luck.

Best regards,
Arron

8. Re: Fix on excluded text

Report abuse

lee charles - 2016-02-20 05:20:22 - In reply to message 7 from arron wall

Hi, Arron.
Thanks for sharing these code. But I wonder whether I need some 3rd party pdf text extraction toolkits (like: http://www.pqscan.com/extract-text/ ) to help me extract text from pdf files. If so, it will be better if itt offers free trial package for users to check. I will try it later and send you feedback.

Best regrads,
Pan

About us

Advertise on this site

For more information send a message to info at phpclasses dot org.