Login   Register  
PHP Classes
elePHPant
Icontem

Fix on excluded text

Recommend this page to a friend!
Stumble It! Stumble It! Bookmark in del.icio.us Bookmark in del.icio.us

      PDF Text Extractor  >  All threads  >  Fix on excluded text  >  (Un) Subscribe thread alerts  
Subject:Fix on excluded text
Summary:A fix on a bug I encountered
Messages:7
Author:John Thomas
Date:2010-10-12 16:08:32
Update:2013-11-22 03:17:42
 

  1. Fix on excluded text   Reply  
Picture of John Thomas
John Thomas
2010-10-12 16:08:32
I was trying to extract text from a pdf when I noticed large blocks of it were missing. After fiddling around with your code (very nice by the way, saved me the grand annoyance of learning the pdf format's internals), I realized the issue was that you rely on newlines around the "obj" tokens in the pdf which aren't actually that reliable.
To be more exact, I changed this code:
preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile, $objects);
$objects = @$objects[1];

To:
preg_match_all("#obj(.*)endobj#ismU", $infile, $objects);
$objects = @$objects[1];
array_map('ltrim',$objects);

The latter captures more objects, but still removes the excess spacing. I'm not sure if this was some weirdness due to my particular pdfs, but if this code is useful, feel free to use it.

  2. Re: Fix on excluded text   Reply  
Picture of joeri
joeri
2010-12-18 07:34:25 - In reply to message 1 from John Thomas
Tomas,

Thanks. I will test it on some PDF's and if it works, include it with the original.


  3. Re: Fix on excluded text   Reply  
Picture of Rene Hart
Rene Hart
2010-12-22 13:37:26 - In reply to message 2 from joeri
I still have issues that some of the text in the PDF I want to convert is missing. Any idea what can be done to solve this ?

  4. Re: Fix on excluded text   Reply  
Picture of Chris Li
Chris Li
2011-06-21 14:27:21 - In reply to message 1 from John Thomas
I checked and tested the PDF2TEXT codes. It works but it did not print new line break. All texts are displayed together. Is there any way to keep new line break for output file. Which code I need to add.

Thanks in advance.

  5. Re: Fix on excluded text   Reply  
Picture of Chris Li
Chris Li
2011-06-23 16:14:59 - In reply to message 1 from John Thomas
Do you have any new improvements on this project?
I have similar project and see some text not captured by current
source codes.

Chris

  6. Re: Fix on excluded text   Reply  
Picture of Tony Wilson
Tony Wilson
2011-06-26 19:35:10 - In reply to message 5 from Chris Li
At first, this seemed to answer my issue of extracting text (to load into a database for searching purposes), however I can not release it as part of my project as I can not always extract the text reliably (chunks are missing).
This is a real shame as it seems to be almost what I needed. Are there any updates scheduled?

  7. Re: Fix on excluded text   Reply  
Picture of arron wall
arron wall
2013-11-22 03:17:42 - In reply to message 1 from John Thomas
I have ever tried to extract text from PDF files with the help of the following code:
using YiiGo.Imaging.Basic;
using YiiGo.Imaging.Basic.Core;
using YiiGo.Imaging.Basic.Codec;
using YiiGo.Imaging.PDF;

YiiGoImaging PDF = new YiiGoImaging();

public void PdfProcessorExtractTextPage();
{
PDFInputFile = (@"C:/1.pdf");
PDFPageNumberStart = "0";
PDFPageNumberStop = "4";
PDFOutputFile = OutputFormat.txt;
PDFOutputFile = (@"C:/extract.txt");
};
PDF. PdfProcessorExtractText (@"C:/1.pdf", "0","4", @"C:/extract.txt");
You can check its tutorial page here:
http://www.yiigo.com/guides/csharp/how-to-extract-pdf-text.s ...
I hope it helps. Good luck.



Best regards,
Arron