EPrints Technical Mailing List Archive

Message: #00448


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: Garbage indexing some pdf


Hi Paolo,

   I took a quick peek at the sample that you were able to provide, and it looks like the character mapping is missing for the content text. If you export the PDF to text via Acrobat or equivalent, you can note via hex editor that the output text file has all characters mapped to ascii(0x2e), via a vanilla run of pdftotext (e.g. pdftotext test.pdf test.txt), the characters are mapped to ascii(0x20) and in unicode from pdftotext (as the command run by the indexer ~= pdftotext -enc UTF-8 -test.pdf test_utf.txt) you get the byte sequence "ef 80 bd" for each character. 

   It may be possible to retroactively reconstitute the mapping information, but I'm not aware of a mechanism to do perform that operation. As well, it appears that this might have been done purposely when the PDF was generated - most tellingly, the licensing / attribution information at the conclusion of the file is mapped properly.

p.s. thank you for the note / query on testing the indexed word lengths, it has notified us of a potential issue in our repository (and possibly others'?) whereby multiple words are being indexed in clusters because they are not tokenized on non-breaking space ('&nbsp') characters.

Cheers,
Casey

________________________________________
From: eprints-tech-bounces@ecs.soton.ac.uk [eprints-tech-bounces@ecs.soton.ac.uk] on behalf of Paolo Tealdi [paolo.tealdi@polito.it]
Sent: Friday, April 27, 2012 5:50 AM
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Re: Garbage indexing some pdf

On 04/26/2012 03:31 PM, Paolo Tealdi wrote:
> On 04/26/2012 02:02 PM, Manojlovich, Slavko wrote:
>> Hi
>> Would you please provide an example of a PDF in your repository which demonstrates this problem?
>> Thanks
>> Slavko Manojlovich
>> Associate University Librarian (IT)
>> Memorial University of Newfoundland
>> St. John's, Newfoundland
>> Canada
>> email: slavko@mun..ca
>>
>>
> Hi.
>
> unfortunately, no public pdf has this problem.
>
> Best regards,
> Paolo Tealdi
>
>

Hi all,
i've managed to create a partial document for example. You can find it
at : http://www.biblio.polito.it/esempio.pdf
This file will be deleted tomorrow. Try to simply cut&paste.

Best regards,
Paolo Tealdi


--
Ing. Paolo Tealdi         Area IT - Politecnico Torino
Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799
Indirizzo/Address : C.so Duca degli Abruzzi,  24 - 10129 Torino - ITALY
Skype : tealdi.paolo
Please consider your environmental responsibility before printing this e-mail

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

This electronic communication is governed by the terms and conditions at
http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php