EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #00439

[EP-tech] Garbage indexing some pdf

To: "<eprints-tech@ecs.soton.ac.uk>" <eprints-tech@ecs.soton.ac.uk>
Subject: [EP-tech] Garbage indexing some pdf
From: Paolo Tealdi <paolo.tealdi@polito.it>
Date: Thu, 26 Apr 2012 12:39:05 +0200

Dear all,

we found that some PDFs aren't parseable by pdftotext function,creating documents completely full of garbage.It's not a permission problem (pdf blocked in copy&paste), more probablyis an encoding problem : opening them with acrobat/xpfd, i can seecharacter fonts with enconding type "embedded".

Did somebody  find this type of pdf files ? Anybody resolved ?

We can find them simply with this select :

select *,length(word) fron eprint__rindex where length(word) > 35

In attach a screenshot with some of the eprint__rindex records garbaged ...

With this select you'll find also pdfs using high unicode characters : aworld will open you :-D


Best regards,
Paolo Tealdi

--
Ing. Paolo Tealdi         Area IT - Politecnico Torino
Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799
Indirizzo/Address : C.so Duca degli Abruzzi,  24 - 10129 Torino - ITALY
Skype : tealdi.paolo
Please consider your environmental responsibility before printing this e-mail

Attachment: example.odt
Description: application/vnd.oasis.opendocument.text

Prev by Date: [EP-tech] Re: Branding on EPrints 3.3.6
Next by Date: [EP-tech] Re: Garbage indexing some pdf
Previous by thread: [EP-tech] How to make related URLs appear as text links and not bare URLs?
Next by thread: [EP-tech] Re: Garbage indexing some pdf
Index(es):
- Date
- Thread