EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #00440
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
[EP-tech] Re: Garbage indexing some pdf
- To: <eprints-tech@ecs.soton.ac.uk>
- Subject: [EP-tech] Re: Garbage indexing some pdf
- From: "Manojlovich, Slavko" <slavko@mun.ca>
- Date: Thu, 26 Apr 2012 09:32:11 -0230
Hi Would you please provide an example of a PDF in your repository which demonstrates this problem? Thanks Slavko Manojlovich Associate University Librarian (IT) Memorial University of Newfoundland St. John's, Newfoundland Canada email: slavko@mun..ca ________________________________ From: eprints-tech-bounces@ecs.soton.ac.uk on behalf of Paolo Tealdi Sent: Thu 4/26/2012 8:09 AM To: <eprints-tech@ecs.soton.ac.uk> Subject: [EP-tech] Garbage indexing some pdf Dear all, we found that some PDFs aren't parseable by pdftotext function, creating documents completely full of garbage. It's not a permission problem (pdf blocked in copy&paste), more probably is an encoding problem : opening them with acrobat/xpfd, i can see character fonts with enconding type "embedded". Did somebody find this type of pdf files ? Anybody resolved ? We can find them simply with this select : select *,length(word) fron eprint__rindex where length(word) > 35 In attach a screenshot with some of the eprint__rindex records garbaged ... With this select you'll find also pdfs using high unicode characters : a world will open you :-D Best regards, Paolo Tealdi -- Ing. Paolo Tealdi Area IT - Politecnico Torino Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799 Indirizzo/Address : C.so Duca degli Abruzzi, 24 - 10129 Torino - ITALY Skype : tealdi.paolo Please consider your environmental responsibility before printing this e-mail This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php
- References:
- [EP-tech] Garbage indexing some pdf
- From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Garbage indexing some pdf
- Prev by Date: [EP-tech] Garbage indexing some pdf
- Next by Date: [EP-tech] Re: How to make related URLs appear as text links and not bare URLs?
- Previous by thread: [EP-tech] Garbage indexing some pdf
- Next by thread: [EP-tech] Re: Garbage indexing some pdf
- Index(es):