EPrints Technical Mailing List Archive
Message: #00969
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
[EP-tech] Solution: Errors while indexing PDF/A files
- To: <eprints-tech@ecs.soton.ac.uk>
- Subject: [EP-tech] Solution: Errors while indexing PDF/A files
- From: <rchilliard@mun.ca>
- Date: Fri, 24 Aug 2012 13:54:53 +0000
Hi All,
Just solved an issue which had been cropping up with our repository and thought I'd pass along the solution at which we've arrived. Our setup is Ubuntu 10.04 running eprints 3.3.7, though the issue will likely apply to most Linux based installs. When re-running indexing on our eprints via the console, we noted a large number of errors as the indexer progresses through documents, computing full text index info e.g.: eprints@samplreposerver:~/bin$ ./epadmin reindex samplerepo eprint You are about to reindex "eprint" in the samplerepo repository. This can take some time. Number of records in set: 141 Continue [y/n] ? es Error: Illegal entry in bfchar block in ToUnicode CMap Error: Illegal entry in bfchar block in ToUnicode CMap Error: Bad annotation destination ... We narrowed the issue down to the combination of the text extraction tool used for PDF files (pdftotext, a component of XPDF) and the particular formatting of the the large number of PDF/A formatted files in our repository. The root issue is that, at version 3.02 of xpdf, abbreviated character codes for Unicode characters in the <00xx> range are considered invalid within CMaps, despite being in agreement with the PDF/A format generally. The solution as we've determined is to simply upgrade to the new version of xpdf (very recently released - 3.03, on 2012-08-15), which addresses the issue, permitting the characters in CMaps, and eliminating the (false) error messages. Unfortunately, xpdf 3.03 is not yet available via package manager for most Linux releases, so it must be installed from tarball (available at http://www.foolabs.com/xpdf/download.html). Hopefully this may prove some help to others -- though if you haven't been handling PDF/A files, you mightn't note the error at all. Cheers, Casey Casey Hilliard PC Consultant, Health Sciences Library / QE2 Systems, Memorial University Phone: 709-777-2387 (HSL) Phone: 709-864-6267 (QE2) This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2011.php This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php |
- Prev by Date: [EP-tech] EPrints 3
- Next by Date: [EP-tech] edit subject
- Previous by thread: [EP-tech] EPrints 3
- Next by thread: [EP-tech] edit subject
- Index(es):