EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #00458

[EP-tech] Search Index Troubles

To: <eprints-tech@ecs.soton.ac.uk>
Subject: [EP-tech] Search Index Troubles
From: <rchilliard@mun.ca>
Date: Mon, 30 Apr 2012 18:38:21 +0000

Hi All,

Over the last few days, we've been sorting out a few kinks with the with fulltext searching / index creation on our local EPrints repository and thought I'd pass along the notes in the hopes that it might help out others. The issues were noted upon performing the query noted by Paolo Tealdi a few days back seeking malformed content in the eprint index table:

select *,length(word) from eprint__rindex where length(word) > 35

In our local results we noted an number of 'word' values corresponding to eprints with pdf documents in which series of valid words were string together with assorted Unicode interspersed.

The offending / troublesome Unicode values interspersed were inserted in the export from pdf to text, as called by eprints to generate the source fulltext to be indexed (called as '$(pdftotext) -enc UTF-8 -layout $(SOURCE) $(TARGET)'). Owing to the '-layout' argument, many spaces, line endings and paragraph endings were converted to UTF-8 formatting characters not handled by the default tokenizer (e.g. space to 'NON BREAKING SPACE' "chr(0x0a)", line ending to 'LINE SEPARATOR' - "\x{2028}" and paragraph ending to 'PARAGRAPH SEPARATOR' - "\x{2029}").

These are easily identifiable for insertion into the list of delimiters, however, it seems that the list of delimiters ('FREETEXT_SEPERATOR_CHARS') is defined in both ~eprints/archives/{archiveid}/cfg/cfg.d/indexing.pl and ~eprints/perl_lib/EPrints/Index/Tokenizer.pm, only the latter of which appears to have any effect. (The former may be orphaned code specific to our repository)

As may also be of note - in our case, resetting the indexed values seemed to require reloading the config (restarting apache and the indexer - to update Tokenizer.pm), as well as dropping the contents of the eprint__rindex table all before finally running epadmin erase_fulltext_index. To any who might be having their search misbehave, hopefully this may be of some help - any warnings, criticisms or comments welcome!

NB: as our config could differ significantly from those out there, it might be best to test the above on a non-critical / test repository if it is of interest to you.

Cheers,

Casey

Casey Hilliard

PC Consultant,

Health Sciences Library / QE2 Systems,

Memorial University

Phone: 709-777-2387 (HSL)

Phone: 709-864-6267 (QE2)

This communication is intended as a private communication for the sole use of the primary addressee. The information contained herein is private and confidential. If you are not the intended receipient, you are hereby notified that copying, forwarding or other dissemination or distribution of this communication by any means is prohibited. If you are not specifically authorized to receive this communication and you believe that you have received it in error, please notify the original sender immediately.

This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php

Follow-Ups:
- [EP-tech] Re: Search Index Troubles
  - From: Tim Brody <tdb2@ecs.soton.ac.uk>

Prev by Date: [EP-tech] Re: Branding on EPrints 3.3.6
Next by Date: [EP-tech] Searches - more information *PLEASE HELP*
Previous by thread: [EP-tech] Garbage indexing some pdf
Next by thread: [EP-tech] Re: Search Index Troubles
Index(es):
- Date
- Thread