EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #00464
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
[EP-tech] Re: Search Index Troubles
- To: eprints-tech@ecs.soton.ac.uk
- Subject: [EP-tech] Re: Search Index Troubles
- From: Tim Brody <tdb2@ecs.soton.ac.uk>
- Date: Tue, 01 May 2012 15:56:51 +0100
Hi, Can you try this change: http://trac.eprints.org/eprints/changeset/7669 (\w is any Unicode character/number) /Tim. On Mon, 2012-04-30 at 18:38 +0000, rchilliard@mun.ca wrote: > Hi All, > > > > Over the last few days, we've been sorting out a few kinks with the > with fulltext searching / index creation on our local EPrints > repository and thought I'd pass along the notes in the hopes that it > might help out others. The issues were noted upon performing the query > noted by Paolo Tealdi a few days back seeking malformed content in the > eprint index table: > > > > select *,length(word) from eprint__rindex where length(word) > 35 > > > > In our local results we noted an number of 'word' values corresponding > to eprints with pdf documents in which series of valid words were > string together with assorted Unicode interspersed. > > > > The offending / troublesome Unicode values interspersed were inserted > in the export from pdf to text, as called by eprints to generate the > source fulltext to be indexed (called as '$(pdftotext) -enc UTF-8 > -layout $(SOURCE) $(TARGET)'). Owing to the '-layout' argument, many > spaces, line endings and paragraph endings were converted to UTF-8 > formatting characters not handled by the default tokenizer (e.g. space > to 'NON BREAKING SPACE' "chr(0x0a)", line ending to 'LINE SEPARATOR' - > "\x{2028}" and paragraph ending to 'PARAGRAPH SEPARATOR' - > "\x{2029}"). > > > > These are easily identifiable for insertion into the list of > delimiters, however, it seems that the list of delimiters > ('FREETEXT_SEPERATOR_CHARS') is defined in both > ~eprints/archives/{archiveid}/cfg/cfg.d/indexing.pl and > ~eprints/perl_lib/EPrints/Index/Tokenizer.pm, only the latter of which > appears to have any effect. (The former may be orphaned code specific > to our repository) > > > > As may also be of note - in our case, resetting the indexed values > seemed to require reloading the config (restarting apache and the > indexer - to update Tokenizer.pm), as well as dropping the contents of > the eprint__rindex table all before finally running epadmin > erase_fulltext_index. To any who might be having their search > misbehave, hopefully this may be of some help - any warnings, > criticisms or comments welcome! > > > > NB: as our config could differ significantly from those out there, it > might be best to test the above on a non-critical / test repository if > it is of interest to you. > > > > Cheers, > > Casey > > > > Casey Hilliard > > PC Consultant, > > Health Sciences Library / QE2 Systems, > > Memorial University > > Phone: 709-777-2387 (HSL) > > Phone: 709-864-6267 (QE2) > > > > This communication is intended as a private communication for the sole > use of the primary addressee. The information contained herein is > private and confidential. If you are not the intended receipient, you > are hereby notified that copying, forwarding or other dissemination or > distribution of this communication by any means is prohibited. If you > are not specifically authorized to receive this communication and you > believe that you have received it in error, please notify the original > sender immediately. > > > > > > This electronic communication is governed by the terms and conditions > at > http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php > *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech > *** Archive: http://www.eprints.org/tech.php/ > *** EPrints community wiki: http://wiki.eprints.org/
Attachment:
signature.asc
Description: This is a digitally signed message part
- References:
- [EP-tech] Search Index Troubles
- From: <rchilliard@mun.ca>
- [EP-tech] Search Index Troubles
- Prev by Date: [EP-tech] Re: Searches - more information *PLEASE HELP*
- Next by Date: [EP-tech] Re: EPrints COUNTER Compliance
- Previous by thread: [EP-tech] Search Index Troubles
- Next by thread: [EP-tech] Re: Search Index Troubles
- Index(es):