EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #09629
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
[EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
- Subject: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- From: Martin Brändle <martin.braendle@uzh.ch>
- Date: Mon, 19 Feb 2024 12:15:53 +0000
CAUTION: This e-mail originated outside the University of Southampton.
Dear all,
We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm
Characters which are above the ASCII table (UTF-8 code point > 0x00ff) are not translated correctly for creating the words in the reverse index, although they are listed in the $EPrints::Index::FREETEXT_CHAR_MAPPING map.
The reverse index (eprint__rindex) for one of the author names having a special character is now a mixture of both versions, e.g. Bzdušek
vs. Bzdusek. If we reindex one of the older records, the reverse index entry it is reverted from Bzdusek to Bzdušek.
If we search with Bzdušek, the records are not found.
We assume that this exists since we upgraded to RHEL 8 and perl 5.26.3
BTW: The Tokenizer code for EPrints 3.3 and EPrints 3.4 is quite different:
https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm
https://github.com/eprints/eprints3.4/blob/master/perl_lib/EPrints/Index/Tokenizer.pm
We have tried both versions, to no avail.
Have others observed similar problems with perl 5.26 or higher? As far as I have seen from perl documentation, Unicode support has changed (e.g. :encoding has been deprecated and removed).
Kind regards,
Martin -- Dr. Martin Brändle |
- Follow-Ups:
- Re: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- From: Martin Brändle <martin.braendle@uzh.ch>
- Re: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- Prev by Date: Re: [EP-tech] Redirect loop after shibboleth authentication
- Next by Date: [EP-tech] Run Apache as EPrints
- Previous by thread: [EP-tech] coversheets
- Next by thread: Re: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- Index(es):