EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #09688
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
- Subject: Re: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- From: Martin Brändle <martin.braendle@uzh.ch>
- Date: Wed, 27 Mar 2024 09:03:37 +0000
CAUTION: This e-mail originated outside the University of Southampton.
Dear all, After two full days of debugging and trying out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING
in Index/Tokenizer.pm is addressed. This behaves completely erratically, sometimes š is translated to s, sometimes not. It is as sometimes the hash would not exist. This problem is observed when characters with UTF codepoint > 0x00ff are used (non-Ascii chars). It might be that a “use 5.8.0” might remedy this (not tried out) by using the old Unicode implementation of perl. However, we applied a solution now that we also use cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes.
Since the transliteration tables are very extensive, maintaining $EPrints::Index::FREETEXT_CHAR_MAPPING is not necessary at all. Also, it is possible to override the
Text::Unidecode transliteration tables if one needs to. See
https://metacpan.org/pod/Text::Unidecode Also, I see that it’s part of the EPrints 3.3 package (but has been removed with EPrints 3.4). Kind regards, Martin -- Dr. Martin Brändle |
- Follow-Ups:
- [EP-tech] eprint_fields_automatic issue?
- From: "Alan.Stiles [He/Him/They]" <alan.stiles@open.ac.uk>
- [EP-tech] eprint_fields_automatic issue?
- References:
- [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- From: Martin Brändle <martin.braendle@uzh.ch>
- [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- Prev by Date: RE: [EP-tech] CAS configuration
- Next by Date: [EP-tech] eprint_fields_automatic issue?
- Previous by thread: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
- Next by thread: [EP-tech] eprint_fields_automatic issue?
- Index(es):