EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09688

Re: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)

To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
From: Martin Brändle <martin.braendle@uzh.ch>
Date: Wed, 27 Mar 2024 09:03:37 +0000

CAUTION: This e-mail originated outside the University of Southampton.

Dear all,

Here a follow-up:

After two full days of debugging and trying out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING in Index/Tokenizer.pm is addressed.

This behaves completely erratically, sometimes š is translated to s, sometimes not. It is as sometimes the hash would not exist.

This problem is observed when characters with UTF codepoint > 0x00ff are used (non-Ascii chars).

It might be that a “use 5.8.0” might remedy this (not tried out) by using the old Unicode implementation of perl.

However, we applied a solution now that we also use cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode

This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes.

Since the transliteration tables are very extensive, maintaining $EPrints::Index::FREETEXT_CHAR_MAPPING is not necessary at all.

Also, it is possible to override the Text::Unidecode transliteration tables if one needs to. See https://metacpan.org/pod/Text::Unidecode

Also, I see that it’s part of the EPrints 3.3 package (but has been removed with EPrints 3.4).

Kind regards,

Martin

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-8005 Zürich

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Martin Brändle <martin.braendle@uzh.ch>
Date: Monday, 19 February 2024 at 13:16
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)

CAUTION: This e-mail originated outside the University of Southampton.

Dear all,

We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm

Characters which are above the ASCII table (UTF-8 code point > 0x00ff) are not translated correctly for creating the words in the reverse index, although they are listed in the $EPrints::Index::FREETEXT_CHAR_MAPPING map.

The reverse index (eprint__rindex) for one of the author names having a special character is now a mixture of both versions, e.g. Bzdušek vs. Bzdusek. If we reindex one of the older records, the reverse index entry it is reverted from Bzdusek to Bzdušek.

If we search with Bzdušek, the records are not found.

We assume that this exists since we upgraded to RHEL 8 and perl 5.26.3

BTW: The Tokenizer code for EPrints 3.3 and EPrints 3.4 is quite different:

https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm

https://github.com/eprints/eprints3.4/blob/master/perl_lib/EPrints/Index/Tokenizer.pm

We have tried both versions, to no avail.

Have others observed similar problems with perl 5.26 or higher? As far as I have seen from perl documentation, Unicode support has changed (e.g. :encoding has been deprecated and removed).

Kind regards,

Martin

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-8005 Zürich

mail: martin.braendle@uzh.ch
phone: +41 44 63 56705
https://orcid.org/0000-0002-7752-6567
https://www.zi.uzh.ch

Follow-Ups:
- [EP-tech] eprint_fields_automatic issue?
  - From: "Alan.Stiles [He/Him/They]" <alan.stiles@open.ac.uk>

References:
- [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
  - From: Martin Brändle <martin.braendle@uzh.ch>

Prev by Date: RE: [EP-tech] CAS configuration
Next by Date: [EP-tech] eprint_fields_automatic issue?
Previous by thread: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)
Next by thread: [EP-tech] eprint_fields_automatic issue?
Index(es):
- Date
- Thread