EPrints Technical Mailing List Archive
Message: #06900
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Problem with searching for names starting with Ö
- To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
- Subject: Re: [EP-tech] Problem with searching for names starting with Ö
- From: Christer Enkvist <christer.enkvist@slu.se>
- Date: Tue, 24 Oct 2017 14:37:56 +0000
Hi, Querying ”with the wrong keyboard” has always been an issue when non-english characters are involved. I agree that the modern way to simply drop
accents and everything “non-ASCII-7” solves most problems as you otherwise needs to know how the letter sounds which is far from obvious.
Example, we translate the Swedish characters “å” and “ä” both to “a” “ö” to “o” For combined characters/litagues like the Dansih “æ” we substitute with both characters, in this case ”ae” as we feels this approach is most intuitive. We get a few false positives in the querying but this is very seldom an issue.
I have simply changed the Tokenizer.pm (and have a copy in our re-installation routines to replace this whenever we setup a new machine or what-not).
Don’t know of any other way. /Christer From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
On Behalf Of Liam Green-Hughes Hi everyone, We've run into an issue with searching for names containing certain characters and how they are handled by the Tokenizer.pm (https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm)
module. I notice in the FREETEXT_CHAR_MAPPING that characters are being substituted when indexing takes place or search terms are entered. Many of the substitutions make sense, but some others seem to be done on a phonetic basis? Strangely, this isn't an issue
on the simple search form, but if a name is entered in the "Creator" field of the advanced search some strange things can happen. For example (btw names have been changed!) if an author exists on the system with the surname "Öl", results will not be returned if I search by "Ol" but they will be if I enter
"Öl" or, more suprisingly, "Oel" (thanks to the substitution made). I understand that in many languages letters such as these are considered to be entirely different characters, but when people search using an English language keyboard they tend
to just drop the accents. This has led to a situation where results were not returned in an expected manner. Has anyone else encountered this problem? I can change the behaviour by changing the mappings in Tokenizer.pm but that means modifying core code. It also doesn't look to be easily
overridable? Am very interested to hear any thoughts about how to approach this! Thanks Liam Library Systems Developer University of Kent |
- References:
- [EP-tech] Problem with searching for names starting with Ö
- From: Liam Green-Hughes <L.E.Green-Hughes@kent.ac.uk>
- [EP-tech] Problem with searching for names starting with Ö
- Prev by Date: Re: [EP-tech] Problem with searching for names starting with Ö
- Next by Date: Re: [EP-tech] Problem with searching for names starting with Ö
- Previous by thread: Re: [EP-tech] Problem with searching for names starting with Ö
- Next by thread: [EP-tech] Render_value help
- Index(es):