EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #02081


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: advanced search doesn't work with utf-8 characters


I think you may have come across the same problem that is described in
this thread:

http://www.eprints.org/tech.php/thread-17424.html

Maybe you can try Tim's patch and see if that works for you?

tommy

On 7/5/13 6:43 AM, "Dobrica Pavlinusic" <dpavlin@rot13.org> wrote:

>I have problem with utf-8 characters in advanced search. None of queries
>which contain utf-8 characters (in Croatia we have few of them: šđčćž)
>produce any results.
>
>I have read through wiki and this mail list and figured out that
>$EPrints::Index::FREETEXT_CHAR_MAPPING might be to blame. I added
>mapping for our characters but it didn't help (it would be nice to have
>full support for all characters without need to edit eprints source).
>
>Digging around through eprints source code, I noticed that my queries
>are split on utf-8 characters. If I uncomment line in Eprints::Search
>with $self->get_conditions->describe I can see following behaviour:
>
>1. search query: "Agić" (utf-8 as last char)
>
>AND(
>        =($archive.metadata_visibility,"show") ... eprint,
>        =($archive.eprint_status,"archive") ... eprint,
>        index($archive.creators_name,"agi") ... eprint__rindex
>)
>
>As you can see, utf-8 character gets dropped and this doesn't produce
>any results. I did check in eprint__rindex table and I do have "agić" in
>there.
>
>2. search query: "Bolanča" (utf-8 is next-to last char)
>
>AND(
>        =($archive.metadata_visibility,"show") ... eprint,
>        =($archive.eprint_status,"archive") ... eprint,
>        AND(
>                grep($archive.creators_name,"%[bolan]%[a]%-%") ...
>eprint__index_grep,
>                AndSubQuery(
>                        index($archive.creators_name,"bolan") ...
>eprint__rindex,
>                        index($archive.creators_name,"a") ...
>eprint__rindex
>                )
>        )
>)
>
>This is even worse, because it split search query into two queries on
>utf-8 character.
>
>I spent last three days inserting warns here-and-there in source code in
>an effort to find out where this splitting is happending, but I have hit
>the brick wall with this problem.
>
>I would appriciate any info or pointers how to resolve this problem.
>
>-- 
>Dobrica Pavlinusic               2share!2flame
>dpavlin@rot13.org
>Unix addict. Internet consultant.
>http://www.rot13.org/~dpavlin
>
>*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>*** Archive: http://www.eprints.org/tech.php/
>*** EPrints community wiki: http://wiki.eprints.org/