EPrints Technical Mailing List Archive

Message: #02082


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: advanced search doesn't work with utf-8 characters


On Mon, Jul 08, 2013 at 04:23:28PM +0000, Tommy Ingulfsen wrote:
> I think you may have come across the same problem that is described in
> this thread:
> 
> http://www.eprints.org/tech.php/thread-17424.html
> 
> Maybe you can try Tim's patch and see if that works for you?

Thank you very much for pointer to thread, Tim's patch indeed fixes this
problem.

As it is already part of next release is there ETA for 3.3.12?

> On 7/5/13 6:43 AM, "Dobrica Pavlinusic" <dpavlin@rot13.org> wrote:
> 
> >I have problem with utf-8 characters in advanced search. None of queries
> >which contain utf-8 characters (in Croatia we have few of them: šđčćž)
> >produce any results.
> >
> >I have read through wiki and this mail list and figured out that
> >$EPrints::Index::FREETEXT_CHAR_MAPPING might be to blame. I added
> >mapping for our characters but it didn't help (it would be nice to have
> >full support for all characters without need to edit eprints source).
> >
> >Digging around through eprints source code, I noticed that my queries
> >are split on utf-8 characters. If I uncomment line in Eprints::Search
> >with $self->get_conditions->describe I can see following behaviour:
> >
> >1. search query: "Agić" (utf-8 as last char)
> >
> >AND(
> >        =($archive.metadata_visibility,"show") ... eprint,
> >        =($archive.eprint_status,"archive") ... eprint,
> >        index($archive.creators_name,"agi") ... eprint__rindex
> >)
> >
> >As you can see, utf-8 character gets dropped and this doesn't produce
> >any results. I did check in eprint__rindex table and I do have "agić" in
> >there.
> >
> >2. search query: "Bolanča" (utf-8 is next-to last char)
> >
> >AND(
> >        =($archive.metadata_visibility,"show") ... eprint,
> >        =($archive.eprint_status,"archive") ... eprint,
> >        AND(
> >                grep($archive.creators_name,"%[bolan]%[a]%-%") ...
> >eprint__index_grep,
> >                AndSubQuery(
> >                        index($archive.creators_name,"bolan") ...
> >eprint__rindex,
> >                        index($archive.creators_name,"a") ...
> >eprint__rindex
> >                )
> >        )
> >)
> >
> >This is even worse, because it split search query into two queries on
> >utf-8 character.
> >
> >I spent last three days inserting warns here-and-there in source code in
> >an effort to find out where this splitting is happending, but I have hit
> >the brick wall with this problem.
> >
> >I would appriciate any info or pointers how to resolve this problem.
> >
> >-- 
> >Dobrica Pavlinusic               2share!2flame
> >dpavlin@rot13.org
> >Unix addict. Internet consultant.
> >http://www.rot13.org/~dpavlin
> >
> >*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> >*** Archive: http://www.eprints.org/tech.php/
> >*** EPrints community wiki: http://wiki.eprints.org/
> 
> 
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/

-- 
Dobrica Pavlinusic               2share!2flame            dpavlin@rot13.org
Unix addict. Internet consultant.             http://www.rot13.org/~dpavlin