EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #04149
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
[EP-tech] Re: Antwort: Use of truncation in advanced searches
- To: eprints-tech@ecs.soton.ac.uk
- Subject: [EP-tech] Re: Antwort: Use of truncation in advanced searches
- From: Gilles Fournié <gilles.fournie@cirad.fr>
- Date: Tue, 21 Apr 2015 14:37:05 +0200
Hi, Thanks to both of you. I have found the reason of the slowness with truncation. It comes from the "COLLATE utf8_general_ci" used with LIKE. If I run the query used by EPrints, it runs for 5 minutes : mysql> SELECT eprint.eprintid FROM eprint LEFT JOIN eprint__ordervalues_fr ON eprint.eprintid = eprint__ordervalues_fr.eprintid , -> eprint__rindex AS eprint__rindex WHERE eprint.eprintid = eprint__rindex.eprintid AND( -> eprint.metadata_visibility = 'show' AND eprint.eprint_status = 'archive' AND( -> eprint__rindex.field = 'title' AND eprint__rindex.word COLLATE utf8_general_ci LIKE 'thermograph%' OR eprint__rindex.field = 'titre_parallele_titre' AND eprint__rindex.word COLLATE utf8_general_ci LIKE 'thermograph%' OR eprint__rindex.field = 'autre_titre_titre' AND eprint__rindex.word COLLATE utf8_general_ci LIKE 'thermograph%' -> ) -> ) GROUP BY eprint.eprintid , -> eprint__ordervalues_fr.date , -> eprint__ordervalues_fr.creators_name , -> eprint__ordervalues_fr.title ORDER BY eprint__ordervalues_fr.date DESC , -> eprint__ordervalues_fr.creators_name ASC , -> eprint__ordervalues_fr.title ASC; +----------+ | eprintid | +----------+ | 513588 | | 477759 | | 476746 | | 426892 | +----------+ 4 rows in set (5 min 33.61 sec) But without the COLLATE part, it returns immediatly... mysql> SELECT eprint.eprintid FROM eprint LEFT JOIN eprint__ordervalues_fr ON eprint.eprintid = eprint__ordervalues_fr.eprintid , -> eprint__rindex AS eprint__rindex WHERE eprint.eprintid = eprint__rindex.eprintid AND( -> eprint.metadata_visibility = 'show' AND eprint.eprint_status = 'archive' AND( -> eprint__rindex.field = 'title' AND eprint__rindex.word LIKE 'thermograph%' OR -> eprint__rindex.field = 'titre_parallele_titre' AND eprint__rindex.word LIKE 'thermograph%' OR -> eprint__rindex.field = 'autre_titre_titre' AND eprint__rindex.word LIKE 'thermograph%' -> ) -> ) GROUP BY eprint.eprintid , -> eprint__ordervalues_fr.date , -> eprint__ordervalues_fr.creators_name , -> eprint__ordervalues_fr.title ORDER BY eprint__ordervalues_fr.date DESC , -> eprint__ordervalues_fr.creators_name ASC , -> eprint__ordervalues_fr.title ASC -> ; +----------+ | eprintid | +----------+ | 513588 | | 477759 | | 476746 | | 426892 | +----------+ 4 rows in set (0.00 sec)Just for testing, I have tried modifying EPrints/Database/mysql.pm (function sql_LIKE) :
* changing o return " COLLATE utf8_general_ci LIKE "; * to o return ' LIKE '; Now, queries from advanced search run fast even with truncations. However, I fear that this modification has side effects ! What do you think about it ? Is it safe to go on without COLLATE utf8_general_ci ? (Note: the eprint__rindex table is defined with this same collation) Thanks for your advices, Regards, Gilles Le 21/04/2015 11:41, John Salter a écrit :
Hi Giles, One thing we did on our platform was the change described here:https://github.com/eprints/eprints/issues/120 (see also: http://www.eprints.org/tech.php/18677.html).The description of this issue is around simple searches (when you’re not using Xapian), but I don’t know if it would have any effect on the advanced searches too.In some circumstances, MySQL takes longer to work out the best way to run a query than it would take just running the query inefficiently. This change makes it try less iterations to find the ‘best’ way to run the query.Might be worth a try? Cheers, John*From:*eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] *On Behalf Of *martin.braendle@id.uzh.ch*Sent:* 21 April 2015 10:19 *To:* eprints-tech@ecs.soton.ac.uk *Subject:* [EP-tech] Antwort: Use of truncation in advanced searches Hi Gilles,our repo has about 80'000 records and 56% fulltext, so is comparable to yours.Advanced search of thermograph* in title: immediate (1-2 seconds)documents (full text): 20-30 seconds. The mysql daemon goes up to 70-100% CPU load.Quick search (Xapian): title:thermograph* : immediate thermograph* : immediateWe recommend in our help page (http://www.zora.uzh.ch/help/) that Quick Search should be the tool of choice and only for very precise searches Advanced Search should be used.From a recent debug session (on another issue) I know that EPrints translates behind the scenes an advanced search query into a series of dozens of complicated SQL statements. It might be that for certain cases these are not optimized.If it were that simple asselect distinct ei.eprintid from eprint__rindex ei, eprint e where ei.field='documents' and ei.word like 'thermograph%' and e.eprint_status='archive' and e.eprintid=ei.eprintid;then that query would be answered in a fraction of a second. But it isn't, and can't be, and EPrints software engineers surely have put a lot of effort into the EPrints database engine part to cover all possible situations.Best regards, Martin -- Dr. Martin Brändle Zentrale Informatik Universität Zürich Winterthurerstr. 190 CH-8057 ZürichInactive hide details for Gilles Fournié ---21/04/2015 10:22:27---Hi, I have a question about right-hand truncation in advancedGilles Fournié ---21/04/2015 10:22:27---Hi, I have a question about right-hand truncation in advanced searches.Von: Gilles Fournié <gilles.fournie@cirad.fr <mailto:gilles.fournie@cirad.fr>>An: eprints-tech@ecs.soton.ac.uk <mailto:eprints-tech@ecs.soton.ac.uk> Datum: 21/04/2015 10:22 Betreff: [EP-tech] Use of truncation in advanced searchesGesendet von: eprints-tech-bounces@ecs.soton.ac.uk <mailto:eprints-tech-bounces@ecs.soton.ac.uk>------------------------------------------------------------------------ Hi, I have a question about right-hand truncation in advanced searches. If we search for (in title for example) : thermography the search runs for 1 to 3 seconds before returning results. If we extend our search to : thermography thermographie the search time is about the same. But if we try to use a wildcard : thermograph* the search takes a very long time (counts in minutes) ! Does somebody have experienced such delays ? Any clues about what we can do to solve this problem ? (our archive contains ~ 91000 eprints) Best regards, GF *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech *** Archive: http://www.eprints.org/tech.php/ *** EPrints community wiki: http://wiki.eprints.org/ *** EPrints developers Forum: http://forum.eprints.org/ *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech *** Archive: http://www.eprints.org/tech.php/ *** EPrints community wiki: http://wiki.eprints.org/ *** EPrints developers Forum: http://forum.eprints.org/
- References:
- [EP-tech] Use of truncation in advanced searches
- From: Gilles Fournié <gilles.fournie@cirad.fr>
- [EP-tech] Antwort: Use of truncation in advanced searches
- From: martin.braendle@id.uzh.ch
- [EP-tech] Re: Antwort: Use of truncation in advanced searches
- From: John Salter <J.Salter@leeds.ac.uk>
- [EP-tech] Use of truncation in advanced searches
- Prev by Date: [EP-tech] Re: upgrading IRStats to IRStats2
- Next by Date: [EP-tech] Shibboleth/SAML and ePrints 3.3.x
- Previous by thread: [EP-tech] Re: Antwort: Use of truncation in advanced searches
- Next by thread: [EP-tech] Shibboleth/SAML and ePrints 3.3.x
- Index(es):