EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #06556
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Plural words in search results
- To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
- Subject: Re: [EP-tech] Plural words in search results
- From: John Salter <J.Salter@leeds.ac.uk>
- Date: Fri, 2 Jun 2017 07:38:14 +0000
The 'extract_words' function does this. Anything in all caps it treats as an acronym - and doesn't strip trailing 's' from the end. I'm not sure if there's a sensible way round this - unless you want to somehow treat all keywords as non-acronyms (and lowercase them all before indexing)? Cheers, John From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
On Behalf Of Matthew Brady Hi John,
Thanks for the feedback… I think I was involved in the discussions years ago about the simple search etc… This particular issue appears to be more subtle,
It removes the trailing ‘s’ from the keywords (unless it is all caps). Keyword indexed term platypus platypu
Platypus platypu
PLATYPUS platypus
It removes the trailing ‘s’ from the search term.. But if the keywords are entered in all caps, it doesn’t remove the ‘s’ Then the search fails
for that item (as it has removed the ‘s’ from the search term). Search _expression_ ‘platypus’ Search performed using ‘platypu’ Keyword indexed term Match platypus platypu Yes Platypus platypu Yes PLATYPUS platypus No I will dig into it, and let you know if there is more patching required ;) Cheers Matt. From:
eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
On Behalf Of John Salter Hi Matt, I've looked into a similar issue in the past - and I think it was discussed on the tech list a few years ago. I had added a fix (we're 3.3.10 too) - which recently was discovered to break things in a more subtle way. If I remember the full story, it goes something like this: The 'simple' search field is broken in vanilla EPrints 3.3.10 - as it doesn't strip out short-words. This fix for this initially was to run the search terms via the $c->{extract_words} function (in cfg.d/indexer.pl). This seemed to resolve the issue (we'd been running it like this for a few years), but we discovered that for a search field looking at multiple metafield types (e.g. a text field and a name field), if the search
term ended in -ss it wouldn't find anything. My current fix is: https://gist.github.com/jesusbagpuss/e096430c825d34a2ef1de671e8a7dfda Both are 'patch' files (overwrite methods in the core EPrints modules - we try to keep these things separated - but you could just take the methods and edit the files they're patching directly). There are two files - one resolves an issue with apostrophes in names (which may or may not affect you). The issue you report is slightly different to the one we found - but I think the cause might be very similar - the stripping of a trailling 's' is applied during indexing, but the same is not applied when searching. Hope that gets you somewhere - some of this stuff is fairly recent in my mind (fixing the fix took a bit of tracing through the modules) - there may be more useful stuff I have in my head! Cheers, John From:
eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
On Behalf Of Matthew Brady Hi All, One of our users came across a problem, when performing some keyword searches… and assumed it was a case problem, since the all uppercase words in their testing
weren’t returning in the result set. After testing, I have a preliminary diagnosis, (we are running 3.3.10 if it makes a difference). It appears the index process is removing the ‘s’ off the end of the word (unless the word is all caps). When performing a search, the system removes the ‘s’ from the search term, and performs the search… in our case this returns 2 of 3 test records. When I took the last two letters off each eprint’s keywords, and then performed a search, it returned all three records in the results.. +----------+----------+---------------+--------------------------+ |<- details from eprint__rindex ->|<- eprint.keywords field->| +----------+----------+---------------+--------------------------+ | eprintid | field | word | keywords | +----------+----------+---------------+--------------------------+ | 29533 | keywords | ornithorhynch | ornithorhynch | | 29534 | keywords | ornithorhynch | Ornithorhynch | | 29535 | keywords | ornithorhynch | ORNITHORHYNCH | +----------+----------+---------------+--------------------------+ The plural determination holds true for the humble Platypus as well
L +----------+----------+-----------------+---------------------------+ | eprintid | field | word | keywords | +----------+----------+-----------------+---------------------------+ | 29533 | keywords | ornithorhynchu | ornithorhynchus, platypus | | 29533 | keywords | platypu | ornithorhynchus, platypus | | 29534 | keywords | ornithorhynchu | Ornithorhynchus, Platypus | | 29534 | keywords | platypu | Ornithorhynchus, Platypus | | 29535 | keywords | ornithorhynchus | ORNITHORHYNCHUS, PLATYPUS | | 29535 | keywords | platypus | ORNITHORHYNCHUS, PLATYPUS | +----------+----------+-----------------+---------------------------+ Cheers Matt.
_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
|
- References:
- [EP-tech] Plural words in search results
- From: Matthew Brady <Matthew.Brady@usq.edu.au>
- Re: [EP-tech] Plural words in search results
- From: John Salter <J.Salter@leeds.ac.uk>
- Re: [EP-tech] Plural words in search results
- From: Matthew Brady <Matthew.Brady@usq.edu.au>
- [EP-tech] Plural words in search results
- Prev by Date: Re: [EP-tech] Plural words in search results
- Next by Date: [EP-tech] OR2017 Final registration reminder!
- Previous by thread: Re: [EP-tech] Plural words in search results
- Next by thread: [EP-tech] currrent_user => usertype
- Index(es):