EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09792


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Ask about search result and reindex


CAUTION: This e-mail originated outside the University of Southampton.

Hi,

 

for Xapian, there are instructions and a script to check and repair indexing on

 

https://github.com/eprintsug/repairXapianIndex

 

As you may know, we don’t have Xapian in use anymore due to various reasons, see https://www.eprints.org/eptech/msg08667.html

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of David R Newman <drn@ecs.soton.ac.uk>
Date: Thursday, 25 July 2024 at 10:11
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>, Agung Prasetyo W. <prazetyo@gmail.com>
Subject: Re: [EP-tech] Ask about search result and reindex

Hi Agung,

I am not sure whether the search you are doing is a database index search or a Xapian index search.  It would be useful if you could provide a link to you search page, so I can go an take a look.  If you were using Xapian search, then the advice I am giving below is not applicable.

The numbers that come back in that result from the script I provided are only those where at least one of the fields (title, abstract or creators_name) has no terms indexed despite those fields not being empty in the record itself.  Typically my experience has been that an item does not get indexed at all or the indexing dies half way through, so the script is relatively effective at spotting where records are not indexed properly.  However, this script is not a perfect check for this.  So there is a good chance that some of the eprint IDs you get back may be indexed for keywords but maybe not creators.

The point of the results I get back is that I can then look at the database's eprint__rindex table and see what is wrong.  In the case of 1101, I would run the following database query:

SELECT COUNT(*), field FROM eprint__rindex WHERE eprintid = "1101" GROUP BY field;

If I got no results back, I would know that the item is not indexed at all but I might discover that for example the creators_name field is not indexed.  Either way the next thing I would probably do is run epadmin reindex on this item and see if the results of the above database query had changed afterwards.

For items that you cannot find in search when you use an appropriate search term that should find them, it is worth doing a similar eprint__rindex query:

SELECT * FROM eprint__rindex WHERE eprintid = "6789" AND field = "creators_name" GROUP BY field;

This would then show whether the names "Rahmat" and "Setyo" are indexed against this field.  One issue is that they may have been stemmed to shorter words but I don't think that will be the case here.  If you find both these names in the result, then the index is not the issue.

Next, I would check the metadata_visibility field you can do this by looking at the eprint view page, that under the Details tabs lists all the stages of the workflow but also an "Other defined fields" section.  If "Metadata Visibility" does not say "Always Show", then it will never appear in search results.  If the issue is the metadata visibility, then you would need to figure out why that has happened but that can be quite complex, so I will not go into detail, unless this is confirmed as the issue.

Regards

David Newman

On 25/07/2024 8:39 am, Agung Prasetyo W. wrote:

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

 

I ran the script you just provided, and the results are as follows.
eprints@repo16-04:~$ cat archives/[archive_ID]/var/eprint_rindex_unindexed.txt
1101
7238
7417
7488

It's just that, when I searched for one of the keywords in item 1101, the search results appeared. Meanwhile, what I asked yesterday, there were several items that I was looking for in the search that did not appear, for example the author with the name "Rahmat Setyo" with the item ID being 6789. When I searched from search it did not appear, but when searching by author, the item with the author is available.

Next, I ran the "epadmin reindex" command on item ID 6789. When the epadmin reindex process was complete, when I searched for the author "Rahmat Setyo", the data appeared in the search results.

 

Thank you.

 

Regards,

Agung PW

 

On Wed, 24 Jul 2024 at 23:34, David R Newman <drn@ecs.soton.ac.uk> wrote:

Hi Agung,

I have made some improvements to the script at:

http://files.eprints.org/3065/

Here are the installation/usage instructions:

Download the Bash script and run as follows to check that all eprint records in the live archive have titles, abstracts and creators indexed (if they exist for that record):

  ./find_eprint_rindex_unindexed

If your EPrints installation's archives are not under /opt/eprints3/archives then specify with -p flag:

  ./find_eprint_rindex_unindexed -p /usr/share/eprints/archives

If you want to check a specific archive rather than the first one the script finds then specify -a flag:

  ./find_eprint_rindex_unindexed -a my_archive

Results are output to the following file or run with -v flag to outout to the screen:

  EPRINTS_PATH/archives/ARCHIVE_ID/var/eprint_rindex_unindexed.txt

If you have un-indexed results you want to ignore you can provide a new line separated list of these in:

  EPRINTS_PATH/archives/ARCHIVE_ID/var/ignore_eprint_rindex_unindexed.txt

Regards

 

David Newman

 

On 24/07/2024 12:05, David R Newman wrote:

Did you specify the ARCHIVE ID as a parameter in the command:

./find_eprint_rindex_unindexed ARCHIVE_ID

Did you make sure you update EP_PATH in the script to match your EPrints path if this is not /opt/eprints3?

Di you update USER_PASS to the username and password for your EPrints database.  The default assume that the root user can access the database with a need for a password.  You will probably need to change:

USER_PASS="-u root"

To something like:

USER_PASS="-u USERNAME -pPASSWORD"

Where USERNAME is $c->{dbuser} and PASSWORD is $c->{dbpass} in your archive's cfg/cfg.d/database.pl.  

I could probably improve the script to get it to pull this out by default when looking up the database name, which is already does from by grabbing dbname from this file.

Regards

David Newman

On 24/07/2024 11:49, Agung Prasetyo W. wrote:

CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

 

How do I know we use eprints database or xapian? After I run your script, it shows nothing. After I open the the file /var/eprint_rindex_unindexed.txt, it shows like below :

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

 

Is my step wrong ??

 

Thank you.

 

Regards,

Agung PW

 

 

 

 

 

On Wed, 24 Jul 2024 at 17:22, David R Newman <drn@ecs.soton.ac.uk> wrote:

Hi Agung,

If you are using the database (i.e. eprint__rindex) table, then I wrote the following (rather hacky) Bash script to test this:

https://files.eprints.org/3065/

The script will ignore items whose metadata visibility is not set to show.  It is worth manually checking you database for item you expect to be able to find in search but cannot to see if the metadata_visibility field has been changed.  If you create new versions of items this will automatically set the current (now old) version to hide.  (This is a far from ideal situation but it is quite difficult to determine a better way to ensure users only find the latest versions, especially when the "New Version" button gets used in the wrong circumstances).

If you are using a Xapian index, (e.g. typically used for simple search), then I did write a different script for this but it is a lot more complex to deploy.

Regards

David Newman

On 24/07/2024 10:51, Agung Prasetyo W. wrote:

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi,

Sometimes there are items that don't appear when I do a search, even though they are in the repository. But after I did the command: epadmin reindex [archive_id] eprint [item_id]
As a result, these items can appear in search results.

Is there a way to find out the item IDs that have not been indexed so that we can reindex the item IDs?

 

Thank you.

 

Regards,

Agung Prasetyo W.



*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/
 



*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/
 



*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/