EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #09578
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Indexing - cleanup indexed terms after mass deletions
- To: <eprints-tech@ecs.soton.ac.uk>, Matthew Kerwin <matthew@kerwin.net.au>
- Subject: Re: [EP-tech] Indexing - cleanup indexed terms after mass deletions
- From: David R Newman <drn@ecs.soton.ac.uk>
- Date: Tue, 30 Jan 2024 09:21:13 +0000
Hi Matt,
Batch edit is sometimes a law to itself. I think the following script will allow you to delete indexes from any dataset:
#!/usr/bin/perl -w
############################################################################
#
# Remove Data Object from Index and Ordervalues
#
# Usage: ./remove_index <ARCHIVE_ID> <DATASET_ID> <DATAOBJ_ID>
#
############################################################################
use FindBin;
use lib "$FindBin::Bin/../../../perl_lib";
use EPrints;
my $repoid = $ARGV[0];
my $session = new EPrints::Session( 1 , $repoid , 1 );
my $datasetid = $ARGV[1];
my $dataset = $session->dataset( $datasetid ) ;
my $dataobjid = $ARGV[2];
EPrints::Index::remove_all( $session, $dataset, $dataobjid );
EPrints::Index::delete_ordervalues( $session, $dataset, $dataobjid );
$session->terminate;
This script assumes it has been added to the bin directory of your archive, if it is elsewhere you may need to update FindBin. Currently the script can only remove the index from one data object at a time but it could be easily modified to iterate through a list. EPrints::Index::remove_all removes all data object fields indexed in the DATASET__rindex and DATASET__index_grep tables. EPrints::Index::delete_ordervalues removes records for the data object in the DATASET__ordervalues_LANG tables. Thsi script will not touch the DATASET__index table but more recently (at least since 3.4, if not earlier) this table has not been used. I would advise you stop the EPrints indexer before running this script. Although, in theory if you have InnoDB tables it should be able to cope with potentially multiple processes modifying index tables.
Regards
David Newman
On 30/01/2024 3:46 am, Matthew Kerwin
wrote:
CAUTION: This e-mail originated outside the University of Southampton. CAUTION: This e-mail originated outside the University of Southampton. Hi Matt, On Tue, 30 Jan 2024 at 09:31, Matthew Brady <Matthew.Brady@unisq.edu.au> wrote:Hi All, Our original repo, houses traditional outputs (Articles, conference papers etc.) as well as Theses… We have split the Theses into a dedicated repo, cloning the original system (metadata and files), and then removed the non-theses (search->batch edit->remove all records). I have noticed that there are entries in the various database index tables, referring to eprints that are no longer in the system… I have run epadmin reindex over ‘<repo> eprint’ and ‘<repo> document’, but the indexed values persist… e.g. eprint__index contains a fieldword = ‘title:elephant’ with ids = ‘:12345:’ but there is no eprint 12345 in the system any longer. I thought the permanent removal of the non-theses items would have cleaned up the index tables as process occurred? Any thoughts appreciated. Cheers, MattIn this particular case, is the 'title:elephant' associated with any of your theses, or _only_ with deleted records? Because if it's the latter, then the row is orphaned – it has no inward referential links – so any reindexing task that is built around "foreach(eprint)" rather than "foreach(tablerow)" won't even see the row in question, so won't know to clean it up. We should probably have a look at the remove/delete routines and see how deep they go into cleaning up index tables, filesystem directories, view pages, etc. Off the top of my head I don't know at all, I'm afraid. I assume "not very deep." For what it's worth, in moments of questionable judgement I have purged our repository's various _index, _rindex, and _orderval tables and triggered the appropriate reindexing/reordering tasks manually. It doesn't seem to have caused any problems after the fact. Cheers -- Matthew Kerwin https://eur03.safelinks.protection.outlook.com/?url="">
*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List *** Archive: https://www.eprints.org/tech.php/ *** EPrints community wiki: https://wiki.eprints.org/
- Follow-Ups:
- [EP-tech] Error on BibTeX importer
- From: "Alan.Stiles [He/Him/They]" <alan.stiles@open.ac.uk>
- [EP-tech] Error on BibTeX importer
- References:
- [EP-tech] Indexing - cleanup indexed terms after mass deletions
- From: Matthew Brady <Matthew.Brady@unisq.edu.au>
- Re: [EP-tech] Indexing - cleanup indexed terms after mass deletions
- From: Matthew Kerwin <matthew@kerwin.net.au>
- [EP-tech] Indexing - cleanup indexed terms after mass deletions
- Prev by Date: Re: [EP-tech] Please Disregard my first email about Problem uploading file.
- Next by Date: Re: [EP-tech] Please Disregard my first email about Problem uploading file.
- Previous by thread: Re: [EP-tech] Indexing - cleanup indexed terms after mass deletions
- Next by thread: [EP-tech] Error on BibTeX importer
- Index(es):