EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09987


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

RE: [EP-tech] Bulk import plugin


CAUTION: This e-mail originated outside the University of Southampton.

Thanks for this, Yuri

 

Yes, ultimately it comes down to field-mapping. But the way it is implanted is also quite important, to prevent error messages and aborted runs.

 

For example,

  1. I am editing RIS.pm in detail to associate RIS fields with Eprints fields – that is what you are highlighting
  2. I am dealing with how the RIS file is presented from Windows to prevent spurious fields from stopping the process (Endnote sneakily inserts an ID number in the data export that is not in the templates but is a default. It is the ID of the record in EndNote, but Eprints thinks it is an ID in the Eprints repository to a former copy of the record, and it can’t find it, so the import fails.) Also, Windows add BOM and CR/LF end-of-line confusions that can easily be checked for and removed in RIS.pm, which I have added there, so RIS.pm doesn’t mind whether file originate from Windows or Unix.
  3. I am editing /cfg/cfg.d/eprint_fields_default.pl to set defaults for mandatory fields that do not exist in my raw data, but their value is clear and constant, due to the nature of the data (refereed, published, and so on). Perhaps these would better be dealt with in RIS.pm, I am not sure.
  4. I spent a lot of time trying to find out how to edit the workflow in Eprints user interface to achieve this import from there, before I realised that a command line in SSH would be much easier. I would like to set it up as a workflow, but I have not got that far into the system yet.

 

So, yes, there is a bit more to it than merely mapping fields from one format to another. However, if we could make this into a simple mapping exercise so that there was a datafile with the mapping in it, based on a list of Eprints fields, that would make life a lot easier as BibTex, RIS and the rest develop and change in their own cycles of development (as David points out). EndNote has this sorted quite will being able to present any field mapping alongside its own list of fields. I realise that this is not really a prime mission of Eprints, which could be why it feels like it is not as clear cut as it might otherwise be. However, in University repositories, what happens when an academic presents a long list of their papers to be included in the repository? What format would they be asked for, I wonder?

 

Best wishes

 

Will

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> On Behalf Of Yuri
Sent: 17 February 2025 11:18
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Bulk import plugin

 

You don't often get email from yurj@alfa.it. Learn why this is important

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Here what people usually want is to customize the mapping (or transposition). So the Import plugins could stay in core (*) but support the customization of the mapping function, and people which needs some different use cases just have to customize it in the repository cfg.d directory ($c->{ris2eprints_mapper} = sub ...). In my experience, the i/e format support itself is rarely customized. I think many Import/Export Eprints plugins usually define a mapping function that just do the mapping.

Said that, copy a plugin and rename it and change something is quite simple. It would good to be able to copy the plugin in the repository cfg.d dir, so that repository has that plugin implementation (and no need to modify the EPrints code itself).

(*) I see I/E plugins in core as "this formats are best practice in the EPrints world"

Il 17/02/25 11:47, David R Newman ha scritto:

Hi Will,

Import plugins really are something that probably would be best maintained outside of the main codebase, as the transposition they provides changes on a different schedule to EPrints releases as the specification BibTex, Endnote RIS, etc, changes.  That is something for our development team to consider.

Inevitably for 20,000 items the import will take some time.  It sound like you did not do this in screen.  Therefore, if you want this to keep running you will need to type Ctrl+z and then bg %1 on the command line before you can logout and still let the import continue. (It is probably %1 but they number is whatever appears inside the square brackets next to Stopped when you press Ctrl+z).

Regards

David Newman

On 17/02/2025 09:14, Will Hughes wrote:

CAUTION: This e-mail originated outside the University of Southampton.

David

 

Wow, that was a struggle but I got it working! Thanks for the pointers. I ended up using RIS format from EndNote as it seems more transparent to me and I'm used to it. The RIS.pm plugin provided in the installation was flaky and threw lots of errors. I've edited a lot and it might be worth uploading it for future users as the existing one is probably better suited to an older version. Shall I send it through to you when I've tidied it up?

 

The next step is to find a tidy way to deal with default values for some fields, and changes to some required fields to make them optional so that I don't need them in the data.

 

I see what you mean about the amount of time it takes to ingest a batch of items this way. It is remarkably slow!  But still much better than processing them by hand. Do I understand correctly that, if I set it to work and then log out of my user account on the server, it will just keep cooking in the background? That would be cool.

 

Best wishes 

 

Will

____


From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Will Hughes <w.p.hughes@reading.ac.uk>
Sent: Saturday, February 15, 2025 8:09:59 PM
To: David R Newman <drn@ecs.soton.ac.uk>; eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Bulk import plugin

 

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Excellent, thank you! I shall try this tomorrow and let you know 

 

Thanks again

 

Best wishes 

 

Will

____


From: David R Newman <drn@ecs.soton.ac.uk>
Sent: Saturday, February 15, 2025 7:54:27 PM
To: Will Hughes <w.p.hughes@reading.ac.uk>; eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Bulk import plugin

 

Hi Will,

You can do a similar bulk import for EndNote or BibTeX by modifying the import command I provide in my earlier email:

EPRINTS_PATH/bin/import ARCHIVE_ID eprint BibTeX metadata.bib

EPRINTS_PATH/bin/import ARCHIVE_ID eprint EndNote metadata.enl

As I said before, you will need to add the skip_buffer to a config file and set a --user argument in the command to import as a specific user. It may be worth reviewing what BibTex/EndNote attributes are supported by the versions you have of the following files (assuming you are run EPrints 3.4.x series):

EPRINTS_PATH/flavours/pub_lib/plugins/EPrints/Plugin/Import/EndNote.pm
EPRINTS_PATH/flavours/pub_lib/plugins/EPrints/Plugin/Import/BibTeX.pm

under their "sub convert_input" functions.

Regards

David Newman

On 15/02/2025 7:37 pm, Will Hughes wrote:

CAUTION: This e-mail originated outside the University of Southampton.

Hi, David

 

Thanks for the quick response. No, I'm not moving them from one to another Eprints repository. I am moving them from an entirely different source. Currently, I have everything in EndNote and can export as BibTex successfully. And this is just metadata, not a repository as such. I am providing a metadata database with URLs to the original papers and theses. This a specialist subject database for a research association, not an institutional repository.

 

Sorry, I should have mentioned this in the original question!

 

Best wishes 

 

Will

____


From: David R Newman <drn@ecs.soton.ac.uk>
Sent: Saturday, February 15, 2025 7:00:18 PM
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>; Will Hughes <w.p.hughes@reading.ac.uk>
Subject: Re: [EP-tech] Bulk import plugin

 

You don't often get email from drn@ecs.soton.ac.uk. Learn why this is important

Hi Will,

I am going to assume this is 20,000 records currently in an EPrints repository you want to transfer to a new/different EPrints repository.  If that is not the case please let me know what format you currently have for these records you want to import.

Exporting the existing records from your old EPrints repository should entail carrying out an (admin menu) EPrint search (for presumably all items in the live archive) and then an export as "EP3 XML with Files Embedded".  If you have big files (e.g. videos), as long as all the files you want to import are currently publicly accessible on the old EPrints repository, you can choose the EP3 XML export.

Importing is most easily/efficiently done from the (SSH) command line of the new EPrints repository server.  First, copy the export file generated from above.  Next, you need to run the following command to import the records (substituting EPRINTS_PATH and ARCHIVE_ID and OLD_ARCHIVE_ID as appropriate:

EPRINTS_PATH/bin/import ARCHIVE_ID --enable-file-imports --enable-web-imports eprint XML export_OLD_ARCHIVE_ID_XMLFiles.xml

However, these will be imported into the review buffer rather than the live archive, so you need to (temporarily) add the following to a configuration file in your new archive's cfg/cfg.d/ directory (e.g. z_skip_buffer.pl):

$c->{skip_buffer} = 1;

For more information about the import command see:

https://wiki.eprints.org/w/API:bin/import

In particular, you may want to set a user to import these records.  I would advise creating a special user for this, as having 20,000 records under a user account you regularly want to manage deposits will make this less responsive as it has to evaluate all 20,000 records to determine which to show on the first page of Manage Deposits.

Regards

David Newman

On 15/02/2025 6:13 pm, Will Hughes wrote:

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi

 

With a new installation I am finding my way around the software. I am looking for the functionality to import records in bulk, straight to the repository.

 

I understand that there is or was a plugin for bulk import, but I cannot find it anywhere. What I want to do is to bring in 20,000 records in a way that make them immediately live. Is there a plugin that can be fired up from the website, or is this a command line interface kind of thing?

 

Any suggestions welcome

 

Thanks

 

Best wishes

 

Will   

 

Will Hughes

Emeritus Professor of Construction Management and Economics

School of the Built Environment     

University of Reading, PO Box 219, Whiteknights

Reading, RG6 6DF, UK

 



*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/
 

 

 



*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/