EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #07656

[EP-tech] Antwort: Thesis Bulk Upload/Import

To: <eprints-tech@ecs.soton.ac.uk>, James Kerwin <jkerwin2101@gmail.com>
Subject: [EP-tech] Antwort: Thesis Bulk Upload/Import
From: <martin.braendle@id.uzh.ch>
Date: Thu, 17 Jan 2019 15:02:21 +0100

Hi James,

we did recently import in our repository about 3000 metadata records and PDFs from Swiss National Licence program and attached about a further 2000 PDFs to existing metadata.
Currently I'm working on importing about 4000 e-theses (metadata + PDF) and later 60'000 metadata records of print theses of University of Zurich (back into 19th century) from UZH's library system Aleph. This will increase the current size of our repo by 50%.

1) Biggest pro of having all documents in one repo is findability - you don't want the user to have to search several times in different repos.
Con is that if one does not have the full-text (as above), the overall full-text and OA ratio may be diluted.

2) Was answered by David Newman. Be aware that the code by Neugebauer and Han for ingesting documents is not up-to-date and did not work an EPrints 3.3 repository - had to learn that the hard way. If you need code samples let me know.

3) There may be not something as a preferred or ideal format. You have to work with what you get from the data provider. In our case, this meant writing our own import scripts and plug-ins. Also, there may be data quality issues, which means one has to do thorough data analysis before and massive data massaging during import (if you have XML data, XSLT 2.0 is your friend because of its strong grouping and sorting facilities). And one has to be prepared to implement error handling for all kind of errors that can be caused by wrong, incomplete or missing data.

In the case of National Licenses, this involved:
- getting CSV files from the data provider
- 1 script and 2 import plug-ins (NationalLicense, DOI)
- filtering out wrong records because the provider did an unsufficient affiliation matching and there were als records from ETH Zurich (instead of University of Zurich)
- extracting the DOIs, then do an duplicate match or import via DOI plugin to which a separate handler had to be passed
- do a guess of the Dewey classification based on the ISSN of the journal where the article was published using our journal database
- fetching the abstracts from a separate URL - the abstracts were not stored in the CSV and sometimes are not available via Crossref
- adding missing fields that are not available in the metadata (e.g. publication status, subject, OA status, copyright, and so on)
- downloading the PDFs and attaching to the eprint, setting language, format, conent, embargo and security, and making thumbnails on the fly
- printing a report of the import (success and failures, detected duplicates)

In the case of the e-theses:
- getting a combined MARCXML/Adam XML file from the provider
- inserting a separate XML element per MARC record into the file that groups a MARC record (M) and the associated ADAM records (A) - the file had the implicit assumption that ADAM records that immediately follow the MARC record belong to the preceding MARC record. However, this is not parsable (there is no schema). So I went from a structure like Root{M A A A M A M A A M A A A M A M A A A ...} to something like Root{Doc(M A A A) Doc(M A) Doc(M A A) Doc(M A A A) Doc(M A) Doc(M A A A) ...}
- doing a tag analysis of both M and A using XSLT, then deciding on the mapping to EPrints fields.
- doing a content analysis of each tag using XSLT by grouping and sorting the content alphabetically. This revealed the whole data nightmare: Inconsistent cataloging due to three different cataloging rulesets that were applied over time, escaped words because of old cataloging rules for indexing, missing data, typos, unusable additional phrases, inconsistent cataloging of author names in different fields (in 100_a: family, given, in 245_c: given family, the latter being impossible to parse correctly because of composed family names), and surprises such as that a thesis may be authored by several authors, but only the first author is recorded in 100_a)
- 1 script, 1 import plug-in (AlephMarc), 1 config file for mapping MARC --> eprint metadata
- extracting the metadata and data massaging
- downloading the PDF of the full-text Adam record and attaching to the eprint, setting language, format, content, embargo and security, and making thumbnails on the fly
- downloading the PDF of the Adam record for the abstract, doing pdftotext conversion, extracting the abstract and removing title and author information from the abstract
- doing pdftotext conversion of the full-text's cover page, trying to guess the faculty (which is often not available in the metadata) that is a required field in the UZH repo
- marking problems in a special eprints field to the review team
- printing a report of the import (success and failures, detected duplicates)

Best regards,

Martin

--
Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Stampfenbachstr. 73
CH-8006 Zürich

mail: martin.braendle@id.uzh.ch
phone: +41 44 63 56705
fax: +41 44 63 54505
http://www.zi.uzh.ch

"James Kerwin via Eprints-tech" ---17.01.2019 11:21:31---Hi All, The University I work at is currently exploring options for digitising our

Von: "James Kerwin via Eprints-tech" <eprints-tech@ecs.soton.ac.uk>
An: <eprints-tech@ecs.soton.ac.uk>
Datum: 17.01.2019 11:21
Betreff: [EP-tech] Thesis Bulk Upload/Import
Gesendet von: eprints-tech-bounces@ecs.soton.ac.uk

Hi All,

The University I work at is currently exploring options for digitising our collection of theses, with an aim of them going into the institutional repository and I have some questions if anybody could lend me some of their experience and opinions.

1) I've noticed some organisations have a separate instance of EPrints for theses. We currently put each thesis into the institutional repository along with all other types of item. Is there a benefit to separating them out?

2) Does EPrints facilitate any sort of bulk upload of Documents and EPrint record creation? I've had a quick look around and found the following from Tomasz Neugebauer and Bin Han:

https://www.researchgate.net/publication/291251891_Batch_Ingesting_into_EPrints_Digital_Repository_Software

I'm curious to see if this is still relevant (it's very thorough) or if there are any other methods or potential pitfalls to avoid.

3) Following on from Q2, is there a preferred/ideal format of metadata? The article makes it clear that many different formats are supported, but again I'm wondering if there are any pros and cons to any particular format.

The digitising won't be complete for some time so I'm taking the opportunity to get ahead of it and be ready.

Thanks,
James
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech*** Archive:http://www.eprints.org/tech.php/*** EPrints community wiki:http://wiki.eprints.org/*** EPrints developers Forum:http://forum.eprints.org/

Follow-Ups:
- [EP-tech] Antwort: Thesis Bulk Upload/Import
  - From: <martin.braendle@id.uzh.ch>

References:
- [EP-tech] Thesis Bulk Upload/Import
  - From: James Kerwin <jkerwin2101@gmail.com>
- [EP-tech] Antwort: Thesis Bulk Upload/Import
  - From: <martin.braendle@id.uzh.ch>

Prev by Date: Re: [EP-tech] Thesis Bulk Upload/Import
Next by Date: [EP-tech] large file upload failing
Previous by thread: [EP-tech] EPrints/CRIS
Next by thread: [EP-tech] DOI handling in orcid_support_advance
Index(es):
- Date
- Thread