EPrints Technical Mailing List Archive
Message: #08311
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
[EP-tech] Announcing eprints2archives
- To: eprints-tech@ecs.soton.ac.uk
- Subject: [EP-tech] Announcing eprints2archives
- From: "Michael Hucka" <mhucka@library.caltech.edu>
- Date: Thu, 03 Sep 2020 11:35:57 -0700
Greetings,eprints2archives is a new program to archive the web pages of an EPrints server in public web archiving sites such as the Internet Archive (https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fweb%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=9tL3Umw2cZzUq%2Fc4m80fu5cApqBpe7E44yooEqKEjT0%3D&reserved=0. It contacts an EPrints server, obtains the list of documents it serves (optionally filtered based on such things as modification date), determines the document URLs, extracts additional URLs by scraping pages under the "/view" section of the public site, and finally, sends the collected URLs to web archives. Use-cases include archiving an server content ahead of migration to another system, and preserving contents in independent third-party archives.
The program is written in Python 3 and works over a network using an EPrints server's REST API and normal HTTP. eprints2archives can work with EPrints servers that require logins as well as those that allow anonymous access. It uses parallel threads by default, transparently handles rate limits, and robustly deals with network errors. Currently, it can send contents to the Internet Archive and Archive.Today; more destination archives may be added in the future.
You can install eprints2archives from PyPI or GitHub. For more information, please visit
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=3W2KnGoczqNuOIcrjrqwlV8ocNYe4FsTq%2Bfv%2Fz%2F%2FB5Q%3D&reserved=0Please report problems using the issue tracking system, which you can find at the GitHub link above.
Best regards, MH --Mike Hucka, Ph.D. -- mhucka@caltech.edu -- https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=2hYDHRLzhKXrA1ZmKF9oYbrLKTPVnpCZonFrwkp4V%2FY%3D&reserved=0
California Institute of Technology
- Follow-Ups:
- [EP-tech] Announcing eprints2archives
- From: "Michael Hucka" <mhucka@library.caltech.edu>
- [EP-tech] Announcing eprints2archives
- References:
- [EP-tech] Announcing eprints2archives
- From: "Michael Hucka" <mhucka@library.caltech.edu>
- [EP-tech] Announcing eprints2archives
- Prev by Date: [EP-tech] Can it be done... Subject hierarchy and using it within compound fields
- Next by Date: [EP-tech] Antwort: Announcing eprints2archives
- Previous by thread: [EP-tech] Sort view with creators_name and corp_creators
- Index(es):