EPrints Technical Mailing List Archive

Message: #06385


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Scripted XML download?


Sorry, just noticed you said you don't have shell access to the server.

You could download from the browse views:


You can iterate up through the years, but I wouldn't recommend parallelising this.  Make sure you've finished downloading 2015's before you take on 2016.  I downloaded 2015's and it only took around 5 minutes.  If you wanted a robust process, you could scrape this page: http://researchonline.lshtm.ac.uk/view/year/ and verify that you've downloaded the correct number of items in the XML:

grep 'eprint id=' 2015.xml | wc -l

...will almost certainly give you the number of eprints without having to parse a large XML file.

If you have enough access to the machine to be able to configure more browse views, then you may be able to set up a view that is 'unlinked' (meaning it's there, but the repository doesn't link it on the /view page) that streamlines this further.


Have you considered using the OAI interface?  It won't give you eprints XML, but you can just download all items that have changed.


--
Adam Field

On 27 Mar 2017, at 22:39, Adam Field <af05v@ecs.soton.ac.uk> wrote:

Have you tried a commane-line export.  Even if it takes a while, as long as it doesn't consume too many system resources then your repository will still be nice and snappy.  You could, for example, trigger it to run at 1am, and write the export to a location in your html directory, then wget it a day later (just in case it runs longer).  You could speed up wgetting by zipping it

the command would be:

<eprints_root>/bin/export <repositoryid> archive XML | gzip > <eprints_root>/archives/<archive_id>/htm/en/eprint_archive.xml.gzip

wget would be:

wget <base_url>/eprint_archive.xml.gzip | gunzip > eprint_archive.xml


Note that there shouldn't be any security issues because the archive dataset is the live items, so it should be all publicly visible anyway.  Also, be careful that you aren't downloading it at the time your regenerating it.

Lastly, the above was typed directly into the email -- your mileage may vary both with syntax and conceptual errors.


--
Adam Field

On 27 Mar 2017, at 14:51, Andy Reid <Andy.Reid@lshtm.ac.uk> wrote:

Hi,

I do some checking, analysis and visualisation of our repository in a third-party package, and I have it set up to ingest Eprints XML.  I’d like to update this once a week or so, but if I download it all in one big go it takes about 3 hours, 1.5GB, and tends to fail halfway in.  I have been doing it manually one year at a time, but that means 17 separate manual search-and-download operations, each taking ten minutes or so.  I don’t have shell access to the server, so can’t script it command-line. 

 

I have looked at the search page but after a search, the download form references a cached search id so I can’t just copy the URL in the download form. 

 

Can anyone give me a template for a URL that would work in a single pass in wget or libwww,  that I could then cron to fetch the EPXML ?  Obviously I have to be able to authenticate as well…  ?

 

Andy Reid

Research Information Manager

Executive Office, Room G40a

London School of Hygiene and Tropical Medicine

Keppel St, LONDON, WC1E 7HT

0207-927-2618 (Internal/Teleworker x2618)

 

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/