EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #06196
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Migrating output data - most efficient way
- To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
- Subject: Re: [EP-tech] Migrating output data - most efficient way
- From: Andrew Beeken <anbeeken@lincoln.ac.uk>
- Date: Tue, 24 Jan 2017 14:22:51 +0000
Hi Dan, Thanks for this remarkably thorough write up! One thing I wasn’t aware of was the export function; I was grabbing the XML from the front end but this should obviously fill in any gaps. I am, however,
having problems with it! Using the syntax: export <repositoryid> archive XML > <filename>.xml gives me a file, however the file is never populated. I don’t know if this is an issue in our repository or if there is a patch anywhere? No errors are thrown on the command line so I’m at a loss!
Stumbled at the first hurdle! Andrew From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
On Behalf Of Stieneke, Dan I’ve done migrations of this sort several times. I never did a database dump, but instead exported/imported XML, on the premise when doing a database copy, mysql would only tell me about database
errors, but an export/import would give eprints a chance to tell me about eprints errors. On the most recent iteration, moving from 3.2.4 to 3.3.15, I ran into two problems migrating the actual eprints: 1)
Exporting was fine, but the import threw a ton of errors. Reversing the .XML file (so newest items were imported first, oldest last) worked. To import in reverse order:
a.
b.
edit
list.txt to
i. Remove top line, which is “eprints”
ii. Remove any cancelled items. You have to make at least one import attempt to figure out
what these are.
c.
tac < list.txt > revlist.txt - reverse the list (tac
is cat in reverse)
d.
tr '\n' ' ' < revlist.txt > revlistspaces.txt (go from one eprintID per
line to a list of space-separated eprintIDs)
e.
~/bin/export ARCHIVEID archive XML $(cat revlistspaces.txt) > REV.xml
i. This depends on pulling the
.PDF files in from the webserver. Use
XMLFiles instead of
XML to get the whole thing Base64-encoded in the
.XML file
2)
Thesis info changed somewhere between those two versions
a.
Old phd info:
<thesis_type>phd</thesis_type>
b.
Equivalent new phd info:
<thesis_type>postdoctoral</thesis_type><thesis_name>phd</thesis_name>
c.
Fixit file :
phdfix.sh
#!/bin/bash sed 's!<thesis_type>phd</thesis_type>!<thesis_type>postdoctoral</thesis_type><thesis_name>phd</thesis_name>!' < $1 > $2
d.
./phdfix.sh REV.xml REV_phdfix.xml (names based on the reversal done
above)
e.
~/bin/epadmin erase_data ARCHIVEID
Had some difficulty with
erase_eprints, apparently some 'leftovers'. This blows away & rebuilds the entire database.
f.
~/bin/import_subjects ARCHIVEID
g.
~/bin/import ARCHIVEID --enable-import-fields --enable-web-imports --update --force --verbose archive XML REV_phdfix.xml
i. --update overwrites
older eprints w/the same ID instead of appending them to the end of the collection.
Once done, verify:
#!/bin/bash # clean up encoding changes dos2unix $1 # name change sed -i 's!OLDURL!NEWURL!g' $1 # double to single quote sed -i "s/\"/'/g" $1 # this appears in different spots in old & new formats sed -i "s! xmlns='http://eprints.org/ep2/data/2.0'!!" $1 # new format uses FQDN sed -i 's!<uri>/id/document/!<uri>http://FQDN/id/document/!' $1 # these ALL change - import increments them sed -i '/<rev_number>.*<\/rev_number>/d' $1 # shows up in different spots than new format sed -i '/<mime_type>application\/pdf<\/mime_type>/d' $1 # quite a few of these empties show in the old format, not the new sed -i '/<id><\/id>/d' $1 sed -i '/<reported_by><\/reported_by>/d' $1 sed -i '/<resolved_by><\/resolved_by>/d' $1 sed -i '/<comment><\/comment>/d' $1 # this is an entire multi-line section that lives in the old, but not in the new perl -i -0pe 's/<copies>.*?<\/copies>/DELETEME/sg' $1 # can't easily delete; this replaces with literal DELETEME # Encoding of abstract has changed. Not always visibly obvious how (CRLFs, I think) #perl -i -0pe 's/<abstract>.*?<\/abstract>/DELETEME/sg' $1 # can't easily delete; this replaces with literal DELETEME sed -i '/DELETEME/d' $1 # thesis type changes between the two versions sed -i '/<thesis_type>phd<\/thesis_type>/d' $1 # old version encodes dashes; new version uses dashes sed -i 's/%2D/-/g' $1 # old version encodes this; new version ignores??? sed -i 's/ //g' $1
b.
FILE
sedsNew.sh
#!/bin/bash # clean up encoding changes dos2unix $1 # these ALL change - import increments them sed -i '/<rev_number>.*<\/rev_number>/d' $1 # shows up in different spots than new format sed -i '/<mime_type>application\/pdf<\/mime_type>/d' $1 # this appears in different spots in old & new formats sed -i "s! xmlns='http://eprints.org/ep2/data/2.0'!!" $1 # these are now encoded; old format was not sed -i "s/"/\'/g" $1 # these are now encoded; old format was not sed -i "s/'/\'/g" $1 # thesis type changes between the two versions sed -i '/<thesis_type>phd<\/thesis_type>/d' $1 sed -i '/<thesis_type>postdoctoral<\/thesis_type>/d' $1 sed -i '/<thesis_name>phd<\/thesis_name>/d' $1 # Encoding of abstract has changed. Not always visibly obvious how (CRLFs, I think) perl -i -0pe 's/<abstract>.*?<\/abstract>/DELETEME/sg' $1 # can't easily delete; this replaces with DELETEME sed -i '/DELETEME/d' $1
A.
OLDbox:
echo 'SELECT CONCAT("update eprint set lastmod_year=",lastmod_year,", lastmod_month=",lastmod_month,", lastmod_day=",lastmod_day,", lastmod_hour=",lastmod_hour,", lastmod_minute=",lastmod_minute,",
lastmod_second=",lastmod_second," WHERE eprintid=",eprintid,";") FROM eprint;' | mysql -u root -p ARCHIVEID > lastmod.sql
How to migrate the users: *** BEGIN FILE: MigrateUsers.sh *** #/bin/bash OLDBOX=IP_OR_FQDN OLDMYSQLPWD=PASSWORDGOESHERE NEWMYSQLPWD=PASSWORDGOESHERE ssh root@${OLDBOX} "sudo -u eprints /usr/share/eprints3/bin/export ARCHIVEID user XML" > oldboxusers.xml # HEREDOC to avoid quote escapement torture # grep to remove the motd lines that get put in there ssh root@${OLDBOX} <<HEREDOC | grep "^update user set password.*" > oldboxpasswords.sql mysql -u root -p${OLDMYSQLPWD} ARCHIVEID --skip-column-names -e 'select concat("update user set password=\"", password, "\" where username=\"", username, "\";") from
user;' HEREDOC sudo -u eprints /usr/share/eprints3/bin/import ARCHIVEID user XML oldboxusers.xml mysql -u root -p${NEWMYSQLPWD} ARCHIVEID < oldboxpasswords.sql rm oldboxpasswords.sql rm oldboxusers.xml *** END FILE: MigrateUsers.sh *** Sorry for some of the weird outline-autonumbering / indentation – all of this was copied / pasted from my documentation wiki. Good luck, Dan Stieneke IT Specialist USDA - ARS - NWISRL 3793 N 3600 E Kimberly, ID 83341 208/423-6519 From:
eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
On Behalf Of Andrew Beeken Hello all! So, my quest to get the Lincoln repository onto a more standard EPrints install continues! Thanks for help with Ubuntu versions etc; I’ve now got that knowledge together and can start to look at the migration. I know I’ve asked this in
the past, but my scope has changed slightly. Now, instead of looking at recreating our live repository like for like, I’m just looking at migrating the output data into a fresh EPrints installation and working from there. I know I realistically have two options – importing from an EPrints XML dump off the main site or through a database dump. I’m assuming that in both cases I’ll need to bring over essential field definition, phrase and workflow files to
avoid causing issues, and I’m assuming that, ideally, the database route would be best to make sure that any “in review” or hidden items are also brought across. Main qustions: ·
Am I right in this assumption? ·
Can I JUST bring out a subset of tables for the deposit data or do I have to do a full database export? I’d ideally prefer to not do the latter as there are customisations to the user table in our live EPrints that I’m trying to avoid replicating in
this new environment. ·
Are there any gotchas I should be aware of? Thanks, as always, in advance! Andrew
|
- References:
- [EP-tech] Migrating output data - most efficient way
- From: Andrew Beeken <anbeeken@lincoln.ac.uk>
- Re: [EP-tech] Migrating output data - most efficient way
- From: "Stieneke, Dan" <Dan.Stieneke@ARS.USDA.GOV>
- [EP-tech] Migrating output data - most efficient way
- Prev by Date: Re: [EP-tech] Migrating output data - most efficient way
- Next by Date: Re: [EP-tech] Migrating output data - most efficient way
- Previous by thread: Re: [EP-tech] Migrating output data - most efficient way
- Next by thread: Re: [EP-tech] Migrating output data - most efficient way
- Index(es):