Last few days I spent migrating our EPrints repository to version 3. This seems like a dull and easy job, but it did have it's own challenges:
- mysql database had latin1 encoding which didn't play well with utf-8 encoded characters from EPrints 2, in effect producing utf-8 strings which where encoded multiple times (and different for different data)
- we also had table with additional works collected after our EPrints 2 installation died, so it had to be imported somehow
In a essence, EPrints set of perl scripts which convert XML archive into database and web interface. So, how hard can it be?
For a start, take a look at utf8-fix.pl script which will try to convert all combination of croatian characters back to utf-8. Creating mapping was not easy. And if you look at the end, you will see that script has verification step at the end which tries to find uncovered utf-8 strings and dump them out. To make it work, I used test-driven methodology (sic!) with fix.sh as small runner script which will do one conversion, show diff from last one (removed lines from log with errors is good) and open vi to edit files directly.
Re-read last sentence once more. I spend two days before I streamlined this workflow up to point where I could really finish conversation, so it's useful to have that in mind if you are writing some kind of data mungling software.
In the process, I also stripped croatian characters from pdf filenames, creating symlinks to unaccented versions and passing generated xml through unaccent-file.pl from fix.sh.
Second part of problem was converting tab delimited file into EPrints XML for import of new documents. However, it's (again) not as easy as it seems, since data had only partial filename which had to be matched to real files on share somewhere. So, I decided to split this problem in following way:
- files.txt is list of available files generated by find /mnt/share -print
- ep-xml.xml is template for single document which uses <!-- "variable" --> to denote places in which I need to insert custom data
- finally, tsv2xp-xml.pl (which should really be named tsv2ep-xml.pl but I made typo) is script which reads both files (together with TSV) and emits XML for eprints