EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #03347


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: Injecting gigabyte-scale files into EPrints archive - impossible?


There's no official documentation about toolbox, it should be documented better.

Can't you just use import with this options:

   --enable-import-ids
            By default import will generate a new eprintid, or userid for
            each record. This option tells it to use the id spcified in the
            imported data. This is generally used for importing into a new
            repository from an old one.


    --enable-file-imports
            Allow the imported data to import files from the local
filesystem. This can obviously be seen as a security hole if you
            don't trust the data you are importing. This sets the
            "enable_file_imports" configuration option for this session
            only.

after you've exported the eprints, modified the document section and reimporting it?

Another option is to use a Perl Library for efficient file handling and change the code where it does

 join("", <STDIN>)




Il 01/08/2014 11:25, Florian Heß ha scritto:
Hello developers and users,

again I'm sorry I have to consult you concerning a problem we've run
into and couldn't solve ourselves.

We need to attach a big file to a document, i.e. one of 3g in size. We
limited web upload to 100m by webserver configuration in order that we
keep control of large file uploads. To get bigger file into the archive
we successfully use the following command:

/usr/bin/perl ~eprints/bin/toolbox $repo addFile \
     --document $docid --filename $filename < /path/to/existing/file

(Besides, is there a convenient way of getting the document id? It is
rather tedious to upload a placeholder file so we can manually seek and
grab a doc id by Firebug extension; after running the command, we open
the EPrint file dialog in the document metadata to switch the main file
and delete the placeholder.)

I narrowed this method down to a line of code in
EPrints::Toolbox::get_data() that I question is scalable for these
dimensions (given our hardware memory space):

      join("", <STDIN>)

builds, in EPrints 3.3.10, a monstrous perl scalar that certainly is
perpetually expanded and moved around in memory to fit in. I wonder if
there is a way I can move the file to the expected place myself and
adjust the file record in the EPrint database. Tried this already but at
last I ended up downloading the tiny placeholder file again. I deleted
the file in the console (rm), but then EPrints system threw "couldn't
read file contents". So, somewhere things still were arranged for the
old file. The browser displays, though, the right filename in the modal
dialog offering to save or to open the file with a program whatsoever.

The toolbox command was appallingly running more than two hours and
gorging swap space like there was no tomorrow, then we killed it. It
consumed 2% of CPU in average, status flag was "D" most of the time (man
ps: "uninterruptable sleep (usually IO)"). It appeared to me it was
constantly swapping.

Today I tried the toolbox addDocument command which doesn't seem to save
me work after all, it just requires xml data. But with
<url>file:///path/of/file/to/import</url>, it runs out of disk space
again while "downloading" that url in /tmp.
Wish I could pass a path of a file to be copied directly, isn't that
possible somehow?


Kind regards
Florian