EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #07412


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Multiple Uploaded Files in One Directory


Hi Alan and John,

Thank you both for the advice, it's really helpful.

I'll go with Alan's solution in the immediate term and John, if you have the time, some more info on your solution would be brilliant.

From what I can see, this doesn't happen very often, but I'd much prefer if it didn't happen at all!

Thanks,
James

On Wed, Aug 15, 2018 at 12:45 PM, John Salter <J.Salter@leeds.ac.uk> wrote:

Hi James,

Welcome to EPrints :o)

 

When EPrints resolves a URL, it uses the eprintid and pos to get the document data object via

EPrints::DataObj::Document::doc_with_eprintid_and_pos

 

Normally there would only be one object returned - and the document that 'works' is the first one returned by the above call.

 

Onto the question about how items get into this state:

This sounds very similar to an issue we had with our Symplectic connector - and how it merged two EPrints together when the corresponding Symplectic items were merged together. This ends up with two documents attached to the same EPrint existing in the same 'pos'.

 

EPrints' default behaviour is to remove the 'pos' during a clone *only* when the doc is being cloned to the same parent: https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/DataObj/Document.pm#L374

 

In some circumstances, this is not the correct course of action - EPrints should check that a doc doesn't already exist at that pos for that eprint.

 

I flagged the issue to Symplectic - thes ticket reads:

#################

We've discovered an issue with the Elements/EPrints connector:

EPrint ID 1; document: A.pdf with pos=1.

EPrint ID 2; document: B.pdf with pos=1.

 

If both of these are attached to Elements records, which are then merged, the resulting EPrint ends up with two documents at pos=1.

This is not meant to happen, and will mean that one of the documents is unreachable.

 

The 'real' bug lies in EPrints - but the connector 'tickles' it when two records are merged - and the $document->clone() method is used (which possibly should be flagged as an 'internal' EPrints method).

#################

 

I've created a fix for the Symplectic connector - and submitted it to them for review/release as a new version of RT1.

As yet this hasn't been released.

 

The specific fix I have for the Symplectic connector is (also saved as: https://gist.github.com/jesusbagpuss/d9e292bd4dd222f5199a36747989f708) in case the code below gets mangled by email transport):

 

###########################################################################################

# Based on EPrints::DataObj::Document::clone

# NB Code duplication with Symplectic::RepoProcess::MergeManager

#

# Cloning documents can result in:

# - two documents with the same 'pos' field - and therefore sharing the same folder

# - 'spaces' in the document structure (e.g. pos=1 and pos=3, but no pos=2)

# this isn't what is needed. The code below manages these scenarios.

# EPrints' default behaviour is to remove the 'pos' during a clone *only* when the doc is being cloned to the same parent.

sub clone_document

{

        my ($self, %args ) = @_;

        my $eprint = $args{'eprint'};

        my $doc = $args{'doc'};

        my $reset_pos = $args{'reset_pos'};

 

        my $data = "" $doc->{data} );

 

        # cloning within the same eprint, in which case get a new position!

        #if( defined $doc->parent && $eprint->id eq $doc->parent->id )

        if( ( defined $doc->parent && $eprint->id eq $doc->parent->id ) || $reset_pos )

        {

                $data->{pos} = undef;

        }

 

        $data->{eprintid} = $eprint->get_id;

        $data->{_parent} = $eprint;

 

        # First create a new doc object

        my $new_doc = $doc->{dataset}->create_object( $doc->{session}, $data );

        return undef if !defined $new_doc;

 

        my $ok = 1;

 

        # Copy files

        foreach my $file (@{$doc->get_value( "files" )})

        {

                $file->clone( $new_doc ) or $ok = 0, last;

        }

 

        if( !$ok )

        {

                $new_doc->remove();

                return undef;

        }

 

        return $new_doc;

}

###########################################################################################

 

NB There are also some other changes requires in the Symplectic connector to make this work. If you'd like more information about this fix, let me know!

 

If you want to know how many items in your repository are affected by the 'duplicated pos' issue, try:

On the database, you can detect how many of your EPrints have this issue using the following SQL:

SELECT

  eprintid, pos, count(*) as c

FROM

  document

GROUP BY

  eprintid, pos

HAVING c > 1;

 

If there are a few items, you may be able to resolve them by human effort.

If there are lots, then some scripting might be needed…

 

Does that help at all?

Cheers,

John

 

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of James Kerwin
Sent: 15 August 2018 10:20
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Multiple Uploaded Files in One Directory

 

Morning all,

 

I'm very new to the world of EPrints and I'm still getting to grips with it.

 

I was alerted to a problem today where a file uploaded to Eprints is giving a "404 File not Found" warning when attempting to view/download the document.

 

On the repository server the document is present but appears in the same directory as another document (which can be accessed through eprints). There is then a a third document in a second directory that can be accessed.

 

Looking in the database I can see that all three documents are public and should be accessible.

 

As I understand it, the URL matches the file structure as:

 

 

And on the server are stored somewhere in the Eprints directory as:

 

[EP/ri/nt/sI/d]/DocPos/document.pdf

 

As in a one-to-one between DocPos and doc name (I've looked at some other examples with more than 2 documents in one EPrint and each one follows this so far).

 

Firstly, are my assumptions correct?

Has anybody had a similar thing happen before?

 

Thanks,

James


*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/