EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #02849


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: Thousands of old eprints repropagated via OAI after epadmin redo_thumbnails &co.


Hi Florian,

Interesting scenario!..

On 08/04/14 07:43, Florian Heß wrote:
Hi,

is there an option (am I just missing it? using EPrints v3.3.10) to
leave the current lastmod timestamp untouched when processing an epadmin
or alike routine automated by EPrints-boxed tools? We had in the past
and will still have a need to batch-process plenty of eprints, epadmin
redo_thumbnails for instance, which results in e.g. their being
renotified via our aggregator for freshly acquired media (RSS-feed and
mail channel both are limited to 1000 items per request, thus some
really fresh ones might be suppressed in the list). Client OAI
harvesters might handle them as new, too, which would be not that
user-friendly.
In the OAI scenario, I think that the OAI clients are faulty as an update of the lastmod timestamp doesn't modify the resource's unique identifier which should be used to see if an item is new or being updated.

But I agree that certain actions shouldn't update the lastmod field (cf. below).

Pondering on it, I would even prefer to see EPrints update it only when
a non-admin user has acted upon an eprint, when they changed metadata.
But sometimes the admin might want to touch eprints "obviously" indeed,
e.g. when he changed field values using the regular workflow or when he
explicitly opts in that.

To put it in a nutshell, I'd wish I could use EPrints API this way:

     use EPrints qw(no_autoupdate_lastmod);

     $dataobj->commit(); # stealth update if $dataobj in storage
     $dataobj->commit({ update_lastmod => 1 });
         # opt-in overwrite default {update_lastmod}
         #     = !exists $import_opts{no_autoupdate_lastmod}

In order to ensure that changes made by admin are still obvious in terms
of database-level debugging or "forensics", my idea is to have an
API-hidden and unprocessed native DATESTAMP field, say "sql_updated",
and have it independently update with means of the database engine.
(AFAIK, MySQL implies out-of-the-box "ON UPDATE CURRENT_TIMESTAMP()" for
any first datestamp field of a table.)
There's a "non_volatile_change" flag you can set (grep for it in DataObj/EPrint.pm), which does pretty much the same as "no_autoupdate_lastmod".

I don't see a need for another timestamp, but I agree that the behaviour around lastmod could be reviewed. Also I don't think fields should be updated or not depending on which part of the system you're using (workflow etc) or which user is modifying a resource. The behaviour should be consistent and intuitive (and handled at the low-level for such system/internal fields).

What about reviewing which actions should update lastmod and which ones should NOT update lastmod?

I think that lastmod should be updated when either the metadata is modified and/or when the file content is changed hence, from the available epadmin functions:

- rebuild_triples: no metadata/content change => no lastmod update
- recommit: by definition, this action should touch lastmod
- reorder: re-create the order values for searching => no lastmod update
- reindex: similar as above
- redo_mime_type: might modify the Document's mime type => update lastmod when the mime type is updated - redo_thumbnails: generation of volatile files for previewing => no lastmod update

What do you reckon? Which other actions need to be reviewed/included here?

By the way, guessing there isn't another way to restore the timestamps
but from backup dumps, is there? Is there yet a way to commit an eprint
explicitly without updating the lastmod timestamp that I can consider in
the future to prevent this?
You might/should be able to recover the timestamps by querying the "history" dataset which keeps records of changes for eprint objects alongside their revision number (which is stored in the eprint).

By setting the non_volatile_change flag you should be able to avoid the auto-updating property of lastmod.

I can create new github issues once we're happy with the revised behaviour.

Seb.



Regards
Florian