EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #00686


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] IRStats / Access log issue


Hi All,

   I think I've found a potential issue which may affect users of the IRStats module, also relating to the access logging components of EPRints. I noted the issue after a status monitor on our repository indicated an extended period of very high transaction rate to the back-end MySQL server. The issue is exposed via a loop in the update subroutine in the IRStats Access.pm module which migrates access counts from the eprints access log table over to the main stats table. In particular, the loop segment which iterates:

sub update
{
...
        # Do chunks of 100,000 records because we can potentially be dealing with
        # millions of records
        for(my $accessid = $highest_destination_access_id; $accessid < $highest_source_access_id;)
        {
                $session->log("Processing from $accessid to $highest_source_access_id");

##because it's the first update, do twice
                $sql = "SELECT * FROM " . $database->quote_identifier($source_table) . " WHERE " .
                        $database->quote_identifier('accessid') . " > $accessid ORDER BY " .
                        $database->quote_identifier('accessid') . " ASC LIMIT 100000";
                $query = $database->do_sql($sql);

                while (my $row = $query->fetchrow_hashref()){

                        next unless valid_accesslog_entry($row);
                        my %hit = %$row;
                        $accessid = $hit{accessid};
...

   Across both loops, the $accessid value is only updated if the current fetched row is valid as per the valid_accesslog_entry() subroutine. This is generally true, however, we have noted some access hits (rightly or wrongly) come from sources with masked or empty useragent values. These values appear to be stored in the access table with NULL values for requester_user_agent, which, when returned as 'undef' by row_hashref(), causes the valid_accesslog_entry() to fail. If the last record in a subset to be migrated ($accessid == ($hightest_source_access_id -1)) is such a record, the $accessid value will not be rolled over, not allowing exit from the outer loop until a further page access is made by a client with a valid useragent. While stuck in the loop, the sql query is repeatedly called, hammering the back-end database. As a resolution step, I'm looking at adding sanity checking to all stored access values being written to the DB (./perl_lib/EPrints/Apache/LogHandler.pm:_create_access()), though I'm interested to know if there might be a less invasive fix that might be carried forward across upgrades?

Cheers,
Casey


Casey Hilliard
PC Consultant,
Health Sciences Library / QE2 Systems,
Memorial University
Phone: 709-777-2387 (HSL)
Phone: 709-864-6267 (QE2)

This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2011.php

This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php