EPrints Technical Mailing List Archive
Message: #01583
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
[EP-tech] Re: RFC access log table
- To: <eprints-tech@ecs.soton.ac.uk>
- Subject: [EP-tech] Re: RFC access log table
- From: Tim Brody <tdb2@ecs.soton.ac.uk>
- Date: Fri, 15 Feb 2013 14:00:49 +0000
On Fri, 15 Feb 2013 10:30:24 +0000, "Alan.Stiles" <Alan.Stiles@open.ac.uk> wrote: > Hi Tim, > > Having a quick look through the access table, it might also be nice if > there was the option to include / exclude a list of known robots and > spiders from the csv dumps, and possibly just to strip them from the > access table outside of the dumps, keeping it to a more manageable size > without losing 'relevant' information - Bing and Yandex appear to be among > our worst offenders. The robots list we use is from Project COUNTER, but hasn't been updated since Jan 2011. You can see it here: https://github.com/eprints/eprints/blob/access_log/perl_lib/EPrints/Apache/LogHandler.pm#L253 The priority for COUNTER appears to be consistency over (necessarily) accuracy. I've created two tools, working on this branch (names may change ...): https://github.com/eprints/eprints/commits/access_log dump_access - write access log entries to CSV files "access_YYYYMM.csv" - remove written entries from the database filter_access - re-run the robots filtering based on the LogHandler list - filter repeated requests based on a time-window These use a new CSV exporter I'm working on, but could use the existing CSV. (I'm working on a publicly usable CSV export/import, which only operates on user-importable fields). /Tim. > -----Original Message----- > From: Tim Brody [mailto:tdb2@ecs.soton.ac.uk] > Sent: 15 February 2013 09:32 > To: eprints-tech@ecs.soton.ac.uk > Subject: [EP-tech] Re: RFC access log table > > Hi, > > Yes, there is nothing in the core that relies on data in access*. The > IRStats 1 & 2 use access to create their summary data. > > It looks like the best solution is to provide a tool to periodically dump > historic access data to files, but that it is still useful to keep > "current" (defined by config) data in the database. > > All the best, > Tim. > > On Fri, 15 Feb 2013 08:13:52 +0100, Yuri <yurj@alfa.it> wrote: >> We've a test server which is a clone of the production server. Can I >> empty those access tables safely to save space? :) can I do an "delete * >> from access" without any issue? The same for access__ordervalues_en and >> all the languages? >> >> Il 15/02/2013 03:13, Mark Gregson ha scritto: >>> Hi Tim >>> >>> Because of the DB backup issues we invested some time a while ago in > some >>> scripts for archiving the access data off to monthly dumps and for >>> restoring it (if required, say be the need to have IRStats reprocess all >>> data). These scripts are not actually in production use because I > haven't >>> had time to test it to my satisfaction (sorry Nick!). >>> >>> CSV is a more accessible format than a MySQL dump, which may be a >>> benefit. >>> >>> We are using IRStats for statistics which uses the access table but I >>> guess this will be easily updated with a new parser. We also do some >>> custom logging to the access table for reporting on outbound link clicks >>> via IRStats. This logging is handled via EPrints::Apache::LogHandler. >>> >>> Cheers >>> Mark >>> >>> >>> -----Original Message----- >>> From: eprints-tech-bounces@ecs.soton.ac.uk >>> [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Tim Brody >>> Sent: Thursday, 14 February 2013 8:01 PM >>> To: eprints-tech@ecs.soton.ac.uk >>> Subject: [EP-tech] RFC access log table >>> >>> Hi All, >>> >>> I'm thinking about the access log table and how it can be made >>> sustainable. >>> >>> What I'm suggesting is to write accesses to CSV-formatted log files, one >>> file per month. What I don't know is whether anyone is relying on the >>> database table for generating statistics? >>> >>> The problem the access log table creates is in backing-up the EPrints >>> database. >>> >>> I'd appreciate any thoughts/comments. >>> >>> -- >>> All the best, >>> Tim >>> >>> *** Options: > http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech >>> *** Archive: http://www.eprints.org/tech.php/ >>> *** EPrints community wiki: http://wiki.eprints.org/ >> >> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech >> *** Archive: http://www.eprints.org/tech.php/ >> *** EPrints community wiki: http://wiki.eprints.org/ > > -- > All the best, > Tim. > *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech > *** Archive: http://www.eprints.org/tech.php/ > *** EPrints community wiki: http://wiki.eprints.org/ -- All the best, Tim.
- References:
- [EP-tech] RFC access log table
- From: Tim Brody <tdb2@ecs.soton.ac.uk>
- [EP-tech] Re: RFC access log table
- From: Mark Gregson <mark.gregson@qut.edu.au>
- [EP-tech] Re: RFC access log table
- From: Yuri <yurj@alfa.it>
- [EP-tech] Re: RFC access log table
- From: Tim Brody <tdb2@ecs.soton.ac.uk>
- [EP-tech] Re: RFC access log table
- From: "Alan.Stiles" <Alan.Stiles@open.ac.uk>
- [EP-tech] RFC access log table
- Prev by Date: [EP-tech] Re: RFC access log table
- Next by Date: [EP-tech] Re: {Disarmed} summary page with link
- Previous by thread: [EP-tech] Re: RFC access log table
- Next by thread: [EP-tech] summary page - mapping text
- Index(es):