EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #05841
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Seeing unusually high downloads in IRStats
- To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
- Subject: Re: [EP-tech] Seeing unusually high downloads in IRStats
- From: John Salter <J.Salter@leeds.ac.uk>
- Date: Tue, 26 Jul 2016 09:11:21 +0000
Hi Betsy, Also worth noting: EPrints itself does some robot filtering before adding things to the ‘Access’ dataset: https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Apache/LogHandler.pm#L61-L98 (this list is a bit old – see
https://github.com/eprints/eprints/issues/239 and
https://github.com/eprints/eprints/issues/311 ) I’m not sure if the reported behaviour from these IPs is this:
-
Some activity with robot UA
-
Some more activity with browser UA If this *is* the pattern, making sure whatever the robot UA is appears in the above code, should mean that subsequent requests (with a non-robot UA) would also get filtered –
for a certain amount of time (based on a brief analysis of the code. I may be reading it wrong!). If the robots are already in the Access table, my previous email will help filter them from IRStats2. The info above may help prevent them getting into the Access dataset (which is processed by IRStas2) in the first place. Cheers, John From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
On Behalf Of John Salter Hi Betsy, As these requests do not identify themselves as robots in their User-Agent, it’s not as simple as adding a new UA to a list. The user-agent filtering is done by: EPrints::Plugin::Stats::Filter::Robots (~/lib/plugins/EPrints/Plugin/Stats/Filter/Robots.pm) I think that you should duplicate this to a new filter: EPrints::Plugin::Stats::Filter::IP As the list of bad IPs might be quite dynamic, you might want to make the equivalent of the @ROBOTS into a config variable? As to the question about applying the new filters to the current dataset, I think you can re-process all the stats – but this may take some time on a busy/established system! Cheers, John From:
eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
On Behalf Of Coles, Elizabeth A. (Betsy) Forwarding from JISC-REPOSITORIES list – we’ve been seeing this in California too, and our IRStats2 counts are through the roof for the last couple of weeks. Can anyone tell me how to filter out these robots in IRStats2? And how to clean the access file so that our irstats2 reports are not distorted by this deluge? I assume I’d want to delete all entries
with a requester_id in the table below and rerun IRstats2 setup from scratch. Thanks, Betsy Coles Caltech – Digital Library Development From: Repositories discussion list [mailto:JISC-REPOSITORIES@JISCMAIL.AC.UK]
On Behalf Of Hilary Jones Hi everyone, There was a discussion, via UKCORR mailing list, on why there are exceptionally high downloads being seen this week in IRStats and what might be causing it. After some investigation we have found that the unusually high downloads are down to four IP ranges:
These IPs have been systematically trawling and downloading files from many UK repositories. Looking at their User Agent strings they do not declare themselves as bots but masquerade as normal users. Happily, the IRUS-UK ingest has been filtering out these robotic downloads, so you won’t see a massive spike in your IRUS-UK stats. We hope this is of help. Best wishes Hilary
|
- References:
- [EP-tech] Seeing unusually high downloads in IRStats
- From: "Coles, Elizabeth A. (Betsy)" <bcoles@caltech.edu>
- Re: [EP-tech] Seeing unusually high downloads in IRStats
- From: John Salter <J.Salter@leeds.ac.uk>
- [EP-tech] Seeing unusually high downloads in IRStats
- Prev by Date: [EP-tech] Google Scholar Help
- Next by Date: Re: [EP-tech] Seeing unusually high downloads in IRStats
- Previous by thread: Re: [EP-tech] Seeing unusually high downloads in IRStats
- Next by thread: Re: [EP-tech] Seeing unusually high downloads in IRStats
- Index(es):