EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #05848


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Seeing unusually high downloads in IRStats


With Apache:

RewriteEngine On
RewriteCond %{HTTP:User-Agent} (?:Yandex|msnbot|Owlinbo|sistrix|genieo|proximic|MJ12bot|AhrefsBot|searchmetrics|SearchmetricsBot|Baidu) [NC]
RewriteRule .? - [F]

just add the guilty.

Problem solved :-D

Il 26/07/2016 14:13, Graham, Clinton T ha scritto:
The University of Pittsburgh opened ticket UCM000000270852 with Bing 
Webmaster Support last week regarding this and received the following 
response:
Thank you for contacting Bing Webmaster Support.  The activity you are 
seeing is most likely caused by one of our bots used for verifying 
your site rather than indexing your site as Bingbot does.  These 
crawlers do not have the same UA, and are in place to make sure the 
verification aspects of your site are in place.
Yesterday, we requested additional information on what “verification” 
really means, and describe the problem of conflating user-generated 
activity with bot-generated activity, especially for the scholarly 
publication process.
I’ll reply again here if this support request goes anywhere, but 
perhaps others might be interested in similarly engaging Bing 
Webmaster Support?
Enjoy,

- Clinton Graham

Systems Developer

University of Pittsburgh | University Library System

412-383-1057

*From:*eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] *On Behalf Of *Coles, Elizabeth A. (Betsy)
*Sent:* Monday, July 25, 2016 7:45 PM
*To:* eprints-tech@ecs.soton.ac.uk
*Subject:* [EP-tech] Seeing unusually high downloads in IRStats

Forwarding from JISC-REPOSITORIES list – we’ve been seeing this in California too, and our IRStats2 counts are through the roof for the last couple of weeks.
Can anyone tell me how to filter out these robots in IRStats2?  And 
how to clean the access file so that our irstats2 reports are not 
distorted by this deluge?  I assume I’d want to delete all entries 
with a requester_id in the table below and rerun IRstats2 setup from 
scratch.
Thanks,

Betsy Coles

Caltech – Digital Library Development

bcoles@caltech.edu <mailto:bcoles@caltech.edu>

*From:* Repositories discussion list [mailto:JISC-REPOSITORIES@JISCMAIL.AC.UK] *On Behalf Of *Hilary Jones
*Sent:* Friday, July 15, 2016 3:43 AM
*To:* JISC-REPOSITORIES@JISCMAIL.AC.UK <mailto:JISC-REPOSITORIES@JISCMAIL.AC.UK> *Subject:* Seeing unusually high downloads in IRStats - IRUS-UK's explanation and why this isn't affecting IRUS-UK stats
Hi everyone,

There was a discussion, via UKCORR mailing list, on why there are exceptionally high downloads being seen this week in IRStats and what might be causing it.
After some investigation we have found that the unusually high 
downloads are down to four IP ranges:
IP range

	

Organisation

	

Location

	

No. IP addresses

103.25.156.*

	

Microsoft Bingbot

	

China

	

128

103.36.96.*

	

Microsoft Corporation

	

China

	

216

111.221.28.*

	

Microsoft Bingbot

	

China

	

256

202.89.235.*

	

Microsoft Bingbot

	

China

	

80

These IPs have been systematically trawling and downloading files from many UK repositories. Looking at their User Agent strings they do not declare themselves as bots but masquerade as normal users.
Happily, the IRUS-UK ingest has been filtering out these robotic 
downloads, so you won’t see a massive spike in your IRUS-UK stats.
We hope this is of help.

Best wishes

Hilary

Jisc <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>
*Hilary Jones*
Services and Projects Support

0161 413 7541
Skype hilary.jones@jisc.ac.uk <mailto:hilary.jones@jisc.ac.uk>
Twitter @JonesHilaryJ
6th Floor Churchgate House, 56 Oxford Street, Manchester, M1  6EU

*jisc.ac.uk <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d> *
Jisc is a registered charity (number 1149740) and a company limited by 
guarantee which is registered in England under Company No. 5747339, 
VAT No. GB 882 5529 90. Jisc’s registered office is: One Castlepark, 
Tower Hill, Bristol, BS2 0JA. T 0203 697 5800. jisc.ac.uk 
<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>


*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/