EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #05849


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Seeing unusually high downloads in IRStats


What do you propose that User Agent match be?  We found each of the following coming from Bing, among others:
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0

We requested that Bing Support describe any existing pattern for identification, or requested they comply with RFC2616 14.22's use of the From header in such a way that we could recommend to Project COUNTER that this be considered for bot identification.

Enjoy,

- Clinton Graham
Systems Developer
University of Pittsburgh | University Library System
412-383-1057

-----Original Message-----
From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Yuri
Sent: Tuesday, July 26, 2016 9:21 AM
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Seeing unusually high downloads in IRStats

With Apache:

RewriteEngine On
RewriteCond %{HTTP:User-Agent} 
(?:Yandex|msnbot|Owlinbo|sistrix|genieo|proximic|MJ12bot|AhrefsBot|searchmetrics|SearchmetricsBot|Baidu) 
[NC]
RewriteRule .? - [F]

just add the guilty.

Problem solved :-D

Il 26/07/2016 14:13, Graham, Clinton T ha scritto:
>
> The University of Pittsburgh opened ticket UCM000000270852 with Bing 
> Webmaster Support last week regarding this and received the following 
> response:
>
> Thank you for contacting Bing Webmaster Support.  The activity you are 
> seeing is most likely caused by one of our bots used for verifying 
> your site rather than indexing your site as Bingbot does.  These 
> crawlers do not have the same UA, and are in place to make sure the 
> verification aspects of your site are in place.
>
> Yesterday, we requested additional information on what "verification" 
> really means, and describe the problem of conflating user-generated 
> activity with bot-generated activity, especially for the scholarly 
> publication process.
>
> I'll reply again here if this support request goes anywhere, but 
> perhaps others might be interested in similarly engaging Bing 
> Webmaster Support?
>
> Enjoy,
>
> - Clinton Graham
>
> Systems Developer
>
> University of Pittsburgh | University Library System
>
> 412-383-1057
>
> *From:*eprints-tech-bounces@ecs.soton.ac.uk 
> [mailto:eprints-tech-bounces@ecs.soton.ac.uk] *On Behalf Of *Coles, 
> Elizabeth A. (Betsy)
> *Sent:* Monday, July 25, 2016 7:45 PM
> *To:* eprints-tech@ecs.soton.ac.uk
> *Subject:* [EP-tech] Seeing unusually high downloads in IRStats
>
> Forwarding from JISC-REPOSITORIES list - we've been seeing this in 
> California too, and our IRStats2 counts are through the roof for the 
> last couple of weeks.
>
> Can anyone tell me how to filter out these robots in IRStats2?  And 
> how to clean the access file so that our irstats2 reports are not 
> distorted by this deluge?  I assume I'd want to delete all entries 
> with a requester_id in the table below and rerun IRstats2 setup from 
> scratch.
>
> Thanks,
>
> Betsy Coles
>
> Caltech - Digital Library Development
>
> bcoles@caltech.edu <mailto:bcoles@caltech.edu>
>
> *From:* Repositories discussion list 
> [mailto:JISC-REPOSITORIES@JISCMAIL.AC.UK] *On Behalf Of *Hilary Jones
> *Sent:* Friday, July 15, 2016 3:43 AM
> *To:* JISC-REPOSITORIES@JISCMAIL.AC.UK 
> <mailto:JISC-REPOSITORIES@JISCMAIL.AC.UK>
> *Subject:* Seeing unusually high downloads in IRStats - IRUS-UK's 
> explanation and why this isn't affecting IRUS-UK stats
>
> Hi everyone,
>
> There was a discussion, via UKCORR mailing list, on why there are 
> exceptionally high downloads being seen this week in IRStats and what 
> might be causing it.
>
> After some investigation we have found that the unusually high 
> downloads are down to four IP ranges:
>
> IP range
>
> 	
>
> Organisation
>
> 	
>
> Location
>
> 	
>
> No. IP addresses
>
> 103.25.156.*
>
> 	
>
> Microsoft Bingbot
>
> 	
>
> China
>
> 	
>
> 128
>
> 103.36.96.*
>
> 	
>
> Microsoft Corporation
>
> 	
>
> China
>
> 	
>
> 216
>
> 111.221.28.*
>
> 	
>
> Microsoft Bingbot
>
> 	
>
> China
>
> 	
>
> 256
>
> 202.89.235.*
>
> 	
>
> Microsoft Bingbot
>
> 	
>
> China
>
> 	
>
> 80
>
> These IPs have been systematically trawling and downloading files from 
> many UK repositories. Looking at their User Agent strings they do not 
> declare themselves as bots but masquerade as normal users.
>
> Happily, the IRUS-UK ingest has been filtering out these robotic 
> downloads, so you won't see a massive spike in your IRUS-UK stats.
>
> We hope this is of help.
>
> Best wishes
>
> Hilary
>
> Jisc 
> <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>
>
> *Hilary Jones*
> Services and Projects Support
>
> 0161 413 7541
> Skype hilary.jones@jisc.ac.uk <mailto:hilary.jones@jisc.ac.uk>
> Twitter @JonesHilaryJ
> 6th Floor Churchgate House, 56 Oxford Street, Manchester, M1  6EU
>
> *jisc.ac.uk 
> <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d> 
> *
>
> Jisc is a registered charity (number 1149740) and a company limited by 
> guarantee which is registered in England under Company No. 5747339, 
> VAT No. GB 882 5529 90. Jisc's registered office is: One Castlepark, 
> Tower Hill, Bristol, BS2 0JA. T 0203 697 5800. jisc.ac.uk 
> <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>
>
>
>
> *** Options: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmailman.ecs.soton.ac.uk%2fmailman%2flistinfo%2feprints-tech&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=Ehu39hyCMWRVOCRKkKklceTfE%2f%2fkg42Pfzm0wbri09Y%3d
> *** Archive: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.eprints.org%2ftech.php%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=V6N4nro4zLCpORRsY9pXdQl6DPfNatw0rDArihFMrgY%3d
> *** EPrints community wiki: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwiki.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=MgG4kKoc%2fdA02Fp2EIC3TUqlmiKO46QH0gxocexaX5U%3d
> *** EPrints developers Forum: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fforum.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=4yAgurdLBbTi005%2fDcW74cNSOYyiTbbx%2f6MfusHVCPg%3d

*** Options: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmailman.ecs.soton.ac.uk%2fmailman%2flistinfo%2feprints-tech&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=Ehu39hyCMWRVOCRKkKklceTfE%2f%2fkg42Pfzm0wbri09Y%3d
*** Archive: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.eprints.org%2ftech.php%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=V6N4nro4zLCpORRsY9pXdQl6DPfNatw0rDArihFMrgY%3d
*** EPrints community wiki: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwiki.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=MgG4kKoc%2fdA02Fp2EIC3TUqlmiKO46QH0gxocexaX5U%3d
*** EPrints developers Forum: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fforum.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=4yAgurdLBbTi005%2fDcW74cNSOYyiTbbx%2f6MfusHVCPg%3d