EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09784


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Bots - Server Resources


Hi James,

robots.txt is often as useful as a chocolate teapot, even with Google who do tend to respect it.  Obviously you don't want to block it.  You could take a look at a new feature I have added for 3.4.6 (yet to be fully released.  Release candidate available https://github.com/eprints/eprints3.4/releases/tag/v3.4.6-rc1 but the codebase will change before full release).  This new feature:

https://github.com/eprints/eprints3.4/issues/388

There is a wiki page with a basic description of how this can be configured:

https://wiki.eprints.org/w/Restrict_paths.pl

The patch for this is available at:

 https://github.com/eprints/eprints3.4/commit/ba45f7a50f7a0151b338efbd5b52f362a97d7c8f 

This still has some debug in.  So all you really need is the block of code in Apache/Rewrite.pm and then the template configuration file you need to add to you archive uncomment and add the settings you want.  I don't think there is a big risk to deploy this.  If anything goes wrong just remove the config file and it will not be called, so will functionally not change the core codebase.

There was some discussion about this still hitting the Perl handler but the issue is not the extra load using the Perl handler adds it is the amount the database is hammered to generate the data needs, (in your case for IRStats2).

Regards

David Newman

On 24/07/2024 14:28, James Kerwin wrote:
CAUTION: This e-mail originated outside the University of Southampton.
CAUTION: This e-mail originated outside the University of Southampton.
Hi everyone,

I'm having an incredibly rough time with my server. Apache keeps getting killed by the OOM Killer because I'm out of memory. Mostly I can restart Apache, but it also sometimes kills a process that leaves me unable to log on until IT turn the server off and on again (I'm unable to do this myself).

I've been watching Top output all morning and see the memory and CPU usage shoot up for /usr/bin/apach running as the eprints user.

Looking in my /var/log/apache2/other_vhosts_access.log file I can see LOADS of requests for stats pages under /cgi/stats/report. I suspect it's a crawler as an army of humans could never submit so many requests. 

I do have a robots.txt file in /opt/eprints3/archives/uolrepo/html/en that specifically disallows the /cgi/ directory, but that is either incorrect, being ignored or (likely) I'm not understanding things correctly.

Here is an example of the requests:

livrepository.liverpool.ac.uk:443 138.253.158.16 - - [24/Jul/2024:13:53:23 +0100] "GET /cgi/stats/report/eprint/3151894?range=2016 HTTP/1.0" 200 70049 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.126 Mobile Safari/537.36 (compatible; GoogleOther)"

livrepository.liverpool.ac.uk:443 138.253.158.16 - - [24/Jul/2024:13:53:23 +0100] "GET /cgi/stats/report/eprint/3108433/requests?range=2019 HTTP/1.0" 200 72878 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.126 Mobile Safari/537.36 (compatible; GoogleOther)"

livrepository.liverpool.ac.uk:443 138.253.158.16 - - [24/Jul/2024:13:53:26 +0100] "GET /cgi/stats/report/eprint/3006551?range=2017 HTTP/1.0" 200 70049 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.126 Mobile Safari/537.36 (compatible; GoogleOther)"

livrepository.liverpool.ac.uk:443 138.253.158.16 - - [24/Jul/2024:13:53:27 +0100] "GET /cgi/stats/report/eprint/3033512/compare_years?range=2021 HTTP/1.0" 200 87158 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.126 Mobile Safari/537.36 (compatible; GoogleOther)"

*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/