EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #10125


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

RE: [EP-tech] DDoS of EPrints advanced search


CAUTION: This e-mail originated outside the University of Southampton.

Hi,
I've also been seeing the same behaviour, but I don't think the traffic is intended to be a DDOS, although that is the result.
It appears to be some poorly behaved non-human activity. It doesn't follow the normal 'robots' rules, but it definitely doesn't look like a human making the requests.

One additional trait I noted is the 'cache=XXX' parameter that was being sent - which is often relatively old.

I added a script to my server to log the number of search cache tables, and the min/max IDs of them for each hour.
I plan to use this to redirect requests with 'old' cache ids in the query-string to a static page, which will describe (to a human) how to re-run their search, but not provide a clickable link to do so.
If others are also seeing this pattern, I can share my stuff once it's ready.

On one of our repositories we were seeing thousands of search requests for a unique author surname - so I put a specific redirect (using an EPrints trigger) to the static browse page for that author.

There are other 'collection' type websites that are seeing similar activity. The Code4lib wiki has this page: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.code4lib.org%2FBlocking_Bots&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C889f1e9c76bf436ed83c08dd9f7f7715%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638842092003255720%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=tEAnRsExNRjirhUmLzT%2F198QsmUvJQMFlBy5GQxGup4%3D&reserved=0, and there is a code4lib Slack channel relating to this activity too.

Cheers,
John

John Salter
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Forcid.org%2F0000-0002-8611-8266&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C889f1e9c76bf436ed83c08dd9f7f7715%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638842092003271901%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=PZhHpjL5qliJqUHrR4EyNkS50x7U0U8KhkBbv%2F94sJs%3D&reserved=0

White Rose Libraries Technical Officer
Library and Research Management team, IT
University of Leeds


-----Original Message-----
From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> On Behalf Of Jens Witzel
Sent: 28 May 2025 12:37
To: eprints-tech@ecs.soton.ac.uk
Subject: AW: [EP-tech] DDoS of EPrints advanced search

CAUTION: External Message. Use caution opening links and attachments.

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi all

We can confirm Davids observation (like my collegue Martin already posted), but probably for some of you it could help to find botnets on class-c level:


$ grep "cgi\/search\/.*advanced" /var/log/httpd/access_log_xxxxxxx | awk '{split($1, ip, "."); class_b=ip[1]"."ip[2]; class_b_count[class_b]++; unique_ips[class_b][$1]=1} END {for (class_b in class_b_count) { printf "%s\t\t%d\t\t%d\n", class_b, class_b_count[class_b], length(unique_ips[class_b]) }}'  | sort -k2,2nr | head -n 10
177.37          8294            5644
179.125         7461            5304
187.19          5980            4061
191.5           5408            3226
130.60          5154            10
177.22          4212            2658
170.254         4181            2538
45.70           4157            2658
179.108         4123            2652
168.232         4105            2635
$

Where the 1st is IP class-b, 2nd is count access, 3rd number of different Ips from that class-b, and  xxxxxxx is your access_log name extension.

Good luck
Jens

--
Jens Witzel
Universität Zürich
Zentrale Informatik
Pfingstweidstrasse 60B
CH-8005 Zürich

mail:  jens.witzel@uzh.ch
phone: +41 44 63 56777
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch%2F&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C889f1e9c76bf436ed83c08dd9f7f7715%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638842092003281025%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=3cNYp48pcbxY3hWbMYePRTQvuVT8bwh2hnZrHA41Ygs%3D&reserved=0

-----Ursprüngliche Nachricht-----
Von: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> Im Auftrag von David R Newman
Gesendet: Mittwoch, 28. Mai 2025 11:26
An: eprints-tech@ecs.soton.ac.uk
Betreff: [EP-tech] DDoS of EPrints advanced search

Hi all,

We have been observing that a lot of EPrints repositories have been receiving Distributed Denial-of-Service (DDoS) attacks on their advanced search.  As running advanced search queries can put quite a lot of load on the server, this can lead to the repository becoming unresponsive.

Analysis of the requests has shown that typically these requests are bots working their way through the pages of search results for the same search rather than lots of individual searches. Typically, each affected repository will only have a few, maybe up to a dozen different actual searches.  The following command will allow you to see what searches these are. (You may need to adapt this is you access log is elsewhere):

grep "GET /cgi/search/archive/advanced" /var/log/httpd/ssl_access_log | grep -v " 403 " | grep -o 'exp=[^&]\+' | sort | uniq -c | sort -n

Typically, /var/log/httpd/ssl_access_log will only cover the requests since sometime just after midnight on the previous Sunday, if you have default log rotate in place.  So, as it is Wednesday now, you should have decent sample to analyse.

What we have done with these results is added a LocationMatch block inside the Virtualhost block in EPrints Apache configuration. Typically, adding this for the HTTPS virtualhost has been sufficient but you may also need to add it to the HTTP virtualhost.  Therefore, it may be worth adding the LocationMatch configuration to a separate file and then including it under both Virtualhost blocks.

Let's say your command above found one specific search that had been requested thousands of times in the last few days:

exp=0%7C1%7C-date%2Fcreators_name%2Ftitle%7Carchive%7C-%7Ctitle%3Atitle%3AALL%3AIN%3Afibromyalgia+symptoms%7C-%7Ceprint_status%3Aeprint_status%3AANY%3AEQ%3Aarchive%7Cmetadata_visibility%3Ametadata_visibility%3AANY%3AEQ%3Ashow

You should take this and strip off everything before %7Ctitle, keeping the title but removing the %7C, so it would look like*:

title%3Atitle%3AALL%3AIN%3Afibromyalgia+symptoms%7C-%7Ceprint_status%3Aeprint_status%3AANY%3AEQ%3Aarchive%7Cmetadata_visibility%3Ametadata_visibility%3AANY%3AEQ%3Ashow

You can then add it to the Apache configuration as follows, being sure to escape with a '\' any plus (+) symbols:

<LocationMatch "^/cgi/search/archive/advanced">
   <If "%{QUERY_STRING} =~
/exp=0%7C1%7C-date%2Fcreators_name%2Ftitle%7Carchive%7C-%7Ctitle%3Atitle%3AALL%3AIN%3Afibromyalgia\+symptoms%7C-%7Ceprint_status%3Aeprint_status%3AANY%3AEQ%3Aarchive%7Cmetadata_visibility%3Ametadata_visibility%3AANY%3AEQ%3Ashow/">
     Require all denied
   </If>
</LocationMatch>

If you have multiple searches you want to block, it clearer to add additional If blocks for each search expression (rather than trying to match multiple search expressions in the same regular expression).  Once you have finished adding these, then as run the appropriate Apache commands to check the config and reload.  E.g.

apachectl configtest
apachectl graceful

We have found that this is quite effective in dealing with this problem.  However, it does means some genuine users may perform legitimate searches and although the first page should return OK, if they try to re-order or go to the next page it will return a 403 forbidden response (but only for these specific searches). This is not ideal but there is really no other straightforward way to handle this problem as the IP addresses vary so widely and doing nothing may mean long periods where no user can access any part of the EPrints repository.

If anyone has any suggestions on how to refine this configuration, then please share.

Thanks and regards

David Newman

*Sometimes requests URL encode the / symbol as %7C and sometimes it doesn't so removing up to %7Ctitle ensure that the pattern you are matching on covers both the encoded and un-encoded versions.