EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #10123


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] DDoS of EPrints advanced search


CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

 

thanks – we observe that since last Thursday and block all those requests which don’t come from our institution network.

Since then experiencing more than 2.5 million requests from >100K different IP addresses.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of David R Newman <drn@ecs.soton.ac.uk>
Date: Wednesday, 28 May 2025 at 11:27
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: [EP-tech] DDoS of EPrints advanced search

Hi all,

We have been observing that a lot of EPrints repositories have been
receiving Distributed Denial-of-Service (DDoS) attacks on their advanced
search.  As running advanced search queries can put quite a lot of load
on the server, this can lead to the repository becoming unresponsive.

Analysis of the requests has shown that typically these requests are
bots working their way through the pages of search results for the same
search rather than lots of individual searches. Typically, each affected
repository will only have a few, maybe up to a dozen different actual
searches.  The following command will allow you to see what searches
these are. (You may need to adapt this is you access log is elsewhere):

grep "GET /cgi/search/archive/advanced" /var/log/httpd/ssl_access_log |
grep -v " 403 " | grep -o 'exp=[^&]\+' | sort | uniq -c | sort -n

Typically, /var/log/httpd/ssl_access_log will only cover the requests
since sometime just after midnight on the previous Sunday, if you have
default log rotate in place.  So, as it is Wednesday now, you should
have decent sample to analyse.

What we have done with these results is added a LocationMatch block
inside the Virtualhost block in EPrints Apache configuration. 
Typically, adding this for the HTTPS virtualhost has been sufficient but
you may also need to add it to the HTTP virtualhost.  Therefore, it may
be worth adding the LocationMatch configuration to a separate file and
then including it under both Virtualhost blocks.

Let's say your command above found one specific search that had been
requested thousands of times in the last few days:

exp=0%7C1%7C-date%2Fcreators_name%2Ftitle%7Carchive%7C-%7Ctitle%3Atitle%3AALL%3AIN%3Afibromyalgia+symptoms%7C-%7Ceprint_status%3Aeprint_status%3AANY%3AEQ%3Aarchive%7Cmetadata_visibility%3Ametadata_visibility%3AANY%3AEQ%3Ashow

You should take this and strip off everything before %7Ctitle, keeping
the title but removing the %7C, so it would look like*:

title%3Atitle%3AALL%3AIN%3Afibromyalgia+symptoms%7C-%7Ceprint_status%3Aeprint_status%3AANY%3AEQ%3Aarchive%7Cmetadata_visibility%3Ametadata_visibility%3AANY%3AEQ%3Ashow

You can then add it to the Apache configuration as follows, being sure
to escape with a '\' any plus (+) symbols:

<LocationMatch "^/cgi/search/archive/advanced">
   <If "%{QUERY_STRING} =~
/exp=0%7C1%7C-date%2Fcreators_name%2Ftitle%7Carchive%7C-%7Ctitle%3Atitle%3AALL%3AIN%3Afibromyalgia\+symptoms%7C-%7Ceprint_status%3Aeprint_status%3AANY%3AEQ%3Aarchive%7Cmetadata_visibility%3Ametadata_visibility%3AANY%3AEQ%3Ashow/">
     Require all denied
   </If>
</LocationMatch>

If you have multiple searches you want to block, it clearer to add
additional If blocks for each search _expression_ (rather than trying to
match multiple search expressions in the same regular _expression_).  Once
you have finished adding these, then as run the appropriate Apache
commands to check the config and reload.  E.g.

apachectl configtest
apachectl graceful

We have found that this is quite effective in dealing with this
problem.  However, it does means some genuine users may perform
legitimate searches and although the first page should return OK, if
they try to re-order or go to the next page it will return a 403
forbidden response (but only for these specific searches). This is not
ideal but there is really no other straightforward way to handle this
problem as the IP addresses vary so widely and doing nothing may mean
long periods where no user can access any part of the EPrints repository.

If anyone has any suggestions on how to refine this configuration, then
please share.

Thanks and regards

David Newman

*Sometimes requests URL encode the / symbol as %7C and sometimes it
doesn't so removing up to %7Ctitle ensure that the pattern you are
matching on covers both the encoded and un-encoded versions.