EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #05889


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Paper of interest: Web robot detection


Hello guys,

I think this paper is very interesting, and contributes to the
eprints-community.

Regards, Cristian

Title: Web robot detection in scholarly Open Access institutional repositories
Author(s): Greene, Joseph
http://hdl.handle.net/10197/7682

Abstract

Purpose -- This paper investigates the impact and techniques for
mitigating the effects of web robots on usage statistics collected by
Open Access institutional repositories (IRs).
Design/methodology/approach -- A review of the literature provides a
comprehensive list of web robot detection techniques. Reviews of
system documentation and open source code are carried out along with
personal interviews to provide a comparison of the robot detection
techniques used in the major IR platforms. An empirical test based on
a simple random sample of downloads with 96.20% certainty is
undertaken to measure the accuracy of an IR's web robot detection at a
large Irish University. Findings -- While web robot detection is not
ignored in IRs, there are areas where the two main systems could be
improved. The technique tested here is found to have successfully
detected 94.18% of web robots visiting the site over a two-year period
(recall), with a precision of 98.92%. Due to the high level of robot
activity in repositories, correctly labelling more robots has an
exponential effect on the accuracy of usage statistics. Limitations --
This study is performed on one repository using a single system.
Future studies across multiple sites and platforms are needed to
determine the accuracy of web robot detection in OA repositories
generally. Originality/value -- This is the only study to date to have
investigated web robot detection in IRs. It puts forward the first
empirical benchmarking of accuracy in IR usage statistics.