EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #06720


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Contents on EPrints repository is not featuring on Google Scholar


Hi,

I'd like to point the community to a SEO change applied to the EPrints core.

https://github.com/eprints/eprints/issues/450
---------------------------------------------

The problem

A number of repository administrators have noticed that their content is not featuring in Google Scholar search results.
We have been in discussion with Google Scholar in regard to how it discovers and indexes the contents of EPrints repositories.

While EPrints is by design crafted to present its content to Google in best way, Google Scholar is encountering issues around the initial discovery of the content.
Google’s crawler processes 100s of billions of links, and it needs a clearer way to identify that a link is to an EPrints repository rather than a normal website.
This would then allow Google Scholar to prioritise the crawling and indexing. Google Scholar already has EPrints specific rules in its crawler, and they are happy to update them.

The solution

Google Scholar and I have come up with a plan to increase the discoverability of EPrints content.

Currently, records on EPrints have URLs which look like
http://YOUR-REPO/EPRINTID/ eg http://irep.ntu.ac.uk/12853/
However this is not easily identified as EPrints content without visiting the actual page, and Google has a lot of pages to visit.

We intend to promote the existing EPrints “URI” form of the links, which are easily identified as being EPrints content.
http://YOUR-REPO/id/eprint/EPRINTID/ eg http://irep.ntu.ac.uk/id/eprint/12853/
Currently the longer form of the URL redirects to the shorter version. And we would like to swap that around so that the shorter redirects the to the longer version.
That way no existing links will stop working, but gradually references to your repository, and more importantly Google's indexer will use the longer identifiable version.

Document URLs would need to be changed in a similar way, again any existing links would continue to work, but the promoted version of the links would change from
http://irep.ntu.ac.uk/12853/1/185527_3220%20Heasell%20prepublilsher.pdf
to
http://irep.ntu.ac.uk/id/eprint/12853/1/185527_3220%20Heasell%20prepublilsher.pdf


We have made the changes described above locally and they have proved successful.
Now we have now also applied the changes to the EPrints core.
These changes can be enabled by updating your 20_base_urls.pl to include
$c->{use_long_url_format} = 1;

If you apply these changes and would like Google Scholar to prioritise a reindex of your repository, get in touch with us and we’ll pass the message along to them.


Justin/Jiadi