EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #08683
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard
- To: <jens.witzel@uzh.ch>
- Subject: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard
- From: David R Newman <drn@ecs.soton.ac.uk>
- Date: Tue, 27 Jul 2021 20:46:01 +0100
Hi Jens,
HEAD requests (adapting the Curl command you provided) work for me running a repository based on the latest GitHub commit. Looking back through the code I think the fix to support HEAD requests correctly was put in place for the EPrints 3.4.1 release:
Regards
David Newman
CAUTION: This e-mail originated outside the University of Southampton.Dear David
as a reminder: grabbing only the "Head" also still doesn't work! We saw this many times in our logfiles.
curl -v -I https://_some_eprint_server_/id/eprint/12345/
Thanks
Jens
--
Jens Witzel
Zentrale Informatik
Universität Zürich
Stampfenbachstrasse 73
CH-8006 Zürich
mail: jens.witzel@uzh.ch
phone: +41 44 63 56777
http://www.zi.uzh.ch
"David R Newman" ---26.07.2021 17:43:21---Hi Jens, Great! If it had not fixed it, I would have been at a bit of a loss.
Von: "David R Newman" <drn@ecs.soton.ac.uk>
An: jens.witzel@uzh.ch
Kopie: eprints-tech@ecs.soton.ac.uk
Datum: 26.07.2021 17:43
Betreff: Re: Antwort: Re: Antwort: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard
Hi Jens,Great! If it had not fixed it, I would have been at a bit of a loss.
Regards
David Newman
On 26/07/2021 16:41, jens.witzel@uzh.ch wrote:
CAUTION: This e-mail originated outside the University of Southampton.
Sorry David,
it was my fault: tried to catch up an non existant link, caused by mixture between ISSUE and testing host #-)
Now I get my "HTTP/1.1 302 Found" which should be fine.
Thanks again
Jens
--
Jens Witzel
Zentrale Informatik
Universität Zürich
Stampfenbachstrasse 73
CH-8006 Zürich
mail: jens.witzel@uzh.ch
phone: +41 44 63 56777
http://www.zi.uzh.ch
Jens Witzel---26.07.2021 17:33:26---Hi David thanks for your fast fix. Just tested it and unfortunately still get this ugly 404 :-/ Rega
Von: Jens Witzel/at/UZH
An: "David R Newman" <drn@ecs.soton.ac.uk>
Kopie: eprints-tech@ecs.soton.ac.uk, jens.witzel@uzh.ch
Datum: 26.07.2021 17:33
Betreff: Antwort: Re: Antwort: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard
Hi David
thanks for your fast fix. Just tested it and unfortunately still get this ugly 404 :-/
Regards
Jens
--
Jens Witzel
Zentrale Informatik
Universität Zürich
Stampfenbachstrasse 73
CH-8006 Zürich
mail: jens.witzel@uzh.ch
phone: +41 44 63 56777
http://www.zi.uzh.ch
"David R Newman" ---26.07.2021 16:03:36---Hi Jens, To fix your specific problem you need to modify
Von: "David R Newman" <drn@ecs.soton.ac.uk>
An: jens.witzel@uzh.ch
Kopie: eprints-tech@ecs.soton.ac.uk
Datum: 26.07.2021 16:03
Betreff: Re: Antwort: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard
Hi Jens,To fix your specific problem you need to modify perl_lib/EPrints/Apache/Rewrite.pm on or around line 422:
- && (index(lc($accept), "text/html") != -1 || index(lc($accept),"*/*") != -1 || $accept eq "" ) ## header must be text/html, or */*, or undef
+ && (index(lc($accept), "text/html") != -1 || index(lc($accept), "text/*") != -1 || index(lc($accept),"*/*") != -1 || $accept eq "" ) ## header must be text/html, text/*, */* or undefI am reviewing the implication of this change and whether any further changes are needed, as I see reference to the accept mime type in several other places and want to see whether setting accept mime type to text/* on other requests would still break things.
Regards
David Newman
On 26/07/2021 09:55, jens.witzel@uzh.ch wrote:
CAUTION: This e-mail originated outside the University of Southampton.
Dear David
thank you for your support!
Kind regards
Jens
--
Jens Witzel
Zentrale Informatik
Universität Zürich
Stampfenbachstrasse 73
CH-8006 Zürich
mail: jens.witzel@uzh.ch
phone: +41 44 63 56777
http://www.zi.uzh.ch
"David R Newman" ---26.07.2021 10:50:37---Hi Jens, I can replicate the same problem on 3.4 GitHub HEAD [1]. I have created
Von: "David R Newman" <drn@ecs.soton.ac.uk>
An: eprints-tech@ecs.soton.ac.uk, jens.witzel@uzh.ch
Datum: 26.07.2021 10:50
Betreff: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard
Hi Jens,I can replicate the same problem on 3.4 GitHub HEAD [1]. I have created a GitHub issue for this [2] and will investigate.
Regards
David Newman
[1] https://github.com/eprints/eprints3.4
[2] https://github.com/eprints/eprints3.4/issues/159
On 26/07/2021 09:31, jens.witzel--- via Eprints-tech wrote:
CAUTION: This e-mail originated outside the University of Southampton.
Dear all
unfortunately one of our partner crawlers reports a 404 error during the download, The problem occurs when wildcards are used as mime subtype.
Here an example on our repo ZORA - let us try to get publication no. 143147 via CURL:
HTTP 200 status is returned, when
- no Accept header is specified: curl -v https://www.zora.uzh.ch/id/eprint/143147/
- an exact MIME type is specified: curl -v -H 'Accept: text/html' https://www.zora.uzh.ch/id/eprint/143147/
- any MIME type is specified: curl -v -H 'Accept: */*' https://www.zora.uzh.ch/id/eprint/143147/
HTTP 404 status is returned if the MIME subtype is open, e.g. 'text/*'.
==> curl -v -H 'Accept: text/*,application/*' https://www.zora.uzh.ch/id/eprint/143147/
[...]
< HTTP/1.1 404 Not Found
< Date: Mon, 26 Jul 2021 08:23:04 GMT
< Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_perl/2.0.11 Perl/v5.16.3
< Cache-Control: no-store, no-cache, must-revalidate
< Strict-Transport-Security: max-age=15780000
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=utf-8
The Header "Accept: text/*,application/*" should be valid. So, we think is goin wrong around CRUD.pm [line 948] - elsif( $subtype eq '*' ) {}
Is this a bug or is there a workaround? Any help is appreciated.
Have a nice day
Jens
--
Jens Witzel
Zentrale Informatik
Universität Zürich
Stampfenbachstrasse 73
CH-8006 Zürich
mail: jens.witzel@uzh.ch
phone: +41 44 63 56777
http://www.zi.uzh.ch
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
Virus-free. www.avg.com
Virus-free. www.avg.com
Virus-free. www.avg.com
- References:
- [EP-tech] Faceted Search with Elasticsearch in EPrints (on Github EprintsUG)
- From: <jens.witzel@uzh.ch>
- [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard
- From: <jens.witzel@uzh.ch>
- [EP-tech] Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard
- From: <jens.witzel@uzh.ch>
- [EP-tech] Antwort: Re: Antwort: Re: Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard
- From: <jens.witzel@uzh.ch>
- [EP-tech] Faceted Search with Elasticsearch in EPrints (on Github EprintsUG)
- Prev by Date: [EP-tech] Antwort: Re: Antwort: Re: Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard
- Next by Date: [EP-tech] Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard
- Previous by thread: [EP-tech] EPrints/CRIS
- Next by thread: [EP-tech] DOI handling in orcid_support_advance
- Index(es):