EPrints Technical Mailing List Archive

Message: #08684


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard

  • To: David R Newman <drn@ecs.soton.ac.uk>
  • Subject: [EP-tech] Antwort: Re: Crawler ends up with 404, dont know how to handle MIME subtype wildcard
  • From: <jens.witzel@uzh.ch>
  • Date: Wed, 28 Jul 2021 08:16:26 +0200

CAUTION: This e-mail originated outside the University of Southampton.

Hi David

thank you once more for your support. We're actually running on 3.3.16 with some changes and i'll check this HEAD fixes.

Kind regards
Jens


--
Jens Witzel
Zentrale Informatik
Universität Zürich
Stampfenbachstrasse 73
CH-8006 Zürich

mail:  jens.witzel@uzh.ch
phone: +41 44 63 56777
http://www.zi.uzh.ch


Inactive hide details for "David R Newman" ---27.07.2021 21:46:14---Hi Jens, HEAD requests (adapting the Curl command you provi"David R Newman" ---27.07.2021 21:46:14---Hi Jens, HEAD requests (adapting the Curl command you provided) work for me

Von: "David R Newman" <drn@ecs.soton.ac.uk>
An: jens.witzel@uzh.ch
Kopie: eprints-tech@ecs.soton.ac.uk
Datum: 27.07.2021 21:46
Betreff: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard





Hi Jens,

HEAD requests (adapting the Curl command you provided) work for me running a repository based on the latest GitHub commit.  Looking back through the code I think the fix to support HEAD requests correctly was put in place for the EPrints 3.4.1 release:

https://github.com/eprints/eprints3.4/commit/d723a6e8f30d3e041fe4ef9d6323ccf7f6a2fbcd#diff-0c92bd144bdc663271c5d8d071977aaa546702c4d8c1b379b32682e5cfa43527

Regards

David Newman

On 27/07/2021 08:15, jens.witzel@uzh.ch wrote:

    CAUTION: This e-mail originated outside the University of Southampton.

    Dear David

    as a reminder: grabbing only the "Head" also still doesn't work! We saw this many times in our logfiles.


    curl -v -I
    https://_some_eprint_server_/id/eprint/12345/

    Thanks
    Jens


    --
    Jens Witzel
    Zentrale Informatik
    Universität Zürich
    Stampfenbachstrasse 73
    CH-8006 Zürich

    mail:  
    jens.witzel@uzh.ch
    phone: +41 44 63 56777

    http://www.zi.uzh.ch

    Inactive hide details for "David R Newman"
            ---26.07.2021 17:43:21---Hi Jens, Great!  If it had not
            fixed it, I would have been "David R Newman" ---26.07.2021 17:43:21---Hi Jens, Great!  If it had not fixed it, I would have been at a bit of a loss.

    Von:
    "David R Newman" <drn@ecs.soton.ac.uk>
    An:
    jens.witzel@uzh.ch
    Kopie:
    eprints-tech@ecs.soton.ac.uk
    Datum:
    26.07.2021 17:43
    Betreff:
    Re: Antwort: Re: Antwort: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard





    Hi Jens,

    Great!  If it had not fixed it, I would have been at a bit of a loss.

    Regards

    David Newman

    On 26/07/2021 16:41, jens.witzel@uzh.ch wrote:

    CAUTION: This e-mail originated outside the University of Southampton.

    Sorry David,

    it was my fault: tried to catch up an non existant link, caused by mixture between ISSUE and testing host #-)
    Now I get my "HTTP/1.1 302 Found" which should be fine.

    Thanks again
    Jens

    --
    Jens Witzel
    Zentrale Informatik
    Universität Zürich
    Stampfenbachstrasse 73
    CH-8006 Zürich

    mail:  
    jens.witzel@uzh.ch
    phone: +41 44 63 56777

    http://www.zi.uzh.ch

    Inactive hide details for Jens Witzel---26.07.2021
              17:33:26---Hi David thanks for your fast fix. Just tested
              it and unfortunateJens Witzel---26.07.2021 17:33:26---Hi David thanks for your fast fix. Just tested it and unfortunately still get this ugly 404 :-/ Rega

    Von:
    Jens Witzel/at/UZH
    An:
    "David R Newman" <drn@ecs.soton.ac.uk>
    Kopie:
    eprints-tech@ecs.soton.ac.uk, jens.witzel@uzh.ch
    Datum:
    26.07.2021 17:33
    Betreff:
    Antwort: Re: Antwort: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard 




    Hi David

    thanks for your fast fix. Just tested it and unfortunately still get this ugly 404 :-/

    Regards
    Jens

    --
    Jens Witzel
    Zentrale Informatik
    Universität Zürich
    Stampfenbachstrasse 73
    CH-8006 Zürich

    mail:  
    jens.witzel@uzh.ch
    phone: +41 44 63 56777

    http://www.zi.uzh.ch


    Inactive hide details for "David R Newman"
            ---26.07.2021 16:03:36---Hi Jens, To fix your specific
            problem you need to modify"David R Newman" ---26.07.2021 16:03:36---Hi Jens, To fix your specific problem you need to modify

    Von:
    "David R Newman" <drn@ecs.soton.ac.uk>
    An:
    jens.witzel@uzh.ch
    Kopie:
    eprints-tech@ecs.soton.ac.uk
    Datum:
    26.07.2021 16:03
    Betreff:
    Re: Antwort: Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard




    Hi Jens,

    To fix your specific problem you need to modify perl_lib/EPrints/Apache/Rewrite.pm on or around line 422:

    -                       &&  (index(lc($accept), "text/html") != -1 || index(lc($accept),"*/*") != -1 || $accept eq ""  )   ## header must be text/html, or */*, or undef
    +                       &&  (index(lc($accept), "text/html") != -1 || index(lc($accept), "text/*") != -1 || index(lc($accept),"*/*") != -1 || $accept eq ""  )   ## header must be text/html, text/*, */* or undef

    I am reviewing the implication of this change and whether any further changes are needed, as I see reference to the accept mime type in several other places and want to see whether setting accept mime type to text/* on other requests would still break things.

    Regards

    David Newman

    On 26/07/2021 09:55, jens.witzel@uzh.ch wrote:

    CAUTION: This e-mail originated outside the University of Southampton.

    Dear David

    thank you for your support!

    Kind regards
    Jens

    --
    Jens Witzel
    Zentrale Informatik
    Universität Zürich
    Stampfenbachstrasse 73
    CH-8006 Zürich

    mail:  
    jens.witzel@uzh.ch
    phone: +41 44 63 56777

    http://www.zi.uzh.ch

    Inactive hide details for "David R Newman"
              ---26.07.2021 10:50:37---Hi Jens, I can replicate the same
              problem on 3.4 GitHub HEA"David R Newman" ---26.07.2021 10:50:37---Hi Jens, I can replicate the same problem on 3.4 GitHub HEAD [1].  I have created

    Von:
    "David R Newman" <drn@ecs.soton.ac.uk>
    An:
    eprints-tech@ecs.soton.ac.uk, jens.witzel@uzh.ch
    Datum:
    26.07.2021 10:50
    Betreff:
    Re: [EP-tech] Crawler ends up with 404, dont know how to handle MIME subtype wildcard 





    Hi Jens,

    I can replicate the same problem on 3.4 GitHub HEAD [1].  I have created a GitHub issue for this [2] and will investigate.

    Regards  

    David Newman

    [1] https://github.com/eprints/eprints3.4 

    [2] https://github.com/eprints/eprints3.4/issues/159 

    On 26/07/2021 09:31, jens.witzel--- via Eprints-tech wrote:

    CAUTION: This e-mail originated outside the University of Southampton.

    Dear all

    unfortunately one of our partner crawlers reports a 404 error during the download, The problem occurs when wildcards are used as mime subtype.

    Here an example on our repo ZORA - let us try to get publication no. 143147 via CURL:

    HTTP 200 status is returned, when
    - no Accept header is specified: curl -v
    https://www.zora.uzh.ch/id/eprint/143147/
    - an exact MIME type is specified: curl -v -H 'Accept: text/html'
    https://www.zora.uzh.ch/id/eprint/143147/
    - any MIME type is specified: curl -v -H 'Accept: */*'
    https://www.zora.uzh.ch/id/eprint/143147/

    HTTP 404 status is returned if the MIME subtype is open, e.g. 'text/*'.

    ==> curl -v -H 'Accept: text/*,application/*'
    https://www.zora.uzh.ch/id/eprint/143147/

    [...]
    < HTTP/1.1 404 Not Found
    < Date: Mon, 26 Jul 2021 08:23:04 GMT
    < Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_perl/2.0.11 Perl/v5.16.3
    < Cache-Control: no-store, no-cache, must-revalidate
    < Strict-Transport-Security: max-age=15780000
    < Transfer-Encoding: chunked
    < Content-Type: text/html; charset=utf-8

    The Header "Accept: text/*,application/*" should be valid. So, we think is goin wrong around CRUD.pm [line 948] -
    elsif( $subtype eq '*' ) {}

    Is this a bug or is there a workaround? Any help is appreciated.

    Have a nice day
    Jens


    --
    Jens Witzel
    Zentrale Informatik
    Universität Zürich
    Stampfenbachstrasse 73
    CH-8006 Zürich

    mail:  
    jens.witzel@uzh.ch
    phone: +41 44 63 56777

    http://www.zi.uzh.ch 


    *** Options:
    http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
    *** Archive:
    http://www.eprints.org/tech.php/
    *** EPrints community wiki:
    http://wiki.eprints.org/