EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #08703


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Antwort: Re: Antwort: Re: Message during process_stats IRStat2


CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

thank you for looking at this and your explanations.

It looks like dataset history may use quite a lot of processing time - is it used anywhere? Otherwise, I would deactivate it in our configuration, because I see no report for it.
Will add some timestamps to process_stats to figure out the time each dataset uses.

As said, we are way beyond 30000 items (170'000). The database is on InnoDB and on a separate, large MariaDB primary/replica server with a lot of InnoDB cache (48 GB, enough to keep the largest tables in cache). So, there should be no bottleneck from this part.

Kind regards,

Martin


--
Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Stampfenbachstr. 73
CH-8006 Zürich

mail: martin.braendle@uzh.ch
phone: +41 44 63 56705
fax: +41 44 63 54505
http://www.zi.uzh.ch


Inactive hide details for "David R Newman" ---14/08/2021 13:48:47---Hi Martin, I do not have any day-to-day involvement with th"David R Newman" ---14/08/2021 13:48:47---Hi Martin, I do not have any day-to-day involvement with the IRStats2 plugin.  Like

Von: "David R Newman" <drn@ecs.soton.ac.uk>
An: martin.braendle@uzh.ch, eprints-tech@ecs.soton.ac.uk
Datum: 14/08/2021 13:48
Betreff: Re: Antwort: Re: [EP-tech] Message during process_stats IRStat2





Hi Martin,

I do not have any day-to-day involvement with the IRStats2 plugin.  Like a couple of other complex plugins, I have a sufficient understanding of how they work to test these against my developments to the core codebase but don't actively take part in their development.  

The IRStats2 plugin suffers from its success as probably the most popular plugin in the Bazaar.  Therefore, lots of people have made their own local tweaks to it to meet their niche requirements, which makes it difficult to bring these all back to a single cogent release, particularly as some of the changes to EPrints 3.4 (from 3.3) means that it has some of its own individual requirements, like the issue Izwan raised.  We have a modified version of IRStats2 to deal with these as well as other optimisations for the operating system and (virtual) hardware specification that we run repositories with IRStats2.

One of my colleagues has a deeper undestanding of the plugin and when we worked in the same office I do remember him cursing the significant lack of optimisation in some of the code for process_stats.  So I think some of these local modifications relate to improving this but I am uncertain of the details.  So I will try to catch up with him about this, hopefully some time next week.

We run a number of repositories that have six figure number (100 thousand plus) of items and nine figure number (100 million plus) of access table records.  I cannot be certain how long process_stats take to run on these repositories but apart from when they unexpectedly come under heavy load overnight, I have not noticed process_stats still running during the day when I have been working on them.  I would say that if you are running the latest IRStats2 Bazaar plugin, then you would need quite a large repository (multiple tens of thousands of items, e.g. at least 30,000+) before running process_stats daily might become a problem.  I think if you have switched over to using InnoDB tables this removes the issue with blocking on the access table when process_stats is running, which can affect responsiveness for those accessing abstract pages or downloading documents during this time (and why process_stats cron jobs should be run overnight, although with InnoDB tables this is more to move the added CPU load to a quieter time of day).  However, having InnoDB tables is unlikely to significantly alter the amount of time process_stats takes to run.

From an organisational point of view, having an eprints and eprintsug GitHub repository for IRStats2 is unhelpful and I think it is one reason for the lack of development, as the 'ownership' of IRStats2 is unclear.  I am hopeful that this situation can be resolved but there are various complexities that have prevented this up to now.  However, I did recently deprecate some eprints GitHub organisation repositories and add pointers to their eprintsug equivalents to resolve this problem for some less contentious plugins.

Regards

David Newman

On 13/08/2021 17:00, martin.braendle@uzh.ch wrote:

    CAUTION: This e-mail originated outside the University of Southampton.

    Hi David,

    daily incremental updates are ok for small repositories.
    However, on a big repo such as ZORA (170K items in total), an incremental update takes 9-10 hours, so we do it weekly, on a separate compute server.


    It doesn't seem that the processing the access table (which is processed in chunks of 100000) is the time-limiting step. According to my last log, that takes about 1000 seconds for 250000 access records. So the rest must have been spent on the history and the eprint set.


    Processing time seems to go linear with the number of repository items and exponential with access (because accesses grow exponentially the more items are added over time).


    Any insights on how the other steps (history, eprint statistics) can be improved performance-wise?


    >From what I gather from GitHub (eprints and eprintsug), the Processor code hadn't been touched since years.


    Kind regards,


    Martin



    --
    Dr. Martin Brändle
    Zentrale Informatik
    Universität Zürich
    Stampfenbachstr. 73
    CH-8006 Zürich



    Inactive hide details for "David R Newman via
            Eprints-tech" ---13/08/2021 14:53:14---Hi Izwan, Just
            to clarify the process_stat"David R Newman via Eprints-tech" ---13/08/2021 14:53:14---Hi Izwan, Just to clarify the process_stats script can be run in two different

    Von:
    "David R Newman via Eprints-tech" <eprints-tech@ecs.soton.ac.uk>
    An:
    "MOHD.IZWAN SALIM" <mohdizwan8733@uitm.edu.my>
    Kopie:
    "EDER Norbert via Eprints-tech" <eprints-tech@ecs.soton.ac.uk>
    Datum:
    13/08/2021 14:53
    Betreff:
    Re: [EP-tech] Message during process_stats IRStat2
    Gesendet von:
    <eprints-tech-bounces@ecs.soton.ac.uk>





    Hi Izwan,

    Just to clarify the process_stats script can be run in two different ways.  One is an initial setup that does various one time task and then processes all of the existing access table records.  The other is just doing an incremental update of stats for the previous day.  The latter should be done via a daily cron job in the eprints crontab.  If you are referring to running the initial setup way, then it might be worth regenerating the stats from scratch, as over time you may have had a lot of internal requests over the lifetime of your repository that would now not be marked as such in your usage stats.  However, I don't expect a regular EPrints repository to make that many internal (i.e. from the eprints server itself) requests. It would only be if you have some bespoke functionality running on your repository that will request abstract pages or download documents.  However, if you were referring to just running the incremental daily update method (for process_stats), assuming that increment is only a day or a few days, then re-running this will not make any difference and regenerating all the stats from scratch would not be worth it, as you probably have very little if any internal requests in this timeframe.

    I get a bit of a feeling that you are not running the process_stats script on a daily basis to make these incremental updates.  Check the out the wiki page that explains about this and how to setup a cron job:

    https://wiki.eprints.org/w/IRStats2#Processing 

    Regards

    David Newman

    On 13/08/2021 13:12, MOHD.IZWAN SALIM wrote:

    CAUTION: This e-mail originated outside the University of Southampton.
    Dear David

    Should I apply the change and re-run the script?

    I already ran it for 2 days?

    Is there any different stat after I apply the change?

    Regards

    Izwan
    UiTM Digital Library

    https://ir.uitm.edu.my/


    On Fri, Aug 13, 2021 at 4:45 PM David R Newman <
    drn@ecs.soton.ac.uk> wrote:
    Hi Izwan,

    Looking at the line of code that has the error:
     

    One of the these to variables is not set.  As there is a comparison involving $hostname further up in the file, it must be $self->{host} that is not set.  This is earlier set from line 24:

    $self->{host} = $self->{session}->config( "host" );

    My suspicion is that you have reconfigured your repository to be HTTPS only and there only set $c->{securehost} and not $c->{host} in your archive's cfg/cfg.d/10_core.pl (or some other config file in the same directory).  If $c->{host} is set to undef this would also return the same error message here.  To resolve this problem you should go to line 24 of /usr/share/eprints/lib/plugins/EPrints/Plugin/Stats/Processor/Access/Referrer.pm and add the following line after it:

    $self->{host} ||= $self->{session}->config( "securehost" );

    This will set $self->{host} to the config value of securehost if there is no value set for host.  I have had to make various amendments to EPrints to support the no $c->{host} set means HTTPS only, so that it does not break things that expect it to always be set.  However, not setting $c->{host} seemed like the most intuitive way to allow system administrators to know they have configured their repository for HTTPS only.  

    Unfortunately, there has not been a new release of IRStats 2 since these changes were baked into recent versions of EPrints 3.4.  In part due to this only affecting those who configured there repositories for HTTPS only in this way.

    Regards

    David Newman

    On 13/08/2021 09:26, MOHD.IZWAN SALIM via Eprints-tech wrote:

    CAUTION: This e-mail originated outside the University of Southampton.
    Dear all, I just migrated and upgraded EPrints 3.3.16 to 3.4.3. Everything worked fine until I ran the process_stat --setup for IRSTAT2.

    The statistic is running (i guess) but it only shows message
    Use of uninitialized value in string eq at /usr/share/eprints/lib/plugins/EPrints/Plugin/Stats/Processor/Access/Referrer.pm line 84.
    Use of uninitialized value in string eq at /usr/share/eprints/lib/plugins/EPrints/Plugin/Stats/Processor/Access/Referrer.pm line 84.
    Use of uninitialized value in string eq at /usr/share/eprints/lib/plugins/EPrints/Plugin/Stats/Processor/Access/Referrer.pm line 84.
    Access: incremental commit to DB

    I'm using mysql 8. I want to get rid of that message.

    Regards

    Izwan
    UiTM Digital Library

    http://ir.uitm.edu.my/

    PENAFIAN: E-mel ini dan apa-apa fail yang dihantar bersama-samanya ("Mesej") adalah dihasratkan hanya untuk kegunaan penerima yang dinyatakan di atas dan mungkin mengandungi maklumat yang tidak umum, bermilik, istimewa, sulit dan dikecualikan dari penzahiran di bawah undang-undang yang terpakai termasuklah Akta Rahsia Rasmi 1972. BACA SELANJUTNYA... 


    DISCLAIMER : This e-mail and any files transmitted with it ("Message") is intended only for the use of the recipient(s) named above and may contain information that is non-public,  proprietary,  privileged,  confidential  and  exempt  from  disclosure under applicable law including the Official Secrets Act 1972. READ MORE...

    *** Options:
    http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
    *** Archive:
    http://www.eprints.org/tech.php/
    *** EPrints community wiki:
    http://wiki.eprints.org/ 
    PENAFIAN: E-mel ini dan apa-apa fail yang dihantar bersama-samanya ("Mesej") adalah dihasratkan hanya untuk kegunaan penerima yang dinyatakan di atas dan mungkin mengandungi maklumat yang tidak umum, bermilik, istimewa, sulit dan dikecualikan dari penzahiran di bawah undang-undang yang terpakai termasuklah Akta Rahsia Rasmi 1972. BACA SELANJUTNYA... 


    DISCLAIMER : This e-mail and any files transmitted with it ("Message") is intended only for the use of the recipient(s) named above and may contain information that is non-public,  proprietary,  privileged,  confidential  and  exempt  from  disclosure under applicable law including the Official Secrets Act 1972. READ MORE...*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
    *** Archive:
    http://www.eprints.org/tech.php/
    *** EPrints community wiki:
    http://wiki.eprints.org/