EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #08702
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Antwort: Re: Message during process_stats IRStat2
- To: <martin.braendle@uzh.ch>, <eprints-tech@ecs.soton.ac.uk>
- Subject: Re: [EP-tech] Antwort: Re: Message during process_stats IRStat2
- From: David R Newman <drn@ecs.soton.ac.uk>
- Date: Sat, 14 Aug 2021 12:48:32 +0100
Hi Martin,
I do not have any day-to-day involvement with the IRStats2
plugin. Like a couple of other complex plugins, I have a
sufficient understanding of how they work to test these against my
developments to the core codebase but don't actively take part in
their development.
The IRStats2 plugin suffers from its success as probably the most popular plugin in the Bazaar. Therefore, lots of people have made their own local tweaks to it to meet their niche requirements, which makes it difficult to bring these all back to a single cogent release, particularly as some of the changes to EPrints 3.4 (from 3.3) means that it has some of its own individual requirements, like the issue Izwan raised. We have a modified version of IRStats2 to deal with these as well as other optimisations for the operating system and (virtual) hardware specification that we run repositories with IRStats2.
One of my colleagues has a deeper undestanding of the plugin and
when we worked in the same office I do remember him cursing the
significant lack of optimisation in some of the code for
process_stats. So I think some of these local modifications
relate to improving this but I am uncertain of the details. So I
will try to catch up with him about this, hopefully some time next
week.
We run a number of repositories that have six figure number (100
thousand plus) of items and nine figure number (100 million plus)
of access table records. I cannot be certain how long
process_stats take to run on these repositories but apart from
when they unexpectedly come under heavy load overnight, I have not
noticed process_stats still running during the day when I have
been working on them. I would say that if you are running the
latest IRStats2 Bazaar plugin, then you would need quite a large
repository (multiple tens of thousands of items, e.g. at least
30,000+) before running process_stats daily might become a
problem. I think if you have switched over to using InnoDB tables
this removes the issue with blocking on the access table when
process_stats is running, which can affect responsiveness for
those accessing abstract pages or downloading documents during
this time (and why process_stats cron jobs should be run
overnight, although with InnoDB tables this is more to move the
added CPU load to a quieter time of day). However, having InnoDB
tables is unlikely to significantly alter the amount of time
process_stats takes to run.
From an organisational point of view, having an eprints and
eprintsug GitHub repository for IRStats2 is unhelpful and I think
it is one reason for the lack of development, as the 'ownership'
of IRStats2 is unclear. I am hopeful that this situation can be
resolved but there are various complexities that have prevented
this up to now. However, I did recently deprecate some eprints
GitHub organisation repositories and add pointers to their
eprintsug equivalents to resolve this problem for some less
contentious plugins.
Regards
David Newman
CAUTION: This e-mail originated outside the University of Southampton.Hi David,
daily incremental updates are ok for small repositories.
However, on a big repo such as ZORA (170K items in total), an incremental update takes 9-10 hours, so we do it weekly, on a separate compute server.
It doesn't seem that the processing the access table (which is processed in chunks of 100000) is the time-limiting step. According to my last log, that takes about 1000 seconds for 250000 access records. So the rest must have been spent on the history and the eprint set.
Processing time seems to go linear with the number of repository items and exponential with access (because accesses grow exponentially the more items are added over time).
Any insights on how the other steps (history, eprint statistics) can be improved performance-wise?
From what I gather from GitHub (eprints and eprintsug), the Processor code hadn't been touched since years.
Kind regards,
Martin
--
Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Stampfenbachstr. 73
CH-8006 Zürich
"David R Newman via Eprints-tech" ---13/08/2021 14:53:14---Hi Izwan, Just to clarify the process_stats script can be run in two different
Von: "David R Newman via Eprints-tech" <eprints-tech@ecs.soton.ac.uk>
An: "MOHD.IZWAN SALIM" <mohdizwan8733@uitm.edu.my>
Kopie: "EDER Norbert via Eprints-tech" <eprints-tech@ecs.soton.ac.uk>
Datum: 13/08/2021 14:53
Betreff: Re: [EP-tech] Message during process_stats IRStat2
Gesendet von: <eprints-tech-bounces@ecs.soton.ac.uk>
Hi Izwan,Just to clarify the process_stats script can be run in two different ways. One is an initial setup that does various one time task and then processes all of the existing access table records. The other is just doing an incremental update of stats for the previous day. The latter should be done via a daily cron job in the eprints crontab. If you are referring to running the initial setup way, then it might be worth regenerating the stats from scratch, as over time you may have had a lot of internal requests over the lifetime of your repository that would now not be marked as such in your usage stats. However, I don't expect a regular EPrints repository to make that many internal (i.e. from the eprints server itself) requests. It would only be if you have some bespoke functionality running on your repository that will request abstract pages or download documents. However, if you were referring to just running the incremental daily update method (for process_stats), assuming that increment is only a day or a few days, then re-running this will not make any difference and regenerating all the stats from scratch would not be worth it, as you probably have very little if any internal requests in this timeframe.
I get a bit of a feeling that you are not running the process_stats script on a daily basis to make these incremental updates. Check the out the wiki page that explains about this and how to setup a cron job:
https://wiki.eprints.org/w/IRStats2#Processing
Regards
David Newman
On 13/08/2021 13:12, MOHD.IZWAN SALIM wrote:
CAUTION: This e-mail originated outside the University of Southampton.
Dear David
Should I apply the change and re-run the script?
I already ran it for 2 days?
Is there any different stat after I apply the change?
Regards
Izwan
UiTM Digital Library
https://ir.uitm.edu.my/
On Fri, Aug 13, 2021 at 4:45 PM David R Newman <drn@ecs.soton.ac.uk> wrote:Hi Izwan,
Looking at the line of code that has the error:
One of the these to variables is not set. As there is a comparison involving $hostname further up in the file, it must be $self->{host} that is not set. This is earlier set from line 24:
$self->{host} = $self->{session}->config( "host" );
My suspicion is that you have reconfigured your repository to be HTTPS only and there only set $c->{securehost} and not $c->{host} in your archive's cfg/cfg.d/10_core.pl (or some other config file in the same directory). If $c->{host} is set to undef this would also return the same error message here. To resolve this problem you should go to line 24 of /usr/share/eprints/lib/plugins/EPrints/Plugin/Stats/Processor/Access/Referrer.pm and add the following line after it:
$self->{host} ||= $self->{session}->config( "securehost" );
This will set $self->{host} to the config value of securehost if there is no value set for host. I have had to make various amendments to EPrints to support the no $c->{host} set means HTTPS only, so that it does not break things that expect it to always be set. However, not setting $c->{host} seemed like the most intuitive way to allow system administrators to know they have configured their repository for HTTPS only.
Unfortunately, there has not been a new release of IRStats 2 since these changes were baked into recent versions of EPrints 3.4. In part due to this only affecting those who configured there repositories for HTTPS only in this way.
Regards
David Newman
On 13/08/2021 09:26, MOHD.IZWAN SALIM via Eprints-tech wrote:
CAUTION: This e-mail originated outside the University of Southampton.
Dear all, I just migrated and upgraded EPrints 3.3.16 to 3.4.3. Everything worked fine until I ran the process_stat --setup for IRSTAT2.
The statistic is running (i guess) but it only shows message
Use of uninitialized value in string eq at /usr/share/eprints/lib/plugins/EPrints/Plugin/Stats/Processor/Access/Referrer.pm line 84.
Use of uninitialized value in string eq at /usr/share/eprints/lib/plugins/EPrints/Plugin/Stats/Processor/Access/Referrer.pm line 84.
Use of uninitialized value in string eq at /usr/share/eprints/lib/plugins/EPrints/Plugin/Stats/Processor/Access/Referrer.pm line 84.
Access: incremental commit to DB
I'm using mysql 8. I want to get rid of that message.
Regards
Izwan
UiTM Digital Library
http://ir.uitm.edu.my/
PENAFIAN: E-mel ini dan apa-apa fail yang dihantar bersama-samanya ("Mesej") adalah dihasratkan hanya untuk kegunaan penerima yang dinyatakan di atas dan mungkin mengandungi maklumat yang tidak umum, bermilik, istimewa, sulit dan dikecualikan dari penzahiran di bawah undang-undang yang terpakai termasuklah Akta Rahsia Rasmi 1972. BACA SELANJUTNYA...
DISCLAIMER : This e-mail and any files transmitted with it ("Message") is intended only for the use of the recipient(s) named above and may contain information that is non-public, proprietary, privileged, confidential and exempt from disclosure under applicable law including the Official Secrets Act 1972. READ MORE...
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
Virus-free. www.avg.com
PENAFIAN: E-mel ini dan apa-apa fail yang dihantar bersama-samanya ("Mesej") adalah dihasratkan hanya untuk kegunaan penerima yang dinyatakan di atas dan mungkin mengandungi maklumat yang tidak umum, bermilik, istimewa, sulit dan dikecualikan dari penzahiran di bawah undang-undang yang terpakai termasuklah Akta Rahsia Rasmi 1972. BACA SELANJUTNYA...
DISCLAIMER : This e-mail and any files transmitted with it ("Message") is intended only for the use of the recipient(s) named above and may contain information that is non-public, proprietary, privileged, confidential and exempt from disclosure under applicable law including the Official Secrets Act 1972. READ MORE...*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
- References:
- [EP-tech] Message during process_stats IRStat2
- From: "MOHD.IZWAN SALIM" <mohdizwan8733@uitm.edu.my>
- Re: [EP-tech] Message during process_stats IRStat2
- From: "MOHD.IZWAN SALIM" <mohdizwan8733@uitm.edu.my>
- Re: [EP-tech] Message during process_stats IRStat2
- From: David R Newman <drn@ecs.soton.ac.uk>
- [EP-tech] Antwort: Re: Message during process_stats IRStat2
- From: <martin.braendle@uzh.ch>
- [EP-tech] Message during process_stats IRStat2
- Prev by Date: [EP-tech] Antwort: Re: Message during process_stats IRStat2
- Next by Date: [EP-tech] Antwort: Re: Antwort: Re: Message during process_stats IRStat2
- Previous by thread: [EP-tech] EPrints/CRIS
- Next by thread: [EP-tech] DOI handling in orcid_support_advance
- Index(es):