EPrints Technical Mailing List Archive
See the EPrints wiki for instructions on how to join this mailing list and related information.
Message: #09864
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] Encoding for database tables - utf8mb4
- To: <eprints-tech@ecs.soton.ac.uk>, Matthew Kerwin <matthew@kerwin.net.au>
- Subject: Re: [EP-tech] Encoding for database tables - utf8mb4
- From: David R Newman <drn@ecs.soton.ac.uk>
- Date: Sun, 27 Oct 2024 14:08:45 +0000
Hi both,I am keen that EPrints can support utf8mb4 but realistically this is not something that can be changed in a minor EPrints release. As you have already discussed, there can be a lot of manual effort within the terminal to make sure everything is updated in the database, (and the codebase, for that matter).
For a major release of EPrints we would want to make this utf8mb4 by default but also provide a comprehensive script for sysadmins to convert. I have observed whilst testing different approaches for conversion to utf8mb4 that database fields can get mis-encoded, even though in theory the conversion should move to a wider representation (i.e. anything that can be represented in utf8 must be representable in utf8mb4).
Unfortunately, this (development work) all takes time, which is currently an ever rarer commodity. So, if either of you have developed scripts for such a conversion you are will to share that would be most appreciated. Maybe we could create a Git repository in a suitable location to host these scripts. However, I would be loathed to encourage others to hack around with the EPrints 3.4 codebase and their databases to support utf8mb4 unless they really know what they are doing. We may produce a generic solution that works well on a vanilla EPrints repository. However, experience tells me that every EPrints repository is different. Therefore, it is difficult to be confident how reliable any such solution might be on a random EPrints repository.
Thinking out loud, we could create a modified EPrints::Database::mysql module (maybe called EPrints::Database::mysql_utf8mb4) and then this could be switched out with the current one. This would either allow a new EPrints repository to start out as utf8mb4 or allow a current repository to be taken offline, the conversion script run and then the repository brought back online again with the new mysql_utf8mb4 database module. Unfortunately, I suspect this solution is over-simplistic, so some time will be need to consider the gotchas and how these can be overcome.
Regards David Newman On 27/10/2024 12:17 pm, Matthew Kerwin wrote:
CAUTION: This e-mail originated outside the University of Southampton. CAUTION: This e-mail originated outside the University of Southampton. Oh, haha, I realised the message you linked to was from an ancient version of me. Well, there you go. It's still working, fwiw. On Sun, 27 Oct 2024 at 22:14, Matthew Kerwin <matthew@kerwin.net.au> wrote:Hi John, On Fri, 25 Oct 2024 at 22:37, John Salter <J.Salter@leeds.ac.uk> wrote:Hi, Currently my EPrints database still uses utf8, rather than utf8mb4. There are a couple of fields (title, abstract) that I would like to update to utf8mb4 – before (at some point in the future) updating the whole database to utf8mb4. Has anyone done anything similar – changing a limited subset of columns - or should I just try and do the whole database at once?I definitely updated ours at some point, many years ago. I think I did the whole database, but if I recall correctly it required a lot of typing in a terminal because I had to change each column of each table, and collations, and various other things. I also changed the code somewhere around EPrints::Database (and possibly ::mysql) to ensure it connected with the right encoding/charset as well. I can probably dig out some more details once I'm back at work tomorrow, if you need it.These instructions seem to cover what I need to do for individual columns: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.eprints.org%2Fw%2FUnicode%23Managing_32-Bit_Unicode_Characters&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C0e5293b5b5c54a5f10af08dcf690e2b9%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638656349342649174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qVOF%2BPsWhp1C8UGzm76ptIQmVWSxrBBo52Suv2K3aSI%3D&reserved=0 and this is a useful guide: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.eprints.org%2Feptech%2Fmsg07198.html&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C0e5293b5b5c54a5f10af08dcf690e2b9%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638656349342649174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=63M4ubV3HVCSh6%2BzAspFodiU%2Bsmke2KTqVaDaE6RF2Y%3D&reserved=0 If the fields are indexed, these changes are probably also needed: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.eprints.org%2Feptech%2Fmsg09275.html&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C0e5293b5b5c54a5f10af08dcf690e2b9%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638656349342649174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=6%2Bt%2Fxxyuk7kT8yiWf%2FKABhx8YsT78yuMdZtNmEwcND0%3D&reserved=0. The reason I want to do it this way is: I have some external systems that want to push utf8_mb4 data into those fields If I try to update the whole DB at the moment, I hit the index length issues, so need more time to investigate/reconfigure/resolve these Any details from anyone who's update part, or all their DB provision to utf8mb4 welcomed! Cheers, JohnCheers -- Matthew Kerwin [he/him] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmatthew.kerwin.net.au%2F&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C0e5293b5b5c54a5f10af08dcf690e2b9%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638656349342649174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9DTofdfptjpSbgkBDljgpCgOpvO1Z7suAghrn1DqtV0%3D&reserved=0-- Matthew Kerwin [he/him] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmatthew.kerwin.net.au%2F&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C0e5293b5b5c54a5f10af08dcf690e2b9%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638656349342805949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=RPaA%2FdGpBBNAv4dG8Bz%2F4LiwS1hPwxFEDpi%2BOM3cdmY%3D&reserved=0 *** Options: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.eprints.org%2Fw%2FEprints-tech_Mailing_List&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C0e5293b5b5c54a5f10af08dcf690e2b9%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638656349342805949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=IMvOqqvlWDEdiHuVReZYHJ4oX0hyE96gdOD6XFZeuf4%3D&reserved=0 *** Archive: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C0e5293b5b5c54a5f10af08dcf690e2b9%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638656349342805949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=QRBYShOqDguCSxMIEZAzpVJ4FYl9dbu1SFV4KqMYD14%3D&reserved=0 *** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.eprints.org%2F&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C0e5293b5b5c54a5f10af08dcf690e2b9%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638656349342805949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=pHGX6Kweo9GQUe%2BDxsCnNxcDjGj3z462yCrEZ1ug8dI%3D&reserved=0
- References:
- [EP-tech] Encoding for database tables - utf8mb4
- From: John Salter <J.Salter@leeds.ac.uk>
- Re: [EP-tech] Encoding for database tables - utf8mb4
- From: Matthew Kerwin <matthew@kerwin.net.au>
- Re: [EP-tech] Encoding for database tables - utf8mb4
- From: Matthew Kerwin <matthew@kerwin.net.au>
- [EP-tech] Encoding for database tables - utf8mb4
- Prev by Date: Re: [EP-tech] Encoding for database tables - utf8mb4
- Previous by thread: Re: [EP-tech] Encoding for database tables - utf8mb4
- Index(es):