EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #10236


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Alphabetically sort names with special characters


CAUTION: This e-mail originated outside the University of Southampton.

Quoting Andrew M <eprints-tech@unitedgames.co.uk>:

Since the script was getting butchered in email form, I've thrown it
online here:
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.andrewjamesmehta.com%2Ffiles%2Feprints%2FUnicodeSortExample.pm&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C033e0f71f42746175ca008ddf0365d8e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930839288538770%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=eoHIyCf85S5PjenKDLSofZaRMEhu5Y3uNgFPj6FtgTM%3D&reserved=0

However, the main part was:

sub unicode_sort {
    my  $self   =   shift;
    my  @configuration_to_ignore_case_and_diacritics    =   (level => 1);

    return
Unicode::Collate->new(@configuration_to_ignore_case_and_diacritics)->sort(@ARG);
}

As written about in the Perl Unicode cookbook:
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2Fperlunicook%23%25E2%2584%259E-36%3A-Case-and-accent-insensitive-Unicode-sort&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C033e0f71f42746175ca008ddf0365d8e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930839288555619%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=qC0kpHcW7CrKIXGM4V1vUk11sF2C5kmwmNxuTczfj7Q%3D&reserved=0

This is Perl, and not EPrints of course,
so the next stage is to figure out where such improved sorts need to
be used in EPrints,
or if there is already an option in EPrints for them.




CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

There was no need for the "our" before $a and $b in that code example.
Apologies. Was messing around with different things and left that in.


Quoting Andrew M <eprints-tech@unitedgames.co.uk>:

Was intrigued by this, and had a moment of spare time,
so wrote a short script, that attempts three different sorts:

Default sort,

Default unicode case folding case-insensitive sort,

...and since the second made no difference, I hit the online cookbook...
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2Fperlunicook%23%25E2%2584%259E-36%3A-Case-and-accent-insensitive-Unicode-sort&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7C033e0f71f42746175ca008ddf0365d8e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930839288569987%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=Khg%2BJ8uIr7H7pYxm%2FondYjm0ODIxBBv8mpZCWCXolwY%3D&reserved=0
and learned about the default unicode case-and-accent-insensitive sort.

So now we know how to do the correct kind of sort in Perl....next
we'd need to know where in the EPrints codebase to apply the fix.

Where are you seeing the wrong order appearing? In what context do
you wish for the order to be changed in?

Of course there may also be a simple EPrints option that switches to
more correct ordering,
so I probably should have checked the EPrints wiki before looking up
the Perl solution.

Attempting to copy and paste the short experimental script I just
wrote - hope it doesn't get butchered in email form:

====================



Quoting Will Hughes <w.p.hughes@reading.ac.uk>:

CAUTION: This e-mail originated outside the University of Southampton.
CAUTION: This e-mail originated outside the University of Southampton.
Hi

Hopefully a quick question with an easy answer:

How do we get alphabetic sorting to list accented characters at an
appropriate point in an alphabetic list? The default behaviour
seems to use UniCode values or something, as accented characters
appear at the end of the alphabet.

For example, when I see this kind of sequence from Eprints:


*   Church, B
*   Lee, K
*   Ågren, R
*   Çınar, D

I feel that it should (probably) be:


*   Ågren, R
*   Church, B
*   Çınar, D
*   Lee, K

Is there a simple setting to implement sorting in a way that
respects accented characters? (and will these characters reproduce
accurately after emailing! Image attached just in case)

Best wishes

Will

Will Hughes
Emeritus Professor of Construction Management and Economics
School of the Built Environment
University of Reading, PO Box 219, Whiteknights
Reading, RG6 6DF, UK