EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #10235


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Alphabetically sort names with special characters


CAUTION: This e-mail originated outside the University of Southampton.

Was intrigued by this, and had a moment of spare time,
so wrote a short script, that attempts three different sorts:

Default sort,

Default unicode case folding case-insensitive sort,

...and since the second made no difference, I hit the online cookbook...
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2Fperlunicook%23%25E2%2584%259E-36%3A-Case-and-accent-insensitive-Unicode-sort&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7Ca2a714e4fef24bff968708ddeff336b1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930551409417862%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=D7UwnbMnfBGkPpHOP1v1mDHR23TgOKcFM8CTuOaUW3U%3D&reserved=0
and learned about the default unicode case-and-accent-insensitive sort.

So now we know how to do the correct kind of sort in Perl....next we'd
need to know where in the EPrints codebase to apply the fix.

Where are you seeing the wrong order appearing? In what context do you
wish for the order to be changed in?

Of course there may also be a simple EPrints option that switches to
more correct ordering,
so I probably should have checked the EPrints wiki before looking up
the Perl solution.

Attempting to copy and paste the short experimental script I just
wrote - hope it doesn't get butchered in email form:

====================

#!/usr/bin/env perl

package UnicodeSortExample;

# Used throughout:
use     strict;
use     warnings;
use     v5.16; # enables 'fc' keyword, as well as unicode_strings and
say and other useful things. See
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2Ffeature%23FEATURE-BUNDLES&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7Ca2a714e4fef24bff968708ddeff336b1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930551409435167%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=wxeqw4x8ISHVJcoXcEimINHjjqqp%2BlOJ5w9qCpCV14E%3D&reserved=0
use     utf8;
use     English; # Allows use of $ARG instead of $_. See
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2Fperlvar&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7Ca2a714e4fef24bff968708ddeff336b1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930551409448579%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=XXd3OWlArwwFuh10zfxjsDmrEFWndVWMHQNrLpC%2FtU0%3D&reserved=0
use     Unicode::Collate;   # Allows sorting in Unicode. Included in
core Perl since Perl 5.8

# Global Encoding Settings:
my  $encoding_layer;

SET_ENCODING_LAYER_AT_COMPILE_TIME: BEGIN {
    my  $encoding_to_use                =   'UTF-8'; # Change this to
desired encoding value.

    $encoding_layer                     =
":encoding($encoding_to_use)";  # This is what actually gets used for
the layer.
};

use     open ':std'                     ,   "$encoding_layer";  # :std
affect is global.
binmode STDIN                           ,   $encoding_layer;
binmode STDOUT                          ,   $encoding_layer;
binmode STDERR                          ,   $encoding_layer;

$ENV{'PERL_UNICODE'}                    =   'AS';               # A =
Expect @ARGV values to be UTF-8 strings.
                                                                # S =
Shortcut for I+O+E - Standard input, output and error, will be UTF-8.
                                                                # ENV
settings are global for current thread and any forked processes.

=pod Pod Documentation for UnicodeSortExample.pm

=encoding utf8

=cut

=pod FILENAME, VERSION, SYNOPSIS, DESCRIPTION, VERSION

=head2 FILENAME

UnicodeSortExample.pm - Experimenting with alphabetical sorting.

=head2 VERSION

This is Version v1.0.0.

=cut

our $VERSION                            =   'v1.0.0';

=pod SYNOPSIS, DESCRIPTION

=head2 SYNOPSIS

    # Run file at the command line:
    perl ./UnicodeSortExample.pm

=head2 DESCRIPTION

Modulino for experiments with alphabetical sorting.

=cut

UnicodeSortExample->start() unless caller;

=pod SUBROUTINES

=head2 SUBROUTINES

=cut

=head3 UnicodeSortExample->start()

Sets input (a hardcoded unsorted list).

Does processing (calls a variety of sort methods and saves the results
to a series of lists).

Displays output (describes what each list is, and then displays it).

=cut

sub start {

    # Initial Values:
    my  $self           =   shift;

    my  @unsorted_list  =   (
                                'Lee, K',
                                'Church, B',
                                'Çınar, D',
                                'Ågren, R',
                            );
    # Processing:
    my  @default_sort       =   $self->default_sort(@unsorted_list);
    my  @case_folding_sort  =   $self->case_folding_sort(@unsorted_list);
    my  @unicode_sort       =   $self->unicode_sort(@unsorted_list);

    # Output:
    say '';
    say 'Unordered Input:';
    say '';
    say "* $ARG\n" for @unsorted_list;

    say 'Applying default alphabetical sort:';
    say '';
    say "* $ARG\n" for @default_sort;

    say 'Applying case folded alphabetical sort:';
    say '';
    say "* $ARG\n" for @case_folding_sort;

    say 'Applying Unicode sort:';
    say '';
    say "* $ARG\n" for @unicode_sort;

}

=head3 $self->default_sort(@unordered_list);

Takes a list,
and returns it sorted,
according to the standard alphabetical sort described at:
L<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2Ffunctions%2Fsort&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7Ca2a714e4fef24bff968708ddeff336b1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930551409462477%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=T5mDyAZDrPlkttjsORgeYy1V9tS5hhMBhHIAjEyVXgo%3D&reserved=0>

=cut

sub default_sort {
    my  $self   =   shift;
    return (sort {our $a cmp our $b} @ARG);
}

=head3 $self->case_folding_sort(@unordered_list);

Takes a list,
and returns it sorted,
according to the standard case insensitive alphabetical sort described at:
L<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2Ffunctions%2Fsort&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7Ca2a714e4fef24bff968708ddeff336b1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930551409475528%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=h29BXvIvZYkSSfsddjCxJOe%2BdF2%2BFaMbEve8UilHtZI%3D&reserved=0>

=cut

sub case_folding_sort {
    my  $self   =   shift;
    return (sort {fc(our $a) cmp fc(our $b)} @ARG); # fc folds cases
across all unicode. So comparisons are always case insensitive across
all unicode types.
};

=head3 $self->unicode_sort(@unordered_list);

Takes a list,
and returns it sorted,
according to the standard unicode case and accent insensitive
alphabetical sort described at:
L<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2Fperlunicook%23%25E2%2584%259E-37%3A-Unicode-locale-collation&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7Ca2a714e4fef24bff968708ddeff336b1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930551409488681%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=hgZweCBJpIZH3NkddS7zN3vlWoEmkHVD5UxG1VCagIU%3D&reserved=0>
and elaboration can be found at:
L<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fperldoc.perl.org%2FUnicode%3A%3ACollate&data=05%7C02%7Ceprints-tech%40ecs.soton.ac.uk%7Ca2a714e4fef24bff968708ddeff336b1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638930551409501769%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=EdVIEifKy0pWl2KaDGS9daq3T0Bzl%2FRsDrusC5KiRu4%3D&reserved=0>
or L<Unicode::Collate>.

=cut

sub unicode_sort {
    my  $self   =   shift;
    my  @configuration_to_ignore_case_and_diacritics    =   (level => 1);

    return
Unicode::Collate->new(@configuration_to_ignore_case_and_diacritics)->sort(@ARG);
}

=head2 AUTHOR (en-GB)

Andrew Mehta

=cut

=head2 COPYRIGHT AND LICENSE (en-GB)

Copyright ©2025, Andrew Mehta.

This program is free software; you can redistribute it and/or modify
it under the same terms as Perl 5.42.0.
For more details, see the full text of the licenses via
L<perlartistic> and L<perlgpl>.
This program is distributed in the hope that it will be useful, but
without any warranty;
without even the implied warranty of merchantability or fitness for a
particular purpose.

=cut

1;

__END__



Quoting Will Hughes <w.p.hughes@reading.ac.uk>:

CAUTION: This e-mail originated outside the University of Southampton.
CAUTION: This e-mail originated outside the University of Southampton.
Hi

Hopefully a quick question with an easy answer:

How do we get alphabetic sorting to list accented characters at an
appropriate point in an alphabetic list? The default behaviour seems
to use UniCode values or something, as accented characters appear at
the end of the alphabet.

For example, when I see this kind of sequence from Eprints:


  *   Church, B
  *   Lee, K
  *   Ågren, R
  *   Çınar, D

I feel that it should (probably) be:


  *   Ågren, R
  *   Church, B
  *   Çınar, D
  *   Lee, K

Is there a simple setting to implement sorting in a way that
respects accented characters? (and will these characters reproduce
accurately after emailing! Image attached just in case)

Best wishes

Will

Will Hughes
Emeritus Professor of Construction Management and Economics
School of the Built Environment
University of Reading, PO Box 219, Whiteknights
Reading, RG6 6DF, UK