EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #02741


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: UTF-8 issues on BibTeX import?


If you want to diagnose and fix character encoding problems properly,
you should track back the strings to their inception (likely somewhere
in the DBI driver), checking whether the actual internal representation
matches the expected representation at each point in control flow where
the strings are used, and paying attention to character-to-byte-stream
and vice versa conversions (each such conversion inevitably occurs in
terms of some character encoding, and it's essential to ensure that the
desired one is used).

More on the dirty details and how to examine strings in Perl:

http://plosquare.blogspot.com/2009/04/viewing-internal-representation-of.html

If you are not very careful, chances are that whatever solution you come
up for your environment with will break something for someone else...

Andrew Beeken wrote:
> The truth is, I’m not sure where this should really go - the issue seems
> to be in the standard BibTeX importer in perl_lib so ideally I’d like to
> extend this to sanitise these kinds of characters out of the data.
> 
> On 10/03/2014 15:45, "Ian Stuart" <Ian.Stuart@ed.ac.uk> wrote:
> 
>> Reading strings?
>>
>> Have you tried
>>
>>   $count = utf8::upgrade($name)
>>
>> see http://perldoc.perl.org/utf8.html
>>
>> (I tried all sorts of things over the years... and I don't think I've
>> been consistent)
>>
>> On 10/03/14 15:31, Andrew Beeken wrote:
>>> Interesting!
>>>
>>> Looking into this a bit further, the issue seems to be around the keys
>>> that records take with them out of, say, a Scopus export. For example, a
>>> record may be given a key of Péron20141; note the accent - this is the
>>> part that¹s causing the issue and is probably understandable if the key
>>> is
>>> conforming to specific standards. With this in mind, is there a
>>> workaround?
>>>
>>> On 10/03/2014 11:24, "Ian Stuart" <Ian.Stuart@ed.ac.uk> wrote:
>>>
>>>> On 10/03/14 11:02, Andrew Beeken wrote:
>>>>> Me again!
>>>>>
>>>>> Another issue that has been flagged up by our admin users is that a
>>>>> BibTeX import will fall over when it encounters accented characters
>>>>> in an author name. I¹ve already flagged a problem with UTF-8 encoding
>>>>> in output in another email and I¹m wondering if there is a similar
>>>>> fix here?
>>>>
>>>> Something to consider (I fell over this) is that web servers have a
>>>> tendency to not actually sent UTF-8, even when you ask them to....
>>>>
>>>> I have a script that wouldn't render the name of some Dutch university
>>>> correctly..... but when I added in the name of a chinese one, it was
>>>> fine.
>>>>
>>>> It was a blinkin' NIGHTMARE to figure out.... and in the end I bypassed
>>>> the EPrints output, and just "printed" directly, with the line
>>>>
>>>>     binmode(STDOUT, ":utf8");
>>>>
>>>> in my code.
>>
>>
>> --
>>
>> Ian Stuart.
>> Developer: ORI, RJ-Broker, and OpenDepot.org
>> Bibliographics and Multimedia Service Delivery team,
>> EDINA,
>> The University of Edinburgh.
>>
>> http://edina.ac.uk/
>>
>> This email was sent via the University of Edinburgh.
>>
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive: http://www.eprints.org/tech.php/
>> *** EPrints community wiki: http://wiki.eprints.org/
>> *** EPrints developers Forum: http://forum.eprints.org/
> 
> 
> The University of Lincoln, located in the heart of the city of Lincoln, has established an international reputation based on high student satisfaction, excellent graduate employment and world-class research.
> 
> The information in this e-mail and any attachments may be confidential. If you have received this email in error please notify the sender immediately and remove it from your system. Do not disclose the contents to another person or take copies.
> 
> Email is not secure and may contain viruses. The University of Lincoln makes every effort to ensure email is sent without viruses, but cannot guarantee this and recommends recipients take appropriate precautions.
> 
> The University may monitor email traffic data and content in accordance with its policies and English law. Further information can be found at: http://www.lincoln.ac.uk/legal.
> 
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/