EPrints Technical Mailing List Archive
Message: #07197
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] A specific eprint doesn't get indexed ,
- To: <eprints-tech@ecs.soton.ac.uk>
- Subject: Re: [EP-tech] A specific eprint doesn't get indexed ,
- From: Matthew Kerwin <matthew@kerwin.net.au>
- Date: Sat, 3 Mar 2018 11:45:00 +1000
We've dealt with this over the years, too. Some pointers, which might be difficult depending on your situation: 1. Make sure the relevant columns/tables/database uses a Unicode encoding (currently the 3.3 branch is set up for 'utf8', but I've migrated ours to 'utf8mb4') -- this involves both: 2) making sure the EPrints code uses the right encoding parameters in all its database queries (not just EPrints::Database and EPrints::Database::mysql, but also any other library or package that handles its own database connections), and b) ensuring that any existing database tables are converted correctly (see: https://dev.mysql.com/doc/refman/5.7/en/alter-table.html#alter-table-character-set ) 2. Make sure the connection to the database uses a Unicode encoding; for example: * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database/mysql.pm#L242 * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database.pm#L164 * https://dev.mysql.com/doc/refman/5.7/en/mysql-command-options.html#option_mysql_default-character-set 3. Making sure EPrints/perl handles Unicode strings correctly and consistently. It's a bit of a pain, but we're working at it! Cheers On 3 March 2018 at 10:53, David R Newman <drn@ecs.soton.ac.uk> wrote: > Hi Avi, > > I have noted this issue happening quite a lot as well. I have tracked it > down to an issue indexing PDF documents where the extracted word to be > indexed contains non-ascii characters. If the whole word is non-ascii > characters, basically the empty string gets indexed, if there is more than > one word that is all non-ascii characters, then it fails with the error you > see below, as it cannot index the empty string twice for the same EPrint and > field (i.e. documents). This is because the eprint__rindex table has three > fields that make up a primary key, field, word and eprintid. As the middle > one is not set that is is why you see documents--91 rather than something > like documents-word-91 in your error message. > > As far as I can tell, this just effects this one badly encoded word from > getting indexed rather than preventing all indexing for the whole EPrint. I > have tested this by writing a script to completely de-index an EPrint and > then ran reindex, I could see the records disappeared from the > eprint__rindex table and then reappear again after the reindex. > > I am going to see if I can get the encoding issue sorted out, as this is > likely to be problematic for people who are indexing publications with > non-Latin alphabets. However, this is never straightforward, based on past > experience. > > Regards > > David Newman > > > On 02/03/2018 10:53, Stenger, Avischai wrote: > > > Hello 2 all, > > i have some eprints that do not get rindexed. If i execute, as an example: > > ~/bin/epadmin reindex REPO eprint 91 > > i get The error: > > DBD::mysql::st execute failed: Duplicate entry 'documents--91' for key > 'PRIMARY' at /usr/share/eprints/bin/../perl_lib/EPrints/Database.pm line > 1287. > > > > i noticed that if i replace the PDF-Document in this eprint i can indexed > it without any Error-message. > > if i check the PDF with some open-pdf-checker it says the PDF ist okay. > (https://www.pdf-online.com/osa/validate.aspx) > > > tnks and have a good weekend > > > Avi > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech > *** Archive: http://www.eprints.org/tech.php/ > *** EPrints community wiki: http://wiki.eprints.org/ > *** EPrints developers Forum: http://forum.eprints.org/ > > > > *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech > *** Archive: http://www.eprints.org/tech.php/ > *** EPrints community wiki: http://wiki.eprints.org/ > *** EPrints developers Forum: http://forum.eprints.org/ > -- Matthew Kerwin https://matthew.kerwin.net.au/
- Follow-Ups:
- Re: [EP-tech] A specific eprint doesn't get indexed ,
- From: Matthew Kerwin <matthew@kerwin.net.au>
- Re: [EP-tech] A specific eprint doesn't get indexed ,
- References:
- [EP-tech] A specific eprint doesn't get indexed ,
- From: "Stenger, Avischai" <avischai.stenger@ulb.tu-darmstadt.de>
- Re: [EP-tech] A specific eprint doesn't get indexed ,
- From: David R Newman <drn@ecs.soton.ac.uk>
- Re: [EP-tech] A specific eprint doesn't get indexed ,
- From: Matthew Kerwin <matthew@kerwin.net.au>
- [EP-tech] A specific eprint doesn't get indexed ,
- Prev by Date: Re: [EP-tech] A specific eprint doesn't get indexed ,
- Next by Date: Re: [EP-tech] A specific eprint doesn't get indexed ,
- Previous by thread: [EP-tech] Sort view with creators_name and corp_creators
- Index(es):