EPrints Technical Mailing List Archive
Message: #07198
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
Re: [EP-tech] A specific eprint doesn't get indexed ,
- To: eprints-tech@ecs.soton.ac.uk
- Subject: Re: [EP-tech] A specific eprint doesn't get indexed ,
- From: David R Newman <drn@ecs.soton.ac.uk>
- Date: Sat, 3 Mar 2018 18:49:24 +0000
Hi Matthew,Thanks for the advice. That seems to work for the issue I observed. To give an example walkthrough for Avi, I did the following:
1. Added the $dsn.= ";mysql_enable_utf8=1"; line just before the return line of the build_connection_string method in perl_lib/EPrints/Database.pm
2. Changed $self->do("SET NAMES 'utf8'"); to $self->do("SET NAMES 'utf8mb4'"); in connect method of perl_lib/EPrints/Database/mysql.pm
3. Ran the following commands at the MySQL prompt. (I am not sure of the collate lines are needed but wanted to keep things consistent):
ALTER TABLE eprint__rindex CONVERT TO CHARACTER SET utf8mb4;ALTER TABLE eprint__rindex modify column word varchar(128) not null collate 'utf8mb4_bin';
ALTER TABLE eprint__rindex modify column field varchar(64) not null collate 'utf8mb4_bin';
4. Ran my script to de-index the record. However, this should not be necessary but it was useful for me to confirm indexes are removed before being re-added.
5. Ran epadmin reindex on the appropriate record.6. Queried the database to make sure words that failed to be indexed succeeded this time.
7. Did an advanced search using the documents field with one of these newly-indexed terms that the database query found to confirm the EPrint is returned as a result.
It is probably worth doing a complete reindex of all EPrint records using epadmin reindex. This will acheive two things, test that the original problem is resolved and make all EPrints searchable on the terms that were intended to be indexed.
Regards David Newman On 03/03/2018 01:45, Matthew Kerwin wrote:
We've dealt with this over the years, too. Some pointers, which might be difficult depending on your situation: 1. Make sure the relevant columns/tables/database uses a Unicode encoding (currently the 3.3 branch is set up for 'utf8', but I've migrated ours to 'utf8mb4') -- this involves both: 2) making sure the EPrints code uses the right encoding parameters in all its database queries (not just EPrints::Database and EPrints::Database::mysql, but also any other library or package that handles its own database connections), and b) ensuring that any existing database tables are converted correctly (see: https://dev.mysql.com/doc/refman/5.7/en/alter-table.html#alter-table-character-set ) 2. Make sure the connection to the database uses a Unicode encoding; for example: * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database/mysql.pm#L242 * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database.pm#L164 * https://dev.mysql.com/doc/refman/5.7/en/mysql-command-options.html#option_mysql_default-character-set 3. Making sure EPrints/perl handles Unicode strings correctly and consistently. It's a bit of a pain, but we're working at it! Cheers On 3 March 2018 at 10:53, David R Newman <drn@ecs.soton.ac.uk> wrote:Hi Avi, I have noted this issue happening quite a lot as well. I have tracked it down to an issue indexing PDF documents where the extracted word to be indexed contains non-ascii characters. If the whole word is non-ascii characters, basically the empty string gets indexed, if there is more than one word that is all non-ascii characters, then it fails with the error you see below, as it cannot index the empty string twice for the same EPrint and field (i.e. documents). This is because the eprint__rindex table has three fields that make up a primary key, field, word and eprintid. As the middle one is not set that is is why you see documents--91 rather than something like documents-word-91 in your error message. As far as I can tell, this just effects this one badly encoded word from getting indexed rather than preventing all indexing for the whole EPrint. I have tested this by writing a script to completely de-index an EPrint and then ran reindex, I could see the records disappeared from the eprint__rindex table and then reappear again after the reindex. I am going to see if I can get the encoding issue sorted out, as this is likely to be problematic for people who are indexing publications with non-Latin alphabets. However, this is never straightforward, based on past experience. Regards David Newman On 02/03/2018 10:53, Stenger, Avischai wrote: Hello 2 all, i have some eprints that do not get rindexed. If i execute, as an example: ~/bin/epadmin reindex REPO eprint 91 i get The error: DBD::mysql::st execute failed: Duplicate entry 'documents--91' for key 'PRIMARY' at /usr/share/eprints/bin/../perl_lib/EPrints/Database.pm line 1287. i noticed that if i replace the PDF-Document in this eprint i can indexed it without any Error-message. if i check the PDF with some open-pdf-checker it says the PDF ist okay. (https://www.pdf-online.com/osa/validate.aspx) tnks and have a good weekend Avi *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech *** Archive: http://www.eprints.org/tech.php/ *** EPrints community wiki: http://wiki.eprints.org/ *** EPrints developers Forum: http://forum.eprints.org/ *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech *** Archive: http://www.eprints.org/tech.php/ *** EPrints community wiki: http://wiki.eprints.org/ *** EPrints developers Forum: http://forum.eprints.org/
- References:
- [EP-tech] A specific eprint doesn't get indexed ,
- From: "Stenger, Avischai" <avischai.stenger@ulb.tu-darmstadt.de>
- Re: [EP-tech] A specific eprint doesn't get indexed ,
- From: David R Newman <drn@ecs.soton.ac.uk>
- Re: [EP-tech] A specific eprint doesn't get indexed ,
- From: Matthew Kerwin <matthew@kerwin.net.au>
- [EP-tech] A specific eprint doesn't get indexed ,
- Prev by Date: Re: [EP-tech] A specific eprint doesn't get indexed ,
- Next by Date: Re: [EP-tech] A specific eprint doesn't get indexed ,
- Previous by thread: [EP-tech] Sort view with creators_name and corp_creators
- Index(es):