EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09179

Re: [EP-tech] Modification of language by script

To: András Holl <holl.andras@konyvtar.mta.hu>
Subject: Re: [EP-tech] Modification of language by script
From: John Salter <J.Salter@leeds.ac.uk>
Date: Wed, 18 Jan 2023 18:39:03 +0000

CAUTION: This e-mail originated outside the University of Southampton.

> so far I was not able to find out the exact process of generating indexcodes

For a PDF, by default, it's:
/usr/bin/pdftotext -enc UTF-8 -layout SOURCE_DOC TARGET_DOC


The full story goes something like this:

        $doc->make_indexcodes();
This searces for 'Convert' plugins that can create 'indexcodes' types. By default, there is only one:
        EPrints::Plugin::Convert::IndexCodes

The 'can_convert' method in that module looks for other plugins that can convert the document type (e.g. pdf) into 'text/plain'.
If one exists, the Convert::IndexCodes plugin returns, saying that it can convert the document.

The actual generation of the indexcodes.txt file is done by:
        EPrints::Plugin::Convert::IndexCodes::export
This calls the other plugin (that can_convert $doc into text/plain), does some stuff (extracts words), and saves the file.

In most cases, this is done by
        EPrints::Plugin::Convert::PlainText
At the top of that module file, there is a hash, defining which application to use for which document format:
%EPrints::Plugin::Convert::PlainText::APPS = qw(
pdf             pdftotext
doc             doc2txt
htm             elinks
html            elinks
xml             elinks
ps              ps2ascii
txt             _special
);

NB There is special handling for docx files in Convert::PlainText.

The programs are defined in:
        ~/lib/syscfg.d/executables.pl
Which checks to see if the program exists.
The actual commands used are defined in:
        ~/lib/syscfg.d/invocations.pl

By default, for a PDF, indexcodes.txt is generated by calling:
        /usr/bin/pdftotext -enc UTF-8 -layout $(SOURCE) $(TARGET)

And in describing all that, I have discovered that pdftotext is not installed on my VM, and I haven't been generating the indexcodes.txt files for PDFs for a few years.
Something for me to add to my to-do list!

Cheers,
John

-----Original Message-----
From: András Holl [mailto:holl.andras@konyvtar.mta.hu]
Sent: 18 January 2023 09:24
To: John Salter <J.Salter@leeds.ac.uk>
Cc: eprints-tech <eprints-tech@ecs.soton.ac.uk>
Subject: Re: Modification of language by script



Hi John,


Many thanks! I am aware of indexcodes, and for this purpose (finding out the language) those bags of words could indeed
be useful! For the text mining I am a bit reluctant to use them - so far I was not able to find out the exact
process of generating indexcodes - so I have relied on my own stuff and standard text extraction tools instead.

Andras Holl

--
Holl András
informatikai főigazgató-helyettes / deputy director (IT)
MTA Könyvtár és Információs Központ / MTA Library and Information Centre


----- Original Message -----
From: "John Salter" <J.Salter@leeds.ac.uk>
To: "holl andras" <holl.andras@konyvtar.mta.hu>
Cc: "eprints-tech" <eprints-tech@ecs.soton.ac.uk>
Sent: Wednesday, 18 January, 2023 09:59:26
Subject: RE: Modification of language by script

Hi András,
Just had an additional thought...

If you have full-text indexing turned on for your repository, there might be useful stuff in the 'indexcodes.txt' documents that are generated from the PDFs.

This may already have extracted words from the PDF that can be fed in to the language-identification tool that you use to check the title.

The indexcodes files should be in a directory in the EPrint folder e.g. for EPrintID 56, the folder is:
        EPRINTS_ROOT/archives/ARCHIVE_ID/documents/disk0/00/00/00/56/
The sub folders may look like this:
        01/Document.pdf
        02/lightbox.jpg <-- thumbnails
        03/preview.jpg
        04/medium.jpg
        05/small.jpg
        06/indexcodes.txt    <---- this file might be useful to look at!
        revisions/ (contains multiple XML files)

The order of the subdirectories might be different - depending on what the indexer processed first - thumbnails or indexing.

You can get the indexcodes document from the original using something like this ($doc is the 'original' document, from e.g. $eprint->get_all_documents):
        my $index_doc = $doc->search_related( "isIndexCodesVersionOf" )->item( 0 );
        if( defined $index_doc )
        {
            # get the filestream and send contents to the language identifier
        }

There are some caveats to the indexcodes data:
- I think that the thumbnails/indexcodes are not normally generated until an item is live in the repository
- the words may be 'stemmed' - removing some word-endings 's', 'ing' type stuff.

Not sure if that's useful or not - thought it was worth sharing in case!

Cheers,
John

-----Original Message-----
From: András Holl [mailto:holl.andras@konyvtar.mta.hu]
Sent: 17 January 2023 10:10
To: John Salter <J.Salter@leeds.ac.uk>
Cc: eprints-tech <eprints-tech@ecs.soton.ac.uk>
Subject: Re: Modification of language by script



Dear John,

Yes, there are some EPrints with multiple documents, and they might be of different language.
And yes, I do want to process everything.

However, not necessarily with the same tool. I have already have scripts for examining the
language settings, and also some tools for finding out what the language is. (At the moment
what I do is the following: get the title, run it through a spell checker with a given dictionary, and
find out the percentage of the known words in the title. This checks in one step whether
the first hypothesis [the language is Hungarian] is true or not. Later I might want to
process the full text layer of the PDF. Next I might repeat the same with the next
choice of language, say, English.)

Also, I might do the simple cases (only one document) first.

So fix_language might be jut the thing I need - I am looking at it.

Thank You!

András

----- Original Message -----
From: "John Salter" <J.Salter@leeds.ac.uk>
To: "holl andras" <holl.andras@konyvtar.mta.hu>
Cc: "eprints-tech" <eprints-tech@ecs.soton.ac.uk>
Sent: Tuesday, 17 January, 2023 10:35:06
Subject: RE: Modification of language by script

Hi András,
Do some EPrints have multiple documents, and can those documents be in different languages?

It sounds like you want to process everything, rather than searching for specific documents/EPrints to update.

>From the details in these pages:
- https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.eprints.org%2Fw%2FAPI%3AEPrints%2FDataSet&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lkVSvEnpy%2FGZAhp7fnY3eA2yC%2FoFw5w4h%2BILZjbSC48%3D&reserved=0
- https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.eprints.org%2Fw%2FAPI%3AEPrints%2FList&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vXFFN9dgtc2%2FaHfrGR1YxPzuLyIw5%2FQFPOY5%2FjW4RPA%3D&reserved=0

I would start with something like this:
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fjesusbagpuss%2Fa8cc8c5328aa6e33e068609bc6f3d6ca&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5MGrqJkmTFvznQsa5RDESYX49kqOxkHi41iJGU%2BAt3A%3D&reserved=0

The bits you need to work out are in the 'process_eprint' function:
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fjesusbagpuss%2Fa8cc8c5328aa6e33e068609bc6f3d6ca%23file-fix_language-L78&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=a0Q6iHQWI%2FZgd71kUcxx%2B2SWhtEmjpGRrj6JUyBVlgw%3D&reserved=0 - how to calculate the language based on the EPrint details
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fjesusbagpuss%2Fa8cc8c5328aa6e33e068609bc6f3d6ca%23file-fix_language-L86&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xzdaG7e%2BHc%2BaAFyzC2pXzhQfk44y0DjcGbyGDxX2Vg4%3D&reserved=0 - how to calculate the language from the document

As-is, the script will not change anything. The 'commit' lines are commented out of safety.
It also references a field that might not exist (eprint.language).

You may want to check the existing setting for the language of the document before updating it.
If you have EPrints with multiple documents attached, you might want to do something like this:
        $eprint->set_under_construction(1);
        ... update (commit) multiple doc changes
        $eprint->set_under_construction(0);
        $eprint->commit;
This means that the EPrint will have one new revision, rather than a revision for each document updated.

Let me know if that helps at all!

Cheers,
John


- you may want t


-----Original Message-----
From: András Holl [mailto:holl.andras@konyvtar.mta.hu]
Sent: 16 January 2023 12:05
To: John Salter <J.Salter@leeds.ac.uk>
Cc: eprints-tech <eprints-tech@ecs.soton.ac.uk>
Subject: Re: Modification of language by script



Dear John,

I am using EPrints 3.3.15. So far, the scripts for 3.2 did work for me.

Since we have installed EPrints (around 2008), the language field for the
documents have been hidden. For each uploaded documents EPrints used the
language settings of the browser as a guess for the language of a document,
and we did not care.

Now we have embarked upon a text mining project, and suddenly it become
important what the language is. I will process the content of the repository
(some 200k items), and find out what the language is, based first on the language
of the title, and then maybe the language of the text layer of the PDFs.

But when I know (or have a reasonable guess), I might try to set the language
of the EPrint document.

With kind regards,

Andras Holl

--
Holl András
informatikai főigazgató-helyettes / deputy director (IT)
MTA Könyvtár és Információs Központ / MTA Library and Information Centre

----- Original Message -----
From: "John Salter" <J.Salter@leeds.ac.uk>
To: "eprints-tech" <eprints-tech@ecs.soton.ac.uk>, "holl andras" <holl.andras@konyvtar.mta.hu>
Sent: Monday, 16 January, 2023 12:37:30
Subject: RE: Modification of language by script

Hi András,
Which version of EPrints are you using?

The scripts you found were written against EPrints 3.2, so might not work if you are using EPrints 3.3 or 3.4.

How do you determine which documents need the language field to be updated?
Is there a field at the EPrint, or at the Document level that you need to search for, to work out which ones, or do you have a list of IDs, or something similar?

Cheers,
John


-----Original Message-----
From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of András Holl via Eprints-tech
Sent: 13 January 2023 12:51
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Modification of language by script

CAUTION: This e-mail originated outside the University of Southampton.

Dear All,

I would like to modify language settings of a document in a given EPrint by a script.

How should I do it, with the script search_and_modify.pl found at
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.eprints.org%2Fservices%2Ftraining%2Fresources%2Fscripts%2Feprints3_2%2Fbin%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YDYoRN5TXewKJJPYvEE9QXaEvZhwZnUBHQvKqV3XwlY%3D&reserved=0 ?

With kind regards,

András Holl

--
Holl András
informatikai főigazgató-helyettes / deputy director (IT)
MTA Könyvtár és Információs Központ / MTA Library and Information Centre

*** Options: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmailman.ecs.soton.ac.uk%2Fmailman%2Flistinfo%2Feprints-tech&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=UL2T%2Bv4M80S1lhiF4oCliqa8M4kMwLYNsMwqxz3yLCw%3D&reserved=0
*** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BqQ%2FG2peAgYye8f4vCUdFjOwarIRQMdhhDR6EFBm7HI%3D&reserved=0
*** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C81c10fdc03df4147eb9608daf9834704%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638096639466550337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AE5rQZBG0qf0YCTPnAJepTwmyXstnur4AIjT%2BIIc2ao%3D&reserved=0
--
Holl András
informatikai főigazgató-helyettes / deputy director (IT)
MTA Könyvtár és Információs Központ / MTA Library and Information Centre

Follow-Ups:
- Re: [EP-tech] Modification of language by script
  - From: John Salter <J.Salter@leeds.ac.uk>

References:
- [EP-tech] Modification of language by script
  - From: András Holl <holl.andras@konyvtar.mta.hu>
- Re: [EP-tech] Modification of language by script
  - From: John Salter <J.Salter@leeds.ac.uk>
- Re: [EP-tech] Modification of language by script
  - From: András Holl <holl.andras@konyvtar.mta.hu>
- Re: [EP-tech] Modification of language by script
  - From: John Salter <J.Salter@leeds.ac.uk>
- Re: [EP-tech] Modification of language by script
  - From: András Holl <holl.andras@konyvtar.mta.hu>
- Re: [EP-tech] Modification of language by script
  - From: John Salter <J.Salter@leeds.ac.uk>
- Re: [EP-tech] Modification of language by script
  - From: András Holl <holl.andras@konyvtar.mta.hu>
- Re: [EP-tech] Modification of language by script
  - From: John Salter <J.Salter@leeds.ac.uk>

Prev by Date: Re: [EP-tech] Modification of language by script
Next by Date: [EP-tech] cannot upload file above 1GB
Previous by thread: [EP-tech] EPrints/CRIS
Next by thread: [EP-tech] DOI handling in orcid_support_advance
Index(es):
- Date
- Thread