Machine translation

Need help with translating WW1, Inter-War or WW2 related documents or information?
User avatar
Der Alte Fritz
Member
Posts: 2171
Joined: 13 Dec 2007, 22:43
Location: Kent United Kingdom
Contact:

Machine translation

#1

Post by Der Alte Fritz » 21 Aug 2016, 08:19

I thought that it might be useful for people to have a review of the various methods of translating books and documents.

1) Scanners
I use a book scanner called Plustek Opticbook 3800 which is similar to a normal flat bed scanner except that the scan takes places up to 8mm of the edge of the scanner box. Bought a second hand one on Amazon for £70. This means that you can hang the book over the edge of the scanner and get right into the centre spine of the book. It comes with basic software to handle images and for OCR (Abbyy Finereader 9)(Book Pavilion) so it does a basic job straight out of the box. To get better results you can run the scanner through exterior programmes such as:

2) OCR Programmes
Abbyy Finereader 12 Professional - this gives pretty accurate results. These programmes will either connect directly to a scanner or scan a pile of documents and images on your computer and join them all up into one document, apply OCR to them and deliver a single document in .doc. rtf, pdf, etc. The OCR is of decent quality, it has the ability to be trained for a particular document, it shows you characters of low confident and allows you to correct them and has on board dictionaries to help with this. Distinguishes between text, tables, images, etc. Makes handling documents very quick and easy. Costs around £100.

3) Machine translation
Google translate is pretty good at German, Microsoft Translate app is better at Russian but not so flexible. The advantage that Google has is that if you log onto a Google account as well, you can train it to recognise off words or give better translations of words that it will remember next time (mainly). (Google still insists that the Russian translation for KA is "spacecraft" despite putting in "Red Army" every time.) Cut and paste entry is fairly easy for small bits of text or both can upload documents to do a larger job.

You can get a much better level of control of a larger document by using the link at the bottom of Google translate which is Translator Toolkit. This allows you to work on a large document with Google translate as you go along AND you can load custom glossaries which it uses to help with the translation. So I have a Soviet Military Glossary and a German one to help along. Except that recently the Soviet one has stopped working in Google and I am sure that the problem is that their end.

Finally if you have a document in Fraktur you can use Abbyy OCR online service to get a decent Latin version. See here:

http://www.frakturschrift.com/en:products:onlineocr

User avatar
Der Alte Fritz
Member
Posts: 2171
Joined: 13 Dec 2007, 22:43
Location: Kent United Kingdom
Contact:

Re: Machine translation

#2

Post by Der Alte Fritz » 27 Aug 2016, 06:35

The secret behind successful machine translation is to get as an accurate input text as possible in exactly the right format so that the individual words join up to make coherent sentences, especially in languages such as German with a lot of hyphenation. My main input text is .rtf Rich Text Format keeping line numbering but with line breaks and hyphenation switched off in standard website fonts such as Arial Trebuchet or Verdana.

Try to keep all the features switched on and in use over several documents and many of the systems have the capability to learn, so with Google Translate, log in with your Google account (to keep any corrections that you make,) with Google Translator Toolkit, again log in and use glossaries, etc.

The way to get the best out of your input scan is to do the work methodically in Abby FineReader 12 and to use the correct settings at each stage. Learning these is by a system of trail and error but here are some of my thoughts:
1) Do each stage of the process as a separate job rather than trying to lump them all together.

2) Start with the raw scan and process the image manually so that you get individual pages rather than double pages, a high contrast between page and text, whiten the background, the right dpi setting, straighten text lines, deskew text lines

3) Save different versions of your work at this stage as these changes cannot be adjusted later.

4) Analyse the text using Finereader and then go through it manually so that the text boxes are close around the text and you have removed as many marks and stray blobs on the page as possible. Use the different types of text box to identify tables, images and other none body text items. Adjust tables so that you get clear blocks of text within each cell. You can set it to ignore or include headers, footers or pages numbers though you need a high production book for it to work successfully first time around.Page numbers are a bore as they break up your text so they are best ignored.

5) If working on a book, I would concentrate at this stage on a few test pages to see how I get one and produce test documents to run through Google to see if I can improve things. AFR12 tells you the error rate and much above 5% and your Google output will be pretty garbles so try to aim for 5% or less.

6) When the analysis of the pages is as good as you can get it, Read the text at this stage for your test pages and see what sort of error rate you come out with. You can Read again and again and it is worth adjusting the setting to see if you get different results with a lower error rate. Only when it is as good as you can get is it worth while manually Verifying the text as these corrections are lost next time that you press the Read button, so a Save at this stage would seem to be a good idea. Picking the right Language is crucial, especially in multi language documents or where like in Russian there are three different versions to pick or close relatives such as Ukrainian, Belorussian, Serbian or Bulgarian. Since this is a internal scan of characters, the right typeface seems to influence things as well, so make the right choice here to lower the error rate.

7) Manually training your system seems to work to quite a limited extent and really only when the characters are not part of a normal typeface, such as fancy book titles and specialist symbols.You need to do several pages to produce any really consistent results but it can help with poor quality scans where you get a consistent error such as mis-reading a 'b' and replacing it with an 'e'. It can help with academic books in correctly identifying references notes on the page which again tend to break up the flow of your text and producing error words such as 'army124'

8) When the AFR12 Reading settings are as good as you can get them on your group of test pages, use the same settings to do the whole book and then Verify the text if you want. This manually corrects any low confidence character but often they are simply that, low confidence and not incorrect. A quicker way is to rely on the in-built dictionary and Spell Check the document to jump from error to error until you find a word unlined in red and then correct that. A right click of the mouse brings up suggestions or alternatives or you can dive into the text and correct it by eye. If a lot of the problems are from words broken into bits by hyphens, go back a stage and see if you can improve this in the settings.

9) You can load your own custom dictionary which can be useful if you are using a lot of technical terms. Getting a glossary in a spreadsheet and then just selecting the column with the relevant language technical terms and pasting this into a .txt document and then uploading it is a quick and easy way to accomplish this. At least this gets rid of the errors in AFR12 and when they re-occur in Google Translate, you know that the problem is here.

10) It does make a difference at this stage as to what type of output text you have selected, Editable Text, Formatted Text, Plain Text, etc so the hyphen problem may disappear by manipulating this rather than the actual text but you may have to sacrifice formatting for translation. Again the settings behind these have quite an influence on the output text so it is worth making a trial of the different settings. For instance keeping 'line breaks and hyphens' seems to stop Google Translate from translating at all, while un-checking this box allows a translation to be made in Russian. No idea why? Since you have your document as a AFR12 document, you can use this to generate a number of types of text document for the translator of you choice to work on.

11) When you finally upload you document into Google translate (it can handle around 250-300 pages at a time) there is the issue of what you save the document as. The output text is basically a webpage and you can:
a) Save as a webpage (this seems to keep the background text as well so that you can mouse over the blocks of text and see the original untranslated text, useful if you are having to correct persistent mistakes with acronyms - a constant problem in Russian - Why гиу should appear as SMI rather than GIU who knows! Probably the State Mortgage Institution being a more popular search than the Red Army.
b) copy the text and then paste into a .doc document. This at least has the advantage that you can use the Find and Replace feature to correct persistent mistakes, eg. converting 'spacecraft' into 'Red Army. :o Or you can use the inbuilt or uploaded dictionaries to try and correct the text some more using the Spellcheck feature.
c) print the page using Cutepdf or Microsoft Convert to PDF and get a .pdf document

I would welcome any helps or tips anyone has.


Felix C
Member
Posts: 1201
Joined: 04 Jul 2007, 17:25
Location: Miami, Fl

Re: Machine translation

#3

Post by Felix C » 05 Sep 2016, 22:12

I did a few paragraphs of Russian translation by manually keying into various online, free, translators. The results were quite good and nearly matched the same text in English. That is a passage from a book in English and the same book translated into Russian. Correct in the above that the context needs to be understood to make sense.

Appreciate your posts DerAF.

User avatar
Der Alte Fritz
Member
Posts: 2171
Joined: 13 Dec 2007, 22:43
Location: Kent United Kingdom
Contact:

Re: Machine translation

#4

Post by Der Alte Fritz » 06 Sep 2016, 07:32

What is great about the Abbyy FineReader is the number of document types it is able to handle, from your own scans to all types of image file on your computer, pdf, djvu and other e-book formats. So long as you can get a good enough scan, your text should bear some resemblance to the original translation. At least this lets you identify areas of more interest and then perhaps do a manual translation.

User avatar
Der Alte Fritz
Member
Posts: 2171
Joined: 13 Dec 2007, 22:43
Location: Kent United Kingdom
Contact:

Re: Machine translation

#5

Post by Der Alte Fritz » 12 Sep 2016, 09:58

I am planning a trip to the British Library soon to look at some Russian and German books. You are allowed to take laptop, notebooks, pencil and camera into the Reading Room so here is how I will deal with languages that I cannot read.

I will take the book and the camera and photograph the page that I want. Take out the Camera Memory Card and insert it into the laptops card reader. Open the photo with Abbyy Finereader, scan it and save as text. Insert the text into Google Translate using the Library Wifi and you have a fair translation in about 2 minutes.

Doing this with the Contents page for instance, I can decide what I need from the book and then with around 900 photos on the camera, I can 'scan' an entire book and transfer it onto the laptop while I get on with another book using a second Memory Card.

Neat eh!

GregSingh
Member
Posts: 3877
Joined: 21 Jun 2012, 02:11
Location: Melbourne, Australia

Re: Machine translation

#6

Post by GregSingh » 12 Sep 2016, 12:20

Wouldn't be quicker just to connect camera to laptop with USB cable and transfer photos this way, so you don't have to swap memory card 500 times?
Have fun anyway!

User avatar
Der Alte Fritz
Member
Posts: 2171
Joined: 13 Dec 2007, 22:43
Location: Kent United Kingdom
Contact:

Re: Machine translation

#7

Post by Der Alte Fritz » 12 Sep 2016, 12:32

GregSingh
That is a good idea although it might be physically a bit clumsy with the wire and everything. I will have an experiment tonight. :)

User avatar
Jeff Leach
Host - Archive section
Posts: 1433
Joined: 19 Jan 2010, 10:08
Location: Stockholm, Sweden

Re: Machine translation

#8

Post by Jeff Leach » 28 Sep 2016, 07:49

This is very useful information but I would like to caution that the machine translation only ease the translations of documents and speed up its process.

Here is a paragraph in German

"Die rum.13.Div. folgt planmäßig um 09:30 Uhr über die Brücken von Skulyany. Der Division wird das Inf.Rgt.6 der rum.14.Div., das bisher der 198.Div. unterstellt war, unterstellt. Die rum.13.Div. bekommt den Auftrag, die Höhen von Buchumyany in Besitz zu nehmen und sich bereitzuhalten, entsprechend dem Vorgehen der 198.Div. rechts rückwärts gestaffelt zu folgen. Hierzu wird um 12:00 Uhr die gesamte Divisions-Artillerie der Division wieder unterstellt. Ein Infanterie-Regiment wird als Korps-Reserve in die Gegend von Blindesti befohlen. Für die Wegnahme der Höhen um Buchumyany wird die Artillerie der rum.14.Div. eingesetzt, weil die eigene Artillerie der Division im Stellungswechsel begriffen und noch nicht feuerbereit ist."

Google translate

"The rum.13.Div. follows on schedule at 09:30 on the bridges of Skulyany. The Division is the Inf.Rgt.6 the rum.14.Div., The date of 198.Div. was subordinated, subordinated. The rum.13.Div. gets the order to take the heights of Buchumyany in possession and to be ready, according to the procedure of 198.Div. right reverse staggered to follow. To this end, the entire division artillery division is assumed again at 12:00. An infantry regiment is commanded as Corps reserve in the area of Blindesti. For the capture of the heights to Buchumyany the artillery of rum.14.Div is. used because our artillery Division conceived the position change and is not yet ready to fire. One of the Romanian 13th Infantry Regiments are to be stationed near Blindesti as part of the corps reserve."

and my quick translation using both the above texts (I am also familiar with the context of the material)

"The Romanian 13th Infantry Division started crossing the bridges at Skulyany on schedule at 09:30. The Romanian 6th Infantry Regiment of the 14th Infantry Division, which had been subordinated to the 198th Infantry Division, was now subordinated to the division. The division was ordered to occupy the heights near Buchumyany and to remain ready there. It was to follow the 198th Infantry Division staggered rearwards off its right flank when it moved out. In order to carry out this mission, control of the divisional artillery will be returned to it at 12:00. This division’s own artillery wasn’t able to deploy forward quickly enough to support the capture of the heights near Buchumyany, so the artillery of the Romanian 14th Infantry Division will support the division during this phase of the operation instead.*

* The Romanian artillery was surprisingly flexibel and was probably on par with the German. The Germans could integrate it into its forces and employ the it as if it were German artillery. The cooperation was so good that there are examples of a Romanian artillery battalion supporting a German infantry regiment as if it was part of the German divisional artillery. On numerouse occasions the Germans compliment the Romanian artillery on the good fire support they gave to their units. If the Romanian artillery had weaknesses it was in its mobility and logistics (resupply)."


This final text should be (most likely is) poorly written English and while have to be working on a few times to improve readibily.

I use Google Translate often when translating German documents it speeds up the process and really eases the mental load. In some cases I also use LEO and Linguee to help with difficult passages. If it was to go so far that I would want to quote a passage, I would most certainly ask for a second opinion, usely on this forum.

What you need to have clear is: these machine translation tool can make foreign langauge document accessable but they don't replace the need to be able to read the language yourself. Even a little familiarity with the foreign langauge will show how poor the machine translations are at times. It can go so far that the machine translation completely missrepresents the text being translated.

In a nutshell: machine translation programs are a godsend but if you work a lot with a foreign language you are still going to need to learn it to a certain degree.

User avatar
Der Alte Fritz
Member
Posts: 2171
Joined: 13 Dec 2007, 22:43
Location: Kent United Kingdom
Contact:

Re: Machine translation

#9

Post by Der Alte Fritz » 28 Sep 2016, 09:52

I would agree with what you say but in terms of productivity when translating a book or other large volume of work, it can really help you identify sections of interest and then concentrate your own translation skills on the important passages rather than wasting time on something of no interest. For instance, I have just scanned a 700 page Russian language book and can now easily identify the parts I need and spend some time getting those passage in a more precise form.

For German I find http://dict.tu-chemnitz.de/ quite good for specialised words.
For Russian the new Lingvo is good see: https://www.lingvolive.com as it this dictionary http://dic.academic.ru/.

I use this online Russian keyboard to type out cyrillic: http://russian.typeit.org/

One trick is to save your document from FineReader in html format. Then open it in your browser and use Google Translation app to translate it straight from the webpage. Then if you scroll over a section of the document it shows you the original text which you can copy and translate more accurately.

Felix C
Member
Posts: 1201
Joined: 04 Jul 2007, 17:25
Location: Miami, Fl

Re: Machine translation

#10

Post by Felix C » 28 Sep 2016, 18:44

I would love something to ocr scan and then translate russian. Fritz are you working with russian?

Tempted to use the abbey software above.

User avatar
Der Alte Fritz
Member
Posts: 2171
Joined: 13 Dec 2007, 22:43
Location: Kent United Kingdom
Contact:

Re: Machine translation

#11

Post by Der Alte Fritz » 29 Sep 2016, 03:07

I am working in Russian at the moment with a range of books, articles and original documents. Most of them are about logistics, so motor vehicles, railways and Rear troops. What are your interests?

User avatar
Jeff Leach
Host - Archive section
Posts: 1433
Joined: 19 Jan 2010, 10:08
Location: Stockholm, Sweden

Re: Machine translation

#12

Post by Jeff Leach » 29 Sep 2016, 07:36

Der Alte Fritz wrote:I would agree with what you say but in terms of productivity when translating a book or other large volume of work, it can really help you identify sections of interest and then concentrate your own translation skills on the important passages rather than wasting time on something of no interest. For instance, I have just scanned a 700 page Russian language book and can now easily identify the parts I need and spend some time getting those passage in a more precise form.
I work for the most part primary scource material (much of it handwritten) so scanning is less of a time saver than working with secondary sources. With most of the Soviet source material scanned it might be worth a try to run it through an OCR program and see what happens.
Der Alte Fritz wrote:For German I find http://dict.tu-chemnitz.de/ quite good for specialised words.
For Russian the new Lingvo is good see: https://www.lingvolive.com as it this dictionary http://dic.academic.ru/.
Der Alte Fritz wrote:I use this online Russian keyboard to type out cyrillic: http://russian.typeit.org/
An alternative is lexilogos so you can use the keyboard, which is much faster if you can finger-type. I find this can lead to more misspelling than I like and have though about purchasing a Cryllic keyboard (there are also sticker you can put on the ketboard you are using).
Der Alte Fritz wrote:One trick is to save your document from FineReader in html format. Then open it in your browser and use Google Translation app to translate it straight from the webpage. Then if you scroll over a section of the document it shows you the original text which you can copy and translate more accurately.
good idea

The main point I was trying to make was that machine translating can be a helpful tool but it still doesn't replace the need to learn a language to a certain degree.

User avatar
Der Alte Fritz
Member
Posts: 2171
Joined: 13 Dec 2007, 22:43
Location: Kent United Kingdom
Contact:

Re: Machine translation

#13

Post by Der Alte Fritz » 29 Sep 2016, 11:36

The main point I was trying to make was that machine translating can be a helpful tool but it still doesn't replace the need to learn a language to a certain degree.
This is a good point to make as machine translation can be pretty poor the further you get between the basic languages. Between French and English is good as many share the same words, same sentence structure, etc. German is less so as it has less of a Romantic language base, sentence structure is often a problem as the approach is radically different in German. Russian is even worse as it lacks many of these common features. Languages with little in common such as Hungarian must really be a challenge.

This is why it is important to share information about best practice and which translation programmes work best with which language combinations. Similarly there are always problems with typewritten or handwritten documents which can be very laborious to translate. As you say nothing beats being able to learn a language.

Felix C
Member
Posts: 1201
Joined: 04 Jul 2007, 17:25
Location: Miami, Fl

Re: Machine translation

#14

Post by Felix C » 30 Sep 2016, 13:05

Der Alte Fritz wrote:I am working in Russian at the moment with a range of books, articles and original documents. Most of them are about logistics, so motor vehicles, railways and Rear troops. What are your interests?
Russian WW1 and Russian Civil War naval operations. For example, there is Tovarisch- Russian Submarine Operations of the First World War which is a goal to be able to read through. Then there is Kaspii god 1920 regarding a white russian naval memoir in the Caspian.etc.etc.

Have downloaded some books in PDF and would like to scan into an OCR and read. As mentioned I typed several paragraphs into a translator and the results were good but have two issues: 1. Books are very time consuming to manually type in their entirety. 2.Non-standardization of what is a Cyrillic keyboard=that is the different translators use different symbol placement. I can learn one quite well but then my finger muscle memory becomes confused when using another. I typically use different types to translators for long or complex text. I end up chicken pecking the keyboard instead of fairly quick character input as when fingers are trained to the keyboard.


.
Last edited by Felix C on 30 Sep 2016, 18:59, edited 1 time in total.

Felix C
Member
Posts: 1201
Joined: 04 Jul 2007, 17:25
Location: Miami, Fl

Re: Machine translation

#15

Post by Felix C » 30 Sep 2016, 13:14

Der Alte Fritz wrote:
The main point I was trying to make was that machine translating can be a helpful tool but it still doesn't replace the need to learn a language to a certain degree.
This is a good point to make as machine translation can be pretty poor the further you get between the basic languages. Between French and English is good as many share the same words, same sentence structure, etc. German is less so as it has less of a Romantic language base, sentence structure is often a problem as the approach is radically different in German. Russian is even worse as it lacks many of these common features. Languages with little in common such as Hungarian must really be a challenge.

This is why it is important to share information about best practice and which translation programmes work best with which language combinations. Similarly there are always problems with typewritten or handwritten documents which can be very laborious to translate. As you say nothing beats being able to learn a language.T
To add to the above

Another issue scanners. I recall there were handheld scanners. Appeared as a T-shaped mouse with the head width about 6" for a typical 6" wide book. These were 300DPI back then and plugged into the PC/Laptop port. I see now scanning pens with limited memory but not what I recall in the Windows 95 era. I searched and cannot find any.

I have a flatbed at home but it is a bit difficult to scan if the binding is stiff as many older books are rebound in library buckram.If the scan is not clear then of course the OCR will not recognize and that can be issue with columns of text warped due to stiff books.

Post Reply

Return to “Translation help: Breaking the Sound Barrier”