Abbyy Finereader trained for long s

From DPWiki
Jump to navigation Jump to search

Introduction

While the program Guiprep has specific option to convert texts with the long s character following OCR, it is also possible to run Abbyy Finereader in 'training mode' to attempt to recognise the various long s characters and replace them instead with the letter 's' in the OCR output. Once the 'training' has been done, the resultant file can be saved (it has a .FBT file extension), this can then be loaded and used again.

I have been through two old texts one (1738) printed in London and the other (1799) printed in the U.S.A printed text, content providers are more than welcome to try my Training File. (Dropbox link)

I use Abbyy Finereader 12 pro, I am assuming users with later versions will still be able to use my training file otherwise the whole exersise is a bit limited, as only useful for those with FR12 (this needs to be checked out with other users).

How to configure Abbyy Finereader for my user pattern

Once you have downloaded my Training File. (Dropbox link), in Abbyy Finereader on the menu bar choose TOOLS then Options. Under the "User patterns and languages" heading click the "Load from File..." button, and select my training file (.FBT) extension. To check it is active click on the "Pattern Editor..." button and it should show "long_s(1730-1820)_v1(active)" v1 for version 1, as I may update some time in the future.

The under the "Training" heading select "Use built-in and users patterns", note keep the "Read with training" choice unticked, unless you want to further add to the training pattern.

Examples of characters trained in Abbyy Finereader from the Caslon Font

FR user patterns.jpg

Examples from two texts showing the OCR output

I trained Finereaderr on two texts, one from the USA the other from England. They being A Treatise On The Plague And Yellow Fever, By James Tytler. Printed Salem (1799) and The British Libriran Printed London (1738)

Here are some typical examples of the font style, and the OCR output produced by my trained version of Finereader, note there are still problems which I discuss below, and how to possibly deal with them.

Long s font examples.jpg

Make sure your scans for OCR are 600 dpi

If you don't Abbyy Finereader will give the following warning "Selected user pattern has been training at a different image resolution and may not work for this image"

I use Infraview batch convert to upscale images to 600dpi if required, note there is a box within the Infraview options to set the DPI to a particular value.

Things to do after you have run the OCR

I still recommend running the resultant text through Guiprep

I could not get my Abbyy Finereader to accept æ and œ characters into its trained character set, so instead you will see [ae] and [oe] in the OCR output. (I figured if the default character set needs changing somehow, then others, certainly those with FR12 would have the same issues). But it is a simple task to search and replace my [ae] and [oe] with æ and œ afterwards.

For some unknown reason my Abbyy Frinereader 12 puts a capital S in place of a small s in the OCR output for some of the recognised long s characters. So you will find perSon and likewiSe there are also instances where try as a might (see below) Abbyy finereader puts a capital S where I have trained it to recognise the letter f so you may find Srom instead of from, and aSter instead of after.

To deal with these I recommend the following:-

  • First look down all the words with capital S and decide if they should be the letter f instead and search and replace them. (I use Guiprep)
  • Secondly, you can using the following Regex expression to replace all capital letter S within a word to the small letter s; Regex replace ([a-z])S([a-z]) with $1s$2
  • Thirdly, some common replacements I have found needed on the two test texts. Replace the word os with of

Some comments on strange indeterminate problems I had with training Abbyy Finereader

As an example I trained the long s looking letter f in word from to correctly recognise a the letter f and it does it correctly when I just OCR read the single word, but... if I then OCR the complete page the word changes to Srom

Similarly all the long s are only trained as small letter s, but for some unexpected reason they appear as capital letter S in the OCR output.

Feedback and comments welcome

Do let me EricHutton know if you have tried my training file, particularly as to which version of Abbyy Finereader you have which it works or does not on. (I think the .FBT file extension was only used to include user training patterns from FR11 onwards, but am not sure).

If you think I have missed any characters or ligatures in my training, as updated versions of the training pattern are certainly possible.