Top
ArticleCity.comArticle Categories C# Tesseract OCR Alternative

C# Tesseract OCR Alternative

Photo by Shahadat Rahman

Originally Posted On: https://ironsoftware.com/csharp/ocr/tutorials/c-sharp-tesseract-ocr/

 

Comparing Iron OCR to Tesseract for C# and .Net Software Projects

Tesseract is an excellent academic OCR library available for free for almost all use cases to developers. The question is, why would we use Iron OCR over Tesseract – particularly as Iron OCR implements Tesseract? The simple answer is that Iron OCR does use Tesseract, but that is not ALL it uses.

  1. string path = @”C:picmytext.jpg”;
  2. Bitmap image = new Bitmap(path);
  3. Tesseract ocr = new Tesseract();
  4. ocr.SetVariable(“tessedit_char_whitelist”, “0123456789”); // If digit only
  5. ocr.Init(@”C:tessdata”, “eng”, false); // To use correct tessdata
  6. List<tessnet2.Word> result = ocr.DoOCR(image, Rectangle.Empty);
  7. foreach (tessnet2.Word word in result)
  8. Console.WriteLine(“{0} : {1}”, word.Confidence, word.Text);

Copy code to clipboardVB  C#

Installation

You will note when working with Tesseract, you are working with a C++ library. That is not a lot of fun in .NET. It requires us to choose the bittiness of our application, meaning that we may only deploy to 32 or 64 bit targets.

With Iron OCR, installation happens entirely using the NuGet Package Manager, and bittiness is not required. The entire C++ layer is managed for you, and there are no extra DLLs to install. Everything is automatic.

Real World Accuracy

Tesseract as a library was designed for perfect documents where a machine printed out high-resolution text to a screen and then read it. That is what Tesseract is good at: reading perfect documents.

The problem is that in the real world, that is not what we have. If Tesseract encounters an image which is rotated, skewed, is of a low DPI, scanned, or has background noise, it becomes almost impossible for Tesseract to get data from that image. In addition, Tesseract will also take a very long time to process that document before giving you back nonsense information.

In the below example, we can see that a simple document that is very easy to read by the eye cannot be read by Tesseract well. However, the below code example and output shows that Iron OCR is significantly more appropriate for real world use cases.

The Truth of Using Tesseract

Tesseract is a library for reading straight and perfect text of standardized typefaces. To use Tesseract when we are using scanned or photographed documents where the images are not digitally perfect like screenshots, we need to perform image preprocessing. This is normally done with Photoshop batch scripts or advanced ImageMagick usage.

Generally, this needs to be developed on a case by case basis for each type of document you are trying to deal with and can take weeks of development.

The key selling point of Iron OCR is that it takes all of this away. Iron OCR has simple variables which you can use to automatically detect and preprocess all of your images so that you get your text out without weeks of developing for specific image use cases.

Fault Tolerance

In addition, Iron OCR has an excellent error model where it gets very specific information if a fault has occurred during an OCR process so that you know exactly what has gone wrong and you can correct it, rather than being left with a generic or null error.

In conclusion, Tesseract is an excellent resource for developers, but it is not a complete OCR library when dealing with scanned or photographed images because these images need to be processed so as to be orthogonal, standardized, high-resolution, and free of digital noise before Tesseract can accurately work with them.

In contrast, Iron OCR can do this automatically in a single line of code. We think it is worth it.

Sample Project

We are currently woking on a sample project to distinguish the differences between Iron OCR and Tesseract for C# which will be posted as a download and also shared on GitHub.

For now you can download the binaries and make a comparison between Tessearct and IronOCR yourself.

No Comments

Sorry, the comment form is closed at this time.