Tesseract OCR in C&num using IronOCR;
Photo by Javier Quesada
Originally Posted On: Tesseract OCR with C# .NET | Iron OCR (ironsoftware.com)
How to use Tesseract OCR in C# – Summary
Tesseract is an excellent academic OCR library available for free for almost all use cases to developers.
C# is lucky to have one of the most accurate and fast Tesseract Libraries available.
IronOCR extends Google Tesseract with IronTesseract – a native C# OCR library with improved stability and higher accuracy than the free Tesseract library.
This article explains why .NET developers strongly consider using IronOCR IronTesseract over vanilla Tesseract.
Code Example for .Net Tesseract Usage
- // PM > Install-Package IronOcr
- // using IronOcr;
- var Ocr = new IronTesseract();
- // Hundreds of languages available
- Ocr.Language = OcrLanguage.English;
- using (var Input = new OcrInput())
- {
- OcrInput.Add(@”imgexample.tiff”)
- // Input.DeNoise(); optional
- // Input.Deskew(); optional
- IronOcr.OcrResult Result = Ocr.Read(Input);
- Console.WriteLine(Result.Text);
- // Explore the OcrResult using IntelliSense
- }
Copy code to clipboardVB C#
Installation
Google Tesseract with .NET
When working with Tesseract, most of us are working with a C++ library.
Interop is is not a lot of fun in .NET – and has poor cross platform and Azure compatibility. It requires us to choose the bittiness of our application, meaning that we may only deploy to 32 or 64 bit targets.
We may need to unsure that Visual C++ runtimes are installed and even compile Tesseract ourselves to get the latest version. Free C# wrappers for these may be years behind the edge.
We also have to find, download and manage C++ DLLs and EXEs we may not understand, and deploy them in environments where permissions may not allow them to run.
Iron Tesseract for C#
With IronOCR, installation happens entirely using the NuGet Package Manager.
PM > Install-Package IronOcr
There are no native dlls or exes to install. Everything is handled by a single .Net component library.
The entire API is in native .NET using a simple C# API.
It supports:
- .NET Framework 4.5 and above
- .NET Standard 2.0 and above (including 3.x & .NET 5 Beta)
- .NET Core 2.0 and above (including 3.x & .NET 5 Beta)
Up To Date & Maintained
Google Tesseract with C#
The latest builds of Tesseract 5 have never been designed to compile on Windows.
Installing Tesseract 5 for C# for free requires manually modifying and compiling Leptonica and Tesseract for Windows. The MinGw cross-compile chain is not successful at producing Windows interop binaries as of today.
In addition, free C# API wrappers on GitHub may be years behind or incompatible.
Iron Tesseract for .NET
Runs Tesseract 5 ( as well as 4 and 3) out of the box on Windows, MacOS, Linux, Azure, AWS, Lambda, Mono and Xamarin Mac with little or no configuration. No native binaries to manage. Framework and Core compatible.
There is little else to say other than it has been done right.
Tesseract 5 API in Iron Tesseract
To date: IronTesseract is the only known implementation of Tesseract5 for .Net Framwork or Core.
- // using IronOcr;
- var Ocr = new IronTesseract(); // nothing to configure
- using (var Input = new OcrInput(@”imagesimage.png”))
- {
- var Result = Ocr.Read(Input);
- Console.WriteLine(Result.Text);
- }
Copy code to clipboardVB C#
Tesseract 4 API in Iron Tesseract
- // using IronOcr;
- var Ocr = new IronTesseract();
- Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract4;
- using (var Input = new OcrInput(@”imagesimage.png”))
- {
- var Result = Ocr.Read(Input);
- Console.WriteLine(Result.Text);
- }
Copy code to clipboardVB C#
Tesseract 3 API in Iron Tesseract
- // using IronOcr;
- var Ocr = new IronTesseract();
- //Legacy Support
- Ocr.Configuration.EngineMode = TesseractEngineMode.TesseractOnly;
- Ocr.Language = OcrLanguage.English;
- // Does not support any Dictionaries marked Best or Fast
- using (var Input = new OcrInput(@”imagesimage.png”))
- {
- OcrResult Result = Ocr.Read(Input);
- Console.WriteLine(Result.Text);
- }
Copy code to clipboardVB C#
Accuracy
Google Tesseract in .NET Projects
Tesseract as a library was designed for perfect documents where a machine printed out high-resolution text to a screen and then read it. That is what Tesseract is good at: reading perfect documents.
The problem is that in the real world, that is not what we have. If Tesseract encounters an image which is rotated, skewed, is of a low DPI, scanned, or has background noise, it becomes almost impossible for Tesseract to get data from that image. In addition, Tesseract will also take a very long time to process that document before giving you back nonsense information.
A simple document that is very easy to read by the eye cannot be read by Tesseract well.
Tesseract is a free library optimal for reading straight and perfect text of standardized typefaces.
To use Tesseract when we are using scanned or photographed documents where the images are not digitally perfect like screenshots, we need to perform image preprocessing. This is normally done with Photoshop batch scripts or advanced ImageMagick usage.
Generally, this needs to be developed on a case by case basis for each type of document you are trying to deal with and can take weeks of development.
Iron Tesseract in .NET Projects
A Iron OCR takes this headache away. Users often achieve 99.8-100% accuracy with minimal configuration.
- // PM > Install-Package IronOcr
- // using IronOcr;
- var Ocr = new IronTesseract();
- using (var Input = new OcrInput())
- {
- OcrInput.Add(@”imgexample.tiff”)
- Input.DeNoise(); //fixes digital noise
- Input.Deskew(); //fixes rotation and perspective
- // there are dozens more filters, but most users wont need them
- IronOcr.OcrResult Result = Ocr.Read(Input);
- Console.WriteLine(Result.Text);
- }
Copy code to clipboardVB C#
Image Compatibility
Google Tesseract in .NET
Only accepts Leptonica PIX image format which is an IntPtr C++ object in C#. PIX objects are not managed memory – and failure to handle them with care in C# results in memory leaks.
Leptonica has good general image compatibility but throw many console warnings and errors. There are known issues with TIFF files and limited support for PDF OCR.
Iron Tesseract for .NET
Images are memory managed. PDF & Tiff supported. System.Drawing, Stream and Byte Array included for every file format.
Broad image support:
- PDF Documents
- Pdf Pages
- MultiFrame TIFF files
- JPEG & JPEG2000
- GIF
- PNG
- BMP
- WBMP
- System.Drawing.Image
- System.Drawing.Bitmap
- System.IO.Streams of images
- Binary image Data (byte[])
- And more that I don’t have space to list.
OCR Image Compatibility Code Example
- var Ocr = new IronTesseract();
- using (var input = new OcrInput())
- {
- input.AddPdf(“example.pdf”,”password”);
- input.AddMultiFrameTiff(“multi-frame.tiff”);
- input.AddImage(“image1.png”)
- input.AddImage(“image2.jpeg”)
- //… many more
- var Result = Ocr.Read(input);
- Console.WriteLine(Result.Text);
- }
Copy code to clipboardVB C#
Performance
Free Google Tesseract
Google Tesseract can perform fast and accurate results if properly tunes and the input images have been preprocessed using Photoshop or ImageMagick.
You will notice that most Tesseract examples online are actually from high resolution screen shots with no digital noise, in fonts that Tesseract has been designed to work well with.
Tesseracts own documentatin states that input images should be sampled at 300DPI or higher for OCR to be effective.
Iron Tesseract Library
The IronOcr C# Tesseract Library works accurately and at speed for most images out of the box. We have implemented multithreading to make use of the muti-core processors that most machines now use.
Even low-resolution images generally work with a high degree of accuracy. No PhotoShop required.
Developers often achieve over 99%+ accuracy with little configuration – which matches current Machine Learning web APIs without the ongoing costs, security risks and bandwidth issues.
Speeds are fast, but can be improved with a little coding.
Performance Tuning Example
- var Ocr = new IronTesseract();
- // Configure for speed. 35% faster and only 0.2% loss of accuracy
- Ocr.Configuration.BlackListCharacters = “~`$#^*_}{][|@¢©«»°±·×‑–—‘’“”•…′″€←↑→↓⇄⇒∅∼≅≈≠≤≥≪≫⌁⌘○◔◑◕●☐☒♡✓✰”;
- Ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto;
- Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
- Ocr.Configuration.EngineMode = TesseractEngineMode.LstmOnly;
- Ocr.Configuration.ReadBarCodes = false;
- Ocr.Language = OcrLanguage.EnglishFast;
- using (var Input = new OcrInput(@”imgPotter.tiff”))
- {
- var Result = Ocr.Read(Input);
- Console.WriteLine(Result.Text);
- Utils.Accuracy.Compare(Result, “txt/Potter.txt”);
- }
Copy code to clipboardVB C#
API
Google Tesseract OCR in .NET
We have 2 free choices
- Work with Interop layers – Many of which found on GitHub are out of date, have unresolved tickets, Memory Leaks & Console warnings. May not support .Net Core or Standard.
- Work with the command line EXE – Hard to deploy and constantly interrupted by virus canners and security policies.
Neither of the above may work well in Web Applications, Azure, Mono, Xamarin, Linux, Docker or Mac.
Iron Tesseract OCR Library for .NET
A managed and tested .NET Library for Tesseract called IronTesseract.
Fully documented with InteliSense support.
Simplest Hello World for Tesseract in .NET
- var Text = new IronOcr.IronTesseract().Read(“img.png”).Text;
Copy code to clipboardVB C#
Has active development and is supported by professional software engineers with an median experence level of over 20 years.
Compatibility
Google Tesseract + Interop for .NET
May be made to work in most platforms is you are willing to find dependancies , build from source or update a free C# interop wrapper. These resources may not be fully compatible for .NET Core or .Net Standard projects.
At present, we have not encountered any logical and simple way to install LibTesseract5 for windows safely without IronTessseract.
Iron Tesseract .Net OCR Library
Unit Tested with CI, and has everything you need to run on:
- Desktop applications,
- Console Apps
- Servers Processes
- Web Applications & MVC
- JetBrains Rider
- Xamarin Mac
On:
- Windows
- Azure
- Linux
- Docker
- Mac
- BSD and FreeBSD
.NET Support for:
- .NET Framework 4.5 and above
- .Net Core – All active versions above 2.0
- .Net Stanrdard – All active versions above 2.0
- Mono
- Xamarin Mac
Language Support
Google Tesseract
Tesseract dictionaries are managed as files and must be cloned from the https://github.com/tesseract-ocr/tessdata. This is about 4 GB.
Some Linux distros have some help to manage Tesseract dictionaries via apt-get.
Exact folder structures must be maintained or Tesseract fails.
Iron Tesseract
Supports more languages than github.com/tesseract-ocr/tessdata and they are all managed as Nuget Packages or easily installable downloads.
Unicode Language Example
- // using IronOcr;
- // PM> Install IronOcr.Languages.Arabic
- var Ocr = new IronTesseract();
- Ocr.Language = OcrLanguage.Arabic;
- using (var input = new OcrInput())
- {
- input.AddImage(“img/arabic.gif”);
- // Add image filters if needed
- // In this case, even thought input is very low quality
- // IronTesseract can read what conventional Tesseract cannot.
- var Result = Ocr.Read(input);
- // Console can’t print Arabic on Windows easily.
- // Let’s save to disk instead.
- Result.SaveAsTextFile(“arabic.txt”);
- }
Copy code to clipboardVB C#
Multiple Language Example
It is also possible to OCR using multiple languages at the same time. This can really help get english language metadata and urls in Unicode documents.
- // using IronOcr;
- // PM> Install IronOcr.Languages.ChineseSimplified
- var Ocr = new IronTesseract();
- Ocr.Language = OcrLanguage.ChineseSimplified;
- Ocr.AddSecondaryLanguage(OcrLanguage.English);
- // We can add any number of languages
- using (var input = new OcrInput())
- {
- input.Add(“multi-language.pdf”);
- var Result = Ocr.Read(input);
- Result.SaveAsTextFile(“results.txt”);
- }
Copy code to clipboardVB C#
What Else
Iron Tesseract has additional features for .NET software developers.
- Automatic image analysis to configure Tesseract for common errors
- Image to Searchable PDF Conversion
- PDF OCR
- Can make any PDF searchable and indexable on search engines
- OCR to HTML output
- TIFF to PDF conversion
- Barcode Reading
- QR Code Reading
- Multithreading
- An advanced OcrResult Class that allows inspection of: Blocks, Paragraphs, Lines, Words, Characters, Fonts and OCR statistics.
Conclusion
Google Tesseract for C# OCR
This is the right library to use for free & academic projects in C#.
Tesseract is an excellent resource for C++ developers, but it is not a complete OCR library for .NET.
When dealing with scanned or photographed images because these images need to be processed so as to be orthogonal, standardized, high-resolution, and free of digital noise before Tesseract can accurately work with them.
Iron Tesseract OCR Library for .NET Framework & Core
In contrast, IronOCR can do this and more in a single line of code.
It is true: IronOCR uses Tesseract for its internal OCR engine.
A very finally tuned Tesseract build for C# with a lot of performance improvements and features added as standard.
It is the right choice for any project where developer time is valuable. When was the last time you found a .NET software Engineer with weeks of time on their hands?
Get Started on your C# Tesseract Project
PM > Install-Package IronOcr
Or you you can download the Iron Tesseract .Net DLL and install it manually.
Any .NET coder should be able to get started with Iron Tesseract OCR in 5 minutes using examples on this page.