Accuracy of Optical Character Recognition Software Google Tesseract

Joshua A. Suitter, University of Southern MaineFollow

Date

Spring 4-2015

Document Type

Poster Session

Department

Engineering

Advisor

Mariusz Jankowski

Keywords

Tesseract, OCR

Abstract

Tesseract is an open-source OCR (Optical Character Recognition) softwareengine originally developed by HP between 1985 and 1995, it is now sponsored by Google Projects (Google Tesseract). While Tesseract is known as one of the most accurate free OCR enginesavailable today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image. During my research I have found that certain fonts are accepted more than others, and font size, spacing, and image quality all play a role in how Tesseract performs. In this project, I will also be looking into Wolfram’s Mathematica built-in Tesseract code: Text Recognize. You will see through this project how different fonts, font sizes, image quality, and tilting of an image affect Tesseracts recognition accuracy. The first part of this project I tested the fonts and font sizes using Tesseract. I did error calculations by eye, looking for when a word came back in the text file incorrectly. The reason for using Mathematica’s version is so I can automate my error process; getting a more accurate result. In my research, I found that both the original Tesseract program and Mathematica’s built-in version are very accurate, especially at higher quality images.

Start Date

April 2015

Recommended Citation

Suitter, Joshua A., "Accuracy of Optical Character Recognition Software Google Tesseract" (2015). Thinking Matters Symposium Archive. 46.
https://digitalcommons.usm.maine.edu/thinking_matters/46

Download

Included in

Graphics and Human Computer Interfaces Commons

COinS

Accuracy of Optical Character Recognition Software Google Tesseract

Date

Document Type

Department

Advisor

Keywords

Abstract

Start Date

Recommended Citation

Included in

Search

Author Corner

Browse

Links

Accuracy of Optical Character Recognition Software Google Tesseract

Author

Date

Document Type

Department

Advisor

Keywords

Abstract

Start Date

Recommended Citation

Included in

Share

Search

Author Corner

Browse

Links