Android and OCR

I’m still remembering it well, the first piece of software I wrote when I came to the US was a de-skewing algorithm. Deskewing an image helps a lot, if you want to do OCR, OMR, barcode detect, or just improve the readability of scanned images.
At the time, I was working for a small software company, developing TeleForm, an application that reads data from paper forms and stores that data in previously created databases. The Cardiff TeleForm product was later re-branded Verity-TeleForm for a brief period in 2004 and 2005 when Verity Inc. acquired Cardiff Software. In 2005, when Autonomy acquired Verity, the Cardiff brand was reintroduced as Autonomy Cardiff (http://www.cardiff.com); more recently, Autonomy was acquired by HP.

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten, or printed text into machine-encoded text.Image Deskew is the process of removing skew from images (especially bitmaps created using a scanner). Skew is an artifact that can occur in scanned images because of the camera being misaligned, imperfections in the scanning or surface, or simply because the paper was not placed completely flat when scanned.

Now most of the data entry or origination happens on the Web, where most of the forms processing has been moved to as well, i.e. OCR hasn’t been in vogue for quite a while. However, the popularity of smartphones, combined with built-in high-quality cameras has created a new category of mobile applications, benefiting greatly from OCR. Take Word-Lens (http://questvisual.com) as an example: an augmented reality translation application that tries to find out what the letters are in an image and then looks in a dictionary, to eventually draws the words back on the screen in translation.

On Device or In The Cloud ?

Before deciding on an OCR library, one needs to decide, where the OCR process should take place: on the Smartphone or in the Cloud. Each approach has its advantages.
On device OCR can be performed without requiring an Internet connection and instead of sending a photo, which can potentially be huge (many phones have 8 or 12 Mega-Pixel cameras now), the text is recognized by an on-board OCR-engine.
However, OCR-libraries tend to be large, i.e. the mobile application will be of considerable size. Depending on the amount of text that needs to be recognized and the available data transfer speed, a cloud-service may provide the result faster. A cloud-service can be updated more easily but individually optimizing (training) an OCR engine may work better when done locally on the device.

Which OCR Library to choose ?

After taking a closer look at the all comparisons, Tesseract stands out. It provides good accuracy, it’s open-source and Apache-Licensed, and has broad language support. It was created by HP and is now developed by Google.

Also, since Tesseract is open source and Apache- Licensed, we can take the source and port it to the Android platform, or put it on a Web-server to run our very own Cloud-service.

A Tesseract is a four- dimensional object, much like a cube is a three-dimensional object. A square has two dimensions. You can make a cube from six squares. A cube has three dimensions. The tesseract is made in the same way, but in four dimensions.

1. Tesseract

The Tesseract OCR engine was developed at Hewlett Packard Labs and is currently sponsored by Google. It was among the top three OCR engines in terms of character accuracy in 1995. //code.google.com/p/tesseract-ocr/

1.1. Running Tesseract locally on a Mac

Like with so make other Unix and Linux tools, Homebrew (http://mxcl.github.com/homebrew/) is the easiest and most flexible way to install the UNIX tools Apple didn’t include with OS X. Once Homebrew is installed (https://github.com/mxcl/homebrew/wiki/installation), Tesseract can be installed on OS X as easy as:
$ brew install tesseract
Once installed,
$ brew info tesseract will return something like this:

tesseract 3.00
//code.google.com/p/tesseract-ocr/
Depends on: libtiff
/usr/local/Cellar/tesseract/3.00 (316 files, 11M)
Tesseract is an OCR (Optical Character Recognition) engine.
The easiest way to use it is to convert the source to a Grayscale tiff:
convert source.png -type Grayscale terre_input.tif
then run tesseract:
tesseract terre_input.tif output
http://github.com/mxcl/homebrew/commits/master/Library/Formula/tesseract.rb

Tesseract doesn’t come with a GUI and instead runs from a command-line interface. To OCR a TIFF-encoded image located on your desktop, you would do something like this:
$ tesseract ~/Desktop/cox.tiff ~/Desktop/cox
Using the image below, Tesseract wrote with perfect accuracy the resulting text into
~/Desktop/cox.txt

 

There are at least two projects, providing a GUI-front-end for Tesseract on OS X

  1. TesseractGUI, a native OSX client: http://download.dv8.ro/files/TesseractGUI/
  2. VietOCR, a Java Client: http://vietocr.sourceforge.net/

 

TesseractGUI, a native OSX Client for Tesseract
VietOCR, a Java Client for Tesseract

1.2. Running Tesseract as a Cloud-Service on a Linux Server

One of the fastest and easiest ways to deploy Tesseract as a Web-service, uses Tornado (http://www.tornadoweb.org/), an open source (Apache Licensed) Python non-blocking web server. Since Tesseract accepts TIFF encoded images but our Cloud-Service should rather work with the more popular JPEG image format, we also need to deploy the free Python Imaging Library (http://www.pythonware.com/products/pil/), license terms are here: http://www.pythonware.com/products/pil/license.htm

The deployment on Ubuntu 11.10 64-bit server looks something like this:

1.2.1. The HTTP Server-Script for port 8080

The Server receives a JPEG image file and stores it locally in the ./static directory, before calling image_to_string, which is defined in the Python script below:

1.2.2. image_to_string function implementation

1.2.3. The Service deploy/start Script

After the service has been started, it can be accessed through a Web browser like shown here: http://proton.techcasita.com:8080 I’m currently running tesseract 3.01 on Ubuntu Linux 11.10 64-bit, please be gentle, it runs on an Intel Atom CPU 330 @ 1.60GHz, 4 cores (typically found in Netbooks)

The HTML encoded result looks something like this:

1.3 Accessing the Tesseract Cloud-Service from Android

The OCRTaskActivity below utilizes Android’s built-in AsyncTask as well as Apache Software Foundation’s HttpComponent library HttpClient4.1.2, available here: http://hc.apache.org/httpcomponents-client-ga/index.html OCRTaskActivity expects the image to be passed in as the Intent Extra “ByteArray” of type ByteArray. The OCR result is returned to the calling Activity as OCR_TEXT, like shown here:

This sample Android app has an Activity that sends a small JPEG image to the Cloud-Service, which is running the Tesseract OCR engine.

1.4. Building a Tesseract native Android Library to be bundled with an Android App

This approach allow an Android application to perform OCR even without a network connection. I.e. the OCR engine is on-board. There are currently two source-bases to start from, the original Tesseract project here:

  1. Tesseract Tools for Android is a set of Android APIs and build files for the Tesseract OCR and Leptonica image processing libraries:
  2. A fork of Tesseract Tools for Android (tesseract-android-tools) that adds some additional functions:

… I went with option 2.

1.4.1. Building the native lib

Each project can be build with the same build steps (see below) and neither works with Android’s NDK r7. However, going back to NDK r6b solved that problem. Here are the build steps. It takes a little while, even on a fast machine.

The build-steps create the native libraries in the libs/armabi and libs/armabi-v7a directories.

The tess-two project can now be included as a library-project into an Android project and with the JNI layer in place, calling into the native OCR library now looks something like this:

 

 

1.4.2. Developing a simple Android App with built-in OCR capabilities

1.4.2.1. Libraries / TrainedData / App Size

The native libraries are about 3 MBytes in size. Additionally, a language and font depending training resource files is needed.
The eng.traineddata file (e.g. available with the desktop version of Tesseract) is placed into the main android’s assers/tessdata folder and deployed with the application, adding another 2 MBytes to the app. However, due to compression, the actual downloadable Android application is “only” about 4.1 MBytes.

During the first start of the application, the eng.traineddata resource file is copied to the phone’s SDCard.

The ocr() method for the sample app may look something like this:

OCR on Android

The popularity of smartphones, combined with built-in high-quality cameras has created a new category of mobile applications, benefiting greatly from OCR.

OCR is very mature technology with a broad range of available libraries to chose from. There are Apache and BSD licensed, fast and accurate solutions available from the open-source community, I have taken a closer look at Tesseract, which is developed by HP and Google.

Tesseract can be used to build a Desktop application, a CloudService, and even baked into a mobile Android application, performing on-board OCR. All three variation of OCR with the Tesseract library have been demonstrated above.

Focussing on mobile applications, however, it became very clear that even on phones with a 5MP camera, the accuracy of the results still vary greatly, depending on lighting conditions, font, and font-sizes, as well as surrounding artifact.

Just like with the TeleForm application, even the best OCR engines perform purely, if the input-image has not been prepared correctly. To make OCR work on a mobile device, no matter if the OCR will eventually be run onboard or in the cloud, much development time needs to be spend to train the engine – but even more importantly, to select and prepare the image areas that will be provided as input to the OCR engine – it’s going to be all about the pre-processing.

 

This shows my Capture OCR sample Android-OCR application (with Tesseract OCR engine built-in), after it performed the OCR on a just taken photo of a book cover.

 

 

 

3 Replies to “Android and OCR”

  1. thanks .. amazing work …but a little bit limited

  2. hai nice tutorial, but how could intergrated native using cmakelist? i tried to using include base api on c++ JNI, thanks

  3. java.net.UnknownHostException: Unable to resolve host “proton.techcasita.com”: No address associated with hostname

Leave a Reply