Android and OCR – Wolf Paulus

I’m still remembering it well, the first piece of software I wrote when I came to the US was a de-skewing algorithm. Deskewing an image helps a lot, if you want to do OCR, OMR, barcode detect, or just improve the readability of scanned images.
At the time, I was working for a small software company, developing TeleForm, an application that reads data from paper forms and stores that data in previously created databases. The Cardiff TeleForm product was later re-branded Verity-TeleForm for a brief period in 2004 and 2005 when Verity Inc. acquired Cardiff Software. In 2005, when Autonomy acquired Verity, the Cardiff brand was reintroduced as Autonomy Cardiff (http://www.cardiff.com); more recently, Autonomy was acquired by HP.

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten, or printed text into machine-encoded text.Image Deskew is the process of removing skew from images (especially bitmaps created using a scanner). Skew is an artifact that can occur in scanned images because of the camera being misaligned, imperfections in the scanning or surface, or simply because the paper was not placed completely flat when scanned.

Now most of the data entry or origination happens on the Web, where most of the forms processing has been moved to as well, i.e. OCR hasn’t been in vogue for quite a while. However, the popularity of smartphones, combined with built-in high-quality cameras has created a new category of mobile applications, benefiting greatly from OCR. Take Word-Lens (http://questvisual.com) as an example: an augmented reality translation application that tries to find out what the letters are in an image and then looks in a dictionary, to eventually draws the words back on the screen in translation.

On Device or In The Cloud ?

Before deciding on an OCR library, one needs to decide, where the OCR process should take place: on the Smartphone or in the Cloud. Each approach has its advantages.
On device OCR can be performed without requiring an Internet connection and instead of sending a photo, which can potentially be huge (many phones have 8 or 12 Mega-Pixel cameras now), the text is recognized by an on-board OCR-engine.
However, OCR-libraries tend to be large, i.e. the mobile application will be of considerable size. Depending on the amount of text that needs to be recognized and the available data transfer speed, a cloud-service may provide the result faster. A cloud-service can be updated more easily but individually optimizing (training) an OCR engine may work better when done locally on the device.

Which OCR Library to choose ?

Wikipedia has a “non-exhaustive” but still very broad comparison of optical character recognition software here: http://en.wikipedia.org/wiki/ List_of_optical_character_recognition_software
A comparison of some of the more popular OCR-Engines can be found here: http://www.freewaregenius.com/2011/11/01/how-to-extract-text-from-images-a- comparison-of-free-ocr-tools/
Linux OCR software comparison with a strong focus on accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

After taking a closer look at the all comparisons, Tesseract stands out. It provides good accuracy, it’s open-source and Apache-Licensed, and has broad language support. It was created by HP and is now developed by Google.

Also, since Tesseract is open source and Apache- Licensed, we can take the source and port it to the Android platform, or put it on a Web-server to run our very own Cloud-service.

A Tesseract is a four- dimensional object, much like a cube is a three-dimensional object. A square has two dimensions. You can make a cube from six squares. A cube has three dimensions. The tesseract is made in the same way, but in four dimensions.

1. Tesseract

The Tesseract OCR engine was developed at Hewlett Packard Labs and is currently sponsored by Google. It was among the top three OCR engines in terms of character accuracy in 1995. http://code.google.com/p/tesseract-ocr/

1.1. Running Tesseract locally on a Mac

Like with so make other Unix and Linux tools, Homebrew (http://mxcl.github.com/homebrew/) is the easiest and most flexible way to install the UNIX tools Apple didn’t include with OS X. Once Homebrew is installed (https://github.com/mxcl/homebrew/wiki/installation), Tesseract can be installed on OS X as easy as:
$ brew install tesseract
Once installed,
$ brew info tesseract will return something like this:
tesseract 3.00 http://code.google.com/p/tesseract-ocr/ Depends on: libtiff /usr/local/Cellar/tesseract/3.00 (316 files, 11M) Tesseract is an OCR (Optical Character Recognition) engine. The easiest way to use it is to convert the source to a Grayscale tiff: `convert source.png -type Grayscale terre_input.tif` then run tesseract: `tesseract terre_input.tif output` http://github.com/mxcl/homebrew/commits/master/Library/Formula/tesseract.rb
Tesseract doesn’t come with a GUI and instead runs from a command-line interface. To OCR a TIFF-encoded image located on your desktop, you would do something like this:
$ tesseract ~/Desktop/cox.tiff ~/Desktop/cox
Using the image below, Tesseract wrote with perfect accuracy the resulting text into
~/Desktop/cox.txt

There are at least two projects, providing a GUI-front-end for Tesseract on OS X

TesseractGUI, a native OSX client: http://download.dv8.ro/files/TesseractGUI/
VietOCR, a Java Client: http://vietocr.sourceforge.net/

TesseractGUI, a native OSX Client for Tesseract

1.2. Running Tesseract as a Cloud-Service on a Linux Server

One of the fastest and easiest ways to deploy Tesseract as a Web-service, uses Tornado (http://www.tornadoweb.org/), an open source (Apache Licensed) Python non-blocking web server. Since Tesseract accepts TIFF encoded images but our Cloud-Service should rather work with the more popular JPEG image format, we also need to deploy the free Python Imaging Library (http://www.pythonware.com/products/pil/), license terms are here: http://www.pythonware.com/products/pil/license.htm

The deployment on Ubuntu 11.10 64-bit server looks something like this:

sudo apt-get install python-tornado
sudo apt-get install python-imaging
sudo apt-get install tesseract-ocr

1.2.1. The HTTP Server-Script for port 8080

#!/usr/bin/env python
import tornado.httpserver
import tornado.ioloop
import tornado.web
import pprint
import Image
from tesseract import image_to_string
import StringIO
import os.path
import uuid
 
class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write('</pre>
<form action="/" method="post" enctype="multipart/form-data">' '
<input type="file" name="the_file" />' '
<input type="submit" value="Submit" />' '</form>
<pre class="prettyprint">')
 
    def post(self):
        self.set_header("Content-Type", "text/html")
    self.write("") # create a unique ID file
        tempname = str(uuid.uuid4()) + ".jpg"
        myimg = Image.open(StringIO.StringIO(self.request.files.items()[0][1][0  ['body']))
        myfilename = os.path.join(os.path.dirname(__file__),"static",tempname);
 
        # save image to file as JPEG
        myimg.save(myfilename)
 
        # do OCR, print result
        self.write(image_to_string(myimg))
        self.write("")
 
settings = {
    "static_path": os.path.join(os.path.dirname(__file__), "static"),
}
 
application = tornado.web.Application([
    (r"/", MainHandler),
], **settings)
 
if __name__ == "__main__":
    http_server = tornado.httpserver.HTTPServer(application)
    http_server.listen(8080)
    tornado.ioloop.IOLoop.instance().start()

The Server receives a JPEG image file and stores it locally in the ./static directory, before calling image_to_string, which is defined in the Python script below:

1.2.2. image_to_string function implementation

#!/usr/bin/env python
 
tesseract_cmd = 'tesseract'
 
import Image
import StringIO
import subprocess
import sys
import os
 
__all__ = ['image_to_string']
 
def run_tesseract(input_filename, output_filename_base, lang=None, boxes=False):
    '''
    runs the command:
        `tesseract_cmd` `input_filename` `output_filename_base`
 
    returns the exit status of tesseract, as well as tesseract's stderr output
 
    '''
 
    command = [tesseract_cmd, input_filename, output_filename_base]
 
    if lang is not None:
        command += ['-l', lang]
 
    if boxes:
        command += ['batch.nochop', 'makebox']
 
    proc = subprocess.Popen(command,
            stderr=subprocess.PIPE)
    return (proc.wait(), proc.stderr.read())
 
def cleanup(filename):
    ''' tries to remove the given filename. Ignores non-existent files '''
    try:
        os.remove(filename)
    except OSError:
        pass
 
def get_errors(error_string):
    '''
    returns all lines in the error_string that start with the string "error"
 
    '''
 
    lines = error_string.splitlines()
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
    if len(error_lines) > 0:
        return '\n'.join(error_lines)
    else:
        return error_string.strip()
 
def tempnam():
    ''' returns a temporary file-name '''
 
    # prevent os.tmpname from printing an error...
    stderr = sys.stderr
    try:
        sys.stderr = StringIO.StringIO()
        return os.tempnam(None, 'tess_')
    finally:
        sys.stderr = stderr
 
class TesseractError(Exception):
    def __init__(self, status, message):
        self.status = status
        self.message = message
        self.args = (status, message)
 
def image_to_string(image, lang=None, boxes=False):
    '''
    Runs tesseract on the specified image. First, the image is written to disk,
    and then the tesseract command is run on the image. Resseract's result is
    read, and the temporary files are erased.
 
    '''
 
    input_file_name = '%s.bmp' % tempnam()
    output_file_name_base = tempnam()
    if not boxes:
        output_file_name = '%s.txt' % output_file_name_base
    else:
        output_file_name = '%s.box' % output_file_name_base
    try:
        image.save(input_file_name)
        status, error_string = run_tesseract(input_file_name,
                                             output_file_name_base,
                                             lang=lang,
                                             boxes=boxes)
        if status:
            errors = get_errors(error_string)
            raise TesseractError(status, errors)
        f = file(output_file_name)
        try:
            return f.read().strip()
        finally:
            f.close()
    finally:
        cleanup(input_file_name)
        cleanup(output_file_name)
 
if __name__ == '__main__':
    if len(sys.argv) == 2:
        filename = sys.argv[1]
        try:
            image = Image.open(filename)
        except IOError:
            sys.stderr.write('ERROR: Could not open file "%s"\n' % filename)
            exit(1)
        print image_to_string(image)
    elif len(sys.argv) == 4 and sys.argv[1] == '-l':
        lang = sys.argv[2]
        filename = sys.argv[3]
        try:
            image = Image.open(filename)
        except IOError:
            sys.stderr.write('ERROR: Could not open file "%s"\n' % filename)
            exit(1)
        print image_to_string(image, lang=lang)
    else:
        sys.stderr.write('Usage: python tesseract.py [-l language] input_file\n')
        exit(2)

1.2.3. The Service deploy/start Script

description  "OCR WebService"
 
start on runlevel [2345]
stop on runlevel [!2345]
 
pre-start script
mkdir /tmp/ocr
 
mkdir /tmp/ocr/static
 
cp /usr/share/ocr/*.py /tmp/ocr
 
end script
exec /tmp/ocr/tesserver.py

After the service has been started, it can be accessed through a Web browser like shown here: http://proton.techcasita.com:8080 I’m currently running tesseract 3.01 on Ubuntu Linux 11.10 64-bit, please be gentle, it runs on an Intel Atom CPU 330 @ 1.60GHz, 4 cores (typically found in Netbooks)

The HTML encoded result looks something like this:

<html><body>Contact Us
www. cox.com
Customer Serv 760-788-9000
Repair 76Oâ€”788~71O0
Cox Telephone 888-222-7743</body></html>

1.3 Accessing the Tesseract Cloud-Service from Android

The OCRTaskActivity below utilizes Android’s built-in AsyncTask as well as Apache Software Foundation’s HttpComponent library HttpClient4.1.2, available here: http://hc.apache.org/httpcomponents-client-ga/index.html OCRTaskActivity expects the image to be passed in as the Intent Extra “ByteArray” of type ByteArray. The OCR result is returned to the calling Activity as OCR_TEXT, like shown here:

setResult(Activity.RESULT_OK, getIntent().putExtra("OCR_TEXT", result));

import android.app.Activity;
import android.graphics.BitmapFactory;
import android.os.AsyncTask;
import android.os.Bundle;
import android.util.Log;
import android.view.View;
import android.widget.ImageView;
import android.widget.ProgressBar;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.mime.HttpMultipartMode;
import org.apache.http.entity.mime.MultipartEntity;
import org.apache.http.entity.mime.content.ByteArrayBody;
import org.apache.http.entity.mime.content.StringBody;
import org.apache.http.impl.client.DefaultHttpClient;
 
import java.io.BufferedReader;
import java.io.InputStreamReader;
 
public class OCRTaskActivity extends Activity {
    private static String LOG_TAG = OCRAsyncTaskActivity.class.getSimpleName();
    private static String[] URL_STRINGS = {"http://proton.techcasita.com:8080"};
 
    private byte[] mBA;
    private ProgressBar mProgressBar;
 
    @Override
    public void onCreate(final Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.ocr);
        mBA = getIntent().getExtras().getByteArray("ByteArray");
        ImageView iv = (ImageView) findViewById(R.id.ImageView);
        iv.setImageBitmap(BitmapFactory.decodeByteArray(mBA, 0, mBA.length));
        mProgressBar = (ProgressBar) findViewById(R.id.progressBar);
        OCRTask task = new OCRTask();
        task.execute(URL_STRINGS);
    }
 
    private class OCRTask extends AsyncTask {
        @Override
        protected String doInBackground(final String... urls) {
            String response = "";
            for (String url : urls) {
                try {
                    response = executeMultipartPost(url, mBA);
                    Log.v(LOG_TAG, "Response:" + response);
                    break;
                } catch (Throwable ex) {
                    Log.e(LOG_TAG, "error: " + ex.getMessage());
                }
            }
            return response;
        }
 
        @Override
        protected void onPostExecute(final String result) {
            mProgressBar.setVisibility(View.GONE);
            setResult(Activity.RESULT_OK, getIntent().putExtra("OCR_TEXT", result));
            finish();
        }
    }
 
    private String executeMultipartPost(final String stringUrl, final byte[] bm) throws Exception {
        HttpClient httpClient = new DefaultHttpClient();
        HttpPost postRequest = new HttpPost(stringUrl);
        ByteArrayBody bab = new ByteArrayBody(bm, "the_image.jpg");
        MultipartEntity reqEntity = new MultipartEntity(HttpMultipartMode.BROWSER_COMPATIBLE);
        reqEntity.addPart("uploaded", bab);
        reqEntity.addPart("name", new StringBody("the_file"));
        postRequest.setEntity(reqEntity);
        HttpResponse response = httpClient.execute(postRequest);
        BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), "UTF-8"));
        String sResponse;
        StringBuilder s = new StringBuilder();
 
        while ((sResponse = reader.readLine()) != null) {
            s = s.append(sResponse).append('\n');
        }
        int i = s.indexOf("body");
        int j = s.lastIndexOf("body");
        return s.substring(i + 5, j - 2);
    }
}

This sample Android app has an Activity that sends a small JPEG image to the Cloud-Service, which is running the Tesseract OCR engine.

1.4. Building a Tesseract native Android Library to be bundled with an Android App

This approach allow an Android application to perform OCR even without a network connection. I.e. the OCR engine is on-board. There are currently two source-bases to start from, the original Tesseract project here:

Tesseract Tools for Android is a set of Android APIs and build files for the Tesseract OCR and Leptonica image processing libraries:
```
svn checkout http://tesseract-android-tools.googlecode.com/svn/trunk/ tesseract-android-tools
```
A fork of Tesseract Tools for Android (tesseract-android-tools) that adds some additional functions:
```
git clone git://github.com/rmtheis/tess-two.git
```

… I went with option 2.

1.4.1. Building the native lib

Each project can be build with the same build steps (see below) and neither works with Android’s NDK r7. However, going back to NDK r6b solved that problem. Here are the build steps. It takes a little while, even on a fast machine.

cd <project-directory>/tess-two
export TESSERACT_PATH=${PWD}/external/tesseract-3.01
export LEPTONICA_PATH=${PWD}/external/leptonica-1.68
export LIBJPEG_PATH=${PWD}/external/libjpeg
ndk-build
android update project --path .
ant release

The build-steps create the native libraries in the libs/armabi and libs/armabi-v7a directories.

The tess-two project can now be included as a library-project into an Android project and with the JNI layer in place, calling into the native OCR library now looks something like this:

1.4.2. Developing a simple Android App with built-in OCR capabilities

...
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(DATA_PATH, LANG);
baseApi.setImage(bitmap);
String recognizedText = baseApi.getUTF8Text();
baseApi.end();
...

1.4.2.1. Libraries / TrainedData / App Size

The native libraries are about 3 MBytes in size. Additionally, a language and font depending training resource files is needed.
The eng.traineddata file (e.g. available with the desktop version of Tesseract) is placed into the main android’s assers/tessdata folder and deployed with the application, adding another 2 MBytes to the app. However, due to compression, the actual downloadable Android application is “only” about 4.1 MBytes.

During the first start of the application, the eng.traineddata resource file is copied to the phone’s SDCard.

The ocr() method for the sample app may look something like this:

protected void ocr() {
 
        BitmapFactory.Options options = new BitmapFactory.Options();
        options.inSampleSize = 2;
        Bitmap bitmap = BitmapFactory.decodeFile(IMAGE_PATH, options);
 
        try {
            ExifInterface exif = new ExifInterface(IMAGE_PATH);
            int exifOrientation = exif.getAttributeInt(ExifInterface.TAG_ORIENTATION, ExifInterface.ORIENTATION_NORMAL);
 
            Log.v(LOG_TAG, "Orient: " + exifOrientation);
 
            int rotate = 0;
            switch (exifOrientation) {
                case ExifInterface.ORIENTATION_ROTATE_90:
                    rotate = 90;
                    break;
                case ExifInterface.ORIENTATION_ROTATE_180:
                    rotate = 180;
                    break;
                case ExifInterface.ORIENTATION_ROTATE_270:
                    rotate = 270;
                    break;
            }
 
            Log.v(LOG_TAG, "Rotation: " + rotate);
 
            if (rotate != 0) {
 
                // Getting width & height of the given image.
                int w = bitmap.getWidth();
                int h = bitmap.getHeight();
 
                // Setting pre rotate
                Matrix mtx = new Matrix();
                mtx.preRotate(rotate);
 
                // Rotating Bitmap
                bitmap = Bitmap.createBitmap(bitmap, 0, 0, w, h, mtx, false);
                // tesseract req. ARGB_8888
                bitmap = bitmap.copy(Bitmap.Config.ARGB_8888, true);
            }
 
        } catch (IOException e) {
            Log.e(LOG_TAG, "Rotate or coversion failed: " + e.toString());
        }
 
        ImageView iv = (ImageView) findViewById(R.id.image);
        iv.setImageBitmap(bitmap);
        iv.setVisibility(View.VISIBLE);
 
        Log.v(LOG_TAG, "Before baseApi");
 
        TessBaseAPI baseApi = new TessBaseAPI();
        baseApi.setDebug(true);
        baseApi.init(DATA_PATH, LANG);
        baseApi.setImage(bitmap);
        String recognizedText = baseApi.getUTF8Text();
        baseApi.end();
 
        Log.v(LOG_TAG, "OCR Result: " + recognizedText);
 
        // clean up and show
        if (LANG.equalsIgnoreCase("eng")) {
            recognizedText = recognizedText.replaceAll("[^a-zA-Z0-9]+", " ");
        }
        if (recognizedText.length() != 0) {
            ((TextView) findViewById(R.id.field)).setText(recognizedText.trim());
        }
    }

OCR on Android

The popularity of smartphones, combined with built-in high-quality cameras has created a new category of mobile applications, benefiting greatly from OCR.

OCR is very mature technology with a broad range of available libraries to chose from. There are Apache and BSD licensed, fast and accurate solutions available from the open-source community, I have taken a closer look at Tesseract, which is developed by HP and Google.

Tesseract can be used to build a Desktop application, a CloudService, and even baked into a mobile Android application, performing on-board OCR. All three variation of OCR with the Tesseract library have been demonstrated above.

Focussing on mobile applications, however, it became very clear that even on phones with a 5MP camera, the accuracy of the results still vary greatly, depending on lighting conditions, font, and font-sizes, as well as surrounding artifact.

Just like with the TeleForm application, even the best OCR engines perform purely, if the input-image has not been prepared correctly. To make OCR work on a mobile device, no matter if the OCR will eventually be run onboard or in the cloud, much development time needs to be spend to train the engine – but even more importantly, to select and prepare the image areas that will be provided as input to the OCR engine – it’s going to be all about the pre-processing.

This shows my Capture OCR sample Android-OCR application (with Tesseract OCR engine built-in), after it performed the OCR on a just taken photo of a book cover.

3 Replies to “Android and OCR”

zahed says: Reply
March 18, 2017 at 5:16 pm

thanks .. amazing work …but a little bit limited
kalvin says: Reply
May 8, 2017 at 6:03 am

hai nice tutorial, but how could intergrated native using cmakelist? i tried to using include base api on c++ JNI, thanks
dd says: Reply
June 15, 2017 at 1:45 pm

java.net.UnknownHostException: Unable to resolve host “proton.techcasita.com”: No address associated with hostname