OCR - Optical Character Recognition - Converting Documents to Full Text

Home

Ch: 6 - Computer Concepts and Legal Applications

Computer Concepts and Legal Applications

Full Text Search & Retrieval & Optical Character Recognition (OCR)

Chapter 6 - Computer Concepts and Legal Applications

OCR - Optical Character Recognition - Converting Documents to Full Text

Recognition software, referred to as OCR (optical character recognition) and ICR (intelligent character recognition), has steadily grown in importance as a subset of the document imaging technology. Both technologies convert visually readable characters into ASCII text, which a computer can store, edit, and process. OCR, which was developed first, recognizes type fonts by pattern matching, character assessment and a crude learning process; ICR reads hand printing, and to a lesser extent, handwriting.

OCR is software that can convert the letters or numbers that appear on a page to a bit mapped image and then into computer readable text known as ASCII. OCR software analyses the dots or pixels on a bitmap page and tries to figure out which dots are an “A” shape, which are in a “8” shape, or other graphical shape. It than compares the character to its library of pattern templates. If there is enough of a match, it sends the ASCII equivalent of that character to the output file. The conversions of these dots into individual letters will then form words. These words can then be searched using full text software. OCR software will work well with typed pages, but not as well with boxes, lines, or handwriting on the document.

A two-step process is involved in converting paper into full text. First, the paper is “scanned” using a scanner that converts the paper into a digital bitmap image or a collection of black and white dots. Then, OCR software reads and then converts the represented dots into alphabetic letters or numbers. The converted full text is then loaded into a full text program for searching. Images do not have to be converted immediately, but can instead be converted later if the information becomes important as a coder reviews the document.

OCR is gr~at, but how can I correct err~rs ~n the conver~~on to a f~ll text docu~e~nt?

The major issue in the conversion process is CONVERSION ACCURACY and COST. Depending on the type and format of the written documents, the conversion rate for documents can easily fluctuate between 70% to 99%. This means that only 70% to 99% of the document was accurately converted into full text. The problem is that the converted document has misspellings and unintelligible characters. The number of mistakes depends on the quality of the original paper document and the OCR software. If it is a first generation photocopy and clear type, like a deposition, then the conversion rate can be quite good – in the 95?99% range. If the 3rd or 4th generation photocopied paper has lines, smudges, handwriting etc., the general rule of thumb is that the conversion rate will end up to be 50?80% accurate. The accuracy conversion rate is improving and with the use of spell checkers and ICR, Intelligent Character Recognition, the conversion rate should increase over the next few years.

OCR and ICR software measures accuracy on mistakes it knows it made. When the software cannot decipher a character it will highlight the word and present the user with the percentage of errors it made. One of the major problems with OCR and ICR software are substitution errors, where the software is convinced it has read a character correctly when in fact it is wrong. When substitution errors are factored in the accuracy rate drops.

Other factors reducing conversion accuracy:

Colored paper and forms with narrow borders or constraints;
Skewed or dog-eared documents;
Faded, faxed, letters and columns;
Speckles, blotches and indecipherable marks;

Lookup tables, dictionaries and automated spell checkers help to correct errors. Lookup tables can check addresses, zip codes and social security numbers automatically. Some systems will highlight the imaged document and the full text errors for easy user correction.

Some systems also allow one to check a fax image and determine whether you want to OCR the document to your application or clip out a portion of the text to be OCR’d.

What do errors cost to clean up?

10 seconds to repair an error;
60 errors in a page that that has been 97% accurately converted;
10 minutes to repair the page;
Typist $24/ hour (including benefits) = $4.00 per page to clean up. (Many service bureaus charge $1.00 to $2.00 to clean up a page that has been OCR’d.);
100 Page documents may cost as much as $400 to repair.

One technique used by many firms is to OCR important documents and not clean them up. Even if the conversion rate is 70%, you still get hits on 70% of the recognized text. One can also purchase sophisticated full text software features, such as adaptive pattern or fuzzy searching to locate words which have not been accurately OCR’d, but which will be located because of the similarity, with similar words. In addition, firms will use an abstracted database, along with full text searching. OCR is a critical component of any office and litigation support system.

OCR Features and Products. Some important OCR software features to consider:

Formatting - Should allow you to recognize tabs, margins, boldfacing, underlining, italics, different fonts, and varying font sizes.
Fonts – Should be able to recognize letters and numbers by font type. This increases the accuracy.
Preprocessing – De-skewing is the ability of OCR software to recognize and compensate for paper that has been improperly scanned in. It should also be able to recognize and delete boxes from forms.
Zoning and forms control – Allows you to define an unlimited number of “zones” or areas from which characters are used to create, index or populate a database. You draw boxes around these areas you define as zones. Zones are stored as templates.
Foreign language –Recognizes foreign languages.
Character files - Allows you to choose “numerical only” or “alpha only”, so it won’t confuse a “5” with an “S”.
Image file formats - Should allow you to convert common and uncommon image formats, such as TIFF, G3, G4, PCX, GIF, Bitmap, and more.
Output File Format - The translation output file should be able to be saved in a word processing, spreadsheet, and database or ASCII format.
Spell Checker and Context Check – The software should offer suggestions from a spell check type directory for unrecognized words. Also, check on letters or numbers in context so an “I” in a word won’t be mistakenly recognized as a “1”. Software should allow you to build supplement dictionaries of name and technical words germane to the case you are working on.
Type Size - Can recognize characters as small as 4-point or as large as 72-point. Newspapers have 72-point headlines. Directories have a lot of 4-point type.
Control – should allow you to control the scan dots per inch, # of pages to scan, and scan now and OCR the documents later on. Other considerations are whether you can OCR multiple pages and save all pages as a single file or as separate files. Can it scan a double-sided page and keep the pages in order?
Other features - unattended operation, multiple scan jobs, color support, one-click scanning, access from other programs, training, Windows XP and earlier version support, and toll free support.

Some products to consider: OmniPage™ (http://www.nuance.com/) and TypeReader™ by Expervision (www.expervision.com).

eDiscovery Alerts

Click here to sign up for ediscovery e-mail alerts that provide news on the latest electronic discovery and evidence issues.

Find Legal Software

Sponsors

Digital Practice of Law Book

Main Menu

eDiscovery Alerts