OCR - Optical Character Recognition - Converting Documents to Full Text Recognition software, referred to as OCR (optical character recognition) and ICR (intelligent character recognition), has steadily grown in importance as a subset of the document imaging technology. Both technologies convert visually readable characters into ASCII text, which a computer can store, edit, and process. OCR, which was developed first, recognizes type fonts by pattern matching, character assessment and a crude learning process; ICR reads hand printing, and to a lesser extent, handwriting. OCR is software that can convert the letters or numbers that appear on a page to a bit mapped image and then into computer readable text known as ASCII. OCR software analyses the dots or pixels on a bitmap page and tries to figure out which dots are an “A” shape, which are in a “8” shape, or other graphical shape. It than compares the character to its library of pattern templates. If there is enough of a match, it sends the ASCII equivalent of that character to the output file. The conversions of these dots into individual letters will then form words. These words can then be searched using full text software. OCR software will work well with typed pages, but not as well with boxes, lines, or handwriting on the document. A two-step process is involved in converting paper into full text. First, the paper is “scanned” using a scanner that converts the paper into a digital bitmap image or a collection of black and white dots. Then, OCR software reads and then converts the represented dots into alphabetic letters or numbers. The converted full text is then loaded into a full text program for searching. Images do not have to be converted immediately, but can instead be converted later if the information becomes important as a coder reviews the document. OCR is gr~at, but how can I correct err~rs ~n the conver~~on to a f~ll text docu~e~nt? The major issue in the conversion process is CONVERSION ACCURACY and COST. Depending on the type and format of the written documents, the conversion rate for documents can easily fluctuate between 70% to 99%. This means that only 70% to 99% of the document was accurately converted into full text. The problem is that the converted document has misspellings and unintelligible characters. The number of mistakes depends on the quality of the original paper document and the OCR software. If it is a first generation photocopy and clear type, like a deposition, then the conversion rate can be quite good – in the 95?99% range. If the 3rd or 4th generation photocopied paper has lines, smudges, handwriting etc., the general rule of thumb is that the conversion rate will end up to be 50?80% accurate. The accuracy conversion rate is improving and with the use of spell checkers and ICR, Intelligent Character Recognition, the conversion rate should increase over the next few years. OCR and ICR software measures accuracy on mistakes it knows it made. When the software cannot decipher a character it will highlight the word and present the user with the percentage of errors it made. One of the major problems with OCR and ICR software are substitution errors, where the software is convinced it has read a character correctly when in fact it is wrong. When substitution errors are factored in the accuracy rate drops. Other factors reducing conversion accuracy:
Lookup tables, dictionaries and automated spell checkers help to correct errors. Lookup tables can check addresses, zip codes and social security numbers automatically. Some systems will highlight the imaged document and the full text errors for easy user correction. Some systems also allow one to check a fax image and determine whether you want to OCR the document to your application or clip out a portion of the text to be OCR’d. What do errors cost to clean up?
One technique used by many firms is to OCR important documents and not clean them up. Even if the conversion rate is 70%, you still get hits on 70% of the recognized text. One can also purchase sophisticated full text software features, such as adaptive pattern or fuzzy searching to locate words which have not been accurately OCR’d, but which will be located because of the similarity, with similar words. In addition, firms will use an abstracted database, along with full text searching. OCR is a critical component of any office and litigation support system. OCR Features and Products. Some important OCR software features to consider:
Some products to consider: OmniPage™ (http://www.nuance.com/) and TypeReader™ by Expervision (www.expervision.com). |