Form
Processing in India
At
VServe we follow the following steps in forms processing:
Just as documents
must be prepared in order to be fed into a scanner by removing staples, smoothing
wrinkles, positioning them for optimal registration, etc., so the image of a form
document must be prepared by following these steps before it can be intelligently
recognized:
Document scanning
Pages
of forms are scanned and converted into bit-mapped (usually TIFF) images of forms
which are either compressed and stored for later batch processing, or are passed
immediately in an uncompressed format to an ICR engine for recognition.
Image analysis
The document image is cleaned up. Character image quality
is improved, using image enhancement techniques. Background "noise"
is removed from the form.
Form alignment
The image is registered and deskewed by the ICR software, which automatically aligns
the form by locating special symbols on the document called registration marks
as guides. Form identification The
document is identified by certain predefined characteristics that the ICR software
is trained to look for, so that the zones containing the fields designated for
recognition can be located by a customized, predefined ICR template. Form ID attributes
can include form numbers, corporate logos, or the name of the form itself imprinted
somewhere on the form. Form background removal
This stage is not necessary if the document is a form that was originally printed in
a colored ("drop out") ink that is invisible to the scanner being used.
If colored ink is not used, the form image may contain lines, boxes, fine print,
and other form attributes-passive data-that tend to confuse the ICR engine. These
form attributes must be extracted from the image of the form, so that only the
character images-the active data-are left behind. Broken and fragmented characters
are automatically repaired and restored to their original shapes. Character
field location
The predefined ICR template automatically locates
the fields that contain character data. The template identifies which individual
fields on the form image require character recognition, and what the nature of
those fields are-hand print, machine print, numeric, alphabetic, alphanumeric,
etc. The template also identifies which areas are barcodes or check box recognition
zones. Character segmentation
Sophisticated software routines analyze, separate,
and break down the character fields into isolated characters. If the form is "ICR
-friendly," characters are segmented with the aid of graphic devices such
as boxes, tick-marks, and connected boxes called "combs" that serve
to force the form user to legibly separate the characters from one another. Character classification Individual
characters are classified by ICR algorithms according to their ASCII category
and assigned a confidence value, which is an index of how "certain"
the ICR engine "feels" about the selection it has made. Alternate character
choices are ranked according to those values, so that they can be incorporated
into editing procedures that improve ICR accuracy. For example, the alternate
choice "1" might be used instead of the first-ranked choice "I"
when contextual analysis reports that the field is all-numeric. Post-processing
The initial or "raw" recognition results are validated using edit procedures
such as grammatical rules, spell-checkers, dictionaries, check-sum routines, and
look-up tables. Ambiguous and erroneous data fields-the "rejects"- are
identified and sent to data entry operators at workstations for manual correction.
Manual correction of rejected character fields
The manner in which the data entry operator is presented the rejected data for correction
can dramatically impact both the speed and the accuracy of the reject repair process.
In particular, the data entry GUI is important because the ergonomics of data
entry are what enable a given data entry operator to reach his or her maximum
correction speed. What is interesting in forms processing is that only one of the steps-character classification-is
specifically concerned with identifying character data. The rest of the steps
have to do with either preparing the imaged characters for classification or interpreting
the results of character classification. With so much opportunity for error increasing
at each successive step of the way, it is remarkable that ICR accuracy rates can
attain (and sometimes exceed) human performance levels. |