Using a Neural Network Character Classifier
by Scott Teresi
In studying methods of handwritten character recognition, what better system to investigate and model than one which is already very successful at such a taskthe human brain. The visual cortex contains 10 billion neurons, each with at least a thousand synapses (Simon, 176). Indeed, such a fantastic network is made up of smaller, modular networks which have developed over time to perform specific tasks. These multiple regions function in parallel and interact to form a robust system for pattern recognition. Accidental malfunction or destruction of certain sections of this area will result in unevenly impaired visual recognition. People with damaged regions of their visual cortex may find that they can recognize letters but not entire words, or specific objects but not an entire scene full of objects. Many character recognition systems appear to suffer from similar maladies in that they can perform one segment of the overall task well but are unable to fully duplicate the richness of the humans character recognition ability. Newer models exploit the same feedback and interaction between independent systems as is present within the visual cortex and provide the diversified processing power needed in order to function in a more robust manner.
Performance of single-algorithm systems drops precipitously as the quality of input decreases. In such situations, a human subject can continue to perform accurate recognition, showing only a gradual decrease in reliability (Wang, 84). Collaboration between separate algorithms proves beneficial, in that such systems will allow a gradation of recognition levels expressed as probabilities or loose guesses to be passed from one level to the next. More specifically, a front-end system will perform some useful first-order basic processing. Then a second level of processing will be engaged which will judge whether to assimilate the results of the first process, extend them and proceed to the next stage with a positive recognition, or to dismiss them and reinvoke the first level again while asking for modifications.
The multiple-layered system which makes up any robust handwriting recognizer has progressed greatly from the days when character recognition meant reading printed numerals of a fixed-size OCR-A font. However, only recently have the successes within the field approached the level of a truly practical handwriting recognizer. Various accepted methods will be outlined and compared to one of the first commercially viable general handwriting recognition products, the Apple Newton.
Components of a Handwriting Recognizer
What kind of magic is required for a computer to become a competent handwriting recognizer? First, an image containing text must be appropriately supplied and preprocessed. Next, the text must either undergo segmentation or feature extraction. Small processed pieces of the text will be the result, and these must undergo recognition by the system. Finally, contextual information should be applied to the recognized symbols to verify the result. The following table briefly outlines these major areas involved in the process.
Table: steps involved in handwriting recognition.
Once the input data has been normalized to some degree, the most challenging field of recognition takes oversegmentation. Two main strategies of segmentation can be applied. The bottom-up, or analytical approach builds words out of the recognition of their component characters. This is the more traditional meaning of segmentation, where letters (or parts of wordsword units) are intelligently separated from the rest of the word. On the other hand, top-down, or holistic models attempt to recognize the attributes present within words. They operate on a higher level of abstraction, extracting features from words rather than partitioning the word and attempting to recognize each part of it.
Segmentation can be combined with the removal of noise and extraneous non-textual strokes, such as when characters are meant to be printed neatly within individual boxes. This requires a quite simpler segmentation procedure than, say, cursive word partitioning. In address and zip code recognition, however, removal of noise such as postal symbols and markings on an envelope is not a trivial matter. One method accomplishes this by following paths of minimal curvature change"good" continuationto intersect the boundaries for segmentation (Impedovo, 10).
Segmentation can also be a byproduct of the character recognition algorithm. This "implicit" segmentation occurs as symbols are tentatively drawn from pieces of the word in question without as yet committing to the exact size of the piece. Contextual or statistical information is then used to determine a match for the tentative symbol. Segmentation is more a byproduct of this process than an algorithm in itself. Segments are the result of recognition, rather than vice versa (Lecolinet, 247).
Methods of analytical or explicit segmentation involve making more rigid guesses for the boundaries of characters. Vertical and horizontal histograms can be used to discover variations in thickness which correspond to boundaries. Also, upper or lower contours can be used to pinpoint them (Impedovo, 12). On-line character recognition can provide invaluable temporal information to aid in segmentation. Some methods even attempt to extract such information from off-line texts.
After initial segmentation, individual symbols are fed to a contextual post-processing unit which then recognizes them. The recognition step generally can modify the boundaries produced by what would then be called the initial process of "loose segmentation." No segmentation method based upon pixels alone can possibly be foolproof. The contextual analysis stage is extremely essential in resolving the unavoidable ambiguities and errors of recognition and adjusting further segmentation attempts.
The holistic approach to segmentation attempts to address these problems more directly. Instead of recognizing characters individually, it mimics the way in which a human may perceive text. This scheme can sometimes tolerate dramatic amounts of deformation within words, as is often seen in cursive script. However, it is greatly dependent upon its prescribed lexicon of words, as they are the units by which the objects of recognition are compared. How can this model be extended to support the nearly infinite variety of inputs which could conceivably be handled by bottom-up approaches? One way to generalize the top-down approach is to provide rules for deriving the lexicon from a database. Representations for new words can then be generated by a reconstruction model which uses generic information about character and ligature formations in the database (Lecolinet, 239).
Given the complexity of an entire word as opposed to a single character, how can the holistic approach manage to match a word shape to an item within the lexicon? Early techniques attempted to detect the "middle zone" of a word and then note where ascenders and descenders and other such notable features were located. However, these comparison techniques were not robust enough to withstand large word deformations. More advanced practices use dynamic programming, which computes and optimizes the distance between the given word and another in the lexicon, provided a set of simple transformations (Lecolinet, 243). Hidden Markov Models are also worthy candidates for word as well as letter matching. They construct probabilistic models of the structure from a set of unknown primitives. HMMs are more analytic than holistic, however, being derived from lower-level letter or intra-letter states. Given this arsenal of tools, todays broad range of multi-layered handwriting recognition systems show great promise in attaining levels of recognition approaching that of humans in some cases.
Introducing the Newton
Apples Newton MessagePad has found success in on-line printed handwriting recognition using the Apple Print-Recognizer. The APR incorporates a three-step bottom-up approach: (1) character-level analysis and loose segmentation, (2) classification, and (3) context-sensitive search (Yaeger, 74). An artificial neural network serves as a classifier for each character. The top finishers within its output vector of probabilities is then entered into a search engine within a lexical context which provides as a result a best guess of the most likely word given the past string of probable characters
To manage the near-impossibility of providing segmentation without context to the neural net, the net must consider many possible segmentations; final decisions are deferred until the search stage. As the user inputs characters, each individual stroke is enumerated and grouped with its neighbors in every possible combination to be fed into the network. The networks possible classifications are finally sent to the search engine which simply looks up the minimum-cost path through its dictionary, obeying the legal transitions between the tentative segmentations which were tried. Grammars for changes in case, the improbability of numbers embedded within words, the use of punctuation, and common patterns such as phone numbers dramatically help improve accuracy beyond the dictionary look-up.
A Neural Network Character Classifier
Neural networks were chosen for the classification engine of the Apple Print-Recognizer based on their extensive history in outperforming other recognition approaches. Great care was taken in generating the networks structure and preparing its input to form a representation most useful for the network. A fairly standard multilayer feedforward network trained by backpropagation was chosen but modularized into two parallel recognizers which were then joined at the final output layer (Yaeger, 75). One half of the network accepted an anti-aliased image of a stroke feature and the strokes enumeration number. Beneath this were two fully-connected hidden layers of 72 and 104 elements each. Since networks respond best to smoothly varying inputs, anti-aliased image data was chosen for input because of the clear improvement over straight binary data.
The other half of the network accepted a 14x14-pixel anti-aliased image of a character, and this was then connected to eight partially-connected hidden layers. These layers selectively accepted activations from the input layer in order to become more sensitized to certain features of the input character. For instance, a hidden layer matrix measuring 1x7 would take the average pixel value over every odd horizontal line of the image in the input layer. This would generate a sort of vertical histogram of the pixels. Other matrices likewise favored the top, bottom, left, and right sides of the input image. Also, 5x5 and 7x7 grids were included to observe features of the input existing on larger scales. These parallel hidden layers were fully connected with one large hidden layer of 112 units.
The resulting outputs from both independent halves of the network were merged together simply by the networks architecture, into one final output layer expressing a probability for each of 95 discrete characters. These outputs would decide the possible candidates for the next character in the current word.
The networks second- and third-choice outputs were very poor, however, as a result of the mean-squared error minimization which occurs in backpropagation. Also the training sets consisted entirely of 0s except for a 1 marking the correct output, and indeed the network attempted to mimic this behavior at all costs. To resolve these issues, the backpropagation error for non-target nodes was reduced by normalizing the error seen at a given output relative to the error seen at the target node. (Error at the target output remained unchanged.) As a result of this training modification, the networks outputs gracefully degraded away from the top-choice classification and hence provided more useful second- and third-choices (Yaeger, 77). While this method slightly reduced the accuracy of the primary character classification, the net effect was always an increase in word accuracy.
Besides reliably producing the correct classification for well-segmented inputs, the classifier network must also carry the weight of processing input patterns resulting from poor segmentation. Consequently, an integral part of network learning was "negatively training" in which badly-segmented input and a target output of all zeros was applied. However, what mechanism would prevent the network from immediately unlearning its correct classification of characters such as "l" or "o" when such inputs could conceivably be presented as examples of incorrect segmentation? Two methods reduced the undesirable impacts of negative training: reducing the learning rate and discarding 70% to 95% of all possible negative examples. Again, this may seem contradictory to the goal of developing a set of weights for the network which produces accurate classifications, but the tradeoff results in significant increases in whole-word recognition.
Several other mechanisms helped increase the APRs robustness. Repeatedly applying the same input images after varying combinations of scaling, skewing, and rotation provided improved generalization of input, decreasing its dependency upon the handwriting supplied by the training data set. Symbols which occur very frequently in the English language were applied to the network less often. Throughout the entire training process, the learning rate was moderated by a type of simulated annealing. At the end of any epoch whose total squared error increased on the training set over the previous epoch, the learning rate was decayed by a factor of 0.9. This adjustment happened to take place every few dozen epochs, allowing a starting learning rate of nearly 1.0. Judging by the favorable results, this apparently aided a great deal in the escaping of local minima.
Operated by a meticulous and trained person, the Apple Print-Recognizer can achieve 100% accuracy in hand-print recognition. Only 80% accuracy is yielded by uninformed complete novice users. Many users, however, have unscientifically reported accuracies above 95% and up to 98%, proving this technology extremely workable. Many of the techniques of APR which force the network to allocate resources to underrepresented stimuli proved extremely valuable to the entire systems high accuracy. Some of these design choices included output error normalization for otherwise stifled second- and third-choice outputs, biasing against the most often occurring inputs, and reduced-probability negative training. These techniques acted together to temper the inherent inclination of neural networks to closely converge upon the trained output. Instead, the network was steered toward more balanced results which a context grammar could dependably exploit. The advantages of the multi-layered model have been shown through the robust performance exhibited by this practical commercial solution to on-line handwriting recognition.
About this paper
I wrote this paper while earning my Masters of Computer Science degree at the University of Illinois at Urbana-Champaign, May 1998. This was my final assignment for CS 442: "Artificial Neural Networks," taught by Dr. Sylvian Ray.
Unfortunately I don't have any additional information on handwriting recognition other than what appears in this paper. I can only recommend checking out the books listed below. I didn't write any actual algorithms which could recognize handwriting.
© 2000 Scott Teresi. This text is available on my web site: http://www.teresi.us
Please notify me if you re-distribute this paper. I would be glad to hear it!
Dori, Dov, and Alfred Bruckstein, ed. Shape, Structure and Pattern Recognition. New Jersey: World Scientific Publishing Co., 1995.
Gorsky, N.D. "Off-line Recognition of Bad Quality Handwritten Words Using Prototypes." Fundamentals in Handwriting Recognition. Ed. Sebastiano Impedovo. New-York: Springer-Verlag, 1994.
Impedovo, Sebastiano. "Frontiers in Handwriting Recognition." Fundamentals in Handwriting Recognition. Ed. Sebastiano Impedovo. New-York: Springer-Verlag, 1994.
Licolinet, Eric, and Olivier Baret. "Cursive Word Recognition: Methods and Strategies." Fundamentals in Handwriting Recognition. Ed. Sebastiano Impedovo. New-York: Springer-Verlag, 1994.
Simon, J.C. "On the Robustness of Recognition of Degraded Line Images." Fundamentals in Handwriting Recognition. Ed. Sebastiano Impedovo. New-York: Springer-Verlag, 1994.
Wang, Patrick Shen-Pei. "Learning, Representation, Understanding and Recognition of Words - An Intelligent Approach." Fundamentals in Handwriting Recognition. Ed. Sebastiano Impedovo. New-York: Springer-Verlag, 1994.
Yaeger, Larry S., Brandyn J. Webb, and Richard F. Lyon. "Combining Neural Networks and Context-Driven Search for Online, Printed Handwriting Recognition in the Newton." A.I. Magazine. 19(1): 73-89, 1998 Spring.
Young, Tzay Y., and King-Sun Fu, ed. Handbook of Pattern Recognition and Image Processing. New York: Academic Press, Inc., 1996.
Home | Contact Page
Professional Portfolio | Resume
Essay/Opinion | Photo Gallery | Digitally Altered | Video Editing | Making Music | Programming
Traveling | Skiing | Urban Exploration
About Me | Friends/Family
Jokes | Sound Clips | Links