Farsi Handwriting Recognition

In addition to storing handwritings as paper documents, handwriting recognition is an ability to transform writing data into electronic data. There are two ways to store handwritings as digital forms. In the offline handwriting, writings are scanned and stored as pictures. In contrast to the offline one, in the online handwriting, data can be seen as a sequence of x-y-coordinates which are sampled as the pen touches the writing surface. Thanks to the advent of tablets and smart phone technologies, handwritings would be stored by much more precision. A basic stroke is a single calligraphic mark moving in one direction across a writing surface. Some extra information from the online handwriting technology are also available: the number and sequences of strokes, tilts and even pressures of the digital pen in each point.

Farsi Handwriting

Contemporary Farsi writing system has been derived from Arabic writing system since circa 940 AD. Both writing systems are categorized as Abjad writing systems in which, every word is made up by a right-to-left chain of symbols representing only consonants of syllables. The most notable feature of Adjad writing system versus Latin alphabetical writing system is the variability of letter shapes in cursive handwritten texts. Farsi writing system is built of a 32 letter alphabet. The letters are grouped as sticky or blunt. Sticky letters are written in connected shape when placed inside a word and blunt letters stop the chaining even placed inside a word.

Stroke Based Wordnet

One of the most noticeable challenges challenges to develop a practical Farsi Handwriting Recognition System is the unpredictable number of indivisible shapes of letter combinations. An usual while efficient approach to this problem is to consider a two-phase framework where the handwritten sample is classified with respect to the shape descriptor features using methods of image processing and pattern recognition, and then the shape-classified strokes are mapped to a vocabulary words using linguistic processing.
we develop a wordnet based on Farsi handwritten stroke categories. Like a mid-level compiler, a stroke-level wordnet translates sequences of strokes to lists of most relevant words. This wordnet can be incorporated into the Farsi HRS which performs through independent modules for stroke shape classification and word inference.

HTK

The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing.

Stroke-level Wordnet

Here, we have compiled a stroke-level wordnet for Farsi words by applying a translational grammar of Farsi writing system rules to 54350 words of the Persian Generative Lexicon. The compiled wordnet and the required grammar and dictionary files in a HTK compatible format are available via the following links;

       Word to Stroke Grammar [ grammar.rar ]
       Word to Stroke Wordnet [ wordnet.rar ]
       Stroke to Base_Glyph Dictionary [ dict.rar ]

The wordnet can be simply reproduced by using the HParse utility in HTK package.

Reference

Please cite the following paper if you use the stroke-level wordnet,

Contact

Ali Katanforoush, <a_katanforosh@sbu.ac.ir>, Department of Computer Science, Shahid Beheshti University, G.C., Tehran.

Time-stamp: "2015-10-23 10:20:24 katanforoush"