Algorithm For Segmentation Of Urdu Script English Language Essay

Cleavage of book plays a critical function in script acknowledgment. It is critical to understand the book that is used in composing a papers before developing or utilizing a theoretical account to acknowledge it. Chain codes etc. In ligature theoretical account, word theoretical account is used at papers, page and word degree for cleavage. Our algorithm for cleavage of Urdu book used character theoretical account and Hidden Markov Model ( HMM ) to heighten work done antecedently. We have extracted characteristics from images and calculated the maximal likeliness to fit characters in illation algorithm with a characteristic extracted from a text sample. The chief characteristics used in the system will be pre-processing, affiliated constituent analysis, acknowledgment and cleavage of text up to character degree. The algorithm will supply a agency to implement an Urdu OCR system on the footing of the character theoretical account.

Keywords – Preprocessing, Segmentation of characters, character theoretical account, Optical character acknowledgment ( OCR ) , max and argmax.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Introduction

We use an OCR system / scanner to acquire images of text [ 1 ] . Into preprocessing image will be converted to noiseless B/W image.

1.1 Cleavage

Cleavage is spliting an image into smaller sections or pieces [ 2 ] . Segmentation occurs on two degrees. At first degree both text and artworks are separated for farther processing. At 2nd degree, cleavage is performed on text to divide paragraphs, words, and characters etc. Cleavage of text can be performed on a papers, page, paragraph and character degrees [ 3 ] . They suggested assorted cleavage attacks viz. [ 4 ] .

Holistic Method

Cleavage based attack

Segmentation free attack

In holistic method whole word is classified utilizing a dictionary, the characteristics of trial input are matched against trained paradigms [ 5 ] . The restriction is that the method is non good for larger categories and it can merely be used with the other two methods. Segmentation divides a word into smaller sections. The image of the word is broken up into several entities called characters [ 4 ] . Segmentation depends on human intuition. In cleavage free attack character theoretical account can be used to concatenate characters and signifier words. For case cleavage free attack can be based on Hidden Markov Model ( HMM ) that is a stochastic theoretical account.

1.2. Urdu Language and Text Segmentation

Urdu is a cursive ( written with the characters joined ) composing linguistic communication. Urdu linguistic communication characters are similar in form and have curves that make it hard to acknowledge by a machine. Furthermore it has more than one symbol to stand for a character. Due to its cursive nature characters / books in Urdu linguistic communication are difficult to acknowledge by a computing machine plan. A really accurate technique is needed to acknowledge / understand Urdu characters. Urdu characters have four simple forms

Basic Symbols ( 38 Symbols )

Table 1 shows the basic symbols / forms for Urdu Language.

Get downing Symbols ( 26 Symbols )

Table 2 shows the basic symbols / forms for Urdu Language.

Mid Symbols ( 40 Symbols )

Table 3 shows the basic symbols / forms for Urdu Language.

Other Symbols

This includes symbols for Numberss, particular symbols like zabar, zair, paish etc.

The symbol tabular arraies, Table 1, Table 2, Table3 and Table 4, for Urdu linguistic communication are given below as:

Table1. Basic Symbols

Table 2. Get downing Symbols

Table 3. Mid Symbols

Table 4. Other Symbols

We used Urdu book Nastaliq for our work. We extracted images for Urdu character set like basic, get downing, mid and other symbols utilizing available Nastaliq fount.

Literature Review

In a structural attack to script designation, stroke geometry has been utilized for book word picture and designation [ 6 ] . Individual character images in a papers are classified either by using a paradigm categorization or by utilizing support vector machine. Ligatures are used for cleavage / acknowledgment of Urdu characters. The ligature is a sequence of characters in a word separated by non-joiner characters like infinite.

Their attack in [ 1 ] used ligature theoretical account and it is divided into two phases:

Line Segmentation

Line cleavage trades with the sensing of text lines in the image. The image is scanned horizontally from right to left way, upwards to downwards, in hunt of a text pel. Afterwards, it is determined whether this pixel belongs to a primary ligature or a secondary ligature as shown in Fig 1. The freewoman concatenation codifications ( FCC ) of the ligature are compared with already calculated FCC of the secondary ligatures.

Character Segmentation

The text is skeletonized and a label matrix is constructed which contains the identifiers of all ligatures in the image. The place of single characters in a word is determined. Cleavage is done utilizing primary ligatures merely.

Fig 1. ( a ) Urdu word ( B ) Seven ligatures ( degree Celsius ) Three Primary ligatures

( vitamin D ) Four Secondary ligatures [ 7 ] .

Restrictions of the method are: foremost, they performed cleavage on the footing of primary ligatures merely, hence, it will non distinguish between seen and sheen because it will disregard secondary ligatures i.e. points. Second, lexicon of images stored for preparation will be immense. Third, there are jobs of over cleavage and under cleavage. In [ 8 ] , they have proposed a ligature and word theoretical account for Urdu word cleavage. It was done in three stages:

In 1st stage, information is collected. They identified Ligatures and calculated word chances utilizing probabilistic step. From the input set of ligatures, all sequences of words are generated and ranked utilizing the vocabulary search.

In the 2nd stage, top K sequences are selected utilizing a selected beam value for farther processing. It uses valid words heuristic for choice procedure.

In the 3rd stage, maximal likely sequence from these Ks word sequences is selected. Their method used lexicon of ligatures/words, concatenation codifications, and to happen best likely sequences they used HMM toolkit HTK to acknowledge a word / ligature. They have recommended that their work can be farther improved by utilizing the character theoretical account for Urdu text cleavage [ 9 ] .

A hapless cleavage will take to hapless acknowledgment [ 10 ] . They divided image into smaller blocks, cheque for uniformity, group unvarying block utilizing colour similarity and place text in this block [ 11 ] . They used border denseness based noise sensing to section out text countries in video/ images [ 12 ] . Cleavage of an image into text and non-text parts consequence public presentation in OCR development [ 13 ] . They proposed line cleavage method utilizing histogram equalisation, indicated assorted jobs and text line into ligature utilizing concatenation codifications [ 14 ] . They presented jumping box based attack for cleavage of tabular array of contents in Urdu book [ 15 ] . They analyzed horizontal and perpendicular projection profiles for line and character cleavage. Misclassification occurs at character degree [ 16 ] . They proposed text line extraction utilizing perpendicular projection, taging all points where pel values are non found and text line into ligatures utilizing stroke geometry [ 17 ] . They proposed designation of partial words ( i.e. affiliated constituents ) in text line and utilizing horizontal / perpendicular projections to place words utilizing comparative distance fiting [ 18 ] . They used lexicon for text line and ligature cleavage in on-line text [ 19 ] .

Problem Statement

Previous work has restrictions that it can non right execute cleavage in few instances and there will be misclassification jobs. Furthermore it can acknowledge a limited set of affiliated constituents or ligatures merely.

Proposed Cleavage Algorithm

We will heighten old work by suggesting an improved algorithm for Urdu book cleavage that will utilize a character theoretical account. For this intent we have created a set of characters. There are about 114 characters excepting some particular characters like zabar, zair, paish etc. We have used characters of fixed size and manner in this work. We are utilizing all the fluctuations of each character in a authorship manner e.g. bay has three forms a basic, a beginning and mid forms. Our algorithm uses a character theoretical account with Hidden Markov Models ( HMMs ) for cleavage of Urdu text. To the best of our cognition, this work has non been done antecedently. We have offline text i.e. , scanned pre-processed B/W Urdu characters and we are utilizing Matlab ver. 7.12 as programming tool.

4.1 Our Method

Our method is divided into three wide stairss:

Measure # 1 Data Acquisition / Feature Extraction:

In the first measure, algorithm transforms images of symbols into binary signifier as a matrix. Then extract characteristics from the images utilizing our characteristic extraction plan and shop it into a disc. These characteristics are represented as concealed provinces: Ten ( one ) = { ten ( 0 ) , x ( 1 ) , . . . , x ( K ) } where each X ( I ) represents a characteristic ( in matrix signifier ) for each form in an Urdu character set ; x ( K ) is a place vector in the matrix X ( I ) .

Measure # 2 Get Observed informations:

The ascertained informations contain sequences of Urdu characters. In our survey we have used a line of Urdu text. After geting this filtered image, we have transformed it into binary signifier. Then extracted characteristics from an image utilizing our characteristic extraction plan. This characteristic contains several Urdu characters in it. The algorithm will scan it and execute cleavage by ciphering maximal chances with concealed provinces and turn uping observations in characteristic utilizing HMMs. These observations form discernible provinces: O ( I ) = { O ( 0 ) , o ( 1 ) , . . . , O ( K ) } where each O ( I ) represents characteristic ( in matrix signifier ) for each form in ascertained provinces ; o ( K ) is a positional vector in matrix O ( I ) .

Measure # 3 Apply HMMs:

We are given:

Hidden provinces: Ten ( one ) = { ten ( 1 ) , x ( 2 ) , . . . , x ( K ) } where one = 1,2, … , m ( for m characters ) .

Discernible provinces: O ( I ) = { O ( 1 ) , o ( 2 ) , . . . , O ( K ) } where one = 1,2, … , N.

Initial Distribution X ( 0 ) .

In a concealed Markov theoretical account the province variable ten ( one ) is discernible merely through its measurings o ( I ) . Now, suppose that a sequence O ( I ) of emanation has been observed.

Fig 2 shows transmutation of a character and an ascertained sequence that are captured utilizing MATLAB matrices.

( a )

( B )

Fig 2: ( a ) A m x n matrix screening Urdu character Alif. ( B ) Sample observation demoing a affiliated constituent of two characters bay and alif spelled out Ba.

Alternatively of utilizing characters our algorithm extracted characteristics from all the characters to cut down calculation complexness. These characteristics will be used as concealed provinces in HMM i.e. ten ( one ) and are stored on disc for illustration, features demoing character alif and bay, captured utilizing MATLAB, are shown below in fig 3.

( a )

( B )

( degree Celsius )

Fig 3. ( a ) Feature for character Alif, ( B ) Feature for character Bay and

( degree Celsius ) Feature for sample S ( I ) taken from word Ba i.e. bay-alif.

The algorithm extracts characteristic from line of sample text S ( I ) . In forward algorithm, the characteristic s ( 1 ) , … , s ( K ) is matched against each of the concealed provinces x ( I ) by fiting rows of x ( I ) with rows of S ( I ) . The procedure continues for all characters and Michigans after ciphering chances for all the characters i.e. P ( X ( I ) |Z ( I ) ) . Afterwards it finds the maximization of chance and in this manner it finds observation O ( 1 ) from the S ( 1 ) . The forward algorithm will go on from s ( k+1 ) , … , s ( L ) to happen observations O ( 2 ) , … , O ( N ) . If there is more than one likely character, so we can utilize a so called Viterbi algorithm that will happen argmax and will give the optimum likely sequence if we are non near to existent consequences. The algorithm for the HMMs is as under:

Algorithm Segsha ( S, L )

j=1

while ( J & A ; lt ; L )

for one = 1 to n

Sample s ( J ) ~ { tungsten }

Wisconsin ‘ = Pr ( s ( J ) |X ( I ) )

end-for

O ( I ) = O ( I ) U { soap ( wi ‘ ) }

s ( J ) = s ( J ) + 1

end-while

Where S is a sample characteristic of vectors obtained from an ascertained sequence O ( I ) i.e. , a line of Urdu text ; L is the dimension of S ( length of S ) ; S ( J ) is a sample taken from S each clip to fit against character characteristic X ( I ) and chance of fiting will give us weights, Wisconsin ‘ , for each character ; max ( Wisconsin ‘ ) is maximization of chance that proceeds as follows:

Here soap ( wi ‘ ) can be calculated by comparing Wisconsin ‘ ~ tungsten and calculated by utilizing the eq.1 [ 20 ] .

Consequence

A sum of 1200 words were used that include all the characters in our character set. Sample scanned text was taken from Nastaliq fount with point size 36. We found that 1176 out of 1200 were wholly recognized. Not the whole word but merely one or two characters in a word were misclassified. The truth of 97 % was really encouraging for us and we are looking frontward to work farther in this country.

Decision

We tested our attack on images of text taken from Nastaliq fount scanned at 300 dpi and found that better consequences can be achieved by utilizing HMM with the character theoretical account. These consequences were checked on a paradigm utilizing a set of characters. We have achieved 97 % truth.

Future Work and Enhancements

In future we are be aftering on two things:

1. To extinguish limitation of fixed font size and manner.

2. To work with handwritten Urdu text.

We will utilize both of the options utilizing the same method but that is another narrative.

Leave a Reply

Your email address will not be published. Required fields are marked *