Saturday, August 28, 2010

Thaana OCR Development


The engine used is tesseract OCR.

After a few hours of training the engine the following image was passed to be processed.


The result is :

ށްރަވަ ލަސައްމަ އެ ންމުބުލި ންސަޓިމެ ށްނަސަމިކޮ އެ ނީބު ންނުޝްމިކޮ އެ
ދިއ .ށެވެމަކަ ވީރުދޮ ށްނަންހުލުފު ށްމަރުކު ގުއީހުތަ ލަސައްމަ އެ އިލަބަ ށްޅަނގުރަ
ށްމަކަ ނެވާއިފަދީ ންލުއ ންގެދިއެ ންހުލުފު ތުމާނޫއަމަ ރުތުއި ޅޭގު އާލަސައްމަ އެ
.ވެއެ ނެބު ންނުޝްމިކޮ އެ ސްވެ

As you can see it's not even close to be perfect. The next stage is to do a more systematic training so that the accuracy levels will improve.

The idea is to improve on the training and develop a small tool that help do Thaana OCR which will be available for FREE and as an open source project. Tesseract presently does not even support RTL languages and that is also something that needs to be handled.

Update:

I tried another OCR test and below is the results. This time it shows clear improvements (can conclude it's 40% accurate?). This was after including 2 more training files.The processed result is below:

ގެތުލަދައް ނީދުވަންގެ ންމުވަޅުދާވި ންރުބަމްމެ ޅުކޮދިއި ގެހުލީޖިމަ ކުމަކަ
ނެހޭޖު ންދަހޯ ންހުރު ގެހުލީޖިމަ ރުއި ނަގާ އްޓަންރަދަ ތަނުވަ ނެއްހޯ ށްކަޓަންފުލް
ންނޫ އްމެކަ ތްއޮ ންރަކު ންކުޓަންފުއްކު މެންކޮމެންކޮ ކީމަތުއޮ ނަންއޮ ށްމަ
.ށެވެމަކަ އްމެކަ ހޭންޖެރަކު ރުއިނަގާ ނުހޯ ށްކަމަރިފުންކު ގެތުލަދައް އީއެ ށާއިމަ


Note: Look at the text in reverse order.

If any of you want to try out here is the traineddata file.

If you are on Ubuntu try apt-get install tesseract-ocr now copy the downloaded div.traineddata to /usr/local/share/tessdata/

To test maybe you can download the images (thaana news) from this post or haveeru.com.mv (take a block of text) and do a
tesseract example.tif outputfilename -l div
once done, do a cat outputfilename.txt that should show you are processed text. 

The training files used and training data is here

9 comments:

SoE said...

an interesting project..

perhaps we can have a look at hocr (hebrew OCR) for right to left implementation? it may be possible to modify/retrain it's engine.

also, have you seen
http://www.xiosis.com/scribefeatures.htm#OCR

who are these guys? There's no info on who's involved or whatnot but it sure looks interesting.

ÎĦΣçҜәѓ™ said...

Most likely , a Recognition specially designed for thaana needs to be programmed!

@SoE:
YEah I also have been wondering who the scribe dudes are! thought I havent tested it , its awesome ideas!

chopey said...

The RTL issue can be easily sorted. Right now tesseract works very well in recognizing thaana. We just need to do more training. Even if you look at the provided samples in my post u'd notice that it's not bad (you have to read it in reverse though)

Yeah. seen the demo on youtube. Looks good and I think it's a commercial product, selling for almost Mrf 500. Must say it's good work based on the demos on youtube.

What we need to achieve is a bit different. We can work on a framework, which developers can use for thaana OCR. It will be cool if this framework included other thaana related stuff like Letin converters, ascii to unicode, etc.

SoE said...

missing something here?

Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/div.unicharset

chopey said...

@SoE all needed files in same folder

chopey said...

after you get the outputfile if you need to test it just reverse the file. maybe a simple script can even do the job. example:


#!/usr/local/bin/perl
open (thaanaFile, 'dhivehi.txt');
while ($line =) {
$rtl = reverse $line;
print "$rtl \n";
}
close (thaanaFile);

subcorpus said...

good project ...
best of luck dude ...
may someday we can scan the old maldivian books and make it available on internets ... :)

Smart and Sassy said...

Why are you not using Xiosis Scribe instead of reinventing the wheel? It's already there; and it works great. You can check it out here:

http://xiosis.com/

chopey said...

@Smart and Sassy, Idea is to have a free an open port. Plus like you highlighted, why reinvent the wheel. Tesseract is a good OCR engine, and has a good future too. If we base Thaana OCR on tesseract it will be more flexible and portable too. The only thing we need to work on is to train for thaana fonts. Advantages include you can even implement OCR on mobile devices, Windows, Mac and even Linux. OCR can be used for multiple things, and having an open and free library is always good. That is the idea behind this as indicated in the post.