Saturday, August 28, 2010

Thaana OCR Development


The engine used is tesseract OCR.

After a few hours of training the engine the following image was passed to be processed.


The result is :

ށްރަވަ ލަސައްމަ އެ ންމުބުލި ންސަޓިމެ ށްނަސަމިކޮ އެ ނީބު ންނުޝްމިކޮ އެ
ދިއ .ށެވެމަކަ ވީރުދޮ ށްނަންހުލުފު ށްމަރުކު ގުއީހުތަ ލަސައްމަ އެ އިލަބަ ށްޅަނގުރަ
ށްމަކަ ނެވާއިފަދީ ންލުއ ންގެދިއެ ންހުލުފު ތުމާނޫއަމަ ރުތުއި ޅޭގު އާލަސައްމަ އެ
.ވެއެ ނެބު ންނުޝްމިކޮ އެ ސްވެ

As you can see it's not even close to be perfect. The next stage is to do a more systematic training so that the accuracy levels will improve.

The idea is to improve on the training and develop a small tool that help do Thaana OCR which will be available for FREE and as an open source project. Tesseract presently does not even support RTL languages and that is also something that needs to be handled.

Update:

I tried another OCR test and below is the results. This time it shows clear improvements (can conclude it's 40% accurate?). This was after including 2 more training files.The processed result is below:

ގެތުލަދައް ނީދުވަންގެ ންމުވަޅުދާވި ންރުބަމްމެ ޅުކޮދިއި ގެހުލީޖިމަ ކުމަކަ
ނެހޭޖު ންދަހޯ ންހުރު ގެހުލީޖިމަ ރުއި ނަގާ އްޓަންރަދަ ތަނުވަ ނެއްހޯ ށްކަޓަންފުލް
ންނޫ އްމެކަ ތްއޮ ންރަކު ންކުޓަންފުއްކު މެންކޮމެންކޮ ކީމަތުއޮ ނަންއޮ ށްމަ
.ށެވެމަކަ އްމެކަ ހޭންޖެރަކު ރުއިނަގާ ނުހޯ ށްކަމަރިފުންކު ގެތުލަދައް އީއެ ށާއިމަ


Note: Look at the text in reverse order.

If any of you want to try out here is the traineddata file.

If you are on Ubuntu try apt-get install tesseract-ocr now copy the downloaded div.traineddata to /usr/local/share/tessdata/

To test maybe you can download the images (thaana news) from this post or haveeru.com.mv (take a block of text) and do a
tesseract example.tif outputfilename -l div
once done, do a cat outputfilename.txt that should show you are processed text. 

The training files used and training data is here

Tuesday, August 17, 2010

Thaana on Android 2.2



I have not tested this on a real Android phone, so I am not sure if this will 100% work. Anyway give it a try and let me know.

First you need to root your android. (I never done it but here is a link which might help)

Next download these two font files


Next you need to have Android SDK installed on your box.

Connect your phone to the box .

Push the Thaana fonts to /system/fonts using adb

adb remount
adb push {full path to DroidSans.ttf} /system/fonts
adb push {full path to DroidSerif-Regular.ttf} /system/fonts

* (in my case it's like adb push /Users/sofwathullahmohamed/Desktop/DroidSerif-Regular.ttf /system/fonts)

finally do a

adb shell reboot

I guess that should do it. If all went ok; I guess you should be able to browse thaana websites (unicode; don't think www.haveeru.com.mv will work). I've tested this on the Android simulator and works fine on Android 2.2.


Update (19, Aug 2010): Started work on a Thaana keyboard for Android


When I started off this I realized it might not be a good idea to directly convert from the QWERTY keyboard layout (conventional keyboard) to Thaana soft keyboard for mobile devices. The user experience might /will not be very good. We need to do some research into the best keyboard layout method to be used on mobile devices. This is important as soon more people will be using mobile devices (like phones and other mobile computing devices) for daily work and entertainment. This is something that needs to be debated and agreed on before any layout is implemented. HELP is NEEDED here.

Thursday, August 12, 2010

iDhivehiSites for iPhone (FREE)


The project is now made Open Source and can be downloaded from http://github.com/jinahadam/iDhivehiSites- and the facebook page is here

We won't get much time to improve on the features, so we hope developers who are interested will contribute. Jinah has implemented twitter share and made changes to the UI, so it looks simpler. We had a problem with the font rendering for www.haveeru.com.mv which also is fixed now (but we do hope haveeru will soon move to unicode).