The engine used is tesseract OCR.
After a few hours of training the engine the following image was passed to be processed.
The result is :
ށްރަވަ ލަސައްމަ އެ ންމުބުލި ންސަޓިމެ ށްނަސަމިކޮ އެ ނީބު ންނުޝްމިކޮ އެ
ދިއ .ށެވެމަކަ ވީރުދޮ ށްނަންހުލުފު ށްމަރުކު ގުއީހުތަ ލަސައްމަ އެ އިލަބަ ށްޅަނގުރަ
ށްމަކަ ނެވާއިފަދީ ންލުއ ންގެދިއެ ންހުލުފު ތުމާނޫއަމަ ރުތުއި ޅޭގު އާލަސައްމަ އެ
.ވެއެ ނެބު ންނުޝްމިކޮ އެ ސްވެ
ދިއ .ށެވެމަކަ ވީރުދޮ ށްނަންހުލުފު ށްމަރުކު ގުއީހުތަ ލަސައްމަ އެ އިލަބަ ށްޅަނގުރަ
ށްމަކަ ނެވާއިފަދީ ންލުއ ންގެދިއެ ންހުލުފު ތުމާނޫއަމަ ރުތުއި ޅޭގު އާލަސައްމަ އެ
.ވެއެ ނެބު ންނުޝްމިކޮ އެ ސްވެ
As you can see it's not even close to be perfect. The next stage is to do a more systematic training so that the accuracy levels will improve.
The idea is to improve on the training and develop a small tool that help do Thaana OCR which will be available for FREE and as an open source project. Tesseract presently does not even support RTL languages and that is also something that needs to be handled.
Update:
I tried another OCR test and below is the results. This time it shows clear improvements (can conclude it's 40% accurate?). This was after including 2 more training files.The processed result is below:
ގެތުލަދައް ނީދުވަންގެ ންމުވަޅުދާވި ންރުބަމްމެ ޅުކޮދިއި ގެހުލީޖިމަ ކުމަކަ
ނެހޭޖު ންދަހޯ ންހުރު ގެހުލީޖިމަ ރުއި ނަގާ އްޓަންރަދަ ތަނުވަ ނެއްހޯ ށްކަޓަންފުލް
ންނޫ އްމެކަ ތްއޮ ންރަކު ންކުޓަންފުއްކު މެންކޮމެންކޮ ކީމަތުއޮ ނަންއޮ ށްމަ
.ށެވެމަކަ އްމެކަ ހޭންޖެރަކު ރުއިނަގާ ނުހޯ ށްކަމަރިފުންކު ގެތުލަދައް އީއެ ށާއިމަ
ނެހޭޖު ންދަހޯ ންހުރު ގެހުލީޖިމަ ރުއި ނަގާ އްޓަންރަދަ ތަނުވަ ނެއްހޯ ށްކަޓަންފުލް
ންނޫ އްމެކަ ތްއޮ ންރަކު ންކުޓަންފުއްކު މެންކޮމެންކޮ ކީމަތުއޮ ނަންއޮ ށްމަ
.ށެވެމަކަ އްމެކަ ހޭންޖެރަކު ރުއިނަގާ ނުހޯ ށްކަމަރިފުންކު ގެތުލަދައް އީއެ ށާއިމަ
Note: Look at the text in reverse order.
If any of you want to try out here is the traineddata file.
If you are on Ubuntu try apt-get install tesseract-ocr now copy the downloaded div.traineddata to /usr/local/share/tessdata/
To test maybe you can download the images (thaana news) from this post or haveeru.com.mv (take a block of text) and do a
tesseract example.tif outputfilename -l div
once done, do a cat outputfilename.txt that should show you are processed text.
The training files used and training data is here