Search the web
Sign In
New User? Sign Up
PANLocalization · PAN Localization Support Network
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
NEED HELP for OCR Project   Message List  
Reply | Forward Message #460 of 469 |
Re: [PAN Localization] Re: NEED HELP for OCR Project

Thank you!

On Fri, Jun 26, 2009 at 4:09 PM, kruyvanna <kruyvanna@...> wrote:


Dear Sovathena,

I have tried to train tesseract with Khmer language.
It's just a trivial test of 3 characters.
So u might want to train it for the complete symbols.

here is the link:
http://vannait.blogspot.com/2009/06/how-to-train-tesseract-ocr.html

Cheers,
Kruy Vanna
GITS, Waseda University.



--- In PANLocalization@yahoogroups.com, "Rajesh Pandey" <pandey.com.np@...> wrote:
>
> Dear Neth you are always welcome.
>
>
> On 6/10/08, Vathena <nethsovathena@...> wrote:
> >
> > Dear Rajesh Pandey,
> >
> > Thanks for your guide to me.
> >
> > Regards,
> >
> > NETH Sovathena
> >
> > On Mon, Jun 9, 2008 at 1:36 PM, Rajesh Pandey <pandey.com.np@...>

> > wrote:
> >
> >> Dear Neth,
> >> I did a small research on Khmer language, installed Catalan Unicode for
> >> Khmer script and found out that the words don't seem to be segmented. Khmer
> >> characters seem quite similar to the Thai characters.
> >>
> >>
> >>
> >> 1. Segmentation involves segmentation of whole document into lines.
> >> 2. Segmentation of lines into words.
> >> 3. Segmentation of words into characters.
> >>
> >> For most ocrs (eg: English OCR, Nepali OCR)
> >>
> >> They work in a top down approach to segment:
> >> *Document -> lines -> words-> characters*
> >>
> >> *For Khmer OCR*
> >> However it looks like you have to approach in this way:
> >> *Document -> lines -> characters*
> >>
> >>
> >> *Character segmentation:*
> >> You have some advantages over Nepali/Devanagari characters:
> >> You don't have to worry much about character segmentation, because Khmer
> >> characters seem to be already segmented.
> >> In our case we have to put an extra effort on segmenting characters
> >> because Nepali/Devanagari characters are joined together in a word.
> >>
> >> *Word segmentation:*
> >> My preliminary research shows that Khmer words are not segmented. Meaning
> >> I did not find spaces between the words. Rather found long sequence of
> >> characters and the whole sentence has a bunch of characters. The speakers
> >> have segmentation according to their syllable or so.
> >> So may be you need to add some more algorithms for word segmentation, or
> >> use a spellchecker and / or grammar checker at the end.
> >>
> >>
> >> The output will be pretty good because there are no spaces between the
> >> words. The input does not have any spaces, so there will not be any spaces
> >> in the output. I guess that will not be a problem.
> >>
> >> Initially you might think of giving a try with Tesseract ocr. I am sure
> >> you will get pretty good results once you have trained.
> >> The homepage for tesseract-ocr is http://code.google.com/p/tesseract-ocr
> >> You might also subscribe to tesseract google groups :
> >> http://groups.google.com/group/tesseract-ocr
> >>
> >> Now good luck with training tesseract-ocr. I think after trying this once
> >> will bring a clear picture of an overall OCR.
> >>
> >>
> >>
> >>
> >>
> >> --- In PANLocalization@yahoogroups.com, "Bal Krishna Bal"
> >> <balkrishna7bal@> wrote:
> >> >
> >> > Dear Neth,
> >> > I have forwarded your email to the Nepali OCR Team and hopefully you
> >> will
> >> > receive a corresponding response very soon.
> >> > Regards,
> >> > Bal Krishna
> >> >
> >> >
> >> > On Mon, Jun 2, 2008 at 1:39 PM, Vathena nethsovathena@ wrote:
> >> >
> >> > > Dear All,
> >> > >
> >> > > My name is NETH Sovathena, a new Software Developer at PAN
> >> Localization
> >> > > Cambodia of IDRC.
> >> > > Now I am responsible for OCR (Optical Character Recognition) project.
> >> > >
> >> > > Now I write this email to all of you for asking some help.
> >> > >
> >> > > I am really difficult with my project--OCR. It is a complicated one
> >> for me
> >> > > while I am a new Software Developer and working with it.
> >> > > After I read any documents related to OCR, I have basic understanding
> >> and
> >> > > know the process of OCR such as Preprocessing, Segmentation, Feature
> >> > > Extraction, Recognition, and Post processing.
> >> > >
> >> > > Currently, I am doing on step " Understanding OCR ", and now focusing
> >> on
> >> > > SEGMENTATION. I try to find and search for Algorithm used for OCR, but
> >> I
> >> > > don't understand and do not find out any more documents and algorithm
> >> yet.
> >> > >
> >> > > Moreover, I do not understand each task that I need to do for this
> >> project
> >> > > such as:
> >> > >
> >> > > * Study OCR Framework
> >> > > * Document scope of OCR (font, sizes, styles, etc.)
> >> > > * Develop Segmentation Strategy
> >> > > * Develop Segmentation Module for Khmer in the frameworks
> >> > > * Test Segmentation Module
> >> > > * Prototype training Module
> >> > > * Collect Training and Test Data
> >> > > * Conduct Training
> >> > > * Conduct Testing
> >> > > * Post Processing
> >> > >
> >> > > etc.
> >> > >
> >> > > * If possible, I would like to ask you for any explanation or more
> >> useful
> >> > > resource for this project.
> >> > >
> >> > > Best regards,
> >> > >
> >> > > NETH Sovathena
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
> >
> >
> >
>
>
>
> --
> Regards,
> Rajesh Pandey
> Researcher and Developer in Nepali OCR Project
> PAN Localization Project, Nepal
> Madan Puraskar Pustakalaya
> Patan Dhoka, Lalitpur
> Phone: 977-1-5521393, Fax: 977-1-5536390
>




--
N. Sovathena
--------------------------------------------------------------
PAN Localization Cambodia (PLC) of IDRC
Mobile Phone: + 855 17 719 326
Office Phone : + 855 11 811 947
Skype: vathena007


Fri Jun 26, 2009 10:00 am

neth_sovathena
Offline Offline
Send Email Send Email

Forward
Message #460 of 469 |
Expand Messages Author Sort by Date

Dear All, My name is NETH Sovathena, a new Software Developer at PAN Localization Cambodia of IDRC. Now I am responsible for OCR (Optical Character...
Vathena
neth_sovathena
Offline Send Email
Jun 2, 2008
7:54 am

Dear Neth, I have forwarded your email to the Nepali OCR Team and hopefully you will receive a corresponding response very soon. Regards, Bal Krishna...
Bal Krishna Bal
balkrish_ru
Offline Send Email
Jun 2, 2008
8:06 am

Dear Neth, I did a small research on Khmer language, installed Catalan Unicode for Khmer script and found out that the words don't seem to be segmented. Khmer...
Rajesh Pandey
rajespande
Offline Send Email
Jun 9, 2008
6:36 am

Dear Rajesh Pandey, Thanks for your guide to me. Regards, NETH Sovathena On Mon, Jun 9, 2008 at 1:36 PM, Rajesh Pandey <pandey.com.np@...>...
Vathena
neth_sovathena
Offline Send Email
Jun 10, 2008
3:43 am

Dear Neth you are always welcome. ... -- Regards, Rajesh Pandey Researcher and Developer in Nepali OCR Project PAN Localization Project, Nepal Madan Puraskar...
Rajesh Pandey
rajespande
Offline Send Email
Jun 10, 2008
7:06 am

Dear Sovathena, I have tried to train tesseract with Khmer language. It's just a trivial test of 3 characters. So u might want to train it for the complete...
kruyvanna
Offline Send Email
Jun 26, 2009
9:09 am

Thank you! ... -- N. Sovathena ... PAN Localization Cambodia (PLC) of IDRC Mobile Phone: + 855 17 719 326 Office Phone : + 855 11 811 947 Skype: vathena007...
Vathena
neth_sovathena
Offline Send Email
Jun 26, 2009
10:01 am

Dear NETH Please visit our website www.ucsc.cmb.ac.lk\ltrl. You can find our current OCR system and relevant publications. Contact me for further...
Nishantha Medagoda
nmedagoda@...
Send Email
Jun 3, 2008
1:46 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help