On Fri, Jun 26, 2009 at 4:09 PM, kruyvanna <kruyvanna@...> wrote:
Dear Sovathena,
I have tried to train tesseract with Khmer language.
It's just a trivial test of 3 characters.
So u might want to train it for the complete symbols.
here is the link:
http://vannait.blogspot.com/2009/06/how-to-train-tesseract-ocr.html
Cheers,
Kruy Vanna
GITS, Waseda University.
--- In PANLocalization@yahoogroups.com, "Rajesh Pandey" <pandey.com.np@...> wrote:
>
> Dear Neth you are always welcome.
>
>> On 6/10/08, Vathena <nethsovathena@...> wrote:> > On Mon, Jun 9, 2008 at 1:36 PM, Rajesh Pandey <pandey.com.np@...>
> >
> > Dear Rajesh Pandey,
> >
> > Thanks for your guide to me.
> >
> > Regards,
> >
> > NETH Sovathena
> >
> > wrote:
> >
> >> Dear Neth,
> >> I did a small research on Khmer language, installed Catalan Unicode for
> >> Khmer script and found out that the words don't seem to be segmented. Khmer
> >> characters seem quite similar to the Thai characters.
> >>
> >>
> >>
> >> 1. Segmentation involves segmentation of whole document into lines.
> >> 2. Segmentation of lines into words.
> >> 3. Segmentation of words into characters.
> >>
> >> For most ocrs (eg: English OCR, Nepali OCR)
> >>
> >> They work in a top down approach to segment:
> >> *Document -> lines -> words-> characters*
> >>
> >> *For Khmer OCR*
> >> However it looks like you have to approach in this way:
> >> *Document -> lines -> characters*
> >>
> >>
> >> *Character segmentation:*
> >> You have some advantages over Nepali/Devanagari characters:
> >> You don't have to worry much about character segmentation, because Khmer
> >> characters seem to be already segmented.
> >> In our case we have to put an extra effort on segmenting characters
> >> because Nepali/Devanagari characters are joined together in a word.
> >>
> >> *Word segmentation:*
> >> My preliminary research shows that Khmer words are not segmented. Meaning
> >> I did not find spaces between the words. Rather found long sequence of
> >> characters and the whole sentence has a bunch of characters. The speakers
> >> have segmentation according to their syllable or so.
> >> So may be you need to add some more algorithms for word segmentation, or
> >> use a spellchecker and / or grammar checker at the end.
> >>
> >>
> >> The output will be pretty good because there are no spaces between the
> >> words. The input does not have any spaces, so there will not be any spaces
> >> in the output. I guess that will not be a problem.
> >>
> >> Initially you might think of giving a try with Tesseract ocr. I am sure
> >> you will get pretty good results once you have trained.
> >> The homepage for tesseract-ocr is http://code.google.com/p/tesseract-ocr
> >> You might also subscribe to tesseract google groups :
> >> http://groups.google.com/group/tesseract-ocr
> >>
> >> Now good luck with training tesseract-ocr. I think after trying this once
> >> will bring a clear picture of an overall OCR.
> >>
> >>
> >>
> >>
> >>
> >> --- In PANLocalization@yahoogroups.com, "Bal Krishna Bal"
> >> <balkrishna7bal@> wrote:
> >> >
> >> > Dear Neth,
> >> > I have forwarded your email to the Nepali OCR Team and hopefully you
> >> will
> >> > receive a corresponding response very soon.
> >> > Regards,
> >> > Bal Krishna
> >> >
> >> >
> >> > On Mon, Jun 2, 2008 at 1:39 PM, Vathena nethsovathena@ wrote:
> >> >
> >> > > Dear All,
> >> > >
> >> > > My name is NETH Sovathena, a new Software Developer at PAN
> >> Localization
> >> > > Cambodia of IDRC.
> >> > > Now I am responsible for OCR (Optical Character Recognition) project.
> >> > >
> >> > > Now I write this email to all of you for asking some help.
> >> > >
> >> > > I am really difficult with my project--OCR. It is a complicated one
> >> for me
> >> > > while I am a new Software Developer and working with it.
> >> > > After I read any documents related to OCR, I have basic understanding
> >> and
> >> > > know the process of OCR such as Preprocessing, Segmentation, Feature
> >> > > Extraction, Recognition, and Post processing.
> >> > >
> >> > > Currently, I am doing on step " Understanding OCR ", and now focusing
> >> on
> >> > > SEGMENTATION. I try to find and search for Algorithm used for OCR, but
> >> I
> >> > > don't understand and do not find out any more documents and algorithm
> >> yet.
> >> > >
> >> > > Moreover, I do not understand each task that I need to do for this
> >> project
> >> > > such as:
> >> > >
> >> > > * Study OCR Framework
> >> > > * Document scope of OCR (font, sizes, styles, etc.)
> >> > > * Develop Segmentation Strategy
> >> > > * Develop Segmentation Module for Khmer in the frameworks
> >> > > * Test Segmentation Module
> >> > > * Prototype training Module
> >> > > * Collect Training and Test Data
> >> > > * Conduct Training
> >> > > * Conduct Testing
> >> > > * Post Processing
> >> > >
> >> > > etc.
> >> > >
> >> > > * If possible, I would like to ask you for any explanation or more
> >> useful
> >> > > resource for this project.
> >> > >
> >> > > Best regards,
> >> > >
> >> > > NETH Sovathena
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
> >
> >
> >
>
>
>
> --
> Regards,
> Rajesh Pandey
> Researcher and Developer in Nepali OCR Project
> PAN Localization Project, Nepal
> Madan Puraskar Pustakalaya
> Patan Dhoka, Lalitpur
> Phone: 977-1-5521393, Fax: 977-1-5536390
>
--
N. Sovathena
--------------------------------------------------------------
PAN Localization Cambodia (PLC) of IDRC
Mobile Phone: + 855 17 719 326
Office Phone : + 855 11 811 947
Skype: vathena007