Search the web
Sign In
New User? Sign Up
LingPipe
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Chinese Model Quality   Message List  
Reply | Forward Message #649 of 777 |
Re: [LingPipe] Chinese Model Quality

Sue Chen wrote:
> ...
> I see LingPipe has functionality to extract English sentence from Text.
> But it doesn't have for Chinese. Do you know any tools that does this
> for Chinese?

We did this for Chinese in the past by extending
sentences.HeuristicSentenceModel with the appropriate end tokens
for Chinese and using the tokenizer.CharacterTokenizerFactory.

I can't remember the circle that's used
for end-of-sentence's unicode right now, and we didn't try to
get fancy with sequences of punctuation, but it worked well
for the corpus we had.

> Does LinkPipe do chucking for English Sentence?
> For example,
> [The bald man] [was sitting] [on his suitcase].

It can find the VPs and NPs, but you have to define
them in terms of parts of speech (see the part-of-speech
tutorial), or you have to train a named-entity chunker
with the chunks ahead of time.

This isn't language dependent, but you need a part-of-speech
tagger for the language in question, which requires
training data.

- Bob Carpenter
Alias-i



Thu Nov 20, 2008 7:28 pm

colloquialdo...
Offline Offline
Send Email Send Email

Forward
Message #649 of 777 |
Expand Messages Author Sort by Date

Hi Bob, I have a question on Model Quality. I used the ChineseToken sample to generated a words-zh-as.CompiledSpellChecker model, which has size 78,303KB.  I...
Sue Chen
suelingpipe
Offline Send Email
Nov 14, 2008
6:11 pm

... The other way to control model size is take longer n-grams and prune out low-count sequences. If you follow the tutorial, you'll see where we run standard...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 15, 2008
2:08 am

Hi Bob, Thanks for replying. Does longer n-grams model mean more accuracy? How do I prune out low-count sequences from model using LingPipe? I have some...
Sue Chen
suelingpipe
Offline Send Email
Nov 18, 2008
7:49 pm

... Usually longer n-grams means more accuracy up to a point at which accuracy plateaus. Longer n-grams can overfit in some situations compared to shorter...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 19, 2008
6:12 pm

Thanks, Bob. The goal of making English Chinese word alignment is to create some TMX files for "translation memory" tools used by translators. We have some MT...
Sue Chen
suelingpipe
Offline Send Email
Nov 20, 2008
3:36 pm

... We did this for Chinese in the past by extending sentences.HeuristicSentenceModel with the appropriate end tokens for Chinese and using the...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 20, 2008
7:28 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help