Search the web
Sign In
New User? Sign Up
LingPipe
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Chinese Model Quality   Message List  
Reply | Forward Message #645 of 777 |
Re: [LingPipe] Chinese Model Quality


> I ran the Chinese Word Demo using a random chinese article. I see
> some difference between those token output. How can I tell if
> 5-grams model is better than 4-grams model? What is the rule to
> determine that?

The other way to control model size is take
longer n-grams and prune out low-count sequences.

If you follow the tutorial, you'll see where
we run standard evaluations. Basically,
there's labeled data and we evaluate whether
the segmentations we pull out match the
annotations on a per-word or per-segmentation
basis.

Warning 1: the evaluations are on
similar text to the training data. When you
go out to wild text, results will almost
certainly be worse. Behavior on unknown words
is also worse than on words in the training set.

If you read Chinese (we don't), you might
want to run the kind of text you're interested
in through it and see what the results look
like. (I'd be interested in the results.)

You can also run our tokenizer in n-best mode and
get more than one possible segmentation if you have
some other way to disambiguate. For search,
that might be useful to create the index.

Warning 2: each corpus uses either traditional
or simplified characters, but not both.

A general app would need to detect and then
apply the right model or would have to build
some kind of combined model (just train on both
data sets -- it should work, but I'm not sure
how much overlap there is and in cases where
there is overlap, whether tokenization standards
differ). The other alternative would do type
detection the way we describe in the language ID
tutorial, and then run one of the character-set
specific models.

- Bob Carpenter
Alias-i



Sat Nov 15, 2008 2:08 am

colloquialdo...
Offline Offline
Send Email Send Email

Forward
Message #645 of 777 |
Expand Messages Author Sort by Date

Hi Bob, I have a question on Model Quality. I used the ChineseToken sample to generated a words-zh-as.CompiledSpellChecker model, which has size 78,303KB.  I...
Sue Chen
suelingpipe
Offline Send Email
Nov 14, 2008
6:11 pm

... The other way to control model size is take longer n-grams and prune out low-count sequences. If you follow the tutorial, you'll see where we run standard...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 15, 2008
2:08 am

Hi Bob, Thanks for replying. Does longer n-grams model mean more accuracy? How do I prune out low-count sequences from model using LingPipe? I have some...
Sue Chen
suelingpipe
Offline Send Email
Nov 18, 2008
7:49 pm

... Usually longer n-grams means more accuracy up to a point at which accuracy plateaus. Longer n-grams can overfit in some situations compared to shorter...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 19, 2008
6:12 pm

Thanks, Bob. The goal of making English Chinese word alignment is to create some TMX files for "translation memory" tools used by translators. We have some MT...
Sue Chen
suelingpipe
Offline Send Email
Nov 20, 2008
3:36 pm

... We did this for Chinese in the past by extending sentences.HeuristicSentenceModel with the appropriate end tokens for Chinese and using the...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 20, 2008
7:28 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help