Search the web
Sign In
New User? Sign Up
LingPipe
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Chinese Model Quality   Message List  
Reply | Forward Message #647 of 777 |
Re: [LingPipe] Chinese Model Quality

Sue Chen wrote:
>
>
> Hi Bob, Thanks for replying.
> Does longer n-grams model mean more accuracy?

Usually longer n-grams means more accuracy up to a point
at which accuracy plateaus. Longer n-grams can overfit in
some situations compared to shorter ones and perform less
well on new data. That's why you have to test out how they
work in practice. It probably won't make a huge difference
for this task.

> How do I prune out
> low-count sequences from model using LingPipe?

Check out the spelling tutorial, which goes through pruning.
Most of the tuning advice there doesn't apply to Chinese
word segmentation, but pruning does. Basically, you get
the underlying language model's sequence counter and prune
that.

> I have some Chinese articles and its English human translations. I want
> to match the Chinese tokens with its English tokens. I think I will run
> LingPipe Chinese Word Segmentation and English Word Segmentation
> program, then align those extracted tokens?

I've never actually tried any alignment programs, so I
don't know what to expect. Are you going to use an existing
MT system to do the alignment?

> If I run the Chinese Word Segmentation, I guess I have to generate a
> model using some training data first. The training data from
> icwb2-data.zip is traditional chinese. My Chinese articles are
> simplified Chinese.

Peking Uni's and Microsoft Research corpora are simplified,
Academia Sinica and Hong Kong are traditional. Here's
the official corpus description:

http://www.sighan.org/bakeoff2005/data/instructions.php.htm

You can find the tagging guidelines as PDFs here:

http://www.sighan.org/bakeoff2005/

(I'm afraid they're mostly Chinese.)

> I find a simplified Chinese corpus, such as LCMC.
> But LCMC corpus format is xml file with POS in it. Can LingPipe process
> it? Or do you have any other suggestions?

LingPipe doesn't have anything built-in to process the
LCMC corpus, but as long as it could be converted
to text with spaces between the words, you could use
it to train LingPipe. We like to parse straight out
of the XML, but it may be easier to just convert any
data you have to a simple text-based format and then
just feed it into the training program for LingPipe
as described in the tutorial.

- Bob Carpenter
Alias-i




Wed Nov 19, 2008 6:12 pm

colloquialdo...
Offline Offline
Send Email Send Email

Forward
Message #647 of 777 |
Expand Messages Author Sort by Date

Hi Bob, I have a question on Model Quality. I used the ChineseToken sample to generated a words-zh-as.CompiledSpellChecker model, which has size 78,303KB.  I...
Sue Chen
suelingpipe
Offline Send Email
Nov 14, 2008
6:11 pm

... The other way to control model size is take longer n-grams and prune out low-count sequences. If you follow the tutorial, you'll see where we run standard...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 15, 2008
2:08 am

Hi Bob, Thanks for replying. Does longer n-grams model mean more accuracy? How do I prune out low-count sequences from model using LingPipe? I have some...
Sue Chen
suelingpipe
Offline Send Email
Nov 18, 2008
7:49 pm

... Usually longer n-grams means more accuracy up to a point at which accuracy plateaus. Longer n-grams can overfit in some situations compared to shorter...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 19, 2008
6:12 pm

Thanks, Bob. The goal of making English Chinese word alignment is to create some TMX files for "translation memory" tools used by translators. We have some MT...
Sue Chen
suelingpipe
Offline Send Email
Nov 20, 2008
3:36 pm

... We did this for Chinese in the past by extending sentences.HeuristicSentenceModel with the appropriate end tokens for Chinese and using the...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 20, 2008
7:28 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help