Search the web
Sign In
New User? Sign Up
LingPipe
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Chinese Model Quality   Message List  
Reply | Forward Message #648 of 777 |
Re: [LingPipe] Chinese Model Quality

Thanks, Bob.

The goal of making English Chinese word alignment is to create some TMX files
for "translation memory" tools used by translators.

We have some MT such as Language Weaver, Systran, apptek.

I see LingPipe has functionality to extract English sentence from Text.  But it
doesn't have for Chinese.  Do you know any tools that does this for Chinese?

Does LinkPipe do chucking for English Sentence?
For example,
[The bald man] [was sitting] [on his suitcase].

thanks,
Sue




________________________________
From: Bob Carpenter <carp@...>
To: LingPipe@yahoogroups.com
Sent: Wednesday, November 19, 2008 1:12:53 PM
Subject: Re: [LingPipe] Chinese Model Quality


Sue Chen wrote:
>
>
> Hi Bob, Thanks for replying.
> Does longer n-grams model mean more accuracy?

Usually longer n-grams means more accuracy up to a point
at which accuracy plateaus. Longer n-grams can overfit in
some situations compared to shorter ones and perform less
well on new data. That's why you have to test out how they
work in practice. It probably won't make a huge difference
for this task.

> How do I prune out
> low-count sequences from model using LingPipe?

Check out the spelling tutorial, which goes through pruning.
Most of the tuning advice there doesn't apply to Chinese
word segmentation, but pruning does. Basically, you get
the underlying language model's sequence counter and prune
that.

> I have some Chinese articles and its English human translations. I want
> to match the Chinese tokens with its English tokens. I think I will run
> LingPipe Chinese Word Segmentation and English Word Segmentation
> program, then align those extracted tokens?

I've never actually tried any alignment programs, so I
don't know what to expect. Are you going to use an existing
MT system to do the alignment?

> If I run the Chinese Word Segmentation, I guess I have to generate a
> model using some training data first. The training data from
> icwb2-data.zip is traditional chinese. My Chinese articles are
> simplified Chinese.

Peking Uni's and Microsoft Research corpora are simplified,
Academia Sinica and Hong Kong are traditional. Here's
the official corpus description:

http://www.sighan. org/bakeoff2005/ data/instruction s.php.htm

You can find the tagging guidelines as PDFs here:

http://www.sighan.. org/bakeoff2005/

(I'm afraid they're mostly Chinese.)

> I find a simplified Chinese corpus, such as LCMC.
> But LCMC corpus format is xml file with POS in it. Can LingPipe process
> it? Or do you have any other suggestions?

LingPipe doesn't have anything built-in to process the
LCMC corpus, but as long as it could be converted
to text with spaces between the words, you could use
it to train LingPipe. We like to parse straight out
of the XML, but it may be easier to just convert any
data you have to a simple text-based format and then
just feed it into the training program for LingPipe
as described in the tutorial.

- Bob Carpenter
Alias-i






[Non-text portions of this message have been removed]




Thu Nov 20, 2008 3:36 pm

suelingpipe
Offline Offline
Send Email Send Email

Forward
Message #648 of 777 |
Expand Messages Author Sort by Date

Hi Bob, I have a question on Model Quality. I used the ChineseToken sample to generated a words-zh-as.CompiledSpellChecker model, which has size 78,303KB.  I...
Sue Chen
suelingpipe
Offline Send Email
Nov 14, 2008
6:11 pm

... The other way to control model size is take longer n-grams and prune out low-count sequences. If you follow the tutorial, you'll see where we run standard...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 15, 2008
2:08 am

Hi Bob, Thanks for replying. Does longer n-grams model mean more accuracy? How do I prune out low-count sequences from model using LingPipe? I have some...
Sue Chen
suelingpipe
Offline Send Email
Nov 18, 2008
7:49 pm

... Usually longer n-grams means more accuracy up to a point at which accuracy plateaus. Longer n-grams can overfit in some situations compared to shorter...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 19, 2008
6:12 pm

Thanks, Bob. The goal of making English Chinese word alignment is to create some TMX files for "translation memory" tools used by translators. We have some MT...
Sue Chen
suelingpipe
Offline Send Email
Nov 20, 2008
3:36 pm

... We did this for Chinese in the past by extending sentences.HeuristicSentenceModel with the appropriate end tokens for Chinese and using the...
Bob Carpenter
colloquialdo...
Offline Send Email
Nov 20, 2008
7:28 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help