Thanks, Bob.
The goal of making English Chinese word alignment is to create some TMX files
for "translation memory" tools used by translators.
We have some MT such as Language Weaver, Systran, apptek.
I see LingPipe has functionality to extract English sentence from Text. But it
doesn't have for Chinese. Do you know any tools that does this for Chinese?
Does LinkPipe do chucking for English Sentence?
For example,
[The bald man] [was sitting] [on his suitcase].
thanks,
Sue
________________________________
From: Bob Carpenter <carp@...>
To: LingPipe@yahoogroups.com
Sent: Wednesday, November 19, 2008 1:12:53 PM
Subject: Re: [LingPipe] Chinese Model Quality
Sue Chen wrote:
>
>
> Hi Bob, Thanks for replying.
> Does longer n-grams model mean more accuracy?
Usually longer n-grams means more accuracy up to a point
at which accuracy plateaus. Longer n-grams can overfit in
some situations compared to shorter ones and perform less
well on new data. That's why you have to test out how they
work in practice. It probably won't make a huge difference
for this task.
> How do I prune out
> low-count sequences from model using LingPipe?
Check out the spelling tutorial, which goes through pruning.
Most of the tuning advice there doesn't apply to Chinese
word segmentation, but pruning does. Basically, you get
the underlying language model's sequence counter and prune
that.
> I have some Chinese articles and its English human translations. I want
> to match the Chinese tokens with its English tokens. I think I will run
> LingPipe Chinese Word Segmentation and English Word Segmentation
> program, then align those extracted tokens?
I've never actually tried any alignment programs, so I
don't know what to expect. Are you going to use an existing
MT system to do the alignment?
> If I run the Chinese Word Segmentation, I guess I have to generate a
> model using some training data first. The training data from
> icwb2-data.zip is traditional chinese. My Chinese articles are
> simplified Chinese.
Peking Uni's and Microsoft Research corpora are simplified,
Academia Sinica and Hong Kong are traditional. Here's
the official corpus description:
http://www.sighan. org/bakeoff2005/ data/instruction s.php.htm
You can find the tagging guidelines as PDFs here:
http://www.sighan.. org/bakeoff2005/
(I'm afraid they're mostly Chinese.)
> I find a simplified Chinese corpus, such as LCMC.
> But LCMC corpus format is xml file with POS in it. Can LingPipe process
> it? Or do you have any other suggestions?
LingPipe doesn't have anything built-in to process the
LCMC corpus, but as long as it could be converted
to text with spaces between the words, you could use
it to train LingPipe. We like to parse straight out
of the XML, but it may be easier to just convert any
data you have to a simple text-based format and then
just feed it into the training program for LingPipe
as described in the tutorial.
- Bob Carpenter
Alias-i
[Non-text portions of this message have been removed]