|
Re: [LingPipe] Chinese Model Quality
> I ran the Chinese Word Demo using a random chinese article. I see
> some difference between those token output. How can I tell if
> 5-grams model is better than 4-grams model? What is the rule to
> determine that?
The other way to control model size is take
longer n-grams and prune out low-count sequences.
If you follow the tutorial, you'll see where
we run standard evaluations. Basically,
there's labeled data and we evaluate whether
the segmentations we pull out match the
annotations on a per-word or per-segmentation
basis.
Warning 1: the evaluations are on
similar text to the training data. When you
go out to wild text, results will almost
certainly be worse. Behavior on unknown words
is also worse than on words in the training set.
If you read Chinese (we don't), you might
want to run the kind of text you're interested
in through it and see what the results look
like. (I'd be interested in the results.)
You can also run our tokenizer in n-best mode and
get more than one possible segmentation if you have
some other way to disambiguate. For search,
that might be useful to create the index.
Warning 2: each corpus uses either traditional
or simplified characters, but not both.
A general app would need to detect and then
apply the right model or would have to build
some kind of combined model (just train on both
data sets -- it should work, but I'm not sure
how much overlap there is and in cases where
there is overlap, whether tokenization standards
differ). The other alternative would do type
detection the way we describe in the language ID
tutorial, and then run one of the character-set
specific models.
- Bob Carpenter
Alias-i
|