|
On November 12, 2008, suechen275 wrote:
> Hi, I am first time to use LingPipe. I tried ChineseToken sample
> that comes with LingPipe 3.6.0 with icwb2-data.zip. It doesn't work.
> It throws exception: ...
It sure does. Thanks for the detailed bug report.
The culprit is the following file:
$LINGPIPE/src/com/aliasi/spell/CompiledSpellChecker.java
The method setTokenSet() should be:
public final void setTokenSet(Set<String> tokenSet) {
if (tokenSet == null) return;
int maxLen = 0;
for (String token : tokenSet)
maxLen = java.lang.Math.max(maxLen,token.length());
mTokenSet = tokenSet;
mTokenPrefixTrie = tokenSet == null ? null : prefixTrie(tokenSet);
}
You can patch and recompile. Our next release will
include the patch.
Here's the output of the revised program in the
Chinese Tokens demo:
\lingpipe\demos\tutorial\chineseTokens>ant
-Ddata.sighan05=\lingpipe-3.6.0\demos\data\sighan2005\dist run-cityu05
Buildfile: build.xml
compile:
[mkdir] Created dir:
c:\carp\mycvs\lingpipe\demos\tutorial\chineseTokens\build\classes
[javac] Compiling 4 source files to
c:\carp\mycvs\lingpipe\demos\tutorial\chineseTokens\build\classes
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
run-cityu05:
[java] CHINESE TOKENS 2005
[java] Data Zip
File=c:\carp\test\lingpipe-3.6.0\demos\data\sighan2005\dist\icwb2-data.zip
[java] Corpus Name=cityu
[java] Output File Name=cityu.segments
[java] Known Tokens File Name=cityu.knownWords
[java] Max N-gram=5
[java] Lambda factor=5.0
[java] Num chars=5000
[java] Max n-best=1024
[java] Reading Data from entry=icwb2-data/training/cityu_training.utf8
[java] Found 53019 sentences.
[java] Found 4922 distinct characters.
[java] Found 69086 distinct tokens.
[java] Testing Results. Zip Entry=icwb2-data/gold/cityu_test_gold.utf8
[java] Found 1493 test sentences.
[java] Found 9001 test tokens.
[java] Found 1670 unknown test tokens.
[java] Found 2702 test characterss. Found 60 unknown test characters.
[java]
[java] Reference/Response Token Length Histogram
[java] Length, #REF, #RESP, Diff
[java] 1, 19115, 19541, 426
[java] 2, 18187, 17980, -207
[java] 3, 2682, 2459, -223
[java] 4, 759, 800, 41
[java] 5, 116, 158, 42
[java] 6, 36, 63, 27
[java] 7, 22, 24, 2
[java] 8, 9, 11, 2
[java] 9, 5, 4, -1
[java] Scores
[java] EndPoint: P=0.9686796936221845 R=0.9715039042693439
F=0.9700897434274647
[java] Chunk: P=0.9265060534457139 R=0.929108852843463
F=0.927805627721468
BUILD SUCCESSFUL
Total time: 59 seconds
(The other data sets had better performance, as
indicated in the tutorial).
- Bob Carpenter
Alias-i
|