Thanks much, Bob. The patch fixed my problem.
________________________________
From: Bob Carpenter <carp@...>
To: LingPipe@yahoogroups.com
Sent: Thursday, November 13, 2008 4:12:18 PM
Subject: [LingPipe] Re: Chinese Token Demo Bug patch
(CompiledSpellChecker.setTokenSet)
On November 12, 2008, suechen275 wrote:
> Hi, I am first time to use LingPipe. I tried ChineseToken sample
> that comes with LingPipe 3.6.0 with icwb2-data.zip. It doesn't work.
> It throws exception: ...
It sure does. Thanks for the detailed bug report.
The culprit is the following file:
$LINGPIPE/src/ com/aliasi/ spell/CompiledSp ellChecker. java
The method setTokenSet( ) should be:
public final void setTokenSet( Set<String> tokenSet) {
if (tokenSet == null) return;
int maxLen = 0;
for (String token : tokenSet)
maxLen = java.lang.Math. max(maxLen, token.length( ));
mTokenSet = tokenSet;
mTokenPrefixTrie = tokenSet == null ? null : prefixTrie(tokenSet );
}
You can patch and recompile. Our next release will
include the patch.
Here's the output of the revised program in the
Chinese Tokens demo:
\lingpipe\demos\ tutorial\ chineseTokens> ant -Ddata.sighan05= \lingpipe-
3.6.0\demos\ data\sighan2005\ dist run-cityu05
Buildfile: build.xml
compile:
[mkdir] Created dir: c:\carp\mycvs\ lingpipe\ demos\tutorial\ chineseTokens\
build\classes
[javac] Compiling 4 source files to c:\carp\mycvs\ lingpipe\ demos\tutorial\
chineseTokens\ build\classes
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
run-cityu05:
[java] CHINESE TOKENS 2005
[java] Data Zip File=c:\carp\ test\lingpipe- 3..6.0\demos\ data\sighan2005\
dist\icwb2- data.zip
[java] Corpus Name=cityu
[java] Output File Name=cityu.segments
[java] Known Tokens File Name=cityu.knownWor ds
[java] Max N-gram=5
[java] Lambda factor=5.0
[java] Num chars=5000
[java] Max n-best=1024
[java] Reading Data from entry=icwb2- data/training/ cityu_training. utf8
[java] Found 53019 sentences.
[java] Found 4922 distinct characters.
[java] Found 69086 distinct tokens.
[java] Testing Results. Zip Entry=icwb2- data/gold/ cityu_test_ gold.utf8
[java] Found 1493 test sentences.
[java] Found 9001 test tokens.
[java] Found 1670 unknown test tokens.
[java] Found 2702 test characterss. Found 60 unknown test characters.
[java]
[java] Reference/Response Token Length Histogram
[java] Length, #REF, #RESP, Diff
[java] 1, 19115, 19541, 426
[java] 2, 18187, 17980, -207
[java] 3, 2682, 2459, -223
[java] 4, 759, 800, 41
[java] 5, 116, 158, 42
[java] 6, 36, 63, 27
[java] 7, 22, 24, 2
[java] 8, 9, 11, 2
[java] 9, 5, 4, -1
[java] Scores
[java] EndPoint: P=0.968679693622184 5 R=0.971503904269343 9 F=0.970089743427464
7
[java] Chunk: P=0.926506053445713 9 R=0.929108852843463 F=0.927805627721468
BUILD SUCCESSFUL
Total time: 59 seconds
(The other data sets had better performance, as
indicated in the tutorial).
- Bob Carpenter
Alias-i
[Non-text portions of this message have been removed]