Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

LingPipe

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 471
  • Category: Open Source
  • Founded: Oct 8, 2003
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

Advanced
Messages Help
Messages 198 - 233 of 1478   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#198 From: "gdcaven" <gdcaven@...>
Date: Thu Jan 5, 2006 2:56 am
Subject: dealing with Chinese English Learners Corpus
gdcaven
Send Email Send Email
 
I meant to use lingpipe to handle Chinese English Learners Corpus
(http://www.clal.org.cn/corpus/EngSearchEngine.aspx). The results
were very disappointing. How can I improve its performance in CELC?

Thanks.

#200 From: Otis Gospodnetić <otis_gospodnetic@...>
Date: Fri Jan 13, 2006 3:10 am
Subject: Classification for language detection?
otis_gospodn...
Send Email Send Email
 
Hello,

I used the classification functionality to detect sentiment as
described in LingPipe tutorials.  I haven't tried this yet, but just
like I can train and create a language model to determine the
sentiment in a piece of text, can I train the model with text from
different languages, and then use the classifier for language
detection purposes?

Thanks,
Otis

#201 From: Otis Gospodnetić <otis_gospodnetic@...>
Date: Fri Jan 13, 2006 3:20 am
Subject: Re: Named Entity Extraction
otis_gospodn...
Send Email Send Email
 
Hi Bob,

> In January, we'll roll out LingPipe 2.2, which introduces
> a new API for dealing with entities and allows n-best
> output (with conditional probabilities) and also extraction
> of n-best entities in order of posterior confidence.  One
> of the things I need to do before the release is write
> another tutorial!  The old API won't go away, because
> it's still faster than the new one.

What does n-best mean in this context?  Does it simply mean that the
library will be able to return N possible entities ordered by their
probability of being correct, as opposed to returning just one entity,
which may or may not be correct?

Thanks,
Otis

#202 From: "Bob Carpenter" <carp@...>
Date: Fri Jan 13, 2006 7:11 pm
Subject: Re: Classification for language detection?
colloquialdo...
Send Email Send Email
 
Indeed.  You can also do it to spot topics.
We're using it to disambiguate genes by
looking at the context.

I'm going to write a tutorial in the next
month for langauge ID.  The classifiers
only work on characters, but if you cast
bytes to char, they could also work on
raw byte streams.   That way, you can
also detect coding.

- Bob

----- Original Message -----
From: "Otis Gospodnetić" <otis_gospodnetic@...>
To: <LingPipe@yahoogroups.com>
Sent: Thursday, January 12, 2006 10:10 PM
Subject: [LingPipe] Classification for language detection?


> Hello,
>
> I used the classification functionality to detect sentiment as
> described in LingPipe tutorials.  I haven't tried this yet, but just
> like I can train and create a language model to determine the
> sentiment in a piece of text, can I train the model with text from
> different languages, and then use the classifier for language
> detection purposes?
>
> Thanks,
> Otis
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>

#203 From: "Bob Carpenter" <carp@...>
Date: Tue Jan 17, 2006 4:36 pm
Subject: Re: Re: Named Entity Extraction
colloquialdo...
Send Email Send Email
 
> What does n-best mean in this context?  Does it simply mean that the
> library will be able to return N possible entities ordered by their
> probability of being correct, as opposed to returning just one entity,
> which may or may not be correct?

I hope this answer isn't more confusing than the first
go round.  Sorry for any smoke it causes in your
brain.

That's what I'm calling confidence-based entity detection -- it
just returns an iterator, though -- you don't even need to fix
an upper bound on N, though doing so can save memory.

N-best gives you the top n analyses for the entire input sequence.
You can see an example in the POS demo:

      INPUT> This correlation was also confirmed by detection of early
carcinoma.
      ...
      N BEST
      #   JointLogProb         Analysis
      0     -90.265  This_DD   correlation_NN   was_VBD  also_RR
confirmed_VVN  by_II   detection_NN   of_II   early_JJ   carcinoma_NN   ._.
      1     -94.072  This_DD   correlation_NN   was_VBD  also_RR
confirmed_VVD  by_II   detection_NN   of_II   early_JJ   carcinoma_NN   ._.
      2     -99.905  This_PND  correlation_NN   was_VBD  also_RR
confirmed_VVN  by_II   detection_NN   of_II   early_JJ   carcinoma_NN   ._.
      3    -101.574  This_DD   correlation_NN   was_VBD  also_RR
confirmed_VVN  by_II   detection_NN   of_II   early_RR   carcinoma_NN   ._.
      4    -102.253  This_DD   correlation_NN   was_VBD  also_RR
confirmed_VVN  by_II   detection_NN   of_II   early_NN   carcinoma_NN   ._.

The differences are subtle -- for instance just the
verb confirmed between 0 and 1.  The probability
estimates tell you that hypothesis 0 is about (94-90)=4
more likely than hypothesis 1, on a log 2 scale (2**4=16).
So you get posterior confidence estimates here, too.

Sequence n-best is useful if you want to do further analyses of a whole
input sequence, such as parsing.  Parsing would relate "correlation"
as the subject of "confirmed" -- a relation that's beyond the scope
of our part-of-speech taggers.

N-best is also good for rescoring.  And you know from
search there's a whole cottage industry of folks fiddling
with this kind of stuff.  For instance, you might want to relate what you
found earlier in a sentence to what you found later in a sentence,
(e.g. to infer that Sun is a company rather than a planet)
and it's too costly computationally to do this for all possible
analyses, so you just look at the N best.

There are some very cool HMM decoding and search algorithms
that can be composed to implement these -- Viterbi and
forward-backward HMM decoding and A* [best first with
completion prediction] search.

- Bob

PS  Breck's back from skiing tomorrow (Wednesday), and I'll
be back in Williamsburg.

#204 From: "carl_t_white" <carl_t_white@...>
Date: Mon Jan 30, 2006 5:25 pm
Subject: Character vs. Word tokens with DynamicLMClassifier
carl_t_white
Send Email Send Email
 
Can anyone explain to me benefit of using a character-based NGram
tokenizer for the DynamicLMClassifier's language model over a
word-based one? I notice that the ClassifyNews.java example uses the
default character-based one, which I find surprising. Wouldn't a
language model that consists of acutal words be of superior use that
one that just uses character chunks? Or am I misunderstanding the
meaning of the charatcer-based language model?

I tried changing it to use a word-based langauge model by changing the
line from:

DynamicLMClassifier classifier = new
DynamicLMClassifier(CATEGORIES,NGRAM_SIZE,BOUNDED);

to:

DynamicLMClassifier classifier = new DynamicLMClassifier(CATEGORIES,
1, new com.aliasi.tokenizer.IndoEuropeanTokenizerFactory ());

and I was surprised to see the accuracy of the demo drop from
0.9861111111111112 to 0.9583333333333334. Can anyone explain why?

#205 From: carp@...
Date: Mon Jan 30, 2006 10:31 pm
Subject: Re: Character vs. Word tokens with DynamicLMClassifier
colloquialdo...
Send Email Send Email
 
carl_t_white wrote:
> Can anyone explain to me benefit of using a character-based NGram
> tokenizer for the DynamicLMClassifier's language model over a
> word-based one?

  > I notice that the ClassifyNews.java example uses the
> default character-based one, which I find surprising.

In almost all of our large-scale experiments, character-based models
have outperformed token-based classifiers.   And they're easier
to use, which is why we recommend them as the best first-choice
for text classification problems of any domain and any language.

And it's not just us.  Check out the work derived from Fuchun
Peng's dissertation (at Waterloo), such as this short
paper:

Fuchun Peng, Dale Schuurmans and Shaojun Wang;
Language and Task Independent Text Categorization with Simple Language Models.
HLT 2003.
http://www.cs.umass.edu/~fuchun/publication/HLT-NAACL03.pdf

  > Wouldn't a
> language model that consists of acutal words be of superior use that
> one that just uses character chunks?

In general, no.

> Or am I misunderstanding the
> meaning of the charatcer-based language model?

I'm not sure what your understanding is, but the
basic idea is that they predict the next character
given the previous N characters rather than predicting
the next word given the previous M words.  N can be much
bigger than M for models of the same size on disk.

The two main advantages of character models are
that (1) you don't have to worry about tokenization
issues (especially problematic in bio-medical text
and languages like Chinese that are written withotu
spaces), and (2) they're more robust to alternative
spellings, morphology, etc.  For instance, a character
language model trained on "language" matches "languages"
pretty well.  Or one trained on "p-53" might match "p53"
pretty well, whereas the tokens "p" and "53" don't
match the token "p53".

Having said this, our tokenized models are actually
smoothed with per-token character-level models.  (This
is very different smoothing than is typically described
in the textbooks or found in the SRI-Cambridge LM toolkit.)
Thus if you haven't seen "p53" in training, you use models
of other character sequences you have seen.  Note that
with smoothing of tokenized models to character models,
the character models have no context (e.g. the character
language models can use "human la" to predict the next "n"
character, whereas the tokenized models can't if they
haven't seen the token "languages").

To get around issue (2) with tokenized models, people
tend to use stemming (such as the Porter stemmer
provided with LingPipe), but that causes problems with
removing significant suffixes.  And it's hard to get
them for most other languages.  To get around (1), people
do things like build Chinese tokenizers (as we do in
another tutorial).

The main disadvantage with character-level models is
that they typically take up more space.  To that end,
I've spent a lot of time optimizing them (as reported
in a workshop paper from the 2005 ACL conference).

> I tried changing it to use a word-based langauge model by changing the
> line from:
>
> DynamicLMClassifier classifier = new
> DynamicLMClassifier(CATEGORIES,NGRAM_SIZE,BOUNDED);
>
> to:
>
> DynamicLMClassifier classifier = new DynamicLMClassifier(CATEGORIES,
> 1, new com.aliasi.tokenizer.IndoEuropeanTokenizerFactory ());
>
> and I was surprised to see the accuracy of the demo drop from
> 0.9861111111111112 to 0.9583333333333334. Can anyone explain why?

A tokenized language model with length 1 token is what's
called a Naive Bayes model (naive in the sense that it
doesn't use context).  With n-gram length 2 for a tokenized
model, the tokenized classifier has exactly the same performance
as our character models: .986111.  This is very rare to
find in an evaluation, but this is a toy dataset, so don't
try to draw too many conclusions.

With char LM length=2, accuracy=.82
length=3, accuracy=.94444,
length=4, accuracy=.9722
length=5+, accuracy=.9866111

With token LM length=1, accuracy=.95833
length=2+, acc=.986111

So you can't draw any conclusions from this data
on character vs. token LMs (at least our tokenized
LMs with character-level LM smoothing for unknown
tokens).

We tried to use smoothing that won't hurt you if the models
are too long, and that seems to work in that anything
from 5 to 10-gram character models and 2- to 5-gram
token models had the same performance.

We've seen improvements going up to character 10- or 12-grams
in some cases, though character 8-grams and token
3-grams are usually enough unless there are multiple
gigabytes of training data.

Another reason character language models are nice is that if
you use 8-grams or 10-grams, they tend to cover multi-word
stretches of short words and one or two word stretches of
long words.  In that way, they're kind of like variable-length
token n-grams.

I hope that helps.  This is a big issue and there's
a lot to say about it both theoretically and practically.

- Bob

PS:  Let me also try to clear up
some possible terminological confusions.

Tokenization involves breaking an input character
sequence down into (typically word-like) chunks.
Tokenized language models can build n-gram language
models over arbitrary tokenizers.

We've provided two implementations of the classification
interface for language models -- one with tokenized
language models and one with character language models.

There is a tokenizer that considers each (non-space)
character its own token.  We've used that mainly for
interfaces like named-entity extraction that require
tokenization (such as temporal entity extracti in Chinese).
We would not recommend running tokenized language models with
character tokenizers. It'd have roughly the same result as the
direct character language model implementation, but would be
much less time and space efficient.

(The sample code in the question above does this all the
right way, by the way.)

#206 From: "carl_t_white" <carl_t_white@...>
Date: Mon Jan 30, 2006 11:16 pm
Subject: Re: Character vs. Word tokens with DynamicLMClassifier
carl_t_white
Send Email Send Email
 
Bob-

Thank you for the clear and thoughtful response. I suppose it was just
intuitive to me that a word-based language model would be much better
at classifying text than a character-based model, but your explanation
makes a lot of sense. I believe that speech recognition applications
use word-based language models for predicting word sequences, but of
course they have some different constraints than text processing does.

My other question is: how do you establish which character count you
should use for a particular classification application? I assume there
is no "magic formula" that takes as inputs the amount of data you have
for a corpus and things like the frequency of certain words and spits
out the "right" NGram count to use for the language model. So is the
only real way to determine the best NGram count just through trial and
error experimentation?

Final question: is there any quick and concise definition of the
"bounded" parameter for the DynamicLMClassifier? The javadoc suggests
that it should just be set to false, but I couldn't find any further
description of the parameter.

#207 From: carp@...
Date: Mon Jan 30, 2006 11:58 pm
Subject: Re: Re: Character vs. Word tokens with DynamicLMClassifier
colloquialdo...
Send Email Send Email
 
> I believe that speech recognition applications
> use word-based language models for predicting word sequences, but of
> course they have some different constraints than text processing does.

I spent a few years in the speech recognition world,
and the answer is that most large-vocabulary unconstrained
dictation systems use word-level language models.
Typically bigrams if they need to be fast and small
and trigrams if they need to be more accurate.

These are often combined with phonotactic models,
which are like character language models, except
that they run over sounds.  These are often linked
into pronunciation models.

Some really cool speech search engines model syllables
or phonemes directly, but I don't know of any
commercial systems doing that.  The advantage is that
you don't need a fixed vocabulary.  But given that
per-phoneme accuracy for large-vocab unconstrained
systems is about 70%, the combinatorics of approaching
99% coverage is pretty staggering).  Accuracy is better
on single-speaker trained systems like desktop dictation
systems.

They're also often combined with grammatical or
database constraints (if you call Delta airlines,
there's a bias in recognition toward cities they
fly to, for instance, and you can't say "from X from Y"
or other non-sensical things).

> My other question is: how do you establish which character count you
> should use for a particular classification application? I assume there
> is no "magic formula" that takes as inputs the amount of data you have
> for a corpus and things like the frequency of certain words and spits
> out the "right" NGram count to use for the language model. So is the
> only real way to determine the best NGram count just through trial and
> error experimentation?

Yes, but we prefer the term "empirical" to "trial and error" :-)

> Final question: is there any quick and concise definition of the
> "bounded" parameter for the DynamicLMClassifier? The javadoc suggests
> that it should just be set to false, but I couldn't find any further
> description of the parameter.

Sorry about that.  It's pretty confusing.  I should've
used a DynamicLM factory and not tried to unfold the
constructor paramters into the classifier class.

The relevant javadoc for the com.aliasi.classify.DynamicLMClassifier
constructor is:

       * @param boundSequences Set to <code>true</code> to use a bounded
       * sequence character language model and <code>false</code> to use
       * a process language model.

I just added a link to where you'll find the sequence and bounded
models described.  The bounded ones are at:

http://www.alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramBoundaryLM.html

The doc's in the LM package.  There are two different
character NGram models -- bounded/boundary and process.
The basic idea is that the boundary model inserts special
begin/end-sequence characters and also uses those.

For something at the word level and often the sentence
level, boundedness can help.  For instance,
a process model would look at "unhappily" and predict
the initial "u" without any context and would not
treat the finaly "ily" specially.  A bounded
model uses a special unprintable character (let's call
it '#' so I can write it), and models  #unhappily#
(not predicting first marker but predicting second
marker to make the normalization work out right).
This helps predict that the word is an adverb
because "un" is a likely prefix for
adjectives or adverbs, and "ly" is a likely suffix
for adjectives.  All of this helps predict categories
for words, which is why we use bounded models for our
part-of-speech tagger's models of words given
a syntactic category.

And this generalizes, too.  If I want to find a name,
I know "son", "ski" and "vitch" are common suffixes,
whereas if I want to find places, I know
"ville" and "borough" are common suffixes.
Of course, these are only statistical
tendencies, not hard and fast rules.

You'll find that the standard textbooks/survey papers hide
this important distinction and often get the normalizations
wrong (to make probabilities sum to 1.0).  They usually
only discuss process models (so-called because they're
a kind of random process).

If you want more detail at the mathematical level,
check out my ACL workshop paper from last year:

Bob Carpenter. 2005. Scaling High-Order Character Language
Models to Gigabytes.  In Proceedings of the Association
for Computational Linguistics Workshop on Software. Ann Arbor.

http://www.colloquial.com/carp/Publications/acl05soft-carpenter.pdf

- Bob

#208 From: eduard barbu <eduard_barbu@...>
Date: Tue Jan 31, 2006 2:07 pm
Subject: Multilabel categories!
eduard_barbu
Send Email Send Email
 
Does any of the classification algorithms implemented in LingPipe allows for multilabel categories (that is allows overlaping categories)?

Regards!
Eduard


Do you Yahoo!?
With a free 1 GB, there's more in store with Yahoo! Mail.

#209 From: carp@...
Date: Tue Jan 31, 2006 6:50 pm
Subject: Re: Multilabel (and hierarchical) categories!
colloquialdo...
Send Email Send Email
 
eduard barbu wrote:
> Does any of the classification algorithms implemented in LingPipe allows
> for multilabel categories (that is allows overlaping categories)?

Not directly.

The best we can offer out of the box is
running a whole bunch of one-vs-all
classifiers.  This is fine if the classes
are independent.

If the classes are dependent, then you
have a much trickier classification problem,
because you want to model dependencies.  A
simple kind of dependency is a disjoint hierarchical
one that says that each document is a member
of exactly one leaf in a single inheritance
hierarchy.  Then if you're a member of a subclass,
you're also a member of all of its superclasses.
You can train superclasses on all of their
subclasses' data.  Then you can decode top-down
starting at the root of the hierarchy, classifying
a daughter, then choosing among the daughters,
and so on.  In a fake example, you might have
three top level classes, A, B and C, of which
A has subclasses a1-a3 and B has subclasses
b1-b2 and C has no subclasses.  The goal is
for an input to see if it is an a1, a2, a3,
b1, b2 or C.

    A
       a1
       a2
       a3
    B
       b1
       b2
    C

You build three classifiers:

1) [A,B,C]
2) [a1,a2,a3]
3) [b1,b2]

You label each instance and if it's
a1, you train class a1 in (2) and
class A in (1).  If it's b2, you
train class b2 in (3) and B in (1).

You then classify an input using
classifier (1).  If the result is
A, then you apply classifier (2) to
determine if it's an a1, a2 or a3.
If it's a B, then you apply classifier
(3).  If it's C, then  you're done.

Note that this assumes that the categorizations
are complete and exclusive.  You can always
get around completeness by including an
OTHER category at any level.  You just need
some training data for it.  Or you need to do
something like setting a threshold such
that you reject anything whose cross-entropy
against the category is too low (by looking
at the score).

If an instance can be multiple categories,
the categorization's problem much trickier
unless the categories are independent (or
an independence assumption is good enough
for your practical purposes -- it often is).

- Bob

#211 From: "icekle" <srinivasnanduri@...>
Date: Wed Feb 8, 2006 5:59 pm
Subject: Support for French?
icekle
Send Email Send Email
 
I have just come across lingpipe and still understanding it's great
potential. It would be extremely useful for one of the projects I am
working on.
I wanted to know if "French" is supported by lingpipe? I will be
processing some french content and was not sure if lingpipe supports
that? In the faq, only Chinese, English, Hindi, Japanese and Klingon
are said to be supported.

Thanks.

#212 From: Breck Baldwin <breck@...>
Date: Wed Feb 8, 2006 6:21 pm
Subject: Re: Support for French?
reckb
Send Email Send Email
 
We don't have any models for French in particular, but there should be
no character set issues if you want to process French.

If you want to detect named entities, part-of-speech tag or such things
then you will have to find French corpora to train. I am not aware of
any publicly availablee resources.

Look at the tutorials page for more of what we can do--many of the
components are language neutral....

http://www.alias-i.com/lingpipe/getting_started.html

good luck

breck

icekle wrote:

> I have just come across lingpipe and still understanding it's great
> potential. It would be extremely useful for one of the projects I am
> working on.
> I wanted to know if "French" is supported by lingpipe? I will be
> processing some french content and was not sure if lingpipe supports
> that? In the faq, only Chinese, English, Hindi, Japanese and Klingon
> are said to be supported.
>
> Thanks.
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>

--
Breck Baldwin
Alias-i, Inc.
181 North 11th Street, Suite 401
Brooklyn, NY 11211
v:718.290.9170
f:718.290.9171
m:917.292.8845
breck@...
Attachment: vcard [not shown]

#213 From: carp@...
Date: Wed Feb 15, 2006 11:00 pm
Subject: Migrating to JDK 1.5 for LingPipe 3.0?
colloquialdo...
Send Email Send Email
 
Please let us know what you think about
assuming JDK 1.5 for LingPipe 3.0 and
beyond.  Here's a quick roadmap:

Sun JDK 1.6 In
--------------
Sun just released the 1.6 JDK beta (aka Tiger).

       http://java.sun.com/j2se/

Sun JDK 1.5 Stable
------------------
1.5 is now the "standard" release.

Sun JDK 1.4 On its Way Out
--------------------------
The 1.4 version of the JDK is no longer even
mentioned on the main J2SE page.  Nevertheless,
it doesn't seem to have reached "end of life"
(EOL), as its pages are still up:

       http://java.sun.com/j2se/1.4.2/download.html

They support two versions at a time, so when
1.6 is stable, 1.4 will be expired.  This
should happen about the time we plan to roll
out LingPipe 3.0.

LingPipe 2.2 compatible with JDK 1.4+
-------------------------------------
We're in the final release engineering stages for
LingPipe 2.2.  It will be compiled under JDK 1.4
and be JDK 1.4 and 1.5 (and presumably 1.6) compatible.

LingPipe 3.0 coded to JDK 1.5???
--------------------------------
We'd like to assume JDK 1.5 libraries and syntax.
We don't anticipate a 3.0 release until summer 2006.

If I refactor existing code to use generics, it
may break some backward compatiblity.

Util.concurrent would likely be used behind the
scenes for things like CRSW cache synchronization
and in various multi-threaded demos.

What's in store for LingPipe 3.0?
---------------------------------
I'd like to refactor the I/O classes involved in
our 1.0 named-entity extraction or just deprecate
the com.aliasi.ne package in favor of the chunking
about to be introduced in 2.2.

We might add probabilistic CFG and/or dependency
parsing.

We might add general transduction, which has
applications to spelling/pronunciation/transliteration
and for morphology/stemming.

We might get more serious about clustering
and include k-means and some soft (EM) style
clusterers.  Depends on whether we want to do
non-database-linked coreference or not.

We might add hierarchical classification.

We might add a bunch of functionality around
Lucene for search, like query refinement,
results clustering, rescoring with newness,
etc.

There'll be a new set of demos and benchmarks
and models that will run on the web, as command
lines and as GUIs.  We're also thinking hard about
more annotation tools to exploit the new models
with tag-a-little, learn-a-little properties.

Anything else you'd like to see?  We're always
happy to field feature requests, and now's the
time to get them in for 3.0.

- Bob Carpenter
    Alias-i

#214 From: eduard barbu <eduard_barbu@...>
Date: Thu Feb 16, 2006 11:01 am
Subject: Re: Migrating to JDK 1.5 for LingPipe 3.0?
eduard_barbu
Send Email Send Email
 
Hi all,

I think it is a good idea to migrate to J.D.K 1.5 [you will have some work to do with generics but I think it is worth the effort!]
As a  feature, I will like to see some summarization capabilities. I think you have enough infrastructure to build even a performant summarization system.

Regards!
Eduard

carp@... wrote:
Please let us know what you think about
assuming JDK 1.5 for LingPipe 3.0 and
beyond.  Here's a quick roadmap:

Sun JDK 1.6 In
--------------
Sun just released the 1.6 JDK beta (aka Tiger).

      http://java.sun.com/j2se/

Sun JDK 1.5 Stable
------------------
1.5 is now the "standard" release.

Sun JDK 1.4 On its Way Out
--------------------------
The 1.4 version of the JDK is no longer even
mentioned on the main J2SE page.  Nevertheless,
it doesn't seem to have reached "end of life"
(EOL), as its pages are still up:

      http://java.sun.com/j2se/1.4.2/download.html

They support two versions at a time, so when
1.6 is stable, 1.4 will be expired.  This
should happen about the time we plan to roll
out LingPipe 3.0.

LingPipe 2.2 compatible with JDK 1.4+
-------------------------------------
We're in the final release engineering stages for
LingPipe 2.2.  It will be compiled under JDK 1.4
and be JDK 1.4 and 1.5 (and presumably 1.6) compatible.

LingPipe 3.0 coded to JDK 1.5???
--------------------------------
We'd like to assume JDK 1.5 libraries and syntax.
We don't anticipate a 3.0 release until summer 2006.

If I refactor existing code to use generics, it
may break some backward compatiblity.

Util.concurrent would likely be used behind the
scenes for things like CRSW cache synchronization
and in various multi-threaded demos.

What's in store for LingPipe 3.0?
---------------------------------
I'd like to refactor the I/O classes involved in
our 1.0 named-entity extraction or just deprecate
the com.aliasi.ne package in favor of the chunking
about to be introduced in 2.2.

We might add probabilistic CFG and/or dependency
parsing.

We might add general transduction, which has
applications to spelling/pronunciation/transliteration
and for morphology/stemming.

We might get more serious about clustering
and include k-means and some soft (EM) style
clusterers.  Depends on whether we want to do
non-database-linked coreference or not.

We might add hierarchical classification.

We might add a bunch of functionality around
Lucene for search, like query refinement,
results clustering, rescoring with newness,
etc.

There'll be a new set of demos and benchmarks
and models that will run on the web, as command
lines and as GUIs.  We're also thinking hard about
more annotation tools to exploit the new models
with tag-a-little, learn-a-little properties.

Anything else you'd like to see?  We're always
happy to field feature requests, and now's the
time to get them in for 3.0.

- Bob Carpenter
   Alias-i








Brings words and photos together (easily) with
PhotoMail - it's free and works with Yahoo! Mail.

#216 From: "Bob Carpenter" <carp@...>
Date: Wed Feb 22, 2006 12:29 am
Subject: LingPipe 2.2.0 released
colloquialdo...
Send Email Send Email
 
Alias-i is pleased to announce the availability of LingPipe 2.2.0.
Check it out at:

      http://www.alias-i.com/lingpipe

Cribbing from the new home page:


NEW FUNCTIONALITY

N-best and Confidence Chunking
N-best and confidence chunking with evaluation and training.
Applies to named entity mention extraction and phrase extraction.

MEDLINE 2006 Parser and Downloader
New DTDs included. New program to automatically download,
update and run checksums over MEDLINE data.


IMPROVEMENTS

LM Speedup
Sped up online tokenized and character language models by one to three
orders of magnitude, depending on branching factor and training set size,
for
dynamic estimates and compilation (not compiled model speed).  This speeds
up the following by 1-3 orders of magnitude for online evaluation and
complilation:  classification, spelling, Chinese word segmentation,
part-of-speech,
named-entity chunking and significant phrase extraction from dynamic models.

HMM Speedup
Sped up HMM decoder by adding emission probability cache; one to two
orders of magnitude for compiled models, which are the heart of
both part-of-speech tagging and the new entity detection.  To support this,
there is a new thread-safe  high-throughput Map implementation designed
especially for
caching: util.FastCache.

Spelling Speedup
Sped up spelling correction by two or more orders of magnitude
by converting two quadratic algorithms to linear (one in length of input,
one in length of n-gram).

Generalized Significant Phrases
Significant phrase extraction generalized to allow compiled background
models with a new tokenized LM interface.  (Thanks to user comments
for pushing us to do this one;  it's a lot faster this way.)

Chinese Word Segmentation
SigHan 2005 Chinese tokenization eval tutorial. LingPipe's
now the best published Chinese word segmenter over the largest
SigHan data set.


There are a few minor bug fixes and a bunch of little additions
here and there.  The biggest bug fix was with scoring for
joint classification (as done by language models, Naive Bayes,
etc.).  It was previously returning the cross-entropy rate rather
than the joint log probability (the former is the latter divided
by the number of characters).  Classification results won't change,
but numerical scores will.

The only backward compatibility issue will be with the spell
checker, whose interface was rather drastically changed to
account for the new configurations.  The old models will still
work -- this only affects runtime config of decoders.

Comments are welcome, as usual.  We'd love to hear what
you're doing with LingPipe.

- Bob Carpenter
   Alias-i

PS:  If anyone has comments or suggestions on the new web design,
we'd love to hear them.  This is my first site design using cascading
style sheets (CSS).

PPS:  In case anyone thought the JDK version discussion
was about this release, it wasn't -- this one's compiled with
JDK 1.4.2.  We may change to JDK 1.5 (Java 5.0) for our
3.0 release, but only if 1.6 is released and support for 1.4.2
is discontinued by Sun.

#220 From: caven wang <gdcaven@...>
Date: Mon Mar 6, 2006 4:18 am
Subject: Re: LingPipe 2.2.0 released
gdcaven
Send Email Send Email
 
Many thanks. I am always waiting for some tutorials on coreference function.
 

Bob Carpenter <carp@...> wrote:
Alias-i is pleased to announce the availability of LingPipe 2.2.0.
Check it out at:

     http://www.alias-i.com/lingpipe

Cribbing from the new home page:


NEW FUNCTIONALITY

N-best and Confidence Chunking
N-best and confidence chunking with evaluation and training.
Applies to named entity mention extraction and phrase extraction.

MEDLINE 2006 Parser and Downloader
New DTDs included. New program to automatically download,
update and run checksums over MEDLINE data.


IMPROVEMENTS

LM Speedup
Sped up online tokenized and character language models by one to three
orders of magnitude, depending on branching factor and training set size,
for
dynamic estimates and compilation (not compiled model speed).  This speeds
up the following by 1-3 orders of magnitude for online evaluation and
complilation:  classification, spelling, Chinese word segmentation,
part-of-speech,
named-entity chunking and significant phrase extraction from dynamic models.

HMM Speedup
Sped up HMM decoder by adding emission probability cache; one to two
orders of magnitude for compiled models, which are the heart of
both part-of-speech tagging and the new entity detection.  To support this,
there is a new thread-safe  high-throughput Map implementation designed
especially for
caching: util.FastCache.

Spelling Speedup
Sped up spelling correction by two or more orders of magnitude
by converting two quadratic algorithms to linear (one in length of input,
one in length of n-gram).

Generalized Significant Phrases
Significant phrase extraction generalized to allow compiled background
models with a new tokenized LM interface.  (Thanks to user comments
for pushing us to do this one;  it's a lot faster this way.)

Chinese Word Segmentation
SigHan 2005 Chinese tokenization eval tutorial. LingPipe's
now the best published Chinese word segmenter over the largest
SigHan data set.


There are a few minor bug fixes and a bunch of little additions
here and there.  The biggest bug fix was with scoring for
joint classification (as done by language models, Naive Bayes,
etc.).  It was previously returning the cross-entropy rate rather
than the joint log probability (the former is the latter divided
by the number of characters).  Classification results won't change,
but numerical scores will.

The only backward compatibility issue will be with the spell
checker, whose interface was rather drastically changed to
account for the new configurations.  The old models will still
work -- this only affects runtime config of decoders.

Comments are welcome, as usual.  We'd love to hear what
you're doing with LingPipe.

- Bob Carpenter
  Alias-i

PS:  If anyone has comments or suggestions on the new web design,
we'd love to hear them.  This is my first site design using cascading
style sheets (CSS).

PPS:  In case anyone thought the JDK version discussion
was about this release, it wasn't -- this one's compiled with
JDK 1.4.2.  We may change to JDK 1.5 (Java 5.0) for our
3.0 release, but only if 1.6 is released and support for 1.4.2
is discontinued by Sun.




Relax. Yahoo! Mail virus scanning helps detect nasty viruses!

#221 From: caven wang <gdcaven@...>
Date: Mon Mar 6, 2006 4:19 am
Subject: Re: LingPipe 2.2.0 released
gdcaven
Send Email Send Email
 
Many thanks. I am always waiting for some tutorials on coreference function.
 

Bob Carpenter <carp@...> wrote:
Alias-i is pleased to announce the availability of LingPipe 2.2.0.
Check it out at:

     http://www.alias-i.com/lingpipe

Cribbing from the new home page:


NEW FUNCTIONALITY

N-best and Confidence Chunking
N-best and confidence chunking with evaluation and training.
Applies to named entity mention extraction and phrase extraction.

MEDLINE 2006 Parser and Downloader
New DTDs included. New program to automatically download,
update and run checksums over MEDLINE data.


IMPROVEMENTS

LM Speedup
Sped up online tokenized and character language models by one to three
orders of magnitude, depending on branching factor and training set size,
for
dynamic estimates and compilation (not compiled model speed).  This speeds
up the following by 1-3 orders of magnitude for online evaluation and
complilation:  classification, spelling, Chinese word segmentation,
part-of-speech,
named-entity chunking and significant phrase extraction from dynamic models.

HMM Speedup
Sped up HMM decoder by adding emission probability cache; one to two
orders of magnitude for compiled models, which are the heart of
both part-of-speech tagging and the new entity detection.  To support this,
there is a new thread-safe  high-throughput Map implementation designed
especially for
caching: util.FastCache.

Spelling Speedup
Sped up spelling correction by two or more orders of magnitude
by converting two quadratic algorithms to linear (one in length of input,
one in length of n-gram).

Generalized Significant Phrases
Significant phrase extraction generalized to allow compiled background
models with a new tokenized LM interface.  (Thanks to user comments
for pushing us to do this one;  it's a lot faster this way.)

Chinese Word Segmentation
SigHan 2005 Chinese tokenization eval tutorial. LingPipe's
now the best published Chinese word segmenter over the largest
SigHan data set.


There are a few minor bug fixes and a bunch of little additions
here and there.  The biggest bug fix was with scoring for
joint classification (as done by language models, Naive Bayes,
etc.).  It was previously returning the cross-entropy rate rather
than the joint log probability (the former is the latter divided
by the number of characters).  Classification results won't change,
but numerical scores will.

The only backward compatibility issue will be with the spell
checker, whose interface was rather drastically changed to
account for the new configurations.  The old models will still
work -- this only affects runtime config of decoders.

Comments are welcome, as usual.  We'd love to hear what
you're doing with LingPipe.

- Bob Carpenter
  Alias-i

PS:  If anyone has comments or suggestions on the new web design,
we'd love to hear them.  This is my first site design using cascading
style sheets (CSS).

PPS:  In case anyone thought the JDK version discussion
was about this release, it wasn't -- this one's compiled with
JDK 1.4.2.  We may change to JDK 1.5 (Java 5.0) for our
3.0 release, but only if 1.6 is released and support for 1.4.2
is discontinued by Sun.




Yahoo! Mail
Bring photos to life! New PhotoMail makes sharing a breeze.

#222 From: caven wang <gdcaven@...>
Date: Mon Mar 6, 2006 4:31 am
Subject: Re: Migrating to JDK 1.5 for LingPipe 3.0?
gdcaven
Send Email Send Email
 
Lingpipe can walk in its own way. Don't stay too close with SUN.
SUN is So Unhappy and Neglective about the truth. SUN named JDK1.5 as JDK5, which indicated a commerical company was over-anxious about some progress. It is yet hard to find a lot of exciting improvement. Do we need LingPipe3.0 or LingPipe 2.3...?

carp@... wrote:
Please let us know what you think about
assuming JDK 1.5 for LingPipe 3.0 and
beyond.  Here's a quick roadmap:

Sun JDK 1.6 In
--------------
Sun just released the 1.6 JDK beta (aka Tiger).

      http://java.sun.com/j2se/

Sun JDK 1.5 Stable
------------------
1.5 is now the "standard" release.

Sun JDK 1.4 On its Way Out
--------------------------
The 1.4 version of the JDK is no longer even
mentioned on the main J2SE page.  Nevertheless,
it doesn't seem to have reached "end of life"
(EOL), as its pages are still up:

      http://java.sun.com/j2se/1.4.2/download.html

They support two versions at a time, so when
1.6 is stable, 1.4 will be expired.  This
should happen about the time we plan to roll
out LingPipe 3.0.

LingPipe 2.2 compatible with JDK 1.4+
-------------------------------------
We're in the final release engineering stages for
LingPipe 2.2.  It will be compiled under JDK 1.4
and be JDK 1.4 and 1.5 (and presumably 1.6) compatible.

LingPipe 3.0 coded to JDK 1.5???
--------------------------------
We'd like to assume JDK 1.5 libraries and syntax.
We don't anticipate a 3.0 release until summer 2006.

If I refactor existing code to use generics, it
may break some backward compatiblity.

Util.concurrent would likely be used behind the
scenes for things like CRSW cache synchronization
and in various multi-threaded demos.

What's in store for LingPipe 3.0?
---------------------------------
I'd like to refactor the I/O classes involved in
our 1.0 named-entity extraction or just deprecate
the com.aliasi.ne package in favor of the chunking
about to be introduced in 2.2.

We might add probabilistic CFG and/or dependency
parsing.

We might add general transduction, which has
applications to spelling/pronunciation/transliteration
and for morphology/stemming.

We might get more serious about clustering
and include k-means and some soft (EM) style
clusterers.  Depends on whether we want to do
non-database-linked coreference or not.

We might add hierarchical classification.

We might add a bunch of functionality around
Lucene for search, like query refinement,
results clustering, rescoring with newness,
etc.

There'll be a new set of demos and benchmarks
and models that will run on the web, as command
lines and as GUIs.  We're also thinking hard about
more annotation tools to exploit the new models
with tag-a-little, learn-a-little properties.

Anything else you'd like to see?  We're always
happy to field feature requests, and now's the
time to get them in for 3.0.

- Bob Carpenter
   Alias-i








YAHOO! GROUPS LINKS






Yahoo! Mail
Use Photomail to share photos without annoying attachments.

#223 From: caven wang <gdcaven@...>
Date: Mon Mar 6, 2006 4:32 am
Subject: Re: Migrating to JDK 1.5 for LingPipe 3.0?
gdcaven
Send Email Send Email
 
Lingpipe can walk in its own way. Don't stay too close with SUN.
SUN is So Unhappy and Neglective about the truth. SUN named JDK1.5 as JDK5, which indicated a commerical company was over-anxious about some progress. It is yet hard to find a lot of exciting improvement. Do we need LingPipe3.0 or LingPipe 2.3...?

carp@... wrote:
Please let us know what you think about
assuming JDK 1.5 for LingPipe 3.0 and
beyond.  Here's a quick roadmap:

Sun JDK 1.6 In
--------------
Sun just released the 1.6 JDK beta (aka Tiger).

      http://java.sun.com/j2se/

Sun JDK 1.5 Stable
------------------
1.5 is now the "standard" release.

Sun JDK 1.4 On its Way Out
--------------------------
The 1.4 version of the JDK is no longer even
mentioned on the main J2SE page.  Nevertheless,
it doesn't seem to have reached "end of life"
(EOL), as its pages are still up:

      http://java.sun.com/j2se/1.4.2/download.html

They support two versions at a time, so when
1.6 is stable, 1.4 will be expired.  This
should happen about the time we plan to roll
out LingPipe 3.0.

LingPipe 2.2 compatible with JDK 1.4+
-------------------------------------
We're in the final release engineering stages for
LingPipe 2.2.  It will be compiled under JDK 1.4
and be JDK 1.4 and 1.5 (and presumably 1.6) compatible.

LingPipe 3.0 coded to JDK 1.5???
--------------------------------
We'd like to assume JDK 1.5 libraries and syntax.
We don't anticipate a 3.0 release until summer 2006.

If I refactor existing code to use generics, it
may break some backward compatiblity.

Util.concurrent would likely be used behind the
scenes for things like CRSW cache synchronization
and in various multi-threaded demos.

What's in store for LingPipe 3.0?
---------------------------------
I'd like to refactor the I/O classes involved in
our 1.0 named-entity extraction or just deprecate
the com.aliasi.ne package in favor of the chunking
about to be introduced in 2.2.

We might add probabilistic CFG and/or dependency
parsing.

We might add general transduction, which has
applications to spelling/pronunciation/transliteration
and for morphology/stemming.

We might get more serious about clustering
and include k-means and some soft (EM) style
clusterers.  Depends on whether we want to do
non-database-linked coreference or not.

We might add hierarchical classification.

We might add a bunch of functionality around
Lucene for search, like query refinement,
results clustering, rescoring with newness,
etc.

There'll be a new set of demos and benchmarks
and models that will run on the web, as command
lines and as GUIs.  We're also thinking hard about
more annotation tools to exploit the new models
with tag-a-little, learn-a-little properties.

Anything else you'd like to see?  We're always
happy to field feature requests, and now's the
time to get them in for 3.0.

- Bob Carpenter
   Alias-i








YAHOO! GROUPS LINKS






Yahoo! Mail
Bring photos to life! New PhotoMail makes sharing a breeze.

#224 From: caven wang <gdcaven@...>
Date: Mon Mar 6, 2006 4:33 am
Subject: Re: Migrating to JDK 1.5 for LingPipe 3.0?
gdcaven
Send Email Send Email
 
Lingpipe can walk in its own way. Don't stay too close with SUN.
SUN is So Unhappy and Neglective about the truth. SUN named JDK1.5 as JDK5, which indicated a commerical company was over-anxious about some progress. It is yet hard to find a lot of exciting improvement. Do we need LingPipe3.0 or LingPipe 2.3...?

carp@... wrote:
Please let us know what you think about
assuming JDK 1.5 for LingPipe 3.0 and
beyond.  Here's a quick roadmap:

Sun JDK 1.6 In
--------------
Sun just released the 1.6 JDK beta (aka Tiger).

      http://java.sun.com/j2se/

Sun JDK 1.5 Stable
------------------
1.5 is now the "standard" release.

Sun JDK 1.4 On its Way Out
--------------------------
The 1.4 version of the JDK is no longer even
mentioned on the main J2SE page.  Nevertheless,
it doesn't seem to have reached "end of life"
(EOL), as its pages are still up:

      http://java.sun.com/j2se/1.4.2/download.html

They support two versions at a time, so when
1.6 is stable, 1.4 will be expired.  This
should happen about the time we plan to roll
out LingPipe 3.0.

LingPipe 2.2 compatible with JDK 1.4+
-------------------------------------
We're in the final release engineering stages for
LingPipe 2.2.  It will be compiled under JDK 1.4
and be JDK 1.4 and 1.5 (and presumably 1.6) compatible.

LingPipe 3.0 coded to JDK 1.5???
--------------------------------
We'd like to assume JDK 1.5 libraries and syntax.
We don't anticipate a 3.0 release until summer 2006.

If I refactor existing code to use generics, it
may break some backward compatiblity.

Util.concurrent would likely be used behind the
scenes for things like CRSW cache synchronization
and in various multi-threaded demos.

What's in store for LingPipe 3.0?
---------------------------------
I'd like to refactor the I/O classes involved in
our 1.0 named-entity extraction or just deprecate
the com.aliasi.ne package in favor of the chunking
about to be introduced in 2.2.

We might add probabilistic CFG and/or dependency
parsing.

We might add general transduction, which has
applications to spelling/pronunciation/transliteration
and for morphology/stemming.

We might get more serious about clustering
and include k-means and some soft (EM) style
clusterers.  Depends on whether we want to do
non-database-linked coreference or not.

We might add hierarchical classification.

We might add a bunch of functionality around
Lucene for search, like query refinement,
results clustering, rescoring with newness,
etc.

There'll be a new set of demos and benchmarks
and models that will run on the web, as command
lines and as GUIs.  We're also thinking hard about
more annotation tools to exploit the new models
with tag-a-little, learn-a-little properties.

Anything else you'd like to see?  We're always
happy to field feature requests, and now's the
time to get them in for 3.0.

- Bob Carpenter
   Alias-i








YAHOO! GROUPS LINKS






Yahoo! Mail
Bring photos to life! New PhotoMail makes sharing a breeze.

#225 From: carp@...
Date: Mon Mar 6, 2006 9:02 pm
Subject: Re: Migrating to JDK 1.5 for LingPipe 3.0?
colloquialdo...
Send Email Send Email
 
caven wang wrote:
> Lingpipe can walk in its own way. Don't stay too close with SUN.
> SUN is So Unhappy and Neglective about the truth. SUN named JDK1.5 as
> JDK5, which indicated a commerical company was over-anxious about some
> progress. It is yet hard to find a lot of exciting improvement. Do we
> need LingPipe3.0 or LingPipe 2.3...?

The issue is compatibility.  The 1.5 release is a major
change in the underlying language.  And while not everyone
may be a type-safety freak like me, I'm very excited about
generics.  I think it'd make the code cleaner and safer
if we replaced hard-coded classes and unadorned collections
with instantiations of generic versions.  And while I
don't care so much about auto-boxing or variable length
args, using them will also break backward compatiblity
in that LingPipe would no longer compile with JDK 1.4.

The other huge huge bonus for me in 1.5 is util.concurrent.
Doug Lea's brilliant set of synchronization libs has
been improved and dropped in as a standard library.
I'd like to use it to do things like create synchronization
wrappers within LingPipe itself.  We already use it for
applications in-house.

Just to show I'm not a hopeless Java sycophant, I'll
come out in opposition of the new extended character
support.  I've read the reports of how they came to
do it the way they did.  And it's an amazing integration
given the need to keep strings backward compatible.
But it means that a char is no longer big enough to
hold a character, in general.  This means no support outside of
16 bit unicode yet for LingPipe without rewriting just about everything
(which is what Sun did -- they tried rewriting the
RegEx library with a bunch of variants before settling
on the current scheme).

The issue isn't really one of marketing or what
numbers we should apply to LingPipe.  Left to our
own devices, we might have geekily chosen powers
of 2 (like MPEG) or primes (don't know of an example).
The only issue is whether people can keep them straight.
My fave Apache project, Lucene, just went from 1.4 to
1.9 without having a 1.5; many projects move Xeno-like
from 1.0 to 1.5 (to 1.75, to 1.875, never quite
arriving at version 2).  There's no sense anywhere to
version numbering.  Might as well ignore it if you're
not a marketing person.

Marketing aside, 1.5 is simply the most impressive Java yet.
It's a major  change in the language, almost all for the better in my
opinion.   Getting a product like Java out the door requires heroic
effort on the part of hundreds of engineers, and I'm frankly
amazed these releases come off so well.  Three cheers for
the Sun engineers: Hip hip hooray.  Hip hip hooray.  Hip hip,
hooray.  Without them, I'd still be suffering with C/C++.

- Bob

#226 From: "seth_a_farrington" <seth_a_farrington@...>
Date: Sat Mar 18, 2006 1:29 am
Subject: Very large language models
seth_a_farri...
Send Email Send Email
 
I'm working on a classifier and I've found that using DynamicLMClassifier, my
accuracy seems
to consistently go up when both increasing the ngrams and the number of pieces
of text I
train with. Currently, I've gotten it trained on 100,000 documents (averaging
about 1,000
words each) with a ngram of 6, and I'm starting to run into memory problems. The
JVM
seems to be taking up about a gig of memory building this language model, and so
when I
try to add more documents or increase the ngram count, swapping makes the
application
unacceptable slow.

Does anyone know if it would be possible to write some sort of NGramBoundaryLM
subclass
that will store parts of the model on disk so as to free up memory? Or are there
any other
techniques to reduce the amount of memory that a language model uses?

Also, I know that it is highly dependent on the application and classification
task, but is it
reasonable to assume that accuracy will continue to increase forever by adding
more and
more documents at higher ngram counts? I don't want to spend too much effort
after a point
of diminishing returns, but I don't know when I'll reach that point. Are there
any tables or
graphs out there that show the correlation between document counts & ngrams to
accuracy
rates?

#227 From: "Bob Carpenter" <carp@...>
Date: Sat Mar 18, 2006 2:32 am
Subject: Re: Very large language models
colloquialdo...
Send Email Send Email
 
Q1: can you train LMs on lots of data?
A1:  Yes, with pruning.

Q2:  Will classification accuracy keep going up with more data?
A2:  Almost always for natural language data, but the rate is usually
logarithmic.

> I'm working on a classifier and I've found that using DynamicLMClassifier,
> my accuracy seems
> to consistently go up when both increasing the ngrams and the number of
> pieces of text I
> train with.

That's good to hear.  It'll almost certainly go up with n-gram size, at
least to 8 with
that many docs, and maybe more if phrases are important.  I'd use larger
n-grams
combined with pruning (as indicated below).

> Currently, I've gotten it trained on 100,000 documents (averaging about
> 1,000
> words each) with a ngram of 6, and I'm starting to run into memory
> problems.

> The JVM
> seems to be taking up about a gig of memory building this language model,
> and so when I
> try to add more documents or increase the ngram count, swapping makes the
> application
> unacceptable slow.

That's 100 M words, or about 600MB.  I was able to train a single
6-gram on almost 10GB of English news in 1.4GB of memory using
either  the 1.4 or 1.5 JDK on Windows.

One thing that could be causing problems is if you have a lot
of numerical data.  There are just endless ways those can go
together.

Of course, if you're training multiple LMs with each piece of
data, you have to add their memory requirements.

I provide some figures in the following paper on scaling LMs:

http://www.colloquial.com/carp/Publications/acl05soft-carpenter.pdf

Memory requirements are largely determined
by how many n-grams you find of each size, which depends on branching
factor and how skewed the distribution is.

> Does anyone know if it would be possible to write some sort of
> NGramBoundaryLM subclass
> that will store parts of the model on disk so as to free up memory?

Unfortunately there is no such thing.  For what it's worth, it's already on
our "nice to have for 3.0" list.  I added the bit-level I/O operations to
support just that.  It's easy to write the counts out -- the harder part's
merging the
huge files that result -- they'll have to be streamed.

In the end, I decided pruning (see below) combined
with a large-memory machine was good enough for most practical situations.

Although it won't affect memory, I'd think that if your docs are 1K each,
you'd want to use the process language models.  It probably won't make much
difference, as the other 998 characters will swamp the
boundaries.

> Or are there any other
> techniques to reduce the amount of memory that a language model uses?

Yes. The best thing to do is to prune the models.  This will
remove counts for sequences below a given minimum.
I'm sorry it's not more obvious how to do this from looking at the
language model doc, but what you want to do is:

      NGramBoundaryLM lm = ...;

      TrieCharSeqCounter counter = lm.substringCounter();
      counter.prune(MIN_COUNT);

where MIN_COUNT is the minimum count for sequences you
want to preserve.

Given that natural language is highly skewed,
most sequences have a count of only 1, so even with MIN_COUNT=2,
you'll save a lot of space.  Distributions of both words
and phrases follows pretty closely to Zipf's law.  Check
out:

http://en.wikipedia.org/wiki/Zipf's_law

This skew is also what accounts for why more data leads
to better models in almost all cases.  The tail is very very
long, to put it in probabilistic and trendy internet terms.

The thing to do is incrementally train and prune.  It's best to get the
models as large as possible before pruning.   I'd guess it'd be better
to use longer n-grams even if you have to prune more agressively.

One thing you could do if you're using a classifier to save memory
is to train each language model separately and prune them.  You
can then compile them to disk and read them back in.  It takes
a lot of memory to compile, and the compiled models aren't much
smaller (but they are much faster).

> Also, I know that it is highly dependent on the application and
> classification task, but is it
> reasonable to assume that accuracy will continue to increase forever by
> adding more and
> more documents at higher ngram counts?

Pretty much, yes.  Even up to Google and MSN-sized
collections.  The classic paper on this topic is Banko and
Brill:

http://research.microsoft.com/~brill/Pubs/ACL2001.pdf

They only go up to a billion words, though :-)

> I don't want to spend too much effort after a point
> of diminishing returns, but I don't know when I'll reach that point. Are
> there any tables or
> graphs out there that show the correlation between document counts &
> ngrams to accuracy
> rates?

It's very very very task dependent.  The tables in my paper
shows learning curves for cross-entropy rates, which is just
a measure of how good the model is at predicting unseen text
given the amount of sample text it's seen.

We've since gotten a 16GB memory machine (they're not that
expensive now with dual opteron setups, and by that I mean
in the US$6K range if you shop around).  I've extended the
results in my ACL paper above, and entropy continues to
decrease with more memory and more training data.

I don't know of anything that reports learning curves for
character language model classifiers.  What you want to
do is plot your accuracy versus amount of training data
and see when it begins to level off.  You'll need to do it
on a log scale -- accuracy tends to grow with the log of
the amount of data.

If you've got data you can share, we could probably help
you train larger models on our big memory machines.

- Bob

#228 From: "seth_a_farrington" <seth_a_farrington@...>
Date: Sat Mar 18, 2006 3:41 am
Subject: Re: Very large language models
seth_a_farri...
Send Email Send Email
 
Bob,

Thanks so much for the very quick response!

The pruning does indeed seem to make some difference, at least with the quick
tests that
I've run using it. Pruning with a MIN_COUNT of 2 every 1,000 training events
appears to
reduce the  totalSequenceCount for each category by between 0.3% and 1.0% (e.g.,
one
reduction I noticed was from 498,140,010 to 497,837,688). This'll probably save
a fair
amount of memory over time, but it's not enormous. I'll try with higher
MIN_COUNT limits
and see if it makes more of a difference.

One other thing: Zipf's Law seems to discuss word frequencies. Has it also been
observed
that it also applies to character-based language models? It doesn't seem to me
like it is a
necessarily extrapolation of the rule that it would apply equally to character
and word
language models, but I don't really have any theoretical grounding in
linguistics at all.

By the way, where did you get your 10GB of English news? My application is
working with
news reports, so if you know of a resource that has that much raw data available
in the
public domain, I'd love to be able to use it.



--- In LingPipe@yahoogroups.com, "Bob Carpenter" <carp@...> wrote:
>
> Q1: can you train LMs on lots of data?
> A1:  Yes, with pruning.
>
> Q2:  Will classification accuracy keep going up with more data?
> A2:  Almost always for natural language data, but the rate is usually
> logarithmic.
>
> > I'm working on a classifier and I've found that using DynamicLMClassifier,
> > my accuracy seems
> > to consistently go up when both increasing the ngrams and the number of
> > pieces of text I
> > train with.
>
> That's good to hear.  It'll almost certainly go up with n-gram size, at
> least to 8 with
> that many docs, and maybe more if phrases are important.  I'd use larger
> n-grams
> combined with pruning (as indicated below).
>
> > Currently, I've gotten it trained on 100,000 documents (averaging about
> > 1,000
> > words each) with a ngram of 6, and I'm starting to run into memory
> > problems.
>
> > The JVM
> > seems to be taking up about a gig of memory building this language model,
> > and so when I
> > try to add more documents or increase the ngram count, swapping makes the
> > application
> > unacceptable slow.
>
> That's 100 M words, or about 600MB.  I was able to train a single
> 6-gram on almost 10GB of English news in 1.4GB of memory using
> either  the 1.4 or 1.5 JDK on Windows.
>
> One thing that could be causing problems is if you have a lot
> of numerical data.  There are just endless ways those can go
> together.
>
> Of course, if you're training multiple LMs with each piece of
> data, you have to add their memory requirements.
>
> I provide some figures in the following paper on scaling LMs:
>
> http://www.colloquial.com/carp/Publications/acl05soft-carpenter.pdf
>
> Memory requirements are largely determined
> by how many n-grams you find of each size, which depends on branching
> factor and how skewed the distribution is.
>
> > Does anyone know if it would be possible to write some sort of
> > NGramBoundaryLM subclass
> > that will store parts of the model on disk so as to free up memory?
>
> Unfortunately there is no such thing.  For what it's worth, it's already on
> our "nice to have for 3.0" list.  I added the bit-level I/O operations to
> support just that.  It's easy to write the counts out -- the harder part's
> merging the
> huge files that result -- they'll have to be streamed.
>
> In the end, I decided pruning (see below) combined
> with a large-memory machine was good enough for most practical situations.
>
> Although it won't affect memory, I'd think that if your docs are 1K each,
> you'd want to use the process language models.  It probably won't make much
> difference, as the other 998 characters will swamp the
> boundaries.
>
> > Or are there any other
> > techniques to reduce the amount of memory that a language model uses?
>
> Yes. The best thing to do is to prune the models.  This will
> remove counts for sequences below a given minimum.
> I'm sorry it's not more obvious how to do this from looking at the
> language model doc, but what you want to do is:
>
>      NGramBoundaryLM lm = ...;
>
>      TrieCharSeqCounter counter = lm.substringCounter();
>      counter.prune(MIN_COUNT);
>
> where MIN_COUNT is the minimum count for sequences you
> want to preserve.
>
> Given that natural language is highly skewed,
> most sequences have a count of only 1, so even with MIN_COUNT=2,
> you'll save a lot of space.  Distributions of both words
> and phrases follows pretty closely to Zipf's law.  Check
> out:
>
> http://en.wikipedia.org/wiki/Zipf's_law
>
> This skew is also what accounts for why more data leads
> to better models in almost all cases.  The tail is very very
> long, to put it in probabilistic and trendy internet terms.
>
> The thing to do is incrementally train and prune.  It's best to get the
> models as large as possible before pruning.   I'd guess it'd be better
> to use longer n-grams even if you have to prune more agressively.
>
> One thing you could do if you're using a classifier to save memory
> is to train each language model separately and prune them.  You
> can then compile them to disk and read them back in.  It takes
> a lot of memory to compile, and the compiled models aren't much
> smaller (but they are much faster).
>
> > Also, I know that it is highly dependent on the application and
> > classification task, but is it
> > reasonable to assume that accuracy will continue to increase forever by
> > adding more and
> > more documents at higher ngram counts?
>
> Pretty much, yes.  Even up to Google and MSN-sized
> collections.  The classic paper on this topic is Banko and
> Brill:
>
> http://research.microsoft.com/~brill/Pubs/ACL2001.pdf
>
> They only go up to a billion words, though :-)
>
> > I don't want to spend too much effort after a point
> > of diminishing returns, but I don't know when I'll reach that point. Are
> > there any tables or
> > graphs out there that show the correlation between document counts &
> > ngrams to accuracy
> > rates?
>
> It's very very very task dependent.  The tables in my paper
> shows learning curves for cross-entropy rates, which is just
> a measure of how good the model is at predicting unseen text
> given the amount of sample text it's seen.
>
> We've since gotten a 16GB memory machine (they're not that
> expensive now with dual opteron setups, and by that I mean
> in the US$6K range if you shop around).  I've extended the
> results in my ACL paper above, and entropy continues to
> decrease with more memory and more training data.
>
> I don't know of anything that reports learning curves for
> character language model classifiers.  What you want to
> do is plot your accuracy versus amount of training data
> and see when it begins to level off.  You'll need to do it
> on a log scale -- accuracy tends to grow with the log of
> the amount of data.
>
> If you've got data you can share, we could probably help
> you train larger models on our big memory machines.
>
> - Bob
>

#229 From: carp@...
Date: Mon Mar 20, 2006 7:24 pm
Subject: Re: Re: Very large language models
colloquialdo...
Send Email Send Email
 
> The pruning does indeed seem to make some difference, at least with the quick
tests that
> I've run using it. Pruning with a MIN_COUNT of 2 every 1,000 training events
appears to
> reduce the  totalSequenceCount for each category by between 0.3% and 1.0%
(e.g., one
> reduction I noticed was from 498,140,010 to 497,837,688).

Hmm.  With n-grams of length six, the reduction should be over
50% on English news text.  If you're only getting a 0.3%
reduction in model size, something's amiss in either the
pruning or reporting.

Are you looking at totalSequenceCount for n-grams of length
6?  The lower order n-grams will have much higher counts.

You can also use com.aliasi.lm.TrieCharSeqCounter.nGramFrequencies
to get an array of counts for a given length to see what they look
like before pruning and after.

Another issue arises from training on the same data
multiple times. That can happen in standard news feeds,
especially if you keep picking up the same AP or Reuters
story, or if you pick up corrections.  You can, in fact,
use the cross-entropy rate of the language model
(log probability divided by length) to assess whether
you've seen the text before.  If it's very low, you've
likely seen it before.

I'd be happy to look at the code.  Our own pruning code
gets a lot of exercise, so I'm pretty sure it's OK,
but I'll check that if I can reproduce a test case that
does the wrong thing pruning-wise.

I'd try to hold off as long as you can before pruning given
your memory, as it'll lead to better models of the same size.

> One other thing: Zipf's Law seems to discuss word frequencies. Has it also
been observed
> that it also applies to character-based language models?

Yes, by everyone.  It's very contentious as to what
form the observation takes and how general/exact
the fit is, but what everyone has observed in just
about any count of linguistic interest, is a huge
skew toward common events and a very long tail of
uncommon events.

> It doesn't seem to me like it is a
> necessarily extrapolation of the rule that it would apply equally to character
and word
> language models, but I don't really have any theoretical grounding in
linguistics at all.

You're right.  It's really an empirical observation
and not something that theoretical linguists care
about.  Traditional Chomskyan linguists only recognize three
distinct "counts", zero, one and infinity.

> By the way, where did you get your 10GB of English news? My application is
working with
> news reports, so if you know of a resource that has that much raw data
available in the
> public domain, I'd love to be able to use it.

Gigaword for English news, which I'm afraid is not in
the public domain.  It's distributed by the Linguistic
Data Consortium (LDC).

Reuters distributes a large corpus for research
use, but there are commercial restrictions.  People
like to use it for classification experiments, in
fact.

I also used MEDLINE for English biomedical text.
It's not public domain, but it is available free
from the US National Library of Medicine (NLM).

- Bob

#230 From: carp@...
Date: Mon Mar 20, 2006 11:14 pm
Subject: LingPipe 2.2.1 (Maintenance Release)
colloquialdo...
Send Email Send Email
 
We needed to fix two serious bugs, so we just released
2.2.1.  It's otherwise almost identical to 2.2.0.

The first of these was so that the new util.FastCache
could handle negative hash codes.  I learned that not
only may hash codes be negative, the remainder operator
(%) does not always return a positive value.  The first
unit tests must've coincidentally used strings with only
positive hash codes.  This has been fixed and tried in
a large scale setting now.

The second was some subtle changes to token sensitivity
and spelling, and a couple more tuning features.

If you're not using either of these features, the new
release won't affect you at all.

Let us know if you have any questions.

We're working on a new release that'll have more demos
and at least a simple tutorial for the new n-best and
confidence named-entity detection.

- Bob Carpenter
    Alias-i

#231 From: Jason Lustig <lustig@...>
Date: Mon Mar 27, 2006 7:01 pm
Subject: LingPipe on the mac
stagnification
Send Email Send Email
 
Hi,

I am trying to use LingPipe for some natural language processing
applications that I am working on, and it looks excellent! I would
like to get it working on my mac so that I can program with it, and
it does not seem to be working properly. I am using Java 1.4.2_09,
and am getting this error when trying to run the command tutorial
from http://www.alias-i.com/lingpipe/demos/command/tutorial.html:

dyn-129-64-208-12:~/Desktop/lingpipe-2.2.1/demos/command jason$ java -
Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser -cp "command-
demo.jar:../../lingpipe-2.0.0.jar:../../lib/xml-apis-2.7.1.jar:../../
lib/xercesImpl-2.7.1.jar" AnnotateCmd  -model=../models/EN_NEWS.model
-inputDir=../data/commands/input -outputDir=../data/commands/output -
contentType="text/xml; charset=UTF-8"
Exception in thread "main" java.lang.UnsupportedClassVersionError:
AnnotateCmd (Unsupported major.minor version 49.0)
          at java.lang.ClassLoader.defineClass0(Native Method)
          at java.lang.ClassLoader.defineClass(ClassLoader.java:539)
          at java.security.SecureClassLoader.defineClass
(SecureClassLoader.java:123)
          at java.net.URLClassLoader.defineClass(URLClassLoader.java:251)
          at java.net.URLClassLoader.access$100(URLClassLoader.java:55)
          at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
          at java.security.AccessController.doPrivileged(Native Method)
          at java.net.URLClassLoader.findClass(URLClassLoader.java:187)
          at java.lang.ClassLoader.loadClass(ClassLoader.java:289)
          at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:
274)
          at java.lang.ClassLoader.loadClass(ClassLoader.java:235)
          at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:
302)

Any thoughts on how to get it to work properly? I would like to build
a java program to do entity recognition and sentence splitting.

Jason

#232 From: Sanjay Singh <sspal@...>
Date: Mon Mar 27, 2006 7:08 pm
Subject: Re: LingPipe on the mac
sspal
Send Email Send Email
 
Jason,

        Compile the lingPipe either using ant or your
custom javac. That helped me get rid of this error.

Regards
--
Jay

--- Jason Lustig <lustig@...> wrote:

> Hi,
>
> I am trying to use LingPipe for some natural
> language processing
> applications that I am working on, and it looks
> excellent! I would
> like to get it working on my mac so that I can
> program with it, and
> it does not seem to be working properly. I am using
> Java 1.4.2_09,
> and am getting this error when trying to run the
> command tutorial
> from
>
http://www.alias-i.com/lingpipe/demos/command/tutorial.html:
>
>
dyn-129-64-208-12:~/Desktop/lingpipe-2.2.1/demos/command
> jason$ java -
>
Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
> -cp "command-
>
demo.jar:../../lingpipe-2.0.0.jar:../../lib/xml-apis-2.7.1.jar:../../
>
> lib/xercesImpl-2.7.1.jar" AnnotateCmd
> -model=../models/EN_NEWS.model
> -inputDir=../data/commands/input
> -outputDir=../data/commands/output -
> contentType="text/xml; charset=UTF-8"
> Exception in thread "main"
> java.lang.UnsupportedClassVersionError:
> AnnotateCmd (Unsupported major.minor version 49.0)
>          at
> java.lang.ClassLoader.defineClass0(Native Method)
>          at
>
java.lang.ClassLoader.defineClass(ClassLoader.java:539)
>          at
> java.security.SecureClassLoader.defineClass
> (SecureClassLoader.java:123)
>          at
>
java.net.URLClassLoader.defineClass(URLClassLoader.java:251)
>          at
>
java.net.URLClassLoader.access$100(URLClassLoader.java:55)
>          at
>
java.net.URLClassLoader$1.run(URLClassLoader.java:194)
>          at
> java.security.AccessController.doPrivileged(Native
> Method)
>          at
>
java.net.URLClassLoader.findClass(URLClassLoader.java:187)
>          at
>
java.lang.ClassLoader.loadClass(ClassLoader.java:289)
>          at
>
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:
>
> 274)
>          at
>
java.lang.ClassLoader.loadClass(ClassLoader.java:235)
>          at
>
java.lang.ClassLoader.loadClassInternal(ClassLoader.java:
>
> 302)
>
> Any thoughts on how to get it to work properly? I
> would like to build
> a java program to do entity recognition and sentence
> splitting.
>
> Jason
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

#233 From: carp@...
Date: Mon Mar 27, 2006 7:52 pm
Subject: Re: LingPipe on the mac - Java major/minor versions
colloquialdo...
Send Email Send Email
 
The "Unsupported major.minor version 49.0" bug
arises when you try to load classes in a Java 1.4
installation that were compiled under 1.5.

There are multiple possible fixes.

1.  Recompile LingPipe in your JVM
This is what Sanjay Singh suggested, and he's
right -- that'll work.  This will work for
other projects where you get this error, too,
if they were written with 1.4 compatiblity
in mind (as LingPipe is).

2.  Switch to a 1.5 JVM
This may not be possible depending on your
platform, but if it is, I'd highly recommend
it.  It's faster and other software's getting
released with 1.5 compiles.

3.  Use our last version from the web archive:

http://www.alias-i.com/lingpipe-2.1.0-website/lingpipe-2.1.0.tar.gz

The only change is that 2.1.1 fixed a bug in
caching -- don't use HMM caching until upgrading
to 2.1.1 from 2.1.0.

4.  Wait for me to post a 1.4-compiled version of
2.1.1, which I'll try do tonight.  I'll mail the
list when it's up.  (Our servers at
work won't actually run 1.4 JVMs -- yet another
reason we'd like to switch to 1.5.)

- Bob Carpenter

Messages 198 - 233 of 1478   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help