Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

LingPipe

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 470
  • Category: Open Source
  • Founded: Oct 8, 2003
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Messages

Advanced
Messages Help
Messages 938 - 968 of 1477   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#938 From: Bob Carpenter <carp@...>
Date: Wed Jun 2, 2010 6:11 pm
Subject: LingPipe 4.0.0 released
colloquialdo...
Send Email Send Email
 
LingPipe 4.0.0 is up at the home page:

http://alias-i.com/lingpipe

There are instructions on migrating from 3.9.x.

We'll be supporting 3.9.x on a branch going forward.

Let us know if you need help migrating.  The basic
drill is to compile in 3.9 until there are no more
deprecation warnings, at which point the code's compatible
with 4.0.

Models compiled in 3.9 will work in 4.0 as is (modulo
requiring 4.0 methods to access their functionality).

- Bob Carpenter & Breck Baldwin
    LingPipe, Inc

#939 From: Ning Yu <ninginiu@...>
Date: Wed Jun 2, 2010 6:58 pm
Subject: Questions regarding EM
ningyu_coco
Send Email Send Email
 
Hello,

Sorry to bother you again.

I wonder if I could ask some questions regarding EM.

1. Is there a way to control the impact of unlabeled data? If not too much
trouble, could you let me know which java code I should update in order to
add this control parameter?
2. Is there a way to print out the model build during each iteration(epoch)
so that one can track what has been changed/updated.
3. I understand that the EM demo calls traditional naive-Bayes classifier,
which implements character level language model. If I want to use unigram or
higher-oder ngrams as classification features, does it mean that I just need
to update the source code coming under the tutorial/em/src folder:
call NaiveBayesClassifier in stead of TradNaiveBayesClassifier?

Thank you for your time and I appreciate your help,
  Ning


On Wed, Jun 2, 2010 at 2:11 PM, Bob Carpenter <carp@...> wrote:

>
>
> LingPipe 4.0.0 is up at the home page:
>
> http://alias-i.com/lingpipe
>
> There are instructions on migrating from 3.9.x.
>
> We'll be supporting 3.9.x on a branch going forward.
>
> Let us know if you need help migrating. The basic
> drill is to compile in 3.9 until there are no more
> deprecation warnings, at which point the code's compatible
> with 4.0.
>
> Models compiled in 3.9 will work in 4.0 as is (modulo
> requiring 4.0 methods to access their functionality).
>
> - Bob Carpenter & Breck Baldwin
> LingPipe, Inc
>
>


[Non-text portions of this message have been removed]

#940 From: Bob Carpenter <carp@...>
Date: Wed Jun 2, 2010 9:07 pm
Subject: Re: Questions regarding EM
colloquialdo...
Send Email Send Email
 
1.  The impact of unlabeled data is controlled by its
size.  There's no other way to control it.  You can
duplicate the labeled data or sample from the unlabeled
data to help match sizes.

2.  Yes, use the iterator method, which will iterate
the models from each iteration.

3.  Sorry for the naming confusion.  Our class NaiveBayesClassifier
uses token-level multinomial classifiers with character
LMs used for smoothing.  It's essentially a token unigram
model with a non-traditional form of smoothing.

Our class TradNaiveBayesClassifier is the
"traditional" naive Bayes that's multinomial all the way,
so it's just smoothed token counts with no relation
between the tokens.

So I'm afraid there's not a good way built in to use
our non-traditional NaiveBayesClassifier with EM, as
I only implemented EM for TradNaiveBayesClassifier.  The
main obstacle for applying EM to the LM-based classifiers
is that the trainer needs to support fractional counts,
which the language model classifiers don't do.  It's possible,
but would be a huge major rewrite of a bunch of classes
used everywhere in LingPipe, so we're unlikely to ever get
around to doing it.

- Bob Carpenter
    LingPipe, Inc.

Ning Yu wrote:

> 1. Is there a way to control the impact of unlabeled data? If not too much
> trouble, could you let me know which java code I should update in order to
> add this control parameter?

> 2. Is there a way to print out the model build during each iteration(epoch)
> so that one can track what has been changed/updated.

> 3. I understand that the EM demo calls traditional naive-Bayes classifier,
> which implements character level language model. If I want to use unigram or
> higher-oder ngrams as classification features, does it mean that I just need
> to update the source code coming under the tutorial/em/src folder:
> call NaiveBayesClassifier in stead of TradNaiveBayesClassifier?

#941 From: Ning Yu <ninginiu@...>
Date: Thu Jun 3, 2010 12:51 am
Subject: Re: Questions regarding EM
ningyu_coco
Send Email Send Email
 
Dear Bob,

Thank you so much!
Your answer is very helpful.

Ning

On Wed, Jun 2, 2010 at 5:07 PM, Bob Carpenter <carp@...> wrote:

>
>
> 1. The impact of unlabeled data is controlled by its
> size. There's no other way to control it. You can
> duplicate the labeled data or sample from the unlabeled
> data to help match sizes.
>
> 2. Yes, use the iterator method, which will iterate
> the models from each iteration.
>
> 3. Sorry for the naming confusion. Our class NaiveBayesClassifier
> uses token-level multinomial classifiers with character
> LMs used for smoothing. It's essentially a token unigram
> model with a non-traditional form of smoothing.
>
> Our class TradNaiveBayesClassifier is the
> "traditional" naive Bayes that's multinomial all the way,
> so it's just smoothed token counts with no relation
> between the tokens.
>
> So I'm afraid there's not a good way built in to use
> our non-traditional NaiveBayesClassifier with EM, as
> I only implemented EM for TradNaiveBayesClassifier. The
> main obstacle for applying EM to the LM-based classifiers
> is that the trainer needs to support fractional counts,
> which the language model classifiers don't do. It's possible,
> but would be a huge major rewrite of a bunch of classes
> used everywhere in LingPipe, so we're unlikely to ever get
> around to doing it.
>
> - Bob Carpenter
> LingPipe, Inc.
>
>
> Ning Yu wrote:
>
> > 1. Is there a way to control the impact of unlabeled data? If not too
> much
> > trouble, could you let me know which java code I should update in order
> to
> > add this control parameter?
>
> > 2. Is there a way to print out the model build during each
> iteration(epoch)
> > so that one can track what has been changed/updated.
>
> > 3. I understand that the EM demo calls traditional naive-Bayes
> classifier,
> > which implements character level language model. If I want to use unigram
> or
> > higher-oder ngrams as classification features, does it mean that I just
> need
> > to update the source code coming under the tutorial/em/src folder:
> > call NaiveBayesClassifier in stead of TradNaiveBayesClassifier?
>
>


[Non-text portions of this message have been removed]

#942 From: "carat_e" <carat_e@...>
Date: Sat Jun 12, 2010 7:23 am
Subject: Significant Keywords instead of Significant Phrases
carat_e
Send Email Send Email
 
Hello,

I tried out the significant phrases tutorial - first of all, thanks for the
lingpipe package and the great tutorials.

While the significant phrases demo is generating significant phrases of a
foreground vs. a background model, I am looking into extracting significant
keywords, given the same foreground and background models.

As I am relatively new to NLP and not too Java savvy, I did not succeed in
making this happen. How would I need to modify this demo in order to generate
unigrams instead of bigrams?

Any help or guidance would be appreciated.

Thanks!

#943 From: "Bob Carpenter" <carp@...>
Date: Sun Jun 13, 2010 5:40 pm
Subject: Re: Significant Keywords instead of Significant Phrases
colloquialdo...
Send Email Send Email
 
You can do this by changing the size of the
n-grams in the tutorial.  That's the constant
NGRAM for the models and NGRAM_REPORTING_LENGTH
for the output.

I've updated everything to the LingPipe
4.0 API, so the code in the tutorial itself is
a little out of date. The arrays of ScoredObject
are replaced with sorted sets with the appropriate
generic specification.

	 SortedSet<ScoredObject<String[]>> newTerms
	     = foregroundModel.newTermSet(NGRAM_REPORTING_LENGTH,
				        MIN_COUNT,
				        MAX_COUNT,
				        backgroundModel);

	 report(newTerms);


- Bob Carpenter
   LingPipe, Inc

On June 12, 2010 03:23:17 A.M. EDT, carat_e <> wrote:

> I tried out the significant phrases tutorial - first of all, thanks
> for the lingpipe package and the great tutorials.
>
> While the significant phrases demo is generating significant phrases
> of a foreground vs. a background model, I am looking into extracting
> significant keywords, given the same foreground and background models.
>
> As I am relatively new to NLP and not too Java savvy, I did not
> succeed in making this happen. How would I need to modify this demo
> in order to generate unigrams instead of bigrams?

#944 From: "carat_e" <carat_e@...>
Date: Thu Jun 17, 2010 8:17 am
Subject: Re: Significant Keywords instead of Significant Phrases
carat_e
Send Email Send Email
 
Thanks for your quick reply, Bob!

I tried what you have suggested with the demo and the rec.sport.hockey data sets
and it works perfectly for NGRAM >=2 and NGRAM_REPORTING_LENGTH >=2.

If I specify NGRAM=1 and NGRAM_REPORTING_LENGTH=2 it works but it doesn't return
any results (as with any other combinations where NGRAM_REPORTING_LENGTH >
NGRAM)which makes sense to me.

If I set NGRAM=1 and NGRAM_REPORTING_LENGTH=1 it throws below exception
("Require n-gram >= 2 for chi square independence. Found nGram length=1").

Now, if I set NGRAM=2 and NGRAM_REPORTING_LENGTH=1, it still throws the exact
same exception ("Require n-gram >= 2 for chi square independence. Found nGram
length=1") - although n-gram is set to be 2 hence meeting the requirement.

Can you please let me know what I have to change to get it running?
(As mentioned, I would like to retrieve significant keywords, i.e. unigrams)

Thanks!

=========================

Training background model
Training on ..\..\..\data\rec.sport.hockey\train

Assembling collocations in Training
Exception in thread "main" java.lang.IllegalArgumentException: Require n-gram >=
2 for chi square independence. Found nGram length=1
         at
com.aliasi.lm.TokenizedLM.chiSquaredIndependence(TokenizedLM.java:995)
         at
com.aliasi.lm.TokenizedLM$CollocationCollector.scoreNGram(TokenizedLM.java:691)
         at com.aliasi.lm.TokenizedLM$Collector.handle(TokenizedLM.java:667)
         at com.aliasi.lm.TokenizedLM$Collector.handle(TokenizedLM.java:646)
         at
com.aliasi.lm.TrieIntSeqCounter.handleNGrams(TrieIntSeqCounter.java:316)
         at
com.aliasi.lm.TrieIntSeqCounter.handleNGrams(TrieIntSeqCounter.java:321)
         at
com.aliasi.lm.TrieIntSeqCounter.handleNGrams(TrieIntSeqCounter.java:261)
         at com.aliasi.lm.TokenizedLM.collocationSet(TokenizedLM.java:790)
         at Test01.main(Test01.java:50)






--- In LingPipe@yahoogroups.com, "Bob Carpenter" <carp@...> wrote:
>
>
> You can do this by changing the size of the
> n-grams in the tutorial.  That's the constant
> NGRAM for the models and NGRAM_REPORTING_LENGTH
> for the output.
>
> I've updated everything to the LingPipe
> 4.0 API, so the code in the tutorial itself is
> a little out of date. The arrays of ScoredObject
> are replaced with sorted sets with the appropriate
> generic specification.
>
>  SortedSet<ScoredObject<String[]>> newTerms
> 	    = foregroundModel.newTermSet(NGRAM_REPORTING_LENGTH,
> 				       MIN_COUNT,
> 				       MAX_COUNT,
> 				       backgroundModel);
>
>  report(newTerms);
>
>
> - Bob Carpenter
>   LingPipe, Inc
>
> On June 12, 2010 03:23:17 A.M. EDT, carat_e <> wrote:
>
> > I tried out the significant phrases tutorial - first of all, thanks
> > for the lingpipe package and the great tutorials.
> >
> > While the significant phrases demo is generating significant phrases
> > of a foreground vs. a background model, I am looking into extracting
> > significant keywords, given the same foreground and background models.
> >
> > As I am relatively new to NLP and not too Java savvy, I did not
> > succeed in making this happen. How would I need to modify this demo
> > in order to generate unigrams instead of bigrams?
>

#945 From: "Bob Carpenter" <carp@...>
Date: Thu Jun 17, 2010 4:55 pm
Subject: Re: Re: Significant Keywords instead of Significant Phrases
colloquialdo...
Send Email Send Email
 
Sorry, my bad.  I forgot that the collocation method
in TokenizedLM is looking for words that are associated
with each other.  It doesn't make sense to do it for
unigrams -- the exception was correct and had the
right error message.

If you only want to find frequent unigrams, it's much
easier.  Just run over the tokenizer and store the
token counts in map.  I find our util.ObjectToCounterMap
useful for this kind of thing.

In order to use the same TokenizedLM interface, you can
use the frequentTermSet() method with an n-gram size of 1.

The handleNGrams() method also gives you a general
visitor implementation that filters on n-gram length
and minimum count.

In terms of trying to find significant keywords, though,
what this'll give you is frequent keywords.  What might
be better is to build a background model over some
generic text (say lots of Wikipedia samples or your whole
corpus if you're dividing it up by parts).  Then for the
text for which you're trying to find keywords, build
a foreground model.  In the foreground model, call
newTermSet() which returns results sorted in order of
their increased likelihood in the foreground model over
the background model.  That should get rid of lots of
the junk like function words ("the", "of", etc.).

Or you can just filter out stop words by hand over
the most frequent terms.

- Bob Carpenter
   LingPipe, Inc

#946 From: Otis Gospodnetic <otis_gospodnetic@...>
Date: Thu Jun 17, 2010 5:20 pm
Subject: 2 Qs: n-gram order and NGRAM_REPORTING_LENGTH
otis_gospodn...
Send Email Send Email
 
Hi,

I just spotted a mention of NGRAM_REPORTING_LENGTH in one of the messages.  I
didn't recall seeing this variable in the code/javadoc/tutorials before, so I
looked it up and ended up with the following questions:

Q1: In places like TokenizedLM ctor (
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#TokenizedLM%\
28com.aliasi.tokenizer.TokenizerFactory,%20int%29 ), is the n-gram order is the
*maximum* n-gram that the model will use?  For example, if I use 3, will the
model end up containing only 3-grams, or will it create unigrams, bigrams and
trigrams?


Q2: http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html
mentions that NGRAM_REPORTING_LENGTH here:

   ScoredObject[] newTerms = foregroundModel.newTerms(NGRAM_REPORTING_LENGTH,
MIN_COUNT, MAX_COUNT, backgroundModel);

And here is the javadoc for that method:
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#newTermSet%2\
8int,%20int,%20int,%20com.aliasi.lm.LanguageModel.Tokenized%29

So the purpose of NGRAM_REPORTING_LENGTH is to act as a filter that restricts
the output ngrams to be of only that specified length?
Is there is API that let's me say "give me ngrams of size min to max size"?

Also, why is MIN_COUNT needed?  I assume it does the obvious - prevents this
method from returning ngrams with fewer than MIN_COUNT occurrences, but do find
that consumers of this API really need the ability to filter by ngram count?

Thanks,
Otis
----
Lucene ecosystem search :: http://search-lucene.com/

#947 From: Bob Carpenter <carp@...>
Date: Thu Jun 17, 2010 6:26 pm
Subject: Re: 2 Qs: n-gram order and NGRAM_REPORTING_LENGTH
colloquialdo...
Send Email Send Email
 
Answers.

Q1.  The N_GRAM parameter for language models means
you'll store all the n-gram counts for n-grams
up to the specified length, 0-grams, 1-grams,
2-grams, ..., n-grams all get counted.   There
are methods to access the counts in underlying
sequence counters and with the visitor for tokenized
LMs.

Q2. a.  Yes, NGRAM_REPORTING_LENGTH only restricts
the outputs.  Of course, it only does the inner
loop work on the specified length.

b. Unfortunately, there's no API call to get
all the n-grams in a range, but it's no slower to
just run all the lengths and collect up the results.

c. Min count does two things.  It saves an inner
loop computation for every n-gram below the min count.
This cna be significant for things like chi-square
collocation or t-test relative importance, which are
both arithmetic and memory intensive.  Second, it filters
the results in a useful way.  You often only
want to report results for frequent n-grams, because
a statistically significant difference might not
be worth reporting for rare terms.

Bonus question.

I don't have a good handle on how to compare
results across n-gram lengths. For instance,
"Yankees", "York Yankees", "New York" and
"New York Yankees" might all be relatively
frequent in one corpus over the other.  Same
thing for "W Bush", "George W", "Bush" and
"George W Bush".

What we sometimes do is look at the output, and if
there's a superstring at a longer n-gram that's significant,
don't report the substring.  But I'm
not sure about things like "New York Yankees".
If that's significant, you might also want
to report "New York" and "Yankees".  But probably
not "York Yankees".

One approach might be to count whole noun
phrases like unigrams.

Any other suggestions on how to
(easily) solve this problem would be greatly
appreciated!  One approach might be to evaluate
the "phrasiness" of the string using something
like part-of-speech tagging.

- Bob Carpenter
    LingPipe, Inc


Otis Gospodnetic wrote:
> Hi,
>
> I just spotted a mention of NGRAM_REPORTING_LENGTH in one of the
> messages. I didn't recall seeing this variable in the
> code/javadoc/tutorials before, so I looked it up and ended up with the
> following questions:
>
> Q1: In places like TokenizedLM ctor (
>
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#TokenizedLM%\
28com.aliasi.tokenizer.TokenizerFactory,%20int%29
> ), is the n-gram order is the *maximum* n-gram that the model will use?
> For example, if I use 3, will the model end up containing only 3-grams,
> or will it create unigrams, bigrams and trigrams?
>
> Q2:
> http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html
> mentions that NGRAM_REPORTING_LENGTH here:
>
> ScoredObject[] newTerms =
> foregroundModel.newTerms(NGRAM_REPORTING_LENGTH, MIN_COUNT, MAX_COUNT,
> backgroundModel);
>
> And here is the javadoc for that method:
>
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#newTermSet%2\
8int,%20int,%20int,%20com.aliasi.lm.LanguageModel.Tokenized%29
>
> So the purpose of NGRAM_REPORTING_LENGTH is to act as a filter that
> restricts the output ngrams to be of only that specified length?
> Is there is API that let's me say "give me ngrams of size min to max size"?
>
> Also, why is MIN_COUNT needed? I assume it does the obvious - prevents
> this method from returning ngrams with fewer than MIN_COUNT occurrences,
> but do find that consumers of this API really need the ability to filter
> by ngram count?
>
> Thanks,
> Otis
> ----
> Lucene ecosystem search :: http://search-lucene.com/

#948 From: Otis Gospodnetic <otis_gospodnetic@...>
Date: Thu Jun 17, 2010 8:57 pm
Subject: Re: 2 Qs: n-gram order and NGRAM_REPORTING_LENGTH
otis_gospodn...
Send Email Send Email
 
Thanks Bob.

Regarding your Q about phrases and subphrases, I read something related the
other day:
http://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/publikationen/200\
9/ranlp09_camera_ready.pdf

  Otis
----
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Bob Carpenter <carp@...>
> To: LingPipe@yahoogroups.com
> Sent: Thu, June 17, 2010 2:26:00 PM
> Subject: Re: [LingPipe] 2 Qs: n-gram order and NGRAM_REPORTING_LENGTH
>
> Answers.

Q1.  The N_GRAM parameter for language models
> means
you'll store all the n-gram counts for n-grams
up to the specified
> length, 0-grams, 1-grams,
2-grams, ..., n-grams all get counted.
> There
are methods to access the counts in underlying
sequence counters and
> with the visitor for tokenized
LMs.

Q2. a.  Yes,
> NGRAM_REPORTING_LENGTH only restricts
the outputs.  Of course, it only
> does the inner
loop work on the specified length.

b. Unfortunately,
> there's no API call to get
all the n-grams in a range, but it's no slower
> to
just run all the lengths and collect up the results.

c. Min count
> does two things.  It saves an inner
loop computation for every n-gram
> below the min count.
This cna be significant for things like
> chi-square
collocation or t-test relative importance, which are
both
> arithmetic and memory intensive.  Second, it filters
the results in a
> useful way.  You often only
want to report results for frequent n-grams,
> because
a statistically significant difference might not
be worth
> reporting for rare terms.

Bonus question.

I don't have a good
> handle on how to compare
results across n-gram lengths. For
> instance,
"Yankees", "York Yankees", "New York" and
"New York Yankees"
> might all be relatively
frequent in one corpus over the other.
> Same
thing for "W Bush", "George W", "Bush" and
"George W
> Bush".

What we sometimes do is look at the output, and if
there's a
> superstring at a longer n-gram that's significant,
don't report the
> substring.  But I'm
not sure about things like "New York Yankees".
If
> that's significant, you might also want
to report "New York" and
> "Yankees".  But probably
not "York Yankees".

One approach might
> be to count whole noun
phrases like unigrams.

Any other suggestions on
> how to
(easily) solve this problem would be greatly
appreciated!  One
> approach might be to evaluate
the "phrasiness" of the string using
> something
like part-of-speech tagging.

- Bob Carpenter

> LingPipe, Inc


Otis Gospodnetic wrote:
> Hi,
>
> I
> just spotted a mention of NGRAM_REPORTING_LENGTH in one of the
>
> messages. I didn't recall seeing this variable in the
>
> code/javadoc/tutorials before, so I looked it up and ended up with the
>
> following questions:
>
> Q1: In places like TokenizedLM ctor (
>
>
>
href="http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#Tokeni\
zedLM%28com.aliasi.tokenizer.TokenizerFactory,%20int%29"
> target=_blank
>
>http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#TokenizedLM\
%28com.aliasi.tokenizer.TokenizerFactory,%20int%29
>
> ), is the n-gram order is the *maximum* n-gram that the model will use?
>
> For example, if I use 3, will the model end up containing only 3-grams,
>
> or will it create unigrams, bigrams and trigrams?
>
> Q2:
>
>
>
href="http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html\
"
> target=_blank
> >http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html
>
> mentions that NGRAM_REPORTING_LENGTH here:
>
>
> ScoredObject[] newTerms =
>
> foregroundModel.newTerms(NGRAM_REPORTING_LENGTH, MIN_COUNT, MAX_COUNT,
>
> backgroundModel);
>
> And here is the javadoc for that method:
>
>
>
href="http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#newTer\
mSet%28int,%20int,%20int,%20com.aliasi.lm.LanguageModel.Tokenized%29"
> target=_blank
>
>http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#newTermSet%\
28int,%20int,%20int,%20com.aliasi.lm.LanguageModel.Tokenized%29
>
>
> So the purpose of NGRAM_REPORTING_LENGTH is to act as a filter that
>
> restricts the output ngrams to be of only that specified
> length?
> Is there is API that let's me say "give me ngrams of size min to
> max size"?
>
> Also, why is MIN_COUNT needed? I assume it does the
> obvious - prevents
> this method from returning ngrams with fewer than
> MIN_COUNT occurrences,
> but do find that consumers of this API really
> need the ability to filter
> by ngram count?
>
>
> Thanks,
> Otis
> ----
> Lucene ecosystem search ::
> href="http://search-lucene.com/" target=_blank
> >http://search-lucene.com/


------------------------------------

Yahoo!
> Groups Links

>
> >http://groups.yahoo.com/group/LingPipe/


> href="http://groups.yahoo.com/group/LingPipe/join" target=_blank
> >http://groups.yahoo.com/group/LingPipe/join
     (Yahoo! ID
> required)

> href="mailto:LingPipe-digest@yahoogroups.com">LingPipe-digest@yahoogroups.com
>

>
href="mailto:LingPipe-fullfeatured@yahoogroups.com">LingPipe-fullfeatured@yahoog\
roups.com

> ymailto="mailto:LingPipe-unsubscribe@yahoogroups.com"
>
href="mailto:LingPipe-unsubscribe@yahoogroups.com">LingPipe-unsubscribe@yahoogro\
ups.com

> href="http://docs.yahoo.com/info/terms/" target=_blank
> >http://docs.yahoo.com/info/terms/

#949 From: "otis_gospodnetic" <otis_gospodnetic@...>
Date: Fri Jun 18, 2010 8:37 pm
Subject: Re: 2 Qs: n-gram order and NGRAM_REPORTING_LENGTH
otis_gospodn...
Send Email Send Email
 
Thanks Bob.

In Q2 a and Q2 c you mention some "inner loop" ("it only does the inner loop
work on the specified length").  I don't know what you are referring to.  Is
there a place in the code that I should look to understand this better?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



--- In LingPipe@yahoogroups.com, Bob Carpenter <carp@...> wrote:
>
> Answers.
>
> Q1.  The N_GRAM parameter for language models means
> you'll store all the n-gram counts for n-grams
> up to the specified length, 0-grams, 1-grams,
> 2-grams, ..., n-grams all get counted.   There
> are methods to access the counts in underlying
> sequence counters and with the visitor for tokenized
> LMs.
>
> Q2. a.  Yes, NGRAM_REPORTING_LENGTH only restricts
> the outputs.  Of course, it only does the inner
> loop work on the specified length.
>
> b. Unfortunately, there's no API call to get
> all the n-grams in a range, but it's no slower to
> just run all the lengths and collect up the results.
>
> c. Min count does two things.  It saves an inner
> loop computation for every n-gram below the min count.
> This cna be significant for things like chi-square
> collocation or t-test relative importance, which are
> both arithmetic and memory intensive.  Second, it filters
> the results in a useful way.  You often only
> want to report results for frequent n-grams, because
> a statistically significant difference might not
> be worth reporting for rare terms.
>
> Bonus question.
>
> I don't have a good handle on how to compare
> results across n-gram lengths. For instance,
> "Yankees", "York Yankees", "New York" and
> "New York Yankees" might all be relatively
> frequent in one corpus over the other.  Same
> thing for "W Bush", "George W", "Bush" and
> "George W Bush".
>
> What we sometimes do is look at the output, and if
> there's a superstring at a longer n-gram that's significant,
> don't report the substring.  But I'm
> not sure about things like "New York Yankees".
> If that's significant, you might also want
> to report "New York" and "Yankees".  But probably
> not "York Yankees".
>
> One approach might be to count whole noun
> phrases like unigrams.
>
> Any other suggestions on how to
> (easily) solve this problem would be greatly
> appreciated!  One approach might be to evaluate
> the "phrasiness" of the string using something
> like part-of-speech tagging.
>
> - Bob Carpenter
>    LingPipe, Inc
>
>
> Otis Gospodnetic wrote:
> > Hi,
> >
> > I just spotted a mention of NGRAM_REPORTING_LENGTH in one of the
> > messages. I didn't recall seeing this variable in the
> > code/javadoc/tutorials before, so I looked it up and ended up with the
> > following questions:
> >
> > Q1: In places like TokenizedLM ctor (
> >
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#TokenizedLM%\
28com.aliasi.tokenizer.TokenizerFactory,%20int%29
> > ), is the n-gram order is the *maximum* n-gram that the model will use?
> > For example, if I use 3, will the model end up containing only 3-grams,
> > or will it create unigrams, bigrams and trigrams?
> >
> > Q2:
> > http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html
> > mentions that NGRAM_REPORTING_LENGTH here:
> >
> > ScoredObject[] newTerms =
> > foregroundModel.newTerms(NGRAM_REPORTING_LENGTH, MIN_COUNT, MAX_COUNT,
> > backgroundModel);
> >
> > And here is the javadoc for that method:
> >
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#newTermSet%2\
8int,%20int,%20int,%20com.aliasi.lm.LanguageModel.Tokenized%29
> >
> > So the purpose of NGRAM_REPORTING_LENGTH is to act as a filter that
> > restricts the output ngrams to be of only that specified length?
> > Is there is API that let's me say "give me ngrams of size min to max size"?
> >
> > Also, why is MIN_COUNT needed? I assume it does the obvious - prevents
> > this method from returning ngrams with fewer than MIN_COUNT occurrences,
> > but do find that consumers of this API really need the ability to filter
> > by ngram count?
> >
> > Thanks,
> > Otis
> > ----
> > Lucene ecosystem search :: http://search-lucene.com/
>

#950 From: Bob Carpenter <carp@...>
Date: Fri Jun 18, 2010 9:24 pm
Subject: Re: Re: 2 Qs: n-gram order and NGRAM_REPORTING_LENGTH
colloquialdo...
Send Email Send Email
 
Look in the TokenizedLM implementation or doc.

What's going on is that there's a general visitor
implementation that visits all the token sequences
above a given count and performs some operation on them.

Collocation and new terms are computed with the visitor.
For collocations, it computes chi-square values to test
if sequences of terms (an n-gram) occur together more frequently than
would be expected by chance (i.e. assuming independence).
For new term discovery, the visitor computes a t-test on the
probability of the sequence in the foreground language model versus
a background language model to determine if the count
is significantly higher in the foreground model.  Both
of these operations are fairly memory and compute
intensive.

It's just that underlying t-test or chi-square test
that I'm talking about being "the inner loop".  I
probably should've said "callback", because that's
what's really going on with the visitor.

Rather than computing a t-test over all n-grams,
it only does the work for n-grams of the specified
length with a count above the specified threshold.

- Bob Carpenter
    LingPipe, Inc

otis_gospodnetic wrote:

> In Q2 a and Q2 c you mention some "inner loop" ("it only does the inner
> loop work on the specified length"). I don't know what you are referring
> to. Is there a place in the code that I should look to understand this
> better?

#952 From: Mandefro Legesse <mandada764@...>
Date: Tue Jun 29, 2010 10:18 am
Subject: Re: Asking information
mandada764
Send Email Send Email
 
Dear Moges
  There is no attached file. Could you attach your codes and specify the specific
code that generated the error. Then forward this to  the LingPipe group so that
we can also provide you with the possible corrections.

Regards
Mandefro L.




________________________________
From: moges ahmed <moges_m@...>
To: LingPipe@yahoogroups.com
Sent: Sun, June 27, 2010 4:42:57 AM
Subject: [LingPipe] Asking information


dear Bob Carpenter,
Currently, I am doing the design phase of my thesis(Named entity recognition for
Amharic language) hand in hand with testing of the prototype using ur own codes.
but i have got a problem when I am training the code with some modifications. as
a result I stacked for the time being. I have attched the codes, model for POS
for Amharic(I have trained with POS of ling pipe code), and sample data for
training, dev, and testing.
So, could u help me in identifying  what the problem could be?
I know how much burden i am putting on U but I have no other option bcz I must
finish my thesis till the end of July.
with regards
Moges.A

--- On Thu, 6/17/10, Bob Carpenter <carp@...> wrote:

From: Bob Carpenter <carp@...>
Subject: Re: [LingPipe] 2 Qs: n-gram order and NGRAM_REPORTING_LENGTH
To: LingPipe@yahoogroups.com
Date: Thursday, June 17, 2010, 11:26 AM

Answers.

Q1.  The N_GRAM parameter for language models means
you'll store all the n-gram counts for n-grams
up to the specified length, 0-grams, 1-grams,
2-grams, ..., n-grams all get counted.   There
are methods to access the counts in underlying
sequence counters and with the visitor for tokenized
LMs.

Q2. a.  Yes, NGRAM_REPORTING_LENGTH only restricts
the outputs.  Of course, it only does the inner
loop work on the specified length.

b. Unfortunately, there's no API call to get
all the n-grams in a range, but it's no slower to
just run all the lengths and collect up the results.

c. Min count does two things.  It saves an inner
loop computation for every n-gram below the min count.
This cna be significant for things like chi-square
collocation or t-test relative importance, which are
both arithmetic and memory intensive.  Second, it filters
the results in a useful way.  You often only
want to report results for frequent n-grams, because
a statistically significant difference might not
be worth reporting for rare terms.

Bonus question.

I don't have a good handle on how to compare
results across n-gram lengths. For instance,
"Yankees", "York Yankees", "New York" and
"New York Yankees" might all be relatively
frequent in one corpus over the other.  Same
thing for "W Bush", "George W", "Bush" and
"George W Bush".

What we sometimes do is look at the output, and if
there's a superstring at a longer n-gram that's significant,
don't report the substring.  But I'm
not sure about things like "New York Yankees".
If that's significant, you might also want
to report "New York" and "Yankees".  But probably
not "York Yankees".

One approach might be to count whole noun
phrases like unigrams.

Any other suggestions on how to
(easily) solve this problem would be greatly
appreciated!  One approach might be to evaluate
the "phrasiness" of the string using something
like part-of-speech tagging.

- Bob Carpenter
    LingPipe, Inc

Otis Gospodnetic wrote:
> Hi,
>
> I just spotted a mention of NGRAM_REPORTING_LENGTH in one of the
> messages. I didn't recall seeing this variable in the
> code/javadoc/tutorials before, so I looked it up and ended up with the
> following questions:
>
> Q1: In places like TokenizedLM ctor (
>
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#TokenizedLM%\
28com.aliasi.tokenizer.TokenizerFactory,%20int%29
> ), is the n-gram order is the *maximum* n-gram that the model will use?
> For example, if I use 3, will the model end up containing only 3-grams,
> or will it create unigrams, bigrams and trigrams?
>
> Q2:
> http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html
> mentions that NGRAM_REPORTING_LENGTH here:
>
> ScoredObject[] newTerms =
> foregroundModel.newTerms(NGRAM_REPORTING_LENGTH, MIN_COUNT, MAX_COUNT,
> backgroundModel);
>
> And here is the javadoc for that method:
>
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#newTermSet%2\
8int,%20int,%20int,%20com.aliasi.lm.LanguageModel.Tokenized%29
>
> So the purpose of NGRAM_REPORTING_LENGTH is to act as a filter that
> restricts the output ngrams to be of only that specified length?
> Is there is API that let's me say "give me ngrams of size min to max size"?
>
> Also, why is MIN_COUNT needed? I assume it does the obvious - prevents
> this method from returning ngrams with fewer than MIN_COUNT occurrences,
> but do find that consumers of this API really need the ability to filter
> by ngram count?
>
> Thanks,
> Otis
> ----
> Lucene ecosystem search :: http://search-lucene.com/

------------------------------------

Yahoo! Groups Links

[Non-text portions of this message have been removed]







[Non-text portions of this message have been removed]

#953 From: "xiongwu" <hopexia@...>
Date: Wed Jun 30, 2010 3:55 pm
Subject: Lowercase Tokenizer for Spellchecker
hopexia
Send Email Send Email
 
I would like to ignore the case sensitivity in the SpellChecker example.

I modified QuerySpellCheck.java to use lowercase Tokenizer.
         TokenizerFactory tokenizerFactory
             = new EnglishStopTokenizerFactory(new
LowerCaseTokenizerFactory(IndoEuropeanTokenizerFactory.INSTANCE));

However, it does not seem to work when I enter the exact keywords in uppercase.
For example:

wayn gretsky
      [java] >Found 0 document(s) that matched query 'wayn gretsky':
      [java] Found 258 document(s) matching best alt='wayne gretzky':
      [java]
WAYN GRETSKY
      [java] >Found 0 document(s) that matched query 'WAYN GRETSKY':
      [java] Best alternative not valid query.
      [java] Alternative=* * * * * * *
      [java]

Could you help me to resolve the issue? Thanks

#954 From: Bob Carpenter <carp@...>
Date: Wed Jun 30, 2010 4:37 pm
Subject: Re: Lowercase Tokenizer for Spellchecker
colloquialdo...
Send Email Send Email
 
Thanks for reporting the problem so clearly.
I replicated the bug and tracked down the
source.

The problem is that tokenizer factory isn't getting
serialized with the spell checker, only the
set of valid tokens is.  The strange best alternative
you're seeing (all asterisks) is because there's a
lot of useless punctuation in the training data.

A deeper problem is that if I patch this to
do the sensible thing, which is serialize the
tokenizer factory if there is one, it'll break
backward compatibility on behavior.  Someone
could train with tokenizer factory that's not
serializable or worse yet, implements Serializable
but has an unserializable component.  It'd also
change the behavior of the compiled models from
what they are now.  I'm not sure what to do here.
At the very least, we need better doc.

To get the behavior you expect, use the setTokenizerFactory()
method on CompiledSpellChecker to set the tokenizer
to what you used for training.  Here's what I did:

$LINGPIPE/demos/tutorial/querySpellCheck/src/QuerySpellCheck.java

...
TokenizerFactory tokenizerFactory
    = new
com.aliasi.tokenizer.LowerCaseTokenizerFactory(IndoEuropeanTokenizerFactory.INST\
ANCE);
...
CompiledSpellChecker compiledSC = readModel(MODEL_FILE);
compiledSC.setTokenizerFactory(tokenizerFactory);
...

Here's what I get using your examples over
the 4 newsgroups sample data:

wayn gretsky
       [java] >Found 0 document(s) that matched query 'wayn gretsky':
       [java] Found 129 document(s) matching best alt='wayne gretzky':
       [java]
WAYN GRETSKY
       [java] >Found 0 document(s) that matched query 'WAYN GRETSKY':
       [java] Found 129 document(s) matching best alt='wayne gretzky':

One more thing.  We hadn't imagined using stoplisted
tokenizer factories because the words on the stoplist
will actually be removed from the text.  Just make
sure that's what you want -- it might be confusing
in a user interface.  On the other hand, it'll avoid
all that correction to really popular known words.

On the other hand, the example shows that you probably
do want to remove gratuitous punctuation, at least
from training.

- Bob Carpenter
    LingPipe, Inc



xiongwu wrote:
>
>
> I would like to ignore the case sensitivity in the SpellChecker example.
>
> I modified QuerySpellCheck.java to use lowercase Tokenizer.
> TokenizerFactory tokenizerFactory
> = new EnglishStopTokenizerFactory(new
> LowerCaseTokenizerFactory(IndoEuropeanTokenizerFactory.INSTANCE));
>
> However, it does not seem to work when I enter the exact keywords in
> uppercase. For example:
>
> wayn gretsky
> [java] >Found 0 document(s) that matched query 'wayn gretsky':
> [java] Found 258 document(s) matching best alt='wayne gretzky':
> [java]
> WAYN GRETSKY
> [java] >Found 0 document(s) that matched query 'WAYN GRETSKY':
> [java] Best alternative not valid query.
> [java] Alternative=* * * * * * *
> [java]

#955 From: "ningyu_coco" <ninginiu@...>
Date: Wed Jun 30, 2010 4:57 pm
Subject: how to read a model file?
ningyu_coco
Send Email Send Email
 
Hello,

I wonder how to read a model (e.g., the subjectivity.model that is generated for
the sentiment analysis tutorial).

Thank you,
  Ning

#956 From: Bob Carpenter <carp@...>
Date: Wed Jun 30, 2010 5:14 pm
Subject: Re: how to read a model file?
colloquialdo...
Send Email Send Email
 
It's a serialized Java object.

You can either read it using a java.io.ObjectInput
or you can use the utility method we provide:

com.aliasi.util.AbstractExternalizable.readObject(File)

You'll need to cast it back to what you want.
In the case of the subjectivity.model, that'll
be an instance of

LMClassifier<LanguageModel,MultivariateDistribution>

I added a note to the doc to this effect (won't be out
until the next release).  It's not
very clear from the doc as is, because there's so much
indirection in all the parameter doc.

- Bob Carpenter
    LingPipe, Inc

ningyu_coco wrote:
>
> I wonder how to read a model (e.g., the subjectivity.model that is
> generated for the sentiment analysis tutorial).

#957 From: "xiongwu" <hopexia@...>
Date: Wed Jun 30, 2010 5:20 pm
Subject: Re: Lowercase Tokenizer for Spellchecker
hopexia
Send Email Send Email
 
Thanks Bob. The trick works. Yeah, it would be good to update the document.

BTW, do you have a punctuation removal tokenizer to use?


--- In LingPipe@yahoogroups.com, Bob Carpenter <carp@...> wrote:
>
> Thanks for reporting the problem so clearly.
> I replicated the bug and tracked down the
> source.
>
> The problem is that tokenizer factory isn't getting
> serialized with the spell checker, only the
> set of valid tokens is.  The strange best alternative
> you're seeing (all asterisks) is because there's a
> lot of useless punctuation in the training data.
>
> A deeper problem is that if I patch this to
> do the sensible thing, which is serialize the
> tokenizer factory if there is one, it'll break
> backward compatibility on behavior.  Someone
> could train with tokenizer factory that's not
> serializable or worse yet, implements Serializable
> but has an unserializable component.  It'd also
> change the behavior of the compiled models from
> what they are now.  I'm not sure what to do here.
> At the very least, we need better doc.
>
> To get the behavior you expect, use the setTokenizerFactory()
> method on CompiledSpellChecker to set the tokenizer
> to what you used for training.  Here's what I did:
>
> $LINGPIPE/demos/tutorial/querySpellCheck/src/QuerySpellCheck.java
>
> ...
> TokenizerFactory tokenizerFactory
>    = new
com.aliasi.tokenizer.LowerCaseTokenizerFactory(IndoEuropeanTokenizerFactory.INST\
ANCE);
> ...
> CompiledSpellChecker compiledSC = readModel(MODEL_FILE);
> compiledSC.setTokenizerFactory(tokenizerFactory);
> ...
>
> Here's what I get using your examples over
> the 4 newsgroups sample data:
>
> wayn gretsky
>       [java] >Found 0 document(s) that matched query 'wayn gretsky':
>       [java] Found 129 document(s) matching best alt='wayne gretzky':
>       [java]
> WAYN GRETSKY
>       [java] >Found 0 document(s) that matched query 'WAYN GRETSKY':
>       [java] Found 129 document(s) matching best alt='wayne gretzky':
>
> One more thing.  We hadn't imagined using stoplisted
> tokenizer factories because the words on the stoplist
> will actually be removed from the text.  Just make
> sure that's what you want -- it might be confusing
> in a user interface.  On the other hand, it'll avoid
> all that correction to really popular known words.
>
> On the other hand, the example shows that you probably
> do want to remove gratuitous punctuation, at least
> from training.
>
> - Bob Carpenter
>    LingPipe, Inc
>
>
>
> xiongwu wrote:
> >
> >
> > I would like to ignore the case sensitivity in the SpellChecker example.
> >
> > I modified QuerySpellCheck.java to use lowercase Tokenizer.
> > TokenizerFactory tokenizerFactory
> > = new EnglishStopTokenizerFactory(new
> > LowerCaseTokenizerFactory(IndoEuropeanTokenizerFactory.INSTANCE));
> >
> > However, it does not seem to work when I enter the exact keywords in
> > uppercase. For example:
> >
> > wayn gretsky
> > [java] >Found 0 document(s) that matched query 'wayn gretsky':
> > [java] Found 258 document(s) matching best alt='wayne gretzky':
> > [java]
> > WAYN GRETSKY
> > [java] >Found 0 document(s) that matched query 'WAYN GRETSKY':
> > [java] Best alternative not valid query.
> > [java] Alternative=* * * * * * *
> > [java]
>

#958 From: Bob Carpenter <carp@...>
Date: Wed Jun 30, 2010 5:43 pm
Subject: Re: Re: Lowercase Tokenizer for Spellchecker (removing punctuation)
colloquialdo...
Send Email Send Email
 
Getting tokenization right's a tricky business.

You can remove punctuation in a couple of ways.
The easiest is probably to use a RegExFilteredTokenizerFactory
with a Pattern constructed from a regex that only
matches alpha-numerics.  I'm not the world's greatest
regex engineer, so you may want to test things for
your data.

For instance, for Unicode, you could use something
that matches alphanumeric tokens only

[\p{L}\p{Nd}]*

Or you can create a regex that matches anything but
puncutation

[^\p{Punct}]*

The problem here is that with the Indo-European tokenizer
factory, numbers like 1.05 get tokenized as a single token
and will be thrown away by this filter.

You may also extend ModifyTokenTokenizerFactory by
implementing its modifyToken(String) method -- a null
return removes the token.

- Bob Carpenter
    LingPipe, Inc

xiongwu wrote:
> BTW, do you have a punctuation removal tokenizer to use?

#959 From: Ning Yu <ninginiu@...>
Date: Wed Jun 30, 2010 8:39 pm
Subject: Re: how to read a model file?
ningyu_coco
Send Email Send Email
 
Thank you, Bob. I just did.
But I wonder if there is any document explain how to understand the
language model in the following format:

Max NGram=8
Log2 Uniform Estimate=-15.999978
i c suff prob 1-lambda firstChild
0 ? -1 -15.999978 1 -10.209756
1   0 -2.3965468 64 -7.930619
2 ! 0 -13.727352 120 -2.7004397
3 " 0 -9.83829 121 -6.366322
4 # 0 -11.803265 122 -2.371559
5 $ 0 -15.186342 127 -0.36257008
6 % 0 -17.184265 134 -0.32192808
7 & 0 -11.847117 136 -4.409391
8 ' 0 -8.476222 137 -3.229588
9 ( 0 -9.748228 162 -6.455327
...
794234   464193 -1.1902922
794235 d 482649 -1.8973387
794236 d 501012 -1.2531055

Best,
  Ning

On Wed, Jun 30, 2010 at 1:14 PM, Bob Carpenter <carp@...> wrote:
> It's a serialized Java object.
>
> You can either read it using a java.io.ObjectInput
> or you can use the utility method we provide:
>
> com.aliasi.util.AbstractExternalizable.readObject(File)
>
> You'll need to cast it back to what you want.
> In the case of the subjectivity.model, that'll
> be an instance of
>
> LMClassifier<LanguageModel,MultivariateDistribution>
>
> I added a note to the doc to this effect (won't be out
> until the next release).  It's not
> very clear from the doc as is, because there's so much
> indirection in all the parameter doc.
>
> - Bob Carpenter
>   LingPipe, Inc
>
> ningyu_coco wrote:
>>
>> I wonder how to read a model (e.g., the subjectivity.model that is
>> generated for the sentiment analysis tutorial).
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>
>

#960 From: Bob Carpenter <carp@...>
Date: Wed Jun 30, 2010 8:53 pm
Subject: Re: compiled form of LM doc
colloquialdo...
Send Email Send Email
 
That's the compiled form of the LM, which is
documented in the javadoc

http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/CompiledNGramProcessLM.html

and there's a paper-length exposition of the whole
setup in:

http://lingpipe.files.wordpress.com/2008/04/alias-i-acl05soft.pdf

- Bob Carpenter
    LingPipe, Inc


Ning Yu wrote:

> But I wonder if there is any document explain how to understand the
> language model in the following format:
>
> Max NGram=8
> Log2 Uniform Estimate=-15.999978
> i c suff prob 1-lambda firstChild
> 0 ? -1 -15.999978 1 -10.209756
> 1 0 -2.3965468 64 -7.930619
> 2 ! 0 -13.727352 120 -2.7004397
> ...
> 794234 464193 -1.1902922
> 794235 d 482649 -1.8973387
> 794236 d 501012 -1.2531055

#961 From: Ning Yu <ninginiu@...>
Date: Thu Jul 1, 2010 7:36 pm
Subject: Re: compiled form of LM doc
ningyu_coco
Send Email Send Email
 
Thank you Bob. It is very helpful.

Ning

On Wed, Jun 30, 2010 at 4:53 PM, Bob Carpenter <carp@...> wrote:

>
>
> That's the compiled form of the LM, which is
> documented in the javadoc
>
>
> http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/CompiledNGramProcessLM.html
>
> and there's a paper-length exposition of the whole
> setup in:
>
> http://lingpipe.files.wordpress.com/2008/04/alias-i-acl05soft.pdf
>
> - Bob Carpenter
> LingPipe, Inc
>
> Ning Yu wrote:
>
> > But I wonder if there is any document explain how to understand the
> > language model in the following format:
> >
> > Max NGram=8
> > Log2 Uniform Estimate=-15.999978
> > i c suff prob 1-lambda firstChild
> > 0 ? -1 -15.999978 1 -10.209756
> > 1 0 -2.3965468 64 -7.930619
> > 2 ! 0 -13.727352 120 -2.7004397
> > ...
> > 794234 464193 -1.1902922
> > 794235 d 482649 -1.8973387
> > 794236 d 501012 -1.2531055
>
>


[Non-text portions of this message have been removed]

#962 From: "Iman" <imanhassansaleh@...>
Date: Fri Jul 2, 2010 1:50 pm
Subject: Re: patched Arabic NER (ANER) corpus
ih.saleh
Send Email Send Email
 
Hi Bob,

I was trying to run Lingpipe with Yassine ANER corpus, but I receive an
exception, and I do not know exactly what is the problem. Here is the console
output:

-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~
Input Files
     NE Corpus: E:\June2010\ANERCorpNER\corpus\ANERCorp

Parameters
    N-Gram=8
    Num chars=1024
    Interpolation Ratio=8.0
    Number of Analyses Rescored=512
    Including MISC entity type=true
    Use dictionary=false


Exception in thread "main" java.lang.IllegalArgumentException: Illegal tag
sequence. tagging.tag(8)=O tagging.tag(9)=I-MISC
         at
com.aliasi.chunk.BioTagChunkCodec.toChunking(BioTagChunkCodec.java:302)
         at ANERXval.toChunking(ANERXval.java:308)
         at ANERXval.parseANER(ANERXval.java:282)
         at ANERXval.main(ANERXval.java:118)
Java Result: 1
-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~


--- In LingPipe@yahoogroups.com, "Bob Carpenter" <carp@...> wrote:
>
>
> Yassine Benajiba patched the ANER corpus of Arabic
> named entities so it is now well formed with no
> typos in the types or ill-formed tag sequences.
> I love open source!
>
> Yassine's at Columbia now and the new location for
> the corpus is:
>
> http://www1.ccls.columbia.edu/~ybenajiba/downloads.html
>
> The patches improved our rescoring detector's
> cross-validated F-measure by 0.2% and also reduced the
> number of types by removing erroneous ones, so models
> should run a bit faster, too.
>
> Now our tutorial code runs without patching the
> corpus:
>
> http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
>
> Now if I could only retroactively patch the CoNLL 2002
> data...
>
> - Bob Carpenter
>   Alais-i
>

#963 From: "Bob Carpenter" <carp@...>
Date: Fri Jul 2, 2010 4:10 pm
Subject: Re: Re: patched Arabic NER (ANER) corpus
colloquialdo...
Send Email Send Email
 
The error is caused from ill-formed input.

Are you running the latest version of LingPipe
and have you downloaded the latest ANER?

http://www1.ccls.columbia.edu/~ybenajiba/downloads.html

I just ran through the instructions in our
read-me from scratch and the corpus appears
to be well-formed.

- Bob Carpenter
   LingPipe, Inc

On July 2, 2010 09:50:56 A.M. EDT, Iman <> wrote:

> I was trying to run Lingpipe with Yassine ANER corpus, but I receive
> an exception, and I do not know exactly what is the problem. Here is
> the console output:
>
> -~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~
> Input Files
>     NE Corpus: E:\June2010\ANERCorpNER\corpus\ANERCorp
>
> Parameters
>    N-Gram=8
>    Num chars=1024
>    Interpolation Ratio=8.0
>    Number of Analyses Rescored=512
>    Including MISC entity type=true
>    Use dictionary=false
>
>
> Exception in thread "main" java.lang.IllegalArgumentException:
> Illegal tag sequence. tagging.tag(8)=O tagging.tag(9)=I-MISC
>         at
> com.aliasi.chunk.BioTagChunkCodec.toChunking(BioTagChunkCodec.java:302)
>         at ANERXval.toChunking(ANERXval.java:308)
>         at ANERXval.parseANER(ANERXval.java:282)
>         at ANERXval.main(ANERXval.java:118)
> Java Result: 1
> -~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~

#964 From: Iman Saleh <imanhassansaleh@...>
Date: Sat Jul 3, 2010 5:58 am
Subject: Re: Re: patched Arabic NER (ANER) corpus
ih.saleh
Send Email Send Email
 
OK thanks. I was using an older version of the corpus.

On Fri, Jul 2, 2010 at 7:10 PM, Bob Carpenter <carp@...> wrote:

>
>
>
> The error is caused from ill-formed input.
>
> Are you running the latest version of LingPipe
> and have you downloaded the latest ANER?
>
>
http://www1.ccls.columbia.edu/~ybenajiba/downloads.html<http://www1.ccls.columbi\
a.edu/%7Eybenajiba/downloads.html>
>
> I just ran through the instructions in our
> read-me from scratch and the corpus appears
> to be well-formed.
>
> - Bob Carpenter
> LingPipe, Inc
>
> On July 2, 2010 09:50:56 A.M. EDT, Iman <> wrote:
>
> > I was trying to run Lingpipe with Yassine ANER corpus, but I receive
> > an exception, and I do not know exactly what is the problem. Here is
> > the console output:
> >
> > -~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~
> > Input Files
> > NE Corpus: E:\June2010\ANERCorpNER\corpus\ANERCorp
> >
> > Parameters
> > N-Gram=8
> > Num chars=1024
> > Interpolation Ratio=8.0
> > Number of Analyses Rescored=512
> > Including MISC entity type=true
> > Use dictionary=false
> >
> >
> > Exception in thread "main" java.lang.IllegalArgumentException:
> > Illegal tag sequence. tagging.tag(8)=O tagging.tag(9)=I-MISC
> > at
> > com.aliasi.chunk.BioTagChunkCodec.toChunking(BioTagChunkCodec.java:302)
> > at ANERXval.toChunking(ANERXval.java:308)
> > at ANERXval.parseANER(ANERXval.java:282)
> > at ANERXval.main(ANERXval.java:118)
> > Java Result: 1
> > -~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~-~~
>
>
>



--
Iman


[Non-text portions of this message have been removed]

#965 From: Mandefro Legesse <mandada764@...>
Date: Wed Jul 7, 2010 6:49 am
Subject: How does the LingPipe's Evaluator evaluates NER Models?
mandada764
Send Email Send Email
 
Hello Everyone. I have trained an NER model with my own training data following
the NER tutorial on LingPipe. It works and the performance (Precision, Recall,
F1....) is also displayed. However, I don't understand how the evaluator
evaluates the model. That is I want to know the way the evaluator predict the
performance by only accepting tagged data. Can anyone just tell me how it works
or provide me with a document that explains the details.

Thanks

Mandefro L.




[Non-text portions of this message have been removed]

#966 From: "yalanciborsaci" <amac@...>
Date: Wed Jul 7, 2010 7:12 am
Subject: Using preprocessed vectors in classification
yalanciborsaci
Send Email Send Email
 
Hello,

This is my first week with LingPipe and I'm trying to find my way around. I
already represent several documents as vectors (bags of words). I do the
tokenization/vectorization outside of LingPipe. Currently, when I want to do
classification, say with NaiveBayesClassifier, I expand these vectors into text
and go on from there. For example, if my document vector is dog = {tail:3,
leash:1, cat:2}, I create this String="tail tail tail leash cat cat" and
classify this using unigram models.

I am sure that's a very inefficient way to do that (apart from being limited to
unigram models). Could anyone give a hint on where I should look for using a
NaiveBayesClassifier (or any other classifier) which uses a map of precomputed
features instead of tokens as features?

Thanks in advance,

Amaç Herdağdelen

#967 From: Bob Carpenter <carp@...>
Date: Wed Jul 7, 2010 6:11 pm
Subject: Re: Using preprocessed vectors in classification
colloquialdo...
Send Email Send Email
 
I'm afraid neither the traditional nor the
LM-based naive Bayes classifiers are set up to
work directly on multinomial data (the vector
of counts).

It's designed that way to encapsulate the vector
extraction in the classifier.  We don't usually
have to classify objects more than once, so there's
not an efficiency argument.

There is an implementation, BigVectorClassifier,
that lets you build a straight up Vector classifier
if you have linear coefficients.  If you add an
intercept for the log category prob and make the
vector values equal to log probs of tokens in
a category, that'll give you the same classification
behavior.  We actually wrote this one to be scalable
for large numbers of categories, where you'd use
a sparse vector representation.  You usually don't
get that out of naive Bayes.

If you're getting something like contest data
that's pre-vectorized, you can use our logistic
regression implementation, which is in terms of
vectors.  It's also a better classifier in terms
of accuracy, but it's more fiddly to train.
(I'm talking about the class stats.LogisticRegression,
not the classifier).

I could add vector classification to the naive Bayes
implementations, but it'd require the user to keep
the symbol table and tokenizer factory synchronized.

- Bob Carpenter
    LingPipe, Inc

yalanciborsaci wrote:
>
>
> Hello,
>
> This is my first week with LingPipe and I'm trying to find my way
> around. I already represent several documents as vectors (bags of
> words). I do the tokenization/vectorization outside of LingPipe.
> Currently, when I want to do classification, say with
> NaiveBayesClassifier, I expand these vectors into text and go on from
> there. For example, if my document vector is dog = {tail:3, leash:1,
> cat:2}, I create this String="tail tail tail leash cat cat" and classify
> this using unigram models.
>
> I am sure that's a very inefficient way to do that (apart from being
> limited to unigram models). Could anyone give a hint on where I should
> look for using a NaiveBayesClassifier (or any other classifier) which
> uses a map of precomputed features instead of tokens as features?
>
> Thanks in advance,
>
> Amaç Herdağdelen

#968 From: Bob Carpenter <carp@...>
Date: Wed Jul 7, 2010 6:13 pm
Subject: Re: How does the LingPipe's Evaluator evaluates NER Models?
colloquialdo...
Send Email Send Email
 
The tutorial explains the top-level details.  The full
documentation for NER scoring starts in:

http://alias-i.com/lingpipe/docs/api/com/aliasi/chunk/ChunkerEvaluator.html

There are several contained classes that compute the
precision-recall, as they're reused for other evaluators.
The precision-recall evaluation classes are in the classify
package -- they're all referenced if you start from
the class linked above.

- Bob Carpenter
    LingPipe, Inc

Mandefro Legesse wrote:
>
> Hello Everyone. I have trained an NER model with my own training
> data following
> the NER tutorial on LingPipe. It works and the performance (Precision,
> Recall,
> F1....) is also displayed. However, I don't understand how the evaluator
> evaluates the model. That is I want to know the way the evaluator
> predict the
> performance by only accepting tagged data. Can anyone just tell me how
> it works
> or provide me with a document that explains the details.

Messages 938 - 968 of 1477   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help