Search the web
Sign In
New User? Sign Up
LingPipe
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 747 - 780 of 780   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries   (Group by Topic) Sort by Date ^  
#747 From: "rafaelcotta" <rcotta@...>
Date: Sun Sep 6, 2009 5:10 pm
Subject: More questions on Classifiers serialization
rafaelcotta
Offline Offline
Send Email Send Email
 
I really try to figure out a solution by myself, with no luck. So, nothing last
but asking the experts...

I have a class that holds references to NaiveBayesClassifier and Classifier, the
compiled version of the NaiveBayesClassifier, and I would like to be able to
save my class in persistent storage so I can get back to the same state after a
application shutdown.

The problem is that when I try to serialize the object I get a
NotSerializableException. If I make the references to Naive and Classifier
transient, the class is serialized with no problem.

The method I am using to save the object looks like this:

public void saveObject(String key, Object object) throws Exception {
__try {
____FileOutputStream fout =
______new FileOutputStream(getObjectFileName(key));
____ObjectOutputStream oos = new ObjectOutputStream(fout);
____oos.writeObject(object);
____oos.close();
__} catch (Exception e) {
____throw e;
__}
}

And I get a java.io.NotSerializableException as the following StackTrace states:

java.io.NotSerializableException: com.aliasi.classify.NaiveBayesClassifier
	 at java.io.ObjectOutputStream.writeObject0(Unknown Source)
	 at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
	 at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
	 at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
	 at java.io.ObjectOutputStream.writeObject0(Unknown Source)
	 at java.io.ObjectOutputStream.writeObject(Unknown Source)
	 at br.cefet.engine.dao.ObjectDAOImpl.saveObject(ObjectDAOImpl.java:41)
	 at test.Tester.testClassifier(Tester.java:49)
	 at test.Tester.main(Tester.java:13)

What am I missing? I don't need a ready solution, but just a way to go.

Thanks in advance.

Rafael Cotta

#748 From: "Bob Carpenter" <carp@...>
Date: Tue Sep 8, 2009 3:53 pm
Subject: Re: More questions on Classifiers serialization
colloquialdo...
Offline Offline
Send Email Send Email
 
I'm afraid com.aliasi.classify.NaiveBayesClassifier
is not serializable.  It does implement
com.aliasi.util.Compilable, so you can save the
compiled (but not the dynamic) form.

To compile naive Bayes, just use:

ObjectOutput out = ...;
NaiveBayesClassifier classifier =
...;https://luxsci.com/images/icons/net_sec/24x24/plain/mail_forward.gif
classifier.compileTo(out);

What gets read back in is just the compiled form.
So you can't train any more.   It is much more
efficient to run in compiled form.

Naive Bayes would be serializable in an ideal world.
As is, it's not because tokenized LMs are not
serializable, and serializing them is a lot of
work (though I'm working at the moment on a partial
solution to this that'll allow scaling to really large
corpora using serialized forms of token counts).

It trains pretty quickly, so it shouldn't be
too much of a pain to retrain from scratch.

- Bob Carpenter
   Alias-i


On September 6, 2009, rafaelcotta <rcotta@...> wrote:

> I really try to figure out a solution by myself, with no luck. So,
> nothing last but asking the experts...
>
> I have a class that holds references to NaiveBayesClassifier and
> Classifier, the compiled version of the NaiveBayesClassifier, and I
> would like to be able to save my class in persistent storage so I can
> get back to the same state after a application shutdown.
>
> The problem is that when I try to serialize the object I get a
> NotSerializableException. If I make the references to Naive and
> Classifier transient, the class is serialized with no problem.
>
> The method I am using to save the object looks like this:
>
> public void saveObject(String key, Object object) throws Exception {
> __try {
> ____FileOutputStream fout =
> ______new FileOutputStream(getObjectFileName(key));
> ____ObjectOutputStream oos = new ObjectOutputStream(fout);
> ____oos.writeObject(object);
> ____oos.close();
> __} catch (Exception e) {
> ____throw e;
> __}
> }
>
> And I get a java.io.NotSerializableException as the following
> StackTrace states:
>
> java.io.NotSerializableException: com.aliasi.classify.NaiveBayesClassifier
>  at java.io.ObjectOutputStream.writeObject0(Unknown Source)
>  at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
>  at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
>  at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
>  at java.io.ObjectOutputStream.writeObject0(Unknown Source)
>  at java.io.ObjectOutputStream.writeObject(Unknown Source)
>  at br.cefet.engine.dao.ObjectDAOImpl.saveObject(ObjectDAOImpl.java:41)
>  at test.Tester.testClassifier(Tester.java:49)
>  at test.Tester.main(Tester.java:13)
>
> What am I missing? I don't need a ready solution, but just a way to go.
>
> Thanks in advance.
>
> Rafael Cotta

#749 From: Rafael Cotta <rcotta@...>
Date: Tue Sep 8, 2009 4:00 pm
Subject: Re: More questions on Classifiers serialization
rafaelcotta
Offline Offline
Send Email Send Email
 
Hi, Bob. Thanks for your reply.

Over the weekend I've been developing a solution to retrain it from
scratch, as I was afraid I wasn't going to succeed on finding a
solution to this question.

The bad part is that I'll have to store all texts used in trainning.
But as I want to give the users the ability to add more categories
after the initial trainning, I would be forced to store all this
information anyway, as I must give the categories list while
constructing the classifier.

Although I couldn't find the solution I wanted for my problem, let me
use this opportunity to say again that LingPipe is such a great tool.

Thanks for your reply.

Rafael Cotta



On Tue, Sep 8, 2009 at 12:53 PM, Bob Carpenter<carp@...> wrote:
>
> I'm afraid com.aliasi.classify.NaiveBayesClassifier
> is not serializable.  It does implement
> com.aliasi.util.Compilable, so you can save the
> compiled (but not the dynamic) form.
>
> To compile naive Bayes, just use:
>
> ObjectOutput out = ...;
> NaiveBayesClassifier classifier =
...;https://luxsci.com/images/icons/net_sec/24x24/plain/mail_forward.gif
> classifier.compileTo(out);
>
> What gets read back in is just the compiled form.
> So you can't train any more.   It is much more
> efficient to run in compiled form.
>
> Naive Bayes would be serializable in an ideal world.
> As is, it's not because tokenized LMs are not
> serializable, and serializing them is a lot of
> work (though I'm working at the moment on a partial
> solution to this that'll allow scaling to really large
> corpora using serialized forms of token counts).
>
> It trains pretty quickly, so it shouldn't be
> too much of a pain to retrain from scratch.
>
> - Bob Carpenter
>  Alias-i
>
>
> On September 6, 2009, rafaelcotta <rcotta@...> wrote:
>
>> I really try to figure out a solution by myself, with no luck. So,
>> nothing last but asking the experts...
>>
>> I have a class that holds references to NaiveBayesClassifier and
>> Classifier, the compiled version of the NaiveBayesClassifier, and I
>> would like to be able to save my class in persistent storage so I can
>> get back to the same state after a application shutdown.
>>
>> The problem is that when I try to serialize the object I get a
>> NotSerializableException. If I make the references to Naive and
>> Classifier transient, the class is serialized with no problem.
>>
>> The method I am using to save the object looks like this:
>>
>> public void saveObject(String key, Object object) throws Exception {
>> __try {
>> ____FileOutputStream fout =
>> ______new FileOutputStream(getObjectFileName(key));
>> ____ObjectOutputStream oos = new ObjectOutputStream(fout);
>> ____oos.writeObject(object);
>> ____oos.close();
>> __} catch (Exception e) {
>> ____throw e;
>> __}
>> }
>>
>> And I get a java.io.NotSerializableException as the following
>> StackTrace states:
>>
>> java.io.NotSerializableException: com.aliasi.classify.NaiveBayesClassifier
>>       at java.io.ObjectOutputStream.writeObject0(Unknown Source)
>>       at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
>>       at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
>>       at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
>>       at java.io.ObjectOutputStream.writeObject0(Unknown Source)
>>       at java.io.ObjectOutputStream.writeObject(Unknown Source)
>>       at br.cefet.engine.dao.ObjectDAOImpl.saveObject(ObjectDAOImpl.java:41)
>>       at test.Tester.testClassifier(Tester.java:49)
>>       at test.Tester.main(Tester.java:13)
>>
>> What am I missing? I don't need a ready solution, but just a way to go.
>>
>> Thanks in advance.
>>
>> Rafael Cotta
>
>

#750 From: "pranav8494" <pranav8494@...>
Date: Fri Sep 11, 2009 2:10 pm
Subject: Natural Language date and time parser for java
pranav8494
Online Now Online Now
Send Email Send Email
 
Hey guys,

I am working on a Natural Language parser which examines a sentence in english
and extracts some information like name, date etc. On of the problem i am having
is to parse the dates.

for example: "Lets meet next tuesday at 5 PM at the beach."

So the output will be something like : "Lets meet 15/09/2009 at 1700 hr at the
beach"

So basically, what i want to know is that is there any way we can do this
operation of parsing dates from a sentence and give a output with some specified
format in LingPipe.

Regards,
Pranav

#751 From: "Bob Carpenter" <carp@...>
Date: Fri Sep 11, 2009 5:31 pm
Subject: Re: Natural Language date and time parser for java
colloquialdo...
Offline Offline
Send Email Send Email
 
No, we don't have anything in LingPipe
for normalizing dates.  We participated
in the TERN evaluation where we trained
named-entity recognizers for English and
Chinese date extraction.  And there's
TIMEX chunking information in the MUC
and ACE data.  So it's easy enough to
train date extractors with training data.

But normalization is difficult, because "next
Tuesday" has an interpretation that depends
on the context of when it was said.  And
the word "next" for days in American English
is ambiguous as to the next one in this week
or the one next week.  And times are ambiguous
on 12-hour clocks.  And lots of references
are vague ("around 5 o'clock"), include further
indexicals ("after work", "at sunset"), refer to
frequency ("twice a day"), or refer to
intervals ("from 9 to 5").

There's been work on normalizing dates,
but I haven't followed it very closely.  I'd
look in things like the TERN proceedings
for ideas, if not software:

http://fofoca.mitre.org/tern.html

- Bob Carpenter
   Alias-i

On September 11, 2009, pranav8494 <pranav8494@...> wrote:

>
>
> Hey guys,
>
> I am working on a Natural Language parser which examines a sentence
> in english and extracts some information like name, date etc. On of
> the problem i am having is to parse the dates.
>
> for example: "Lets meet next tuesday at 5 PM at the beach."
>
> So the output will be something like : "Lets meet 15/09/2009 at 1700
> hr at the beach"
>
> So basically, what i want to know is that is there any way we can do
> this operation of parsing dates from a sentence and give a output
> with some specified format in LingPipe.
>
> Regards,
> Pranav

#754 From: "reckb" <breck@...>
Date: Mon Sep 21, 2009 3:45 pm
Subject: Apologies for recent spam
reckb
Offline Offline
Send Email Send Email
 
Spammer has been deleted from membership.

thanks

Breck

#755 From: prasenjit mukherjee <prasen.bea@...>
Date: Tue Oct 13, 2009 9:39 am
Subject: Generate a random matrix
prasen_bea
Offline Offline
Send Email Send Email
 
Is there a utility class to generate a random matrix ( given
dimensions m,n )  in lingpipe ?

-Thanks,
Prasen

#756 From: Bob Carpenter <carp@...>
Date: Tue Oct 13, 2009 5:48 pm
Subject: Re: Generate a random matrix
colloquialdo...
Offline Offline
Send Email Send Email
 
prasenjit mukherjee wrote:
> Is there a utility class to generate a random matrix ( given
> dimensions m,n ) in lingpipe ?

No, but it's really easy.  If you want to populate
an M x N matrix with a random double between 0 and 1:

int M = 5; // rows
int N = 7; // columns
Random random = new Random();
Matrix m = new DenseMatrix(M,N);
for (int m = 0; m < M; ++m)
____for (int n = 0; n < N; ++n)
________m.setValue(m,n,random.nextDouble());

You could generate different kinds of random values
by scaling the unit interval generated by nextDouble().

- Bob Carpenter
    Alias-i

#757 From: prasenjit mukherjee <prasen.bea@...>
Date: Fri Oct 16, 2009 10:55 am
Subject: NaN values after SVD
prasen_bea
Offline Offline
Send Email Send Email
 
I have a relatively sparse matrix with integer values ( and hence they
are not properly normalized ). I am getting NaN for almost everything
( singluarValues, leftvector, Right Vector ). Is it necessary that the
row ( or column ) vectors of the input matrix  should be a probability
distribution ? In my case they are some integer values . Is that the
reason of NaN SvdMatrix ?

Any help is greatly appreciated.

-Thanks,
Prasen

#758 From: "Bob Carpenter" <carp@...>
Date: Fri Oct 16, 2009 9:28 pm
Subject: Re: NaN values after SVD
colloquialdo...
Offline Offline
Send Email Send Email
 
The SVD algorithm should work for any matrix.

There are two variants -- sparse and partial.
If your matrix has specified values, and the
others are zeroes, use the svd() method.  If
the matrix has some known values and the others
are unknown, use the partialSvd() method.

You might have a learning rate that's too high
for the size/density of the problem.  Have you tried
different (smaller) learning rates?  If the rate's too
large, you'll wind up with NaNs.  If it's too
small, you won't converge.

The choice of regularization shouldn't be an
issue.

If you can't get it to work with any lerning
rate, I can take a look at what's going on if
you send me your matrixand the parameters you
call it with.

- Bob Carpenter
   Alias-i



On October 16, 2009, prasenjit mukherjee <prasen.bea@...> wrote:

> I have a relatively sparse matrix with integer values ( and hence they
> are not properly normalized ). I am getting NaN for almost everything
> ( singluarValues, leftvector, Right Vector ). Is it necessary that the
> row ( or column ) vectors of the input matrix  should be a probability
> distribution ? In my case they are some integer values . Is that the
> reason of NaN SvdMatrix ?
>
> Any help is greatly appreciated.
>
> -Thanks,
> Prasen

#759 From: BA YORO <yo_ba@...>
Date: Fri Oct 16, 2009 9:40 pm
Subject: Re: NaN values after SVD
yo_ba
Offline Offline
Send Email Send Email
 
Hi all,
i am working on WSD and I want to use Wikipedia as dictionary. It is possible to
replace the dictionary on senseval? How can I transforme the wikitext to the
senseval format?
Thanks you
"Sans la liberté de blamer, il n'est point d'éloge flatteur"

--- En date de : Ven 16.10.09, Bob Carpenter <carp@...> a écrit :

De: Bob Carpenter <carp@...>
Objet: Re: [LingPipe] NaN values after SVD
À: LingPipe@yahoogroups.com
Date: Vendredi 16 Octobre 2009, 22h28






 







The SVD algorithm should work for any matrix.



There are two variants -- sparse and partial.

If your matrix has specified values, and the

others are zeroes, use the svd() method.  If

the matrix has some known values and the others

are unknown, use the partialSvd() method.



You might have a learning rate that's too high

for the size/density of the problem.  Have you tried

different (smaller) learning rates?  If the rate's too

large, you'll wind up with NaNs.  If it's too

small, you won't converge.



The choice of regularization shouldn't be an

issue.



If you can't get it to work with any lerning

rate, I can take a look at what's going on if

you send me your matrixand the parameters you

call it with.



- Bob Carpenter

   Alias-i







On October 16, 2009, prasenjit mukherjee <prasen.bea@gmail. com> wrote:



> I have a relatively sparse matrix with integer values ( and hence they

> are not properly normalized ). I am getting NaN for almost everything

> ( singluarValues, leftvector, Right Vector ). Is it necessary that the

> row ( or column ) vectors of the input matrix  should be a probability

> distribution ? In my case they are some integer values . Is that the

> reason of NaN SvdMatrix ?

>

> Any help is greatly appreciated.

>

> -Thanks,

> Prasen































[Non-text portions of this message have been removed]

#760 From: "Bob Carpenter" <carp@...>
Date: Fri Oct 16, 2009 10:44 pm
Subject: Re: word sense over Wikipedia
colloquialdo...
Offline Offline
Send Email Send Email
 
On October 16, 2009, BA YORO <yo_ba@...> wrote:
> i am working on WSD and I want to use
> Wikipedia as dictionary. It is
> possible to replace the dictionary
> on senseval? How can I transforme
> the wikitext to the senseval format?

I'm not sure what exactly you're proposing.  As
is, the word sense disambiguation (WSD) demo:

http://alias-i.com/lingpipe/demos/tutorial/wordSense/read-me.html

uses training data consisting of words plus sets
of contexts (like snippets) in which they're used.

One thing you could do is take the Wikipedia disambiguation
pages, such as this one:

http://en.wikipedia.org/wiki/NLP

and convert it to word-sense disambiguation training
data by

1.  Senses = target pages.  That is, for the word
"NLP", the senses are the target pages for
disambiguation (e.g. "natural language processing",
"nonlinear programming", "National Library of
the Philippines", ...)

2.  Take each target page and scrape the text out.
That'll be a different process for HTML and for
the Wikitext downloaded from Wikipedia.  We don't
supply Wikipedia parsers, so you're on your own,
there.  There are lots of HTML parsers.

3.  Use the text as the training data for the sense.

4.  Compile the classifier (only really needed
for speed/persistence).

5.  Disambiguate senses of words in contexts by
just classifying the text.

You don't actually need to convert to Senseval
format for step 3 -- LingPipe lets you train the
dynamic classifiers like LMs by supplying
category/text samples.

That's it.  Then test it by giving snippets of
text containing the target word and the classifier
returns a distribution over senses.

Logistic regression will probably work a bit
better than the LMs I used for the senseval demo,
but they're much more complex to train.  See that
tutorial for more info.  The use, after training,
would be the same.

- Bob Carpenter
   Alias-i

#761 From: Chou Enlai <adougher9@...>
Date: Sat Oct 17, 2009 12:07 am
Subject: Re: word sense over Wikipedia
adougher9
Online Now Online Now
Send Email Send Email
 
BA YORO I can probably help with the Wikipedia processing, as I have custom and
existing perl modules for that.  I have converted wikipedia to dictd format
before.  If you have server space I can set it up on that, as my own computer
systems are overstretched.  I'm interested in using Wikipedia (and more) to
build a very large Word/Phrase/Acronym definition system, using word sense
induction among other techniques.  The only problem is I am struggling to make a
living and even if I weren't, I have many projects, so I can't guarantee a
real-time response.  The other thing is I would want to know what you are using
it for, as I don't contribute (directly) to colonialist oppression.  For more
info, see http://frdcsa.org.




________________________________
From: Bob Carpenter <carp@...>
To: LingPipe@yahoogroups.com
Sent: Fri, October 16, 2009 5:44:09 PM
Subject: Re: [LingPipe] word sense over Wikipedia



On October 16, 2009, BA YORO <yo_ba@yahoo. fr> wrote:
> i am working on WSD and I want to use
> Wikipedia as dictionary. It is
> possible to replace the dictionary
> on senseval? How can I transforme
> the wikitext to the senseval format?

I'm not sure what exactly you're proposing.  As
is, the word sense disambiguation (WSD) demo:

http://alias-i.com/lingpipe/demos/tutorial/wordSense/read-me.html

uses training data consisting of words plus sets
of contexts (like snippets) in which they're used.

One thing you could do is take the Wikipedia disambiguation
pages, such as this one:

http://en.wikipedia.org/wiki/NLP

and convert it to word-sense disambiguation training
data by

1.  Senses = target pages.  That is, for the word
"NLP", the senses are the target pages for
disambiguation (e.g. "natural language processing",
"nonlinear programming" , "National Library of
the Philippines" , ...)

2.  Take each target page and scrape the text out.
That'll be a different process for HTML and for
the Wikitext downloaded from Wikipedia.  We don't
supply Wikipedia parsers, so you're on your own,
there.  There are lots of HTML parsers.

3.  Use the text as the training data for the sense.

4.  Compile the classifier (only really needed
for speed/persistence) .

5.  Disambiguate senses of words in contexts by
just classifying the text.

You don't actually need to convert to Senseval
format for step 3 -- LingPipe lets you train the
dynamic classifiers like LMs by supplying
category/text samples.

That's it.  Then test it by giving snippets of
text containing the target word and the classifier
returns a distribution over senses.

Logistic regression will probably work a bit
better than the LMs I used for the senseval demo,
but they're much more complex to train.  See that
tutorial for more info.  The use, after training,
would be the same.

- Bob Carpenter
Alias-i











[Non-text portions of this message have been removed]

#762 From: BA YORO <yo_ba@...>
Date: Sat Oct 17, 2009 11:15 am
Subject: Re: word sense over Wikipedia
yo_ba
Offline Offline
Send Email Send Email
 
Thank you Bob,

http://en.wikipedia .org/wiki/ NLP

I know I need the disambiguation page of the words I want to disambiguate,
my problem is how to get this format ( sense id, synsets et gloss)like this
lexelt format
with the wikipedia resources I have.


<lexelt item="activate.v">
<sense id="38201" source="ws"
        synset="activate actuate energize start stimulate"
        gloss="to initiate action in; make active."/>
.....
...
</lexelt>

Is The LMs classification enough. Thank you






--- En date de : Ven 16.10.09, Bob Carpenter <carp@...> a écrit :

De: Bob Carpenter <carp@...>
Objet: Re: [LingPipe] word sense over Wikipedia
À: LingPipe@yahoogroups.com
Date: Vendredi 16 Octobre 2009, 23h44






 







On October 16, 2009, BA YORO <yo_ba@yahoo. fr> wrote:

> i am working on WSD and I want to use

> Wikipedia as dictionary. It is

> possible to replace the dictionary

> on senseval? How can I transforme

> the wikitext to the senseval format?



I'm not sure what exactly you're proposing.  As

is, the word sense disambiguation (WSD) demo:



http://alias- i.com/lingpipe/ demos/tutorial/ wordSense/ read-me.html



uses training data consisting of words plus sets

of contexts (like snippets) in which they're used.



One thing you could do is take the Wikipedia disambiguation

pages, such as this one:



http://en.wikipedia .org/wiki/ NLP



and convert it to word-sense disambiguation training

data by



1.  Senses = target pages.  That is, for the word

"NLP", the senses are the target pages for

disambiguation (e.g. "natural language processing",

"nonlinear programming" , "National Library of

the Philippines" , ...)



2.  Take each target page and scrape the text out.

That'll be a different process for HTML and for

the Wikitext downloaded from Wikipedia.  We don't

supply Wikipedia parsers, so you're on your own,

there.  There are lots of HTML parsers.



3.  Use the text as the training data for the sense.



4.  Compile the classifier (only really needed

for speed/persistence) .



5.  Disambiguate senses of words in contexts by

just classifying the text.



You don't actually need to convert to Senseval

format for step 3 -- LingPipe lets you train the

dynamic classifiers like LMs by supplying

category/text samples.



That's it.  Then test it by giving snippets of

text containing the target word and the classifier

returns a distribution over senses.



Logistic regression will probably work a bit

better than the LMs I used for the senseval demo,

but they're much more complex to train.  See that

tutorial for more info.  The use, after training,

would be the same.



- Bob Carpenter

   Alias-i







































[Non-text portions of this message have been removed]

#763 From: prasenjit mukherjee <prasen.bea@...>
Date: Sun Oct 18, 2009 3:27 am
Subject: Re: NaN values after SVD
prasen_bea
Offline Offline
Send Email Send Email
 
Had sent you the data ( its quite huge ) and the parameters to
carp@.... Would really appreciate of you could take a look.
-Thanks,
Prasen

On Sat, Oct 17, 2009 at 2:58 AM, Bob Carpenter <carp@...> wrote:

>
>
>
> The SVD algorithm should work for any matrix.
>
> There are two variants -- sparse and partial.
> If your matrix has specified values, and the
> others are zeroes, use the svd() method. If
> the matrix has some known values and the others
> are unknown, use the partialSvd() method.
>
> You might have a learning rate that's too high
> for the size/density of the problem. Have you tried
> different (smaller) learning rates? If the rate's too
> large, you'll wind up with NaNs. If it's too
> small, you won't converge.
>
> The choice of regularization shouldn't be an
> issue.
>
> If you can't get it to work with any lerning
> rate, I can take a look at what's going on if
> you send me your matrixand the parameters you
> call it with.
>
> - Bob Carpenter
> Alias-i
>
>
>
> On October 16, 2009, prasenjit mukherjee
<prasen.bea@...<prasen.bea%40gmail.com>>
> wrote:
>
> > I have a relatively sparse matrix with integer values ( and hence they
> > are not properly normalized ). I am getting NaN for almost everything
> > ( singluarValues, leftvector, Right Vector ). Is it necessary that the
> > row ( or column ) vectors of the input matrix should be a probability
> > distribution ? In my case they are some integer values . Is that the
> > reason of NaN SvdMatrix ?
> >
> > Any help is greatly appreciated.
> >
> > -Thanks,
> > Prasen
>
>
>


[Non-text portions of this message have been removed]

#764 From: prasenjit mukherjee <prasen.bea@...>
Date: Sun Oct 18, 2009 7:09 am
Subject: problem Interpreting SVD values
prasen_bea
Offline Offline
Send Email Send Email
 
I am trying to evaluate  partialSvd() on a smaller matrix and this is
what my findings are. Below is my input matrix, assuming 4 terms and 3
docs.

doc0 => (2,t0) (2,t1)
doc1 => (2,t0) (2,t1)
doc2 => (2,t2) (2,t3)

As one can see docs d0,d1 are exactly same containing 4 terms  with 2
from t0,t1 each.  3rd doc is different containing 4 terms with 2 from
t2,t3 each. Below is their matrix representation  ( in TXD form ) :

0,0,2
0,1,2
1,0,2
1,1,2
2,2,2
2,3,2

I ran with maxOrder =2 and following input  params :
         double featureInit = 0.01;
         double initialLearningRate = 0.005;
         int annealingRate = 1000;
         double regularization = 0.00;
         double minImprovement = 0.0001;
         int minEpochs = 2;
         int maxEpochs = 100;//50000;
and was expecting to get d0,d1 in 1 cluster and d2 in another.
Contrary to my expectation I am getting the following output ( See U,V
values) :

      [java]       :00 Start
      [java]       :00   Factor=0
      [java]       :00     epoch=0 rmse=1.9999848100360043
      [java]       :00     epoch=1 rmse=1.9999835637692873
      [java]       :00     epoch=2 rmse=1.999982296871324
      [java]       :00 Converged in epoch=2 rmse=1.999982296871324
relDiff=3.167271940722782E-7
      [java]       :00 Order=0 RMSE=1.9999835637692873
      [java]       :00   Factor=1
      [java]       :00     epoch=0 rmse=1.9999522133829444
      [java]       :00     epoch=1 rmse=1.9999506819096369
      [java]       :00     epoch=2 rmse=1.99994912138043
      [java]       :00 Converged in epoch=2 rmse=1.99994912138043
relDiff=3.901420744799641E-7
      [java]       :00 Order=1 RMSE=1.9999506819096369
      [java] SVD Computation Done. Singular Values:
      [java]  2.796903874825226E-4  2.536844759290206E-4
      [java] Output U_Matrix: ./rundir/U_out.matrix
      [java] Output V_Matrix: ./rundir/V_out.matrix


And my U,V matrices are :
U:
0,0,-0.690807182791581
0,1,0.6535363126818338
1,0,0.053924014251416
1,1,-0.2055548955329534
2,0,-0.7210254065499858
2,1,0.7284486755624372

Shouldn't the coeffs of 0 and 1s be the same in U, because they refer
to d0 and d1  ?

V:
0,0,-0.7473523845369358
0,1,-0.14168050325102471
1,0,0.35114591804331297
1,1,0.6137947267695599
2,0,-0.4945242525093567
2,1,0.776371576839163
3,0,0.27130558646761577
3,1,0.02073265696164584

#766 From: prasenjit mukherjee <prasen.bea@...>
Date: Mon Oct 19, 2009 3:19 am
Subject: Re: problem Interpreting SVD values
prasen_bea
Offline Offline
Send Email Send Email
 
thanks for the quick response.  your results make much more sense. May
be I am doing something wrong with the lingpipe package and thats why
found it difficult to interpret the lingpipe output :

  And my U,V matrices are :
>> U:
          0      1
0    -0.69     0.65
1   0.0539   -0.205
2  -0.721    0.728


>> V:
          0      1
0    -0.74    -0.141
1   0.35   -0.613
2  -0.494    0.776
3  0.271   0.0207

-Prasen

On Sun, Oct 18, 2009 at 10:58 PM, Ted Dunning <ted.dunning@...> wrote:
> I have not worked with lingpipe, but ...
>
> When I follow the steps you are taking using R, I get this:
>
> *> docs=data.frame(d0=c(2,2,0,0), d1=c(2,2,0,0), d2=c(0,0,2,2),
> row.names=c("t0","t1","t2","t3"))
>> docs
>   d0 d1 d2
> t0  2  2  0
> t1  2  2  0
> t2  0  0  2
> t3  0  0  2
>> svd(docs)
> $d
> [1] 4.000000 2.828427 0.000000
>
> $u
>           [,1]       [,2]       [,3]
> [1,] -0.7071068  0.0000000 -0.7071068
> [2,] -0.7071068  0.0000000  0.7071068
> [3,]  0.0000000 -0.7071068  0.0000000
> [4,]  0.0000000 -0.7071068  0.0000000
>
> $v
>           [,1] [,2]       [,3]
> [1,] -0.7071068    0 -0.7071068
> [2,] -0.7071068    0  0.7071068
> [3,]  0.0000000   -1  0.0000000
> *
>
> Note how my document matrix differs substantially from yours, but that is
> simply because we are using different representations.  You have lines that
> have triples containing document number, term number and count, I have the
> resulting matrix.
>
> As far as my results are concerned, the diagonal component of the svd
> (labeled $d above) clearly shows that there are only 2 singular values.
> This means that the first two columns of u and v are the only ones necessary
> for reconstructing my docs matrix.  The third vector in each represents the
> null space of the document matrix.
>
> Moreover, if you look at the first two columns of my u vector, you see a
> representation that show that documents tend to contain t0 and t1 in equal
> number or they contain t2 and t3 in equal number but they don't tend to
> contain any other pattern.  Singular vectors are not normally so easy to
> interpret.
>
> For reference, I normally prefer document x term matrices.  Here is that
> form of the computation:
>
> *> docs=data.frame(t0=c(2,2,0), t1=c(2,2,0), t2=c(0,0,2), t3=c(0,0,2),
> row.names=c("d0","d1","d2"))
>> docs
>   t0 t1 t2 t3
> d0  2  2  0  0
> d1  2  2  0  0
> d2  0  0  2  2
>> svd(docs)
> $d
> [1] 4.000000 2.828427 0.000000
>
> $u
>          [,1] [,2]       [,3]
> [1,] 0.7071068    0 -0.7071068
> [2,] 0.7071068    0  0.7071068
> [3,] 0.0000000    1  0.0000000
>
> $v
>          [,1]      [,2]       [,3]
> [1,] 0.7071068 0.0000000 -0.7071068
> [2,] 0.7071068 0.0000000  0.7071068
> [3,] 0.0000000 0.7071068  0.0000000
> [4,] 0.0000000 0.7071068  0.0000000
>
> *
> The results are the same, of course with some names changed.
>
> On Sun, Oct 18, 2009 at 12:46 AM, prasenjit mukherjee
> <prasen.bea@...>wrote:
>
>> Apologies, as I know the question is actually for lingpipe, but was
>> hoping if I could get some response from mahout users as well ( who
>> has probably worked with  lingpipe )
>>
>>
>> ---------- Forwarded message ----------
>> From: prasenjit mukherjee <prasen.bea@...>
>> Date: Sun, Oct 18, 2009 at 12:39 PM
>> Subject: problem Interpreting SVD values
>> To: lingpipe <lingpipe@yahoogroups.com>
>>
>>
>> I am trying to evaluate  partialSvd() on a smaller matrix and this is
>> what my findings are. Below is my input matrix, assuming 4 terms and 3
>> docs.
>>
>> doc0 => (2,t0) (2,t1)
>> doc1 => (2,t0) (2,t1)
>> doc2 => (2,t2) (2,t3)
>>
>> As one can see docs d0,d1 are exactly same containing 4 terms  with 2
>> from t0,t1 each.  3rd doc is different containing 4 terms with 2 from
>> t2,t3 each. Below is their matrix representation  ( in TXD form ) :
>>
>> 0,0,2
>> 0,1,2
>> 1,0,2
>> 1,1,2
>> 2,2,2
>> 2,3,2
>>
>> I ran with maxOrder =2 and following input  params :
>>        double featureInit = 0.01;
>>        double initialLearningRate = 0.005;
>>        int annealingRate = 1000;
>>        double regularization = 0.00;
>>        double minImprovement = 0.0001;
>>        int minEpochs = 2;
>>        int maxEpochs = 100;//50000;
>> and was expecting to get d0,d1 in 1 cluster and d2 in another.
>> Contrary to my expectation I am getting the following output ( See U,V
>> values) :
>>
>>     [java]       :00 Start
>>     [java]       :00   Factor=0
>>     [java]       :00     epoch=0 rmse=1.9999848100360043
>>     [java]       :00     epoch=1 rmse=1.9999835637692873
>>     [java]       :00     epoch=2 rmse=1.999982296871324
>>     [java]       :00 Converged in epoch=2 rmse=1.999982296871324
>> relDiff=3.167271940722782E-7
>>     [java]       :00 Order=0 RMSE=1.9999835637692873
>>     [java]       :00   Factor=1
>>     [java]       :00     epoch=0 rmse=1.9999522133829444
>>     [java]       :00     epoch=1 rmse=1.9999506819096369
>>     [java]       :00     epoch=2 rmse=1.99994912138043
>>     [java]       :00 Converged in epoch=2 rmse=1.99994912138043
>> relDiff=3.901420744799641E-7
>>     [java]       :00 Order=1 RMSE=1.9999506819096369
>>     [java] SVD Computation Done. Singular Values:
>>     [java]     2.796903874825226E-4  2.536844759290206E-4
>>     [java] Output U_Matrix: ./rundir/U_out.matrix
>>     [java] Output V_Matrix: ./rundir/V_out.matrix
>>
>>
>> And my U,V matrices are :
>> U:
>> 0,0,-0.690807182791581
>> 0,1,0.6535363126818338
>> 1,0,0.053924014251416
>> 1,1,-0.2055548955329534
>> 2,0,-0.7210254065499858
>> 2,1,0.7284486755624372
>>
>> Shouldn't the coeffs of 0 and 1s be the same in U, because they refer
>> to d0 and d1  ?
>>
>> V:
>> 0,0,-0.7473523845369358
>> 0,1,-0.14168050325102471
>> 1,0,0.35114591804331297
>> 1,1,0.6137947267695599
>> 2,0,-0.4945242525093567
>> 2,1,0.776371576839163
>> 3,0,0.27130558646761577
>> 3,1,0.02073265696164584
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

#768 From: Bob Carpenter <carp@...>
Date: Mon Oct 19, 2009 5:11 pm
Subject: Re: Re: problem Interpreting SVD values
colloquialdo...
Offline Offline
Send Email Send Email
 
The main problem you have is the example, which
is numerically unstable for the approximation method
we're using for many learning rates.  That's because
there's an exact solution at rank 1, so the rank 2
solutions don't really matter.

Here's the example matrix:

M = { { 2, 2, ? }, { 2, 2, ? }, { ?, 2, 2 } }

The first two rows are the same { 2, 2, ? }, and
the last is different.  The ? represent UNKNOWN
values (use sparse SVD if you want to treat these
as zero values).

I put a demo program to run this example up at:

http://alias-i.com/misc/SvdDemo.java

I'm not promising it'll be there long term.  I'll
repeat the important parts below.

The right way to code M for LingPipe's SVD is:

int[][] columnIds
= new int[][]
{ { 0, 1 }, { 0, 1 }, { 1, 2 } };

double[][] values
= new double[][]
{ { 2.0, 2.0 }, { 2.0, 2.0 }, {2.0, 2.0 } };

I then took these parameters:

          int maxOrder = 2;
          double featureInit = 0.5;
          double initialLearningRate = 0.1;
          double annealingRate = 500;
          double regularization = 0.0;
          double minImprovement = 0;
          int minEpochs = 100;
          int maxEpochs = 5000;
          Reporter reporter = Reporters.stdOut().setLevel(LogLevel.DEBUG);

and called:

          SvdMatrix matrix
              = SvdMatrix
              .partialSvd(columnIds,
                          values,
                          maxOrder,
                          featureInit,
                          initialLearningRate,
                          annealingRate,
                          regularization,
                          reporter,
                          minImprovement,
                          minEpochs,
                          maxEpochs);


Printing out the results gives:

Reconstructed Matrix
2.0, 1.9999999999999987, 1.9999999999999976
2.0000000000000004, 1.9999999999999984, 1.9999999999999971
2.0000000000000027, 2.000000000000001, 2.0

Singular Values
sigma[0] 5.999999999999998
sigma[1] 3.1928790430530444E-15

Left Singular Vectors
row[0]=
-0.5773502691896256, -0.5339821085725156
row[1]=
-0.5773502691896255, -0.5336534604380071
row[2]=
-0.5773502691896263, -0.6558026318085269

Right Singular Vectors
row[0]=
-0.5773502691896265, 0.6513335382455789
row[1]=
-0.5773502691896258, 0.16025484158672923
row[2]=
-0.577350269189625, -0.7416758103811311

----------------------------

The first thing to note is that the reconstructed matrix
has values that match the inputs very well and simply
filled in 2 for all the other values.

The second thing to note is that indeed, the singular
vectors are as expected.  The first two rows get the
same representation as left singular vectors.

But this won't work with all learning rates.
With many choices of learning rates, you'll
get a low value for singular value 1 and varying
values for the second dimension of the left singular
vectors.

The important thing to note is that the first singular value is 6
and the second is 3E-15!  That's as close to zero as you're
going to get with floating point approximations.  So it's
effectively a rank 1 solution.  That is, only the first
column of the singular vectors and first singular value are
required to reconstruct the input matrix exactly.

In the situation where you try to fit singular values and
vectors beyond the rank of the input matrix, gradient descent
algorithms run into problems because of the relative scales of
the output.  Here's the end of the first singular value
and first dimensions of the singular vectors:

        :02     epoch=4994 rmse=1.2979008917057966E-15
        :02     epoch=4995 rmse=1.2979008917057966E-15
        :02     epoch=4996 rmse=1.2979008917057966E-15
        :02     epoch=4997 rmse=1.2979008917057966E-15
        :02     epoch=4998 rmse=1.2979008917057966E-15
        :02     epoch=4999 rmse=1.2979008917057966E-15
        :02 Order=0 RMSE=1.2979008917057966E-15

That's a signal to stop.  We're fit as perfectly as
floating point allows after the first (indexed 0)
dimension.

Going on to the second dimension means fitting
roughly at noise levels.  It'll start with more error because
the feature inits are bad, and then work down to about
the same overall level of error.  That requires incrementing
the singular value down to zero, which it effectively does.
The problem is that as the singular value goes to zero, the
effect of the singular vectors also goes to zero, so the
overall algorithm's a bit unstable.

- Bob Carpenter
    Alias-i

#769 From: Fred Parnon <fparnon@...>
Date: Mon Oct 19, 2009 4:51 am
Subject: Re: Aplogies for the spam was Re: Cheapest Dating Software on the internet
fparnon
Offline Offline
Send Email Send Email
 
Hi Breck:

I was wondering how that one got through!

Best Regards,
Fred Parnon

breck wrote:
> websolsoftwares wrote:
>
>
> Sorry for the spam folks. Spammer deleted and moderation in place--somehow
> moderation for memberships was turned off. I'll moderate for a while to
> keep things clean.
>
> Breck
>
>
>> Cheapest Dating Software on the internet
>> [repeat of spam removed]

#770 From: BA YORO <yo_ba@...>
Date: Tue Oct 20, 2009 3:41 pm
Subject: Re: word sense over Wikipedia
yo_ba
Offline Offline
Send Email Send Email
 
"Sans la liberté de blamer, il n'est point d'éloge flatteur"

--- En date de : Ven 16.10.09, Bob Carpenter <carp@...> a écrit :

De: Bob Carpenter <carp@...>
Objet: Re: [LingPipe] word sense over Wikipedia
À: LingPipe@yahoogroups.com
Date: Vendredi 16 Octobre 2009, 23h44






 




     3.  Use the text as the training data for the sense.

Hello Bob,
If i understand i have to use the Text from Wikipedia as dictionary and not the
dictionary.mapping.xml ? Is this right or should i do the mapping manually
first.?Thank you for your help



































[Non-text portions of this message have been removed]

#771 From: Bob Carpenter <carp@...>
Date: Tue Oct 20, 2009 5:30 pm
Subject: Re: word sense over Wikipedia
colloquialdo...
Offline Offline
Send Email Send Email
 
I'm afraid LingPipe just isn't set up to be
a standalone command-line tool.  It's a Java
API that requires Java programs to access it.

The tutorials are meant to be examples of how
to write such programs.  They're not meant to
be flexible command-line based programs for
doing computational linguistics experiments.

What you can do without getting into Java
programming per se is munge any data you have
into the same format expected by our tutorials.
Then you can just run the tutorial code by
changing input locations.

The munging is up to you, though, and itself
often requires a lot of programming.

If you are comfortable with Java,
I'd  suggest that you read about
our classifiers and just use them directly.
That means implementing a parser for the
data you have in Java and plugging it into
a classifier using a second piece of Java code.

The simplest example of this process is
in the first classification tutorial:

http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.htm

- Bob Carpenter
    Alias-i

BA YORO wrote:

> 3. Use the text as the training data for the sense.
>
> Hello Bob,
> If i understand i have to use the Text from Wikipedia as dictionary and
> not the dictionary.mapping.xml ? Is this right or should i do the
> mapping manually first.?Thank you for your help

#772 From: "fayas" <fayas_h@...>
Date: Tue Oct 27, 2009 4:11 pm
Subject: Accessing LingPipe from Servlet ?
fayas_h
Offline Offline
Send Email Send Email
 
Hi all,

I have two requirements.

Requirement No 1:
-----------------

I have installed (unzipped) Lingpipe on my windows XP machine, and I can run
demo files without any problem(.bat files).. But I don't how to run the Lingpipe
itself and where to pass the inputs and get response.

Requirement No 2:
-----------------

I am developing an web application, I want to do analytics on my site content,
how can I access Lingpipe from servlet to send request and retrieve response.

If there is any code example it will be more helpfull.

Thanks & Regards,
Fayas

#773 From: prasenjit mukherjee <prasen.bea@...>
Date: Tue Oct 20, 2009 7:10 am
Subject: Re: Re: problem Interpreting SVD values
prasen_bea
Offline Offline
Send Email Send Email
 
Thanks for your reply.

Seems that the solution (SvdDemo) slightly varies with each run. One of the
runs gave me the following result :

Singular Values
      [java] sigma[0]72.86991439169711
      [java] sigma[1]2.2882752129455235
      [java]
      [java] Left Singular Vectors
      [java] row[0]=
      [java] -0.7081428760032364, -0.016589125848884684
      [java] row[1]=
      [java] -0.7055497573194808, -0.02076764463081699
      [java] row[2]=
      [java] 0.027077797404641172, -0.9996466905062298
      [java]
      [java] Right Singular Vectors
      [java] row[0]=
      [java] -0.038845155669432425, -0.42071833667019104
      [java] row[1]=
      [java] -0.03791002129449349, -0.9070980861851431
      [java] row[2]=
      [java] 0.9985258555322785, -0.013005507630171286

-Prasen

On Mon, Oct 19, 2009 at 10:41 PM, Bob Carpenter <carp@...> wrote:


[Non-text portions of this message have been removed]

#774 From: "Bob Carpenter" <carp@...>
Date: Tue Oct 27, 2009 4:58 pm
Subject: Re: Accessing LingPipe from Servlet ?
colloquialdo...
Offline Offline
Send Email Send Email
 
I'm afraid to say there is no "running LingPipe".
LingPipe is just a set of Java APIs.  That means
you need to write Java code to access LingPipe's
functionality.  (In this way, it's just like Lucene.)

The commands (.bat and .sh files) are just there
for demo purposes.  You can look at the code for
them in $LINGPIPE/demos/generic and design your
own commands or embed the same processing in servlets.

The other places to find example code for LingPipe
is in the tutorials and in the sandbox repositories.
The tutorials are much better documented and kept
up to date with releases of LingPipe and other
dependent libraries.

There's also an example of embedding the same
commands in servlets in the same place.  It's
a bit complex, though, because I wrote adapters
to use the same basic processor framework
for all the apps, so that it'd be runnable
from the command line, from a GUI, or from
a servlet.

In practice, you'll want to just write your own
servlet to fit into whatever web development
framework/environment you're using.

Calling LingPipe from a servlet is no different
than calling any other code from a servlet.

You'll have to make sure LingPipe's jar's on
the classpath, as well as any dependent
jars you need for the application (e.g. for
HTML parsing, for file upload, etc.).  You'll
also want the compiled form of the models
you need on the classpath so you can access
them as resources from within an application.

All of LingPipe's services are thread safe
for analysis (not necessarily for training),
so you can just have a single instance and
let the servlet run multi-threaded.

You'll probably need to read models in, which
can be done through resources.  Just use
an object input to deserialize through a resource.

- Bob Carpenter
   Alias-i

On October 27, 2009, fayas <fayas_h@...> wrote:

> Hi all,
>
> I have two requirements.
>
> Requirement No 1:
> -----------------
>
> I have installed (unzipped) Lingpipe on my windows XP machine, and I
> can run demo files without any problem(.bat files).. But I don't how
> to run the Lingpipe itself and where to pass the inputs and get
> response.
>
> Requirement No 2:
> -----------------
>
> I am developing an web application, I want to do analytics on my site
> content, how can I access Lingpipe from servlet to send request and
> retrieve response.
>
> If there is any code example it will be more helpfull.
>
> Thanks & Regards,
> Fayas

#775 From: "Bob Carpenter" <carp@...>
Date: Tue Oct 27, 2009 5:01 pm
Subject: Re: Re: problem Interpreting SVD values
colloquialdo...
Offline Offline
Send Email Send Email
 
If it converges, you should get the same
solution up to sign variation.  To monitor
convergence, look at the error term and
try different learning rates and annealing
rates.

Because you have products of two variables, the
sign isn't identified.  It's like
sqrt(4), which can be -2 or 2.

As I said in the last message,
using the stochastic gradient form of this
for small matrices is not the best solution.
Those can be solved with exact algorithms
if you can handle O(n*m) memory (where n is
number of rows, m number of columns).

- Bob Carpenter
   Alias-i

On October 20, 2009, prasenjit mukherjee <prasen.bea@...> wrote:

> Thanks for your reply.
>
> Seems that the solution (SvdDemo) slightly varies with each run. One of the
> runs gave me the following result :
>
> Singular Values
>      [java] sigma[0]72.86991439169711
>      [java] sigma[1]2.2882752129455235
>      [java]
>      [java] Left Singular Vectors
>      [java] row[0]=
>      [java] -0.7081428760032364, -0.016589125848884684
>      [java] row[1]=
>      [java] -0.7055497573194808, -0.02076764463081699
>      [java] row[2]=
>      [java] 0.027077797404641172, -0.9996466905062298
>      [java]
>      [java] Right Singular Vectors
>      [java] row[0]=
>      [java] -0.038845155669432425, -0.42071833667019104
>      [java] row[1]=
>      [java] -0.03791002129449349, -0.9070980861851431
>      [java] row[2]=
>      [java] 0.9985258555322785, -0.013005507630171286
>
> -Prasen

#776 From: "fayas" <fayas_h@...>
Date: Thu Oct 29, 2009 2:29 pm
Subject: Re: Accessing LingPipe from Servlet ?
fayas_h
Offline Offline
Send Email Send Email
 
Hi,

Thanks for your detailed information.

I got it working with my project. Sample codes in tutorial folder helped me alot
in solving the issue.

Basically its very simple if we use "eclipse IDE" for servlet development.

Steps:

Install Tomcat.

Install Eclipse JEE IDE environment. And execute a sample servlet
Ref: http://www.youtube.com/watch?v=EOkN5IPoJVs

Now all you have to do is include lingpipe-3.8.2.jar in environment and start
using its classes by importing packages.

Thanks
Fayas.

--- In LingPipe@yahoogroups.com, "Bob Carpenter" <carp@...> wrote:
>
>
> I'm afraid to say there is no "running LingPipe".
> LingPipe is just a set of Java APIs.  That means
> you need to write Java code to access LingPipe's
> functionality.  (In this way, it's just like Lucene.)
>
> The commands (.bat and .sh files) are just there
> for demo purposes.  You can look at the code for
> them in $LINGPIPE/demos/generic and design your
> own commands or embed the same processing in servlets.
>
> The other places to find example code for LingPipe
> is in the tutorials and in the sandbox repositories.
> The tutorials are much better documented and kept
> up to date with releases of LingPipe and other
> dependent libraries.
>
> There's also an example of embedding the same
> commands in servlets in the same place.  It's
> a bit complex, though, because I wrote adapters
> to use the same basic processor framework
> for all the apps, so that it'd be runnable
> from the command line, from a GUI, or from
> a servlet.
>
> In practice, you'll want to just write your own
> servlet to fit into whatever web development
> framework/environment you're using.
>
> Calling LingPipe from a servlet is no different
> than calling any other code from a servlet.
>
> You'll have to make sure LingPipe's jar's on
> the classpath, as well as any dependent
> jars you need for the application (e.g. for
> HTML parsing, for file upload, etc.).  You'll
> also want the compiled form of the models
> you need on the classpath so you can access
> them as resources from within an application.
>
> All of LingPipe's services are thread safe
> for analysis (not necessarily for training),
> so you can just have a single instance and
> let the servlet run multi-threaded.
>
> You'll probably need to read models in, which
> can be done through resources.  Just use
> an object input to deserialize through a resource.
>
> - Bob Carpenter
>   Alias-i
>
> On October 27, 2009, fayas <fayas_h@...> wrote:
>
> > Hi all,
> >
> > I have two requirements.
> >
> > Requirement No 1:
> > -----------------
> >
> > I have installed (unzipped) Lingpipe on my windows XP machine, and I
> > can run demo files without any problem(.bat files).. But I don't how
> > to run the Lingpipe itself and where to pass the inputs and get
> > response.
> >
> > Requirement No 2:
> > -----------------
> >
> > I am developing an web application, I want to do analytics on my site
> > content, how can I access Lingpipe from servlet to send request and
> > retrieve response.
> >
> > If there is any code example it will be more helpfull.
> >
> > Thanks & Regards,
> > Fayas
>

#777 From: Bob Carpenter <carp@...>
Date: Thu Oct 29, 2009 6:58 pm
Subject: Re: Re: Accessing LingPipe from Servlet ?
colloquialdo...
Offline Offline
Send Email Send Email
 
Glad you got it working.  Using LingPipe's like
using any other library.  You just need to make
the jar accessible to the web app.

While you can install the jar directly in the
web server's lib directory, we prefer to package LingPipe
along with a web archive (.war) file, so that everything's
in one place and you can install/deploy the
war as a single file.

The main reason this is easier is that it isolates
dependencies.  Tomcat has its own class loader that
keeps the libs for different applications separate.
This allows different versions of the same libs to
run in the same servlet container.

- Bob Carpenter
    Alias-i

fayas wrote:

> Thanks for your detailed information.
>
> I got it working with my project. Sample codes in tutorial folder helped
> me alot in solving the issue.
>
> Basically its very simple if we use "eclipse IDE" for servlet development.
>
> Steps:
>
> Install Tomcat.
>
> Install Eclipse JEE IDE environment. And execute a sample servlet
> Ref: http://www.youtube.com/watch?v=EOkN5IPoJVs
> <http://www.youtube.com/watch?v=EOkN5IPoJVs>
>
> Now all you have to do is include lingpipe-3.8.2.jar in environment and
> start using its classes by importing packages.
>
> Thanks
> Fayas.

#778 From: Brian Frutchey <mbfrutchey@...>
Date: Sat Nov 21, 2009 12:30 am
Subject: Chunker performance
mbfrutchey
Offline Offline
Send Email Send Email
 
I am hoping to improve the performance of my LingPipe Named Entity Recognition
tests.  Currently I am only able to extract entities from about 5K of text/sec
using either of the below methods:

{code}
Chunker chunker =
(Chunker)AbstractExternalizable.readObject(new
File("ne-en-news-muc6.AbstractCharLmRescoringChunker"));
Chunking entities = chunker.chunk(sentence);
for(Chunk entity : entities.chunkSet()) {
     String entityValue = sentence.substring(entity.start(),
entity.end()).trim();
     logger.info("Found entity ("+entityValue+", type: "
             +entity.type()+") with score of "+entity.score());
}
{/code}

{code}
ConfidenceChunker chunker =
(ConfidenceChunker)AbstractExternalizable.readObject(new
File("ne-en-news-muc6.AbstractCharLmRescoringChunker"));
Iterator<Chunk> entities = chunker.nBestChunks(sentence.toCharArray(),
         0, sentence.length(), 6);
while(entities.hasNext()) {
     Chunk entity = entities.next();
     if(entity.score() > .98) {
         String entityValue = sentence.substring(entity.start(),
entity.end()).trim();
         logger.info("Found entity ("+entityValue+", type: "
             +entity.type()+") with score of "+entity.score());
     }
}
{/code}


Apparently the LmRescoringChunkers are the slowest performers, but my corpus is
news data not biomedical so testing with the other downloadable models produces
no/incorrect/undesirable entities - however they do perform faster.  Do I need
to train a HmmChunker for news to speed things up?  I am hoping there are a few
other tricks to be implemented, but I don't have the time to trace code on a
hope...  help?

#779 From: breck <breck@...>
Date: Sat Nov 21, 2009 6:01 pm
Subject: Re: Chunker performance
reckb
Offline Offline
Send Email Send Email
 
Brian Frutchey wrote:
> I am hoping to improve the performance of my LingPipe Named Entity Recognition
tests.  Currently I am only able to extract entities from about 5K of text/sec
using either of the below methods:
>
> {code}
> Chunker chunker =
> (Chunker)AbstractExternalizable.readObject(new
> File("ne-en-news-muc6.AbstractCharLmRescoringChunker"));
> Chunking entities = chunker.chunk(sentence);
> for(Chunk entity : entities.chunkSet()) {
>     String entityValue = sentence.substring(entity.start(),
entity.end()).trim();
>     logger.info("Found entity ("+entityValue+", type: "
>             +entity.type()+") with score of "+entity.score());
> }
> {/code}
>
> {code}
> ConfidenceChunker chunker =
(ConfidenceChunker)AbstractExternalizable.readObject(new
File("ne-en-news-muc6.AbstractCharLmRescoringChunker"));
> Iterator<Chunk> entities = chunker.nBestChunks(sentence.toCharArray(),
>         0, sentence.length(), 6);
> while(entities.hasNext()) {
>     Chunk entity = entities.next();
>     if(entity.score() > .98) {
>         String entityValue = sentence.substring(entity.start(),
entity.end()).trim();
>         logger.info("Found entity ("+entityValue+", type: "
>             +entity.type()+") with score of "+entity.score());
>     }
> }
> {/code}
>
>
> Apparently the LmRescoringChunkers are the slowest performers, but my corpus
is news data not biomedical so testing with the other downloadable models
produces no/incorrect/undesirable entities - however they do perform faster.  Do
I need to train a HmmChunker for news to speed things up?
The other chunkers are much faster. I can get you models for English
news when i get back to the
office. But it is explained clearly in the named entity tutorial how to
train models if you have data.

If you really want to do proper ne eval you should match the training
data to the evaluation data.
That means creating around 300,000 tokens of annotated text. It could be
done by one person
in about a week.

best

Breck


>  I am hoping there are a few other tricks to be implemented, but I don't have
the time to trace code on a hope...  help?
>
>
>
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>
>

#780 From: Bob Carpenter <carp@...>
Date: Mon Nov 23, 2009 7:15 pm
Subject: Re: Chunker performance
colloquialdo...
Offline Offline
Send Email Send Email
 
Indeed, the rescoring chunker is the slowest.  It
works by generating an n-best list of possible
chunkings using a base HMM chunker, then rescores
them  using a more compute-intensive longer-distance
model.  Time will depend on the number of chunkings
that get rescored, and that's configurable.  It may
also help to add a cache to the contained chunker.

First, cast to the actual type (which you can
infer using .getClass() and printing it out):

AbstractCharLmRescoringChunker chunker
= (AbstractCharLmRescoringChunker) deserializedChunker;

You can also make the rescoring chunker faster by reducing
the number of chunkings rescored:

chunker.setNumChunkingsRescored(NUM_CHUNKINGS_RESCORED);

Then, pull out the base chunker, cast it to an HMM
chunker,

HmmChunker baseChunker
= (HmmChunker) chunker.baseChunker();

pull out the HMM decoder from the HMM chunker,

HmmDecoder decoder
= baseChunker.getDecoder();

and set up a cache on it:

decoder.setEmissionLog2Cache(new FastCache<String,double[]>(HMM_CACHE_SIZE);

The cache is measured in number of elements, which are
map entries to the number of tags in the model, which
will be around 5 times the number of types.  This may
not make much difference with a lot of chunkings being
rescored -- the rescoring model will dominate.

You can further speed up the base HMM chunker by
setting beams.  For this, you need to set the smallest
beam that doesn't cause errors (or even smaller to
trade more accuracy for speed).

baseChunker.setLog2Beam(HMM_LOG2_BEAM);

baseChunker.setLog2EmissionBeam(HMM_LOG2_EMISSION_BEAM);

Once the cache is configured, the chunker is thread
safe, so you can speed things up by using more than
one thread, too.

If you're willing to sacrifice some accuracy,
just use a base HMM chunker directly.  Configured
with a reasonable beam and cache, it'll run
first best on the order of 300K tokens/second
in a single thread.

You can pull one out of the chunker you're using, which
is a rescoring chunker -- just use the baseChunker
below and configure its cache.

The base HMM chunker is actually better at pulling
out chunks by confidence.  But you'll need to apply
the linear cache to do that (same way as with the
log2 cache).

- Bob Carpenter
    Alias-i

Messages 747 - 780 of 780   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help