My question was -- what kind of network analyis FOR NORMALIZING AUTHORS
AND AFFILIATIONS.
I.e., are you doing any yet, or envisioning how to do it, or just
dreaming?
I'm guessing you're doing rudimentary stuff in that particular area, and
wishing it weren't so rudimentary, but that's just a guess.
;)
Best,
CAM
At 03:18 AM 7/21/2006, William Hayes wrote:
Hi Curt,
Not would, we run collaboration network analyses for Biogen Idec
researchers looking at a disease area or target. We've even looked
at treatment technologies to figure out who to work with or consult based
on whether they are a supernode or provide significant linking
potential. We use Cytoscape for visualization and Medline or Dialog
for providing the literature sources for analysis. Our main area of
improvement is normalizing the authors and affiliations more
effectively. It's not a major problem, but it would enhance the
utility of the results.
It's similar in nature to what we've seen from Boston Consulting Group,
just more flexible and internally produced.
Thank you for your introduction. I noticed with significant
interest your work on collapsing name variants in a systematic
fashion. One of our outstanding problems for generating
collaboration networks using Cytoscape is just that issue of name
variants (and making sure the names refer to specific individuals).
We expect that network analysis will play a part in this
normalization. I'd like to suggest Medline (the National Library of
Medicine's biomedical abstracts database) containing author and
affiliation information with no associated unique primary ID's as an
excellent database to perform this work :)
If you do find this interesting as part of your graduate studies, I'd be
happy to help you get started.
Regards,
William
That sounds pretty interesting William. What kind of network
analysis would you do?
Thanks,
CAM
Curt A. Monash, Ph.D.
President, Monash Information Services curtmonash@...
(978) 266-1815 (main)
Backups:
curtmonash@..., (978)
266-1866
Not would, we run collaboration network analyses for Biogen Idec researchers looking at a disease area or target. We've even looked at treatment technologies to figure out who to work with or consult based on whether they are a supernode or provide significant linking potential. We use Cytoscape for visualization and Medline or Dialog for providing the literature sources for analysis. Our main area of improvement is normalizing the authors and affiliations more effectively. It's not a major problem, but it would enhance the utility of the results.
It's similar in nature to what we've seen from Boston Consulting Group, just more flexible and internally produced.
Thank you for your introduction. I noticed with significant
interest your work on collapsing name variants in a systematic
fashion. One of our outstanding problems for generating
collaboration networks using Cytoscape is just that issue of name
variants (and making sure the names refer to specific individuals).
We expect that network analysis will play a part in this
normalization. I'd like to suggest Medline (the National Library of
Medicine's biomedical abstracts database) containing author and
affiliation information with no associated unique primary ID's as an
excellent database to perform this work :)
If you do find this interesting as part of your graduate studies, I'd be
happy to help you get started.
Regards,
William
That sounds pretty interesting William. What kind of network
analysis would you do?
Thanks,
CAM
Curt A. Monash, Ph.D.
President, Monash Information Services curtmonash@...
(978) 266-1815 (main)
Backups: curtmonash@..., (978) 266-1866
Thank you for your introduction. I noticed with significant
interest your work on collapsing name variants in a systematic
fashion. One of our outstanding problems for generating
collaboration networks using Cytoscape is just that issue of name
variants (and making sure the names refer to specific individuals).
We expect that network analysis will play a part in this
normalization. I'd like to suggest Medline (the National Library of
Medicine's biomedical abstracts database) containing author and
affiliation information with no associated unique primary ID's as an
excellent database to perform this work :)
If you do find this interesting as part of your graduate studies, I'd be
happy to help you get started.
Regards,
William
That sounds pretty interesting William. What kind of network
analysis would you do?
Thanks,
CAM
Curt A. Monash, Ph.D.
President, Monash Information Services
curtmonash@...
(978) 266-1815 (main)
Backups: curtmonash@..., (978) 266-1866
Hi all,
Just a short introduction of myself and a response to Seth's question.
I'm a senior lecturer at Macquarie University, Sydney, Australia,
where I am doing research in the area of question answering.
And my answer to Seth's question is, besides suggesting you to look at
my own question answering project
<http://www.ics.mq.edu.au/~diego/answerfinder/>, to look at the TREC
question answering track of the TREC conferences
<http://trec.nist.gov/>, or at the CLEF conferences
<http://clef.isti.cnr.it/>, where much of current research in QA is
published.
Text-based question answering is a very dynamic area of R&D, and the
main web search engines are starting to incorporate QA technology.
Expect to see more and more QA abilities in the future web search
engines. Current systems focus on short fact-based questions, and
currently they are starting to attempt more complex questions where
the answer needs to be composed from bits and pieces found in various
sources.
The following list of QA systems is not exhaustive but it can give you
an idea of what you can find nowadays.
http://www.ics.mq.edu.au/~pizzato/repository
Hope this helps.
Cheers,
Diego
--- In TextAnalytics@yahoogroups.com, Seth Grimes <grimes@...> wrote:
>
> Hello all,
>
> What's going on in the world of natural-language query/question
> answering? For examples of this, see (and try!) --
>
> http://start.csail.mit.edu/
>
> http://brainboost.com
>
> http://answers.com
>
> For that matter, go to http://google.com and enter "2 + 2 - 1/17"
> or "map Georgia." Google Enterprise has partners that are extending
this
> capability to cover artifacts produced in response to a Google OneBox
> "search." If I understand this correctly, the partner's software
inserts
> Google index entries and Google lists those artifacts along with
document
> hits.
>
> I'm interested in particular in implementations for governmental /
> social / economic statistics (and maps). The sites I cited do
alright for
> simple questions about demographics but fail on more complex but still
> typical questions. I'd guess that's because they're broadly targeted;
> perhaps they could be tuned for the vocabularies and syntaxes of stats
> questions.
>
> I'd like to hear about academic and industrial research and
> productizations and to get pointers to papers.
>
> Thanks,
>
> Seth
>
>
> --
> Seth Grimes Alta Plana Corp, analytical computing & data management
> Intelligent Enterprise magazine (CMP), Contributing Editor
> grimes@... http://altaplana.com 301-270-0795
>
I know also about Ngram Statistics Package (NSP).
"The Ngram Statistics Package (NSP) is a suite of programs that aids in
analyzing Ngrams in text files. We define an Ngram as a sequence of 'n'
tokens that occur within a window of at least 'n' tokens in the text;
what constitutes a "token" can be defined by the user.'
Project page:
http://search.cpan.org/~tpederse/Text-NSP-0.97/Docs/README.pod#DESCRIPTION
roxana
> I'm working on a catalog of open-source software for text-analytics and
> related functions. Here's what I have so far. Please add to the list.
> After a bit more review, I'll paste this info into the Wikipedia Text
> Analytics entry.
>
> It would be great to have your reactions to the various packages!
>
> Seth
>
>
> OpenNLP
>
> "An organizational center for open source projects related to natural
> language processing.... OpenNLP also hosts a variety of java-based NLP
> tools which perform sentence detection, tokenization, pos-tagging,
> chunking and parsing, named-entity detection, and coreference using the
> OpenNLP Maxent machine learning package."
>
> Home page: http://opennlp.sourceforge.net/
>
> Project page: http://sourceforge.net/projects/opennlp/
>
>
> Carrot2
>
> "A search results clustering framework. Includes clustering components and
> a stand-alone meta search component. Combines well with indexing and
> search engines (open source and proprietary)."
>
> Home page: http://www.carrot2.org
>
> Project page: http://sourceforge.net/projects/carrot2/
>
>
> FreeLing
>
> "An open source language analysis tool suite."
>
> http://garraf.epsevg.upc.es/freeling/
>
>
> GATE -- General Architecture for Text Engineering
>
> "GATE is ... the leading toolkit for Text Mining ... comprised of an
> architecture, a free open source framework (or SDK) and graphical
> development environment."
>
> Home page: http://gate.ac.uk/index.html
>
> Project page: http://sourceforge.net/projects/gate
>
>
> Graphviz -- Graph Visualization Software
>
> Graph visualization is a way of representing structural information as
> diagrams of abstract graphs and networks.... The Graphviz layout programs
> take descriptions of graphs in a simple text language, and make diagrams
> in several useful formats such as images and SVG for web pages, Postscript
> for inclusion in PDF or other documents; or display in an interactive
> graph browser."
>
> http://graphviz.org
>
>
> jTokeniser
>
> "The jTokeniser package was designed to combine a set of tokenisers that
> range from basic whitespace tokenisers to more complex ones that deal
> intuitively with natural language.... Tokenisers include:
>
> * WhiteSpaceTokeniser
> * StringTokeniser (based on specified delimiters)
> * RegexTokeniser (regular expression defines a token)
> * RegexSeparatorTokeniser (define what is *not* a token)
> * BreatIteratorTokeniser (sophisticated locale-specific tokeniser)
> * SentenceTokeniser (sentence segmentation)"
>
> http://www.andy-roberts.net/software/jTokeniser/
>
>
> Kea
>
> "Kea-3.0 automatically extracts keyphrases from the full text of
> documents.... Kea-4.0 is a new version of Kea that has been developed for
> controlled indexing of documents in the domain of agriculture."
>
> http://www.nzdl.org/Kea/
>
>
> LingPipe ** free but not open source
>
> "A suite of Java libraries for the linguistic analysis of human language."
>
> http://www.alias-i.com/lingpipe/index.html
>
>
> LTC -- Linguistic Tree Constructor
>
> "LTC is a free program for building linguistic syntax trees from text."
>
> Home page: http://ltc.sourceforge.net
>
> Project page: http://sourceforge.net/projects/ltc
>
>
> Lucene
>
> "Apache Lucene is a high-performance, full-featured text search engine
> library written entirely in Java."
>
> http://lucene.apache.org/
>
>
> NLTK
>
> "NLTK, the Natural Language Toolkit, is a suite of program modules, data
> sets and tutorials supporting research and teaching in computational
> linguistics and natural language processing."
>
> Home page: http://nltk.sourceforge.net/index.html
>
> Project page: http://sourceforge.net/projects/nltk
>
>
> Nutch
>
> "Nutch builds on Lucene Java to provide web search application software."
>
> http://lucene.apache.org/nutch/
>
>
> TouchGraph
>
> "TouchGraph provides a hands-on way to visualize networks of interrelated
> information. Networks are rendered as interactive graphs, which lend
> themselves to a variety of transformations."
>
> Home page: http://www.touchgraph.com/
>
> Project page: http://touchgraph.sourceforge.net/
>
>
> Weka
>
> "Weka is a collection of machine learning algorithms for data mining
> tasks.... Weka contains tools for data pre-processing, classification,
> regression, clustering, association rules, and visualization. It is also
> well-suited for developing new machine learning schemes."
>
> http://www.cs.waikato.ac.nz/~ml/weka/
>
> See Weka-related projects:
> http://weka.sourceforge.net/wiki/index.php/Related_Projects
>
>
>
> --
> Seth Grimes Alta Plana Corp, analytical computing & data management
> Intelligent Enterprise magazine (CMP), Contributing Editor
> grimes@...http://altaplana.com 301-270-0795
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>
I forgot to include UIMA --
UIMA -- Unstructured Information Management Architecture
"An open, industrial-strength, scaleable and extensible platform for
creating, integrating and deploying unstructured information management
solutions from combinations of semantic analysis and search components."
Project site at IBM Research: http://www.research.ibm.com/UIMA/
SDK site at IBM alphaWorks: http://www.alphaworks.ibm.com/tech/uima
Framework site at SourceForge: http://uima-framework.sourceforge.net
On Wed, 19 Jul 2006, Seth Grimes wrote:
> I'm working on a catalog of open-source software for text-analytics and
> related functions. Here's what I have so far. Please add to the list.
> After a bit more review, I'll paste this info into the Wikipedia Text
> Analytics entry.
>
> It would be great to have your reactions to the various packages!
>
> Seth
>
>
> OpenNLP
>
> "An organizational center for open source projects related to natural
> language processing.... OpenNLP also hosts a variety of java-based NLP
> tools which perform sentence detection, tokenization, pos-tagging,
> chunking and parsing, named-entity detection, and coreference using the
> OpenNLP Maxent machine learning package."
>
> Home page: http://opennlp.sourceforge.net/
>
> Project page: http://sourceforge.net/projects/opennlp/
>
>
> Carrot2
>
> "A search results clustering framework. Includes clustering components and
> a stand-alone meta search component. Combines well with indexing and
> search engines (open source and proprietary)."
>
> Home page: http://www.carrot2.org
>
> Project page: http://sourceforge.net/projects/carrot2/
>
>
> FreeLing
>
> "An open source language analysis tool suite."
>
> http://garraf.epsevg.upc.es/freeling/
>
>
> GATE -- General Architecture for Text Engineering
>
> "GATE is ... the leading toolkit for Text Mining ... comprised of an
> architecture, a free open source framework (or SDK) and graphical
> development environment."
>
> Home page: http://gate.ac.uk/index.html
>
> Project page: http://sourceforge.net/projects/gate
>
>
> Graphviz -- Graph Visualization Software
>
> Graph visualization is a way of representing structural information as
> diagrams of abstract graphs and networks.... The Graphviz layout programs
> take descriptions of graphs in a simple text language, and make diagrams
> in several useful formats such as images and SVG for web pages, Postscript
> for inclusion in PDF or other documents; or display in an interactive
> graph browser."
>
> http://graphviz.org
>
>
> jTokeniser
>
> "The jTokeniser package was designed to combine a set of tokenisers that
> range from basic whitespace tokenisers to more complex ones that deal
> intuitively with natural language.... Tokenisers include:
>
> * WhiteSpaceTokeniser
> * StringTokeniser (based on specified delimiters)
> * RegexTokeniser (regular expression defines a token)
> * RegexSeparatorTokeniser (define what is *not* a token)
> * BreatIteratorTokeniser (sophisticated locale-specific tokeniser)
> * SentenceTokeniser (sentence segmentation)"
>
> http://www.andy-roberts.net/software/jTokeniser/
>
>
> Kea
>
> "Kea-3.0 automatically extracts keyphrases from the full text of
> documents.... Kea-4.0 is a new version of Kea that has been developed for
> controlled indexing of documents in the domain of agriculture."
>
> http://www.nzdl.org/Kea/
>
>
> LingPipe ** free but not open source
>
> "A suite of Java libraries for the linguistic analysis of human language."
>
> http://www.alias-i.com/lingpipe/index.html
>
>
> LTC -- Linguistic Tree Constructor
>
> "LTC is a free program for building linguistic syntax trees from text."
>
> Home page: http://ltc.sourceforge.net
>
> Project page: http://sourceforge.net/projects/ltc
>
>
> Lucene
>
> "Apache Lucene is a high-performance, full-featured text search engine
> library written entirely in Java."
>
> http://lucene.apache.org/
>
>
> NLTK
>
> "NLTK, the Natural Language Toolkit, is a suite of program modules, data
> sets and tutorials supporting research and teaching in computational
> linguistics and natural language processing."
>
> Home page: http://nltk.sourceforge.net/index.html
>
> Project page: http://sourceforge.net/projects/nltk
>
>
> Nutch
>
> "Nutch builds on Lucene Java to provide web search application software."
>
> http://lucene.apache.org/nutch/
>
>
> TouchGraph
>
> "TouchGraph provides a hands-on way to visualize networks of interrelated
> information. Networks are rendered as interactive graphs, which lend
> themselves to a variety of transformations."
>
> Home page: http://www.touchgraph.com/
>
> Project page: http://touchgraph.sourceforge.net/
>
>
> Weka
>
> "Weka is a collection of machine learning algorithms for data mining
> tasks.... Weka contains tools for data pre-processing, classification,
> regression, clustering, association rules, and visualization. It is also
> well-suited for developing new machine learning schemes."
>
> http://www.cs.waikato.ac.nz/~ml/weka/
>
> See Weka-related projects:
> http://weka.sourceforge.net/wiki/index.php/Related_Projects
>
>
>
> --
> Seth Grimes Alta Plana Corp, analytical computing & data management
> Intelligent Enterprise magazine (CMP), Contributing Editor
> grimes@...http://altaplana.com 301-270-0795
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>
--
Seth Grimes Alta Plana Corp, analytical computing & data management
Intelligent Enterprise magazine (CMP), Contributing Editor
grimes@...http://altaplana.com 301-270-0795
I'm working on a catalog of open-source software for text-analytics and
related functions. Here's what I have so far. Please add to the list.
After a bit more review, I'll paste this info into the Wikipedia Text
Analytics entry.
It would be great to have your reactions to the various packages!
Seth
OpenNLP
"An organizational center for open source projects related to natural
language processing.... OpenNLP also hosts a variety of java-based NLP
tools which perform sentence detection, tokenization, pos-tagging,
chunking and parsing, named-entity detection, and coreference using the
OpenNLP Maxent machine learning package."
Home page: http://opennlp.sourceforge.net/
Project page: http://sourceforge.net/projects/opennlp/
Carrot2
"A search results clustering framework. Includes clustering components and
a stand-alone meta search component. Combines well with indexing and
search engines (open source and proprietary)."
Home page: http://www.carrot2.org
Project page: http://sourceforge.net/projects/carrot2/
FreeLing
"An open source language analysis tool suite."
http://garraf.epsevg.upc.es/freeling/
GATE -- General Architecture for Text Engineering
"GATE is ... the leading toolkit for Text Mining ... comprised of an
architecture, a free open source framework (or SDK) and graphical
development environment."
Home page: http://gate.ac.uk/index.html
Project page: http://sourceforge.net/projects/gate
Graphviz -- Graph Visualization Software
Graph visualization is a way of representing structural information as
diagrams of abstract graphs and networks.... The Graphviz layout programs
take descriptions of graphs in a simple text language, and make diagrams
in several useful formats such as images and SVG for web pages, Postscript
for inclusion in PDF or other documents; or display in an interactive
graph browser."
http://graphviz.org
jTokeniser
"The jTokeniser package was designed to combine a set of tokenisers that
range from basic whitespace tokenisers to more complex ones that deal
intuitively with natural language.... Tokenisers include:
* WhiteSpaceTokeniser
* StringTokeniser (based on specified delimiters)
* RegexTokeniser (regular expression defines a token)
* RegexSeparatorTokeniser (define what is *not* a token)
* BreatIteratorTokeniser (sophisticated locale-specific tokeniser)
* SentenceTokeniser (sentence segmentation)"
http://www.andy-roberts.net/software/jTokeniser/
Kea
"Kea-3.0 automatically extracts keyphrases from the full text of
documents.... Kea-4.0 is a new version of Kea that has been developed for
controlled indexing of documents in the domain of agriculture."
http://www.nzdl.org/Kea/
LingPipe ** free but not open source
"A suite of Java libraries for the linguistic analysis of human language."
http://www.alias-i.com/lingpipe/index.html
LTC -- Linguistic Tree Constructor
"LTC is a free program for building linguistic syntax trees from text."
Home page: http://ltc.sourceforge.net
Project page: http://sourceforge.net/projects/ltc
Lucene
"Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java."
http://lucene.apache.org/
NLTK
"NLTK, the Natural Language Toolkit, is a suite of program modules, data
sets and tutorials supporting research and teaching in computational
linguistics and natural language processing."
Home page: http://nltk.sourceforge.net/index.html
Project page: http://sourceforge.net/projects/nltk
Nutch
"Nutch builds on Lucene Java to provide web search application software."
http://lucene.apache.org/nutch/
TouchGraph
"TouchGraph provides a hands-on way to visualize networks of interrelated
information. Networks are rendered as interactive graphs, which lend
themselves to a variety of transformations."
Home page: http://www.touchgraph.com/
Project page: http://touchgraph.sourceforge.net/
Weka
"Weka is a collection of machine learning algorithms for data mining
tasks.... Weka contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization. It is also
well-suited for developing new machine learning schemes."
http://www.cs.waikato.ac.nz/~ml/weka/
See Weka-related projects:
http://weka.sourceforge.net/wiki/index.php/Related_Projects
--
Seth Grimes Alta Plana Corp, analytical computing & data management
Intelligent Enterprise magazine (CMP), Contributing Editor
grimes@...http://altaplana.com 301-270-0795
I am interested in whatever pointers to the literature you can give me for author or affiliation disambiguation and normalization.
Good luck in your studies. I can't imagine how hard it is to keep a PhD program going while working full-time. I was lucky enough to be able to concentrate on just my PhD and thought it took forever.
I'm sure that at some point my work will expand to include structured
data like citations, which have similar problems no matter what the
domain. But for now my work is limited to unstructured full text
documents in the domain of genealogy, so MedLine is not in the cards for
my dissertation, but perhaps for the future beyond my dissertation.
My husband is on the faculty of SUNY Upstate Medical University here in
Syracuse, so I imagine at some point I will get involved with medical or
perhaps biomedical domain work. During dinner with his colleagues one
night I was trying to explain my dissertation topic, and one of his
fellow researchers immediately drew a parallel with genes and proteins.
If you'd like, I can point you to literature that does involve work with
citations -- there's quite a bit out there.
-- Thanks again,
Mary D. Taffet
Ph.D. Candidate/Syracuse University School of Information Studies
Scientist/TextWise LLC
Syracuse, NY
William Hayes wrote:
> Hi Mary,
>
> Thank you for your introduction. I noticed with significant interest
> your work on collapsing name variants in a systematic fashion. One of
> our outstanding problems for generating collaboration networks using
> Cytoscape is just that issue of name variants (and making sure the names
> refer to specific individuals). We expect that network analysis will
> play a part in this normalization. I'd like to suggest Medline (the
> National Library of Medicine's biomedical abstracts database) containing
> author and affiliation information with no associated unique primary
> ID's as an excellent database to perform this work :)
>
> If you do find this interesting as part of your graduate studies, I'd be
> happy to help you get started.
>
> Regards,
>
> William
>
>
William,
Thank you for your reply.
I'm sure that at some point my work will expand to include structured
data like citations, which have similar problems no matter what the
domain. But for now my work is limited to unstructured full text
documents in the domain of genealogy, so MedLine is not in the cards for
my dissertation, but perhaps for the future beyond my dissertation.
My husband is on the faculty of SUNY Upstate Medical University here in
Syracuse, so I imagine at some point I will get involved with medical or
perhaps biomedical domain work. During dinner with his colleagues one
night I was trying to explain my dissertation topic, and one of his
fellow researchers immediately drew a parallel with genes and proteins.
If you'd like, I can point you to literature that does involve work with
citations -- there's quite a bit out there.
-- Thanks again,
Mary D. Taffet
Ph.D. Candidate/Syracuse University School of Information Studies
Scientist/TextWise LLC
Syracuse, NY
William Hayes wrote:
> Hi Mary,
>
> Thank you for your introduction. I noticed with significant interest
> your work on collapsing name variants in a systematic fashion. One of
> our outstanding problems for generating collaboration networks using
> Cytoscape is just that issue of name variants (and making sure the names
> refer to specific individuals). We expect that network analysis will
> play a part in this normalization. I'd like to suggest Medline (the
> National Library of Medicine's biomedical abstracts database) containing
> author and affiliation information with no associated unique primary
> ID's as an excellent database to perform this work :)
>
> If you do find this interesting as part of your graduate studies, I'd be
> happy to help you get started.
>
> Regards,
>
> William
>
>
> [snip]
Without knowing why you need to segment your text and what you are going to do with it downstream, I'd have to agree with Dominic that parsing the text into paragraphs is one of the best ways to segment text passages that are consistent in content (at least in the European languages with which I'm familiar - caveat - I'm not a linguist). Sentences are designed to express an atomic fact (mostly), and paragraphs are designed to present a concept and it's supporting evidence.
Text Tiling is obviously a good choice. However, as far as I know, the
implementation of this method is not easy, the algorithm is
time-consuming, and the results can be unpredictible.
Have you thought about simply dividing your documents into paragraphs
or (overlapping or non-overlapping) window passages (i.e. sequences of
words)?
Regards,
Dominic
--- In TextAnalytics@yahoogroups.com, tamer adel <tamadel2003@...> wrote:
>
> Hi,All
> I have text document of one mass and i want to divide it into
multi paragaph that are coherent portions....the subject is new to me
....i made search and i found text tiling algorithm is preferred
method to execute my task ... is any one know more than me guid to
another method or algorithm as i suggested to solve the suggestion
problem.
> pls, replay is urgent to me till the afternoon of tomorrow.
>
> regards,
> tam
>
>
> Tamer Abu Elenain
> Software Developer
> (+2) 012 562 74 21
Thank you for your introduction. I noticed with significant interest your work on collapsing name variants in a systematic fashion. One of our outstanding problems for generating collaboration networks using Cytoscape is just that issue of name variants (and making sure the names refer to specific individuals). We expect that network analysis will play a part in this normalization. I'd like to suggest Medline (the National Library of Medicine's biomedical abstracts database) containing author and affiliation information with no associated unique primary ID's as an excellent database to perform this work :)
If you do find this interesting as part of your graduate studies, I'd be happy to help you get started.
My name is Mary D. Taffet. I have a Bachelor's degree in Linguistics
from UNC-Chapel Hill, a Master's degree in Linguistics from Syracuse
University, an MLS degree in Information and Library Science from
Syracuse University's School of Information Studies and am currently a
Ph.D. Candidate at the School of Information Studies.
In between my bachelor's and master's programs, I became a business
applications programmer working with COBOL on a mainframe. Needless to
say, at some point I realized that I didn't have to choose between
working with language and working with computers, both of which I both
enjoy and am fairly good at. So I went back to school with the goal of
learning about Natural Language Processing/Computational Linguistics,
which I have been focusing on since 1999. [And in the process became
one of the few skilled COBOL programmers to never ever work on a Y2K
project, though I got offers most every week it seemed...]
I was a Research Assistant at TextWise from 1999-2000, then was a
Research Assistant at the Center for Natural Language Processing at
Syracuse University's School of Information Studies from 2000-2004. Now
I'm back at TextWise as a fulltime employee since 2005 working on
contextual advertising.
At some point after I started grad school, I became addicted to
genealogy, and have done the bulk of my genealogical research online
since then, with a few trips to Salt Lake City and the Montreal Archives
along the way. I am a very frustrated online genealogical researcher
due to the difficulty of searching names online. Fortunately the
difficulty in searching names online is something that even a
non-genealogical researcher can do as a dissertation as it is a general
problem for all sorts of applications. So that's the focus of my
dissertation. I am looking at the relationship between people and the
way people are referred to in written documents. I hope to bring
together all variant forms of a person's name, while at the same time
teasing apart identical names that refer to different people. I have an
electronic corpus of 14,000+ biographies from a 1904 publication
supplied by Ancestry.com.
I had to put my dissertation work aside for a while during my father's
hospitalization last year, and am still trying to get back into the
swing of things after my father passed away. It's not easy with a
fulltime job, but hopefully I will get there before too much more time
has passed.
-- Mary D. Taffet
Ph.D. Candidate/Syracuse University-School of Information Studies
Scientist/TextWise LLC
Syracuse, NY
Hello,
My name is Mary D. Taffet. I have a Bachelor's degree in Linguistics
from UNC-Chapel Hill, a Master's degree in Linguistics from Syracuse
University, an MLS degree in Information and Library Science from
Syracuse University's School of Information Studies and am currently a
Ph.D. Candidate at the School of Information Studies.
In between my bachelor's and master's programs, I became a business
applications programmer working with COBOL on a mainframe. Needless to
say, at some point I realized that I didn't have to choose between
working with language and working with computers, both of which I both
enjoy and am fairly good at. So I went back to school with the goal of
learning about Natural Language Processing/Computational Linguistics,
which I have been focusing on since 1999. [And in the process became
one of the few skilled COBOL programmers to never ever work on a Y2K
project, though I got offers most every week it seemed...]
I was a Research Assistant at TextWise from 1999-2000, then was a
Research Assistant at the Center for Natural Language Processing at
Syracuse University's School of Information Studies from 2000-2004. Now
I'm back at TextWise as a fulltime employee since 2005 working on
contextual advertising.
At some point after I started grad school, I became addicted to
genealogy, and have done the bulk of my genealogical research online
since then, with a few trips to Salt Lake City and the Montreal Archives
along the way. I am a very frustrated online genealogical researcher
due to the difficulty of searching names online. Fortunately the
difficulty in searching names online is something that even a
non-genealogical researcher can do as a dissertation as it is a general
problem for all sorts of applications. So that's the focus of my
dissertation. I am looking at the relationship between people and the
way people are referred to in written documents. I hope to bring
together all variant forms of a person's name, while at the same time
teasing apart identical names that refer to different people. I have an
electronic corpus of 14,000+ biographies from a 1904 publication
supplied by Ancestry.com.
I had to put my dissertation work aside for a while during my father's
hospitalization last year, and am still trying to get back into the
swing of things after my father passed away. It's not easy with a
fulltime job, but hopefully I will get there before too much more time
has passed.
-- Mary D. Taffet
Ph.D. Candidate/Syracuse University-School of Information Studies
Scientist/TextWise LLC
Syracuse, NY
---------- Forwarded message ----------
Date: Mon, 17 Jul 2006 16:37:12 -0500
From: "Marsh, Brice" <Brice.F.Marsh@...>
I'm academicaly curious, but I don't have the time to become an active
participant. My name is Brice Marsh and I'm the Executive Director of Teen
Think Tanks of America, Inc. (www.teenthinktanks.org) and we generate lots
of collaborative material that we need to classify and organize for
reporting purposes. However, my "day job" is as a senior computer
scientist with a federal contractor for NASA at Marshall Space Flight
Center and that keeps me busy. But, I do wish to be able to stay abreast
of your work and the progress of your research. So in this regard, I must
be classified as a "taker" and not a "giver", I'm sorry; but you're more
than welcome to review/use any of the material we have posted on the TTT
website, only with attribution, please.
Thanks.
Brice F. Marsh
bricemarsh@...
Tam,
Text Tiling is obviously a good choice. However, as far as I know, the
implementation of this method is not easy, the algorithm is
time-consuming, and the results can be unpredictible.
Have you thought about simply dividing your documents into paragraphs
or (overlapping or non-overlapping) window passages (i.e. sequences of
words)?
Regards,
Dominic
--- In TextAnalytics@yahoogroups.com, tamer adel <tamadel2003@...> wrote:
>
> Hi,All
> I have text document of one mass and i want to divide it into
multi paragaph that are coherent portions....the subject is new to me
....i made search and i found text tiling algorithm is preferred
method to execute my task ... is any one know more than me guid to
another method or algorithm as i suggested to solve the suggestion
problem.
> pls, replay is urgent to me till the afternoon of tomorrow.
>
> regards,
> tam
>
>
> Tamer Abu Elenain
> Software Developer
> (+2) 012 562 74 21
> tamadel2003@...
>
>
>
> ---------------------------------
> Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and
30+ countries) for 2¢/min or less.
>
Thank you for your suggestions. I'm testing/reviewing them to figure out the best solution for our problem. I'll submit my opinions/results back to the list when I'm done. I appreciate your time and enthusiasm. This list has a very nice cross-section of expertise and interests based on the introductions and responses so far. Kudos to our moderators for starting this list.
I have text document of one mass and i want to divide it into multi paragaph that are coherent portions....the subject is new to me ....i made search and i found text tiling algorithm is preferred method to execute my task ... is any one know more than me guid to another method or algorithm as i suggested to solve the suggestion problem.
pls, replay is urgent to me till the afternoon of tomorrow.
Hi William,
You might want to check into Schemalogic. They are based in Kirkland
WA. www.schemalogic.com
Regards,
Lew Larson
--- In TextAnalytics@yahoogroups.com, "William Hayes"
<william.s.hayes@...> wrote:
>
> Hi all,
>
> Has anyone run across a good thesauri or ontology management server.
> Something that will allow user editing of hierarchically tagged
canonical
> names with their synonyms, basic visualization and can export into
various
> formats such as ANSI Thesaurus format and is accessible via web
services for
> accessing synonym suggestions for enterprise search engines? I know
that's
> a pretty tall order, but it's one that we need in the text analytics
area.
> I've run across some fairly expensive tools for this, but I was
hoping to
> find something a good deal less pricey (and hopefully easily
extendable).
>
> We need to manage protein, disease, tissue, cell line, adverse event,
> pathological process, etc thesauri. I'd like to be able to tag the
protein
> thesauri with various relationship information such as Pathway
> participation, Molecular function, Biological process, etc with is-a and
> part-of relations maintained and be able to tag particular synonyms with
> filterable labels (such as 'rarely used', 'ambiguousWithGeneralText',
> 'ambiguousWithOtherProtein','stopword', etc.). This would really be a
> terminological resource server for text analytic engines. The more
easily
> edited (in an ad hoc) fashion, the easier it can be set up as a public
> resource to allow (moderated?) community curation of terminologies
for text
> mining.
>
> TIA,
>
> William
>
I am new to the group and I would like to introduce myself. My name is Sanford Schram but everybody call me Sandy. I have a Masters Degree in Engineering. I had done a great deal of work, early in my career, in simulation. Then after seven years at Xerox Data Systems where I managed a number of development projects I formed Computer Strategies and developed Business Systems. I first got involved with TA when I designed and implemented a document processing system for a large client of mine, Baxter Healthcare. I moved on to Business Intelligence and developed a number of large reporting systems and assisted in the development of a number of Data Warehouses where I got introduced to data mining.
I have been teaching classes at the undergraduate and graduate level in database design, project management, decision support systems, and enterprise system development.
On a recent consulting engagement I had an opportunity to apply my TA experience in a new area, that of documenting software development systems. This is where I am currently focused. Bringing together the various documentation objects, database meta-data and code into a coherent knowledge base with multiple taxonomies so that a variety of development and design groups can work and collaborate efficiently.
So I bring a engineer's rather than a scientists perspective to the table. Hopefully my contribution in terms of how do we use specific techniques to solve business problems and my questions along that line will stimulate the more theoretical member of the group.
I am pleased to have been allowed to join this group.
Hi,All It's first time to send msg or request help from Text Analytics group and i hope if you can help me. I'm making research about text segmentations and this subject is new to me to solve the following problem : i want to know the procedures or implementation algorithm to partition or segments one block of text to multi coherent portions of text blocks to facilitates the retrieve or full text search to text document and so i ask if you can guid me to good any reference help me to know more with illustration about Text Segmentations with yr knowledge text segmentation is essential step in textual processing and one of natural language processing .
Hello to Karl Wiig, Seth Grimes, Neil Raden, and everyone I've not yet had the
pleasure of meeting either electronically or in person,
I'm Joe Firestone, Managing Director and CEO of Center for the Open Enterprise,
LLC. COE is the parent company of the Knowledge Management Consortium
International (KMCI), the KMCI Publishing Group (KMCI Press and KMCI Online
Press) and the new Adaptive Metrics Center. Both KMCI and AMC do independent
research and also offer training and consulting services. KMCI in Knowledge
Management and AMC in Business Performance Management and Measurement.
My interest in text analytics goes back to the 1950s when I first learned about
content analysis applied to Soviet studies. Later on, I did some research for
the US Air Force on Intentions Analysis and Forecasting, and on applying
computerized content analysis to the study of national intentions and long-range
forecasting of inter-nation behavior. In the early to middle 70s, I published a
few academic articles using measures of national motives in statistical models
predicting national behavior. Since then, my changing interests have carried me
into many other areas, but I've always tried to keep up with the progress of
text analytics.
More recently, my work in Knowledge Management and Adaptive Metrics has led me
back to a greater focus on text analytics, since I'm persuaded that if you want
to measure the quality of problem and knowledge claim formulation, and also the
quality of knowledge claim evaluation, one of the best ways to develop metrics
is through analysis of the semantic patterns in text. This idea has an important
place in our training workshops and in our treatment of core software tools for
knowledge management in KMCI's CKIM Certificate Training Workshop. It is also
the idea I'll be pursuing most often in this group.
Best,
Joe
Joseph M. Firestone, Ph.D.
Managing Director, CEO
KMCI and the Adaptive Metrics Center
www.kmci.org
www.adaptivemetricscenter.com
http://radio.weblogs.com/0135950
CKO
Executive Information Systems, Inc.
www.dkms.com
703-461-8823
Thank you for allowing me to
participate in the TA discussion group.
I would also like an opportunity
to fill you in about my work and my practice, Business
Consulting Services.
Now in
our 16th year, our
focus continues in business process improvement and strategic
information technology consulting. Specifically, we seek
to improve client performance by streamlining their business processes
and work flows first, then recommending the proper technology to
deliver the highest ROI. We also provide support for SOX
compliance in the areas of work flow, IT and document management
policy. While our work doesn't specialize in TA techniques per se, it
often serves to introduce our clients to the values of the discipline,
and broadens their planning perspectives.
I continue to expand and
enhance our
strategic partner roster. Through our affiliation with Research and Organization Management (Bethesda,
MD), we can perform assessments
of staff and executive team's performance. In addition, you can see
how you compare with industry best practices and hundreds of other
organizations. Since
last
year, we provide knowledge management training and
certification classes as an affiliate of the International
Knowledge Management Institute (DC). Classes are available
for
both groups and individuals, with special discounts for multiple
registrations.
We
also deliver many ancillary services, such as research, analysis,
performance metrics, feasibility studies, RFP development, vendor
selection and project management, to name only a few.
If
you think our services can augment or assist you in any way, I
would be
happy to discuss the possibilities with you.
"Performance
Improvement through Technology Planning and Operational Redesign"
Business
Consulting Services
improves operating results through business process improvement and
information technology consulting. Serving the business, government
and non-profit communities, we provide only senior level resources and
skill sets at competitive fees affordable to a client's budget.
*
Certified
Management Consultant (CMC) is a
certification mark awarded by the Institute of Management Consultants
USA and
represents evidence of the
highest standards of
consulting, and his adherence to the technical and ethical canons of
the
profession. Less than 1% of all management consultants have achieved
this
level of performance. Certified Computing Professional (CCP) is awarded
by the
Institute for the Certification of Computing Professionals, and
certifies
proficiency in the information technology field.
Tom
Casey is one of
fewer than 15 consultants in the
world to have achieved both the Certified Management Consultant (CMC)
and
Certified Computing Professional (CCP) designations, the only
internationally
accepted certification in each field. To achieve this distinction, Mr.
Casey has undergone peer reviews, client audits, competency tests and
oral
interviews; he has complied with continuing education requirements and
has
pledged to uphold the Codes of Ethics for both organizations.
Hi. I’m co-moderator of this list and my name is
Rob Raisch.
I work for Financial Media Holdings Group here in Boston, MA,
where we produce publications, products and events of and about corporate
regulatory compliance and governance. Our flag-ship publication, Compliance
Week www.complianceweek.com , is a
weekly online newsletter reaching more than 40,000 financial and legal
executives at U.S. public companies. Each month we also produce a snazzy physical
(atoms not bits) version as well.
I’ve been involved with the Internet and other forms
of online information retrieval for more than twenty years as a programmer, systems
architect, writer, and entrepreneur, and along the way, I’ve consulted
with some pretty large companies on a variety of online technologies. (You’ll
find a very out-of-date bio at www.raisch.com
.)
A lot of what I do here at CW is to help our writers and
analysts make sense of the documentation generated by public companies as
required by the U.S. Securities and Exchange Commission. If you haven’t
seen the mountains of filings public companies have to provide U.S. regulators
each year, I think you’d be amazed to learn that the vast majority were designed
to be reviewed and analyzed by human beings, rather than by machines. Equally
surprising, very little has changed in how these documents are structured since
the SEC was commissioned in the 1930’s so you can imagine the problems
they present to anyone interested in extracting usable knowledge from them
using a computer. (Check out http://edgar.sec.gov,
the free online repository of some of these documents.)
Basically, it’s a big, poorly structured corpus of
valuable information; just the thing for which text analytics exists! For
me, the only real saving grace is that this variety of business communications
doesn’t deviate much from a small subset of expression, so the task isn’t
completely impossible. ::grin::
So, a lot of my day is spent coming up with interesting ways
to determine which companies provide country club memberships (and other
perquisites) to their executives, or which pharmaceutical companies reported ecologically-related
issues as material weaknesses, or how companies over $5B in market cap account
for executive stock options. And while the data can be rather dry (arid!),
it’s the hunt I find most fun and rewarding.
To do this, I use a loose bag of tools I’ve either collected
from the public domain or developed myself including various tokenizers,
lexers, parsers, parts-of-speech taggers, named-entity extractors, back-prop
neural network classifiers, etc. The only tools we’ve purchased are
the real “lights-out” backbone systems, like our full-text search
engine (from Coveo) and relational database (from Microsoft.) But even
then, I’ll use open-source replacements like Lucene and MySql if the job
calls for them.
So yes, you guessed it! I’m a serious geek and damned
proud of it.
Hopefully, I’ll provide some perspective for those of
us working on custom systems using mostly home-grown solutions. (Oh, I
should also mention I have nothing but the greatest respect for the vendors in
this space and for their incredibly cool tools.)
So, Welcome! Glad you’re here.
--
Robert Raisch, CTO - Financial Media Holdings Group, Inc.
Publishers of "Compliance Week"
<http://www.complianceweek.com>
Hello all,
What's going on in the world of natural-language query/question
answering? For examples of this, see (and try!) --
http://start.csail.mit.edu/http://brainboost.comhttp://answers.com
For that matter, go to http://google.com and enter "2 + 2 - 1/17"
or "map Georgia." Google Enterprise has partners that are extending this
capability to cover artifacts produced in response to a Google OneBox
"search." If I understand this correctly, the partner's software inserts
Google index entries and Google lists those artifacts along with document
hits.
I'm interested in particular in implementations for governmental /
social / economic statistics (and maps). The sites I cited do alright for
simple questions about demographics but fail on more complex but still
typical questions. I'd guess that's because they're broadly targeted;
perhaps they could be tuned for the vocabularies and syntaxes of stats
questions.
I'd like to hear about academic and industrial research and
productizations and to get pointers to papers.
Thanks,
Seth
--
Seth Grimes Alta Plana Corp, analytical computing & data management
Intelligent Enterprise magazine (CMP), Contributing Editor
grimes@...http://altaplana.com 301-270-0795
I'm Curt Monash. I've been an analyst of the software industry
since 1981, and following linguistics-related technologies since about
1983, when I helped with an investment banking deal for natural language
pioneer Artificial Intelligence Corp. I'm writing a fair
amount about text analytics these days, mainly in Computerworld
(specifically in my monthly columns in July and probably also August),
and even more so at
www.texttechnologies.com
Experiences that helped me form my views including being involved in the
rise and fall of the classical AI companies in the 1980s; having my own
unsuccessful search/classification startup in the late 1990s; and helping
build one of the Web's premier sites about public search engines, the
Spider's Apprentice, also in the 1990s. .
My other big areas of professional interest are all in the software and
online services industries -- database management, analytics, knowledge
discovery, etc.. Most of what I write about those can be found in
Computerworld, at
www.dbms2.com
(database), and at www.monashreport.com (industry strategy and trends, public policy, analytics, etc.)
My Ph.D. was in game theory, and my only post-doc was in public policy.
Hi all,
I am new to this group. Here are a few words about myself:
I hold a Ph.D. in cognitive computer science from Université du Québec à
Montréal. In my doctoral dissertation ("Application de techniques de forage
de textes de nature prédictive et exploratoire à des fins de gestion et
d'analyse thématique de documents textuels non structurés"), I explored and
validated the use of descriptive and predictive text mining techniques to
assist thematic analysis of unstructured documents. My current research
interests concern the use of hybrid text mining techniques (using concepts
and techniques from both linguistics and artificial intelligence) to assist
ontology development from unstructured documents. I also collaborate in
various projects concerning the application of text mining techniques in the
context of institutional repositories and digital libraries,
computer-assisted reading and text analysis, etc.
I am currently a postdoctoral fellow at Observatoire de Linguistique
Sens-Texte (Université de Montréal) and will be (starting December 1st,
2006) assistant professor at École de Bibliothéconomie et des Sciences de
l'Information (Université de Montréal).
Regards,
Dominic Forest
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Dominic Forest
Chercheur postdoctoral
Observatoire de Linguistique Sens-Texte (OLST)
Université de Montréal
Courrier électronique : dominic.forest@...
Sites Internet : www.dominicforest.com
_____________________________________________________
Whereas we help clients with "Windows" and Linux environments, we use
Mac OS 10.4.7 exclusively. Any suggestions for what you pursue?
Greetings
--
Karl M. Wiig
Chairman
Knowledge Research Institute, Inc.
7101 Lake Powell Drive, Arlington, TX 76016 USA
Phone: (817) 572-6254 / Cell: (682) 554-3998 / Fax: (817) 478-1048
http://www.krii.com
Has anyone run across a good thesauri or ontology management server. Something that will allow user editing of hierarchically tagged canonical names with their synonyms, basic visualization and can export into various formats such as ANSI Thesaurus format and is accessible via web services for accessing synonym suggestions for enterprise search engines? I know that's a pretty tall order, but it's one that we need in the text analytics area. I've run across some fairly expensive tools for this, but I was hoping to find something a good deal less pricey (and hopefully easily extendable).
We need to manage protein, disease, tissue, cell line, adverse event, pathological process, etc thesauri. I'd like to be able to tag the protein thesauri with various relationship information such as Pathway participation, Molecular function, Biological process, etc with is-a and part-of relations maintained and be able to tag particular synonyms with filterable labels (such as 'rarely used', 'ambiguousWithGeneralText', 'ambiguousWithOtherProtein','stopword', etc.). This would really be a terminological resource server for text analytic engines. The more easily edited (in an ad hoc) fashion, the easier it can be set up as a public resource to allow (moderated?) community curation of terminologies for text mining.
Hello! Oracle Corp in Redwood Shores, CA has a few full-time opportunities for software developers to work on the Oracle Text development team. If you have real-world experience with developing Search, Information Retrieval, or NLP and have expertise in C or C++, please take a look at www.oracle.com/technology/products/text/index/html to gain more information. If you are interested in applying, please send your resume to me at megan.delaney@...
Oracle Recruiting: "Continuously selected by our clients as the exclusive vendor of preeminent talent"
The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to betaken in reliance on it, is prohibited and may be unlawful. No internal Oracle email, except that clearly intended for public distribution (e.g.Oracle Press Releases), should be sent to any party outside Oracle.
A quick welcome to the Text Analytics discussion group. I think you'll
find we have a nice mix of researchers, vendors, and practioners here;
also some recruiters.
While I set up the group and Rob Raisch, CTO of Compliance Week is serving
as co-moderator, I hope to do very little moderating. You're free to post
whatever statements, questions, problems, and announcements you wish so
long as they relate to text analytics as you define that term. Just
follow the usual rules regarding respect for other list members, and
please identify yourself when posting unless you have a good reason not
to. Do introduce yourself to the list if you wish.
Thanks all,
Seth Grimes
--
Seth Grimes Alta Plana Corp, analytical computing & data management
Intelligent Enterprise magazine (CMP), Contributing Editor
grimes@...http://altaplana.com 301-270-0795