marco turchi wrote:
> Dear All,
> my name is Marco, and I'm new in this group and as lingpipe user, so do not
> hate me for some silly questions :-)
>
> My research project requires to extract Named Entity from documents written
> in different languages. I have read on the lingpipe web site that it allows
> to do it, but I do not understand if training data for the NE extractor are
> available for research purposes or they are present only in the Developer
> version of the product.
>
> Please can u help me?
We don't distribute any data, but our named entity tutorial
points to some sources of data. ELRA and LDC also distribute
data, but it's expensive. Most of the free data sets have
restrictive licenses.
You can also create your own training data using our
citationEntities sandbox project, but it's a lot of work.
Does the recognizer need to be multilingual in the
sense of handling documents in multiple languages? Or
do you just need NE for multiple languages? You can
often do language identification first, if documents
only contain a single language.
For real multilingual apps, you'll need to train
a single model with data from different langauges.
This has worked pretty well in our experience, at least
for English and Hindi.
- Bob Carpenter
Alias-i