Search the web
Sign In
New User? Sign Up
billiontriples · The Billion Triples Challenge
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 1 - 30 of 141   Newest  |  < Newer  |  Older >  |  Oldest
Messages: Show Message Summaries   (Group by Topic) Sort by Date v  
#30 From: Jim Hendler <hendler@...>
Date: Fri Dec 7, 2007 3:00 am
Subject: Re: Re: Data set
james.hendler
Offline Offline
Send Email Send Email
 

On Dec 6, 2007, at 9:36 PM, Kingsley Idehen wrote:




[snip]


2. This project should ultimately produce a myriad of demonstrations of how Semantic
Data Web oriented technologies construct and extract value from a heterogeneous and
distributed linked data web woven together using dereferencable URIs.






[snip]

Ohh, I like that - Peter and I may have to steal it!



#29 From: Jim Hendler <hendler@...>
Date: Fri Dec 7, 2007 2:36 am
Subject: Re: Re: Intro from Talis
james.hendler
Offline Offline
Send Email Send Email
 
Gents - great discussion, and exactly why we created this challenge - we want everyone to be able to show off what /they/ think is something cool the Semantic Web will let us do with that kind of information -- that's why we don't want to limit it to some particular domain or questions -- please bring the things your tools can do to the set we create - that's what it's all about!!
 -JH


On Dec 6, 2007, at 6:00 PM, Ian Davis wrote:


On Thu, 2007-12-06 at 23:51 +0100, Georgi Kobilarov wrote:
> Hi Ian,
> 
> I disagree with you. Semantic Web doesn't compete with today's IR
> methods.
> I'm afraid that the questions you used as examples are quite artificial
> and boring.
> 
> Why? Because I need only 2 minutes to provide answers to all of them
> using "the current Web". There is a Wikipedia article with holidays in
> Italy...
> 

I probably shouldn't have included those sample questions because they
distracted from my point about heterogenous data integration. 

> In my opinion Semantic Web technologies have no potential providing a
> better solution for fulfilling these tasks.
> Sure, one could start designing an application using semweb technologies
> which is able to provide answers to questions like "how fast can a horse
> run". And it might work. And it might take only 2 years of development
> until users are able to find "answers" using this app as fast as they
> can today with Google. And nobody will use it.
> 
> Let us focus on tasks that are difficult to solve today. Tasks where all
> data *is* available in some form on the Web, but users are unable to
> integrate this data in order to solve the task. Data integration is the
> strength of the Semantic Web. And the tasks I think of are unlikely to
> be ever formulated into a NLP-style queries.

> Example:
> "I want to travel home from Bristol to Berlin next week. How?"
> Depending on how much time I want to spend on this task, I would
> 1. Look up flights from Bristol to Berlin
> 2. Look up flights from London to Berlin
> 3. Do this for different airlines 
> 4. Loop up trains from Bristol to London
> 5. Look up busses from Bristol to London
> 6. Do this for other cities near Bristol as well 
> 7. Take into account that I could fly before Friday but need to take
> vacation from work
> 8. Compare travel times and costs
> 
> That is pretty annoying to do! Very annoying. And I end up spending more
> money than necessary because I'm too lazy.
> 
> Another example? I travel to London for a conference. Where should I
> stay? B&B or hotel? Should be near nice pubs as well as near to the
> conference. "near" to conference means short time to walk or short time
> using public transport. 
> And I have friends living in London. They have their addresses in
> Facebook...

I don't see those queries as being very much different than mine - just
more imaginative. The base problem remains of how to formulate those
questions over a heterogenous data set including traversal.

> 
> While I wrote these two examples, more and more ways to solve them came
> into my mind. And that's one thing that should be taken into account as
> well: People cannot formulate their tasks into one query. At the
> beginning they only have an idea of what they want to achieve, and that
> idea is far from being complete. 
> 
> So, no offence, but let us find some more interesting tasks.

No offence taken :)

Ian


"If we knew what we were doing, it wouldn't be called research, would it?." - Albert Einstein

Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180





#28 From: "Kingsley Idehen" <kidehen@...>
Date: Fri Dec 7, 2007 2:36 am
Subject: [Dbpedia-discussion] Re: Data set
kidehen
Offline Offline
Send Email Send Email
 
--- In billiontriples@yahoogroups.com, Sören Auer <auer@...> wrote:
>
> Kingsley Idehen wrote:
> > Soren: You know I can't let potential performance and scalability
> > misconceptions go unanswered :-)
>
> I didn't mean at all that Virtuoso is not capable of handling very large
> datasets (in fact I'm really impressed about its performance), but I
> think there is a very common misconception about handling a Billion triples:
>
> Handling a Billion triple or even a Trillion is easy as long as you only
> have one concurrent user. If you have 10 or even the whole Web
> everything looks completely different. But you are right with EC2 and
> Virtuoso Clustering even that might just be a matter of zeros on the
> check for Amazon ;-)
>
> Sören
>
Soren,

Sure :-)
That said, I did respond with more than 1 concurrent user in mind :-)
Anyway, the most important thing here is that:

1. a Billion Triples is low hanging fruit when you factor in all the Web 2.0
based user
generated content that can be morphed (today) into RDF based structured data via
RDFizers
2. This project should ultimately produce a myriad of demonstrations of how
Semantic
Data Web oriented technologies construct and extract value from a heterogeneous
and
distributed linked data web woven together using dereferencable URIs.


Kingsley

#27 From: Sören Auer <auer@...>
Date: Fri Dec 7, 2007 2:07 am
Subject: Re: [Dbpedia-discussion] Re: Data set
soerenauer
Offline Offline
Send Email Send Email
 
Kingsley Idehen wrote:
> Soren: You know I can't let potential performance and scalability
> misconceptions go unanswered :-)

I didn't mean at all that Virtuoso is not capable of handling very large
datasets (in fact I'm really impressed about its performance), but I
think there is a very common misconception about handling a Billion triples:

Handling a Billion triple or even a Trillion is easy as long as you only
have one concurrent user. If you have 10 or even the whole Web
everything looks completely different. But you are right with EC2 and
Virtuoso Clustering even that might just be a matter of zeros on the
check for Amazon ;-)

Sören

#26 From: Ian Davis <lists@...>
Date: Thu Dec 6, 2007 11:00 pm
Subject: RE: Re: Intro from Talis
ianalchemy
Offline Offline
Send Email Send Email
 
On Thu, 2007-12-06 at 23:51 +0100, Georgi Kobilarov wrote:
> Hi Ian,
>
> I disagree with you. Semantic Web doesn't compete with today's IR
> methods.
> I'm afraid that the questions you used as examples are quite artificial
> and boring.
>
> Why? Because I need only 2 minutes to provide answers to all of them
> using "the current Web". There is a Wikipedia article with holidays in
> Italy...
>

I probably shouldn't have included those sample questions because they
distracted from my point about heterogenous data integration.


> In my opinion Semantic Web technologies have no potential providing a
> better solution for fulfilling these tasks.
> Sure, one could start designing an application using semweb technologies
> which is able to provide answers to questions like "how fast can a horse
> run". And it might work. And it might take only 2 years of development
> until users are able to find "answers" using this app as fast as they
> can today with Google. And nobody will use it.
>
> Let us focus on tasks that are difficult to solve today. Tasks where all
> data *is* available in some form on the Web, but users are unable to
> integrate this data in order to solve the task. Data integration is the
> strength of the Semantic Web. And the tasks I think of are unlikely to
> be ever formulated into a NLP-style queries.


> Example:
> "I want to travel home from Bristol to Berlin next week. How?"
> Depending on how much time I want to spend on this task, I would
> 1. Look up flights from Bristol to Berlin
> 2. Look up flights from London to Berlin
> 3. Do this for different airlines
> 4. Loop up trains from Bristol to London
> 5. Look up busses from Bristol to London
> 6. Do this for other cities near Bristol as well
> 7. Take into account that I could fly before Friday but need to take
> vacation from work
> 8. Compare travel times and costs
>
> That is pretty annoying to do! Very annoying. And I end up spending more
> money than necessary because I'm too lazy.
>
> Another example? I travel to London for a conference. Where should I
> stay? B&B or hotel? Should be near nice pubs as well as near to the
> conference. "near" to conference means short time to walk or short time
> using public transport.
> And I have friends living in London. They have their addresses in
> Facebook...

I don't see those queries as being very much different than mine - just
more imaginative. The base problem remains of how to formulate those
questions over a heterogenous data set including traversal.


>
> While I wrote these two examples, more and more ways to solve them came
> into my mind. And that's one thing that should be taken into account as
> well: People cannot formulate their tasks into one query. At the
> beginning they only have an idea of what they want to achieve, and that
> idea is far from being complete.
>
> So, no offence, but let us find some more interesting tasks.

No offence taken :)

Ian

#25 From: "Giovanni Tummarello" <g.tummarello@...>
Date: Fri Dec 7, 2007 12:16 am
Subject: Re: Re: Data set
gtummarello
Offline Offline
Send Email Send Email
 
Hi Georgi,

human created links across specific topics are pretty precious in
reality (people follow them for a reason in fact).

an example if you look at http://dbpedia.org/resource/Natalie_Ramsey
there is no semantic link between her and Cruel Intention 3  while in
fact she did star there as the text says both for the movie page and
her page.

imagine our surprise when we found out that sindice in fact knew the
link doing a semantic search! (URI based search, not a text based one)

example

http://www.sindice.com/query/lookup?type=uri&uri=http%3A%2F%2Fdbpedia.org%2Freso\
urce%2FNatalie_Ramsey

and for the movie

http://www.sindice.com/query/lookup?type=uri&uri=http%3A%2F%2Fdbpedia.org%2Freso\
urce%2FCruel_Intentions_3

explanation? the sitemap http://dbpedia.org/sitemap.xml also contains
the dump of the links between the pages, which is however not served
as linked data. As sindice uses the dump it finds in the sitemap to do
the indexing, mistery solved.

So as Richard (who found this out!) says is probably to be considered
a "bug" in the sitemap. Nevertheless there is really no reason why not
to serve it (in theory!) as their semantic is in fact well defined:
they're links.. (there is no ambiguity).

  I understand that in a sense its "subsimbolic semantics"  m but this
is the perfect food for computational intelligence algorithms... which
are nowadays most that works in artificial intelligence.

And since we're making it machine readable.. the argument "visualizers
cant really handle many links" really really doesnt hold. would a
human really want to look at a dbpedia rendered RDF? isnt it by
definition inferior to  what wikipedia has to offer?

but of course if that slows considerably the service, i see your point.
on the other hand if there is the technical way to serve this data...

Giovanni



>
> > Was wondering, will you be serving this information in the new
> version?
>
> Well, no, because I do not see any benefit and it would only slow down
> the linked data access.
>
> In my opinion Linked Data is about providing useful information, and
> this is hopefully not only a pure scalability contest.
>
> Cheers,
> Georgi
>

#24 From: Kingsley Idehen <kidehen@...>
Date: Thu Dec 6, 2007 10:57 pm
Subject: Re: [Dbpedia-discussion] Re: Data set
kidehen
Offline Offline
Send Email Send Email
 
Richard Cyganiak wrote:
>
> On 6 Dec 2007, at 21:36, Kingsley Idehen wrote:
>>>> The current dbpedia does not serve as linked data all the information
>>>> it has. In fact it does not include the colinking information.
>>>> Although linking is, lets say, "weak semantic" it is still very
>>>> important.
>>>>
>>>> Was wondering, will you be serving this information in the new
>>>> version?
>>>
>>> I guess this kind of discussion fits better to the DBpedia mailinglist
>>> (which I include in my reply) since as I understood the BT-Challenge
>>> will provide a dataset for download anyway and not refer to LOD
>>> endpoints.
>>> Within DBpedia Chris and Kingsley are the ones who care about the
>>> SPARQL
>>> endpoint and I guess they will be happy to comment on this. I
>>> suppose it
>>> does (due to performance issues) not make sense to host all DBpedia
>>> datasets within one single endpoint.
>>>
>> Soren,
>>
>> We aren't worried about the size of the data sets :-) Even more so as we
>> add EC2 and Virtuoso Clustering to the mix :-)
>
> Just to be clear: The only dataset not served in the SPARQL endpoint
> and linked data is the pagelinks dataset.
>
> If we added the pagelinks into the linked data, then many resources
> would have ~100 pagelinks attached, compared to ~50 other, more
> interesting triples. Some user interfaces do not cope well with that.
> The interesting properties would be lost in the noise.
>
> The pure byte size of each HTML and RDF document would also increase
> significantly. This would further slow down response times.
>
> Considering that the pagelinks don't add much value for browsing or
> SPARQL queries, I believe that the decision not to serve the pagelinks
> is sound.

On that basis , for sure!

We need to make it easier people to query against DBpedia, anything that
diminishes this goal is a detraction and ultimately detrimental to the
overall project.

Soren: You know I can't let potential performance and scalability
misconceptions go unanswered :-)



Kingsley
>
> Richard
>
>
>>
>>
>> --
>>
>>
>> Regards,
>>
>> Kingsley Idehen          Weblog: http://www.openlinksw.com/blog/~kidehen
>> President & CEO
>> OpenLink Software     Web: http://www.openlinksw.com
>>
>>
>>
>>
>>
>> -------------------------------------------------------------------------
>>
>> SF.Net email is sponsored by:
>> Check out the new SourceForge.net Marketplace.
>> It's the best place to buy or sell services for
>> just about anything Open Source.
>> http://sourceforge.net/services/buy/index.php
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> Dbpedia-discussion@...
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>
>


--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO
OpenLink Software     Web: http://www.openlinksw.com

#23 From: "Georgi Kobilarov" <gkob@...>
Date: Thu Dec 6, 2007 10:51 pm
Subject: RE: Re: Intro from Talis
georgi.kobil...
Offline Offline
Send Email Send Email
 
Hi Ian,

I disagree with you. Semantic Web doesn't compete with today's IR
methods.
I'm afraid that the questions you used as examples are quite artificial
and boring.

Why? Because I need only 2 minutes to provide answers to all of them
using "the current Web". There is a Wikipedia article with holidays in
Italy...

In my opinion Semantic Web technologies have no potential providing a
better solution for fulfilling these tasks.
Sure, one could start designing an application using semweb technologies
which is able to provide answers to questions like "how fast can a horse
run". And it might work. And it might take only 2 years of development
until users are able to find "answers" using this app as fast as they
can today with Google. And nobody will use it.

Let us focus on tasks that are difficult to solve today. Tasks where all
data *is* available in some form on the Web, but users are unable to
integrate this data in order to solve the task. Data integration is the
strength of the Semantic Web. And the tasks I think of are unlikely to
be ever formulated into a NLP-style queries.

Example:
"I want to travel home from Bristol to Berlin next week. How?"
Depending on how much time I want to spend on this task, I would
1. Look up flights from Bristol to Berlin
2. Look up flights from London to Berlin
3. Do this for different airlines
4. Loop up trains from Bristol to London
5. Look up busses from Bristol to London
6. Do this for other cities near Bristol as well
7. Take into account that I could fly before Friday but need to take
vacation from work
8. Compare travel times and costs

That is pretty annoying to do! Very annoying. And I end up spending more
money than necessary because I'm too lazy.

Another example? I travel to London for a conference. Where should I
stay? B&B or hotel? Should be near nice pubs as well as near to the
conference. "near" to conference means short time to walk or short time
using public transport.
And I have friends living in London. They have their addresses in
Facebook...

While I wrote these two examples, more and more ways to solve them came
into my mind. And that's one thing that should be taken into account as
well: People cannot formulate their tasks into one query. At the
beginning they only have an idea of what they want to achieve, and that
idea is far from being complete.

So, no offence, but let us find some more interesting tasks.

Cheers,
Georgi

--
Georgi Kobilarov
www.georgikobilarov.com


> -----Original Message-----
> From: billiontriples@yahoogroups.com
> [mailto:billiontriples@yahoogroups.com] On Behalf Of Ian Davis
> Sent: Thursday, December 06, 2007 7:30 PM
> To: billiontriples@yahoogroups.com
> Subject: Re: [billiontriples] Re: Intro from Talis
>
> (this message grew in the telling...)
>
> On Wed, 2007-12-05 at 20:45 +0000, serendipity588 wrote:
>
> > With respect to your perspective, my response would be that we will
> > certainly not define a particular task you have to perform with the
> > data set. Much like with actually all good research contributions or
> > business propositions, the question is, what is, in your estimate,
> > that you could do with these billion triples that you believe your
> > competition or peers around the world can not do? What makes you
> > unique when it comes to handling this data set and how would you go
> > about proving it?
>
> Hmmm. That's a shame, because my feeling is that the challenge could
> work really well with a particular task. The theme is the "open web"
> with all the messiness and real-world expression of knowledge that
> entails. There could be a very interesting challenge involving
> discovery
> of information using a large number of unanticipated but interrelated
> vocabularies.
>
> I think one of the biggest challenges the semantic web faces is
> demonstrating that it offers practical benefits over the standard IR
> methods of the current crop of search engines. How can the semantic
web
> help people get the information they need more easily than simply
> indexing textual content at scale.
>
> I've not seen a convincing demonstration that the semweb does enable
> this goal. I _believe_ it does, but I don't have any real evidence.
One
> debate we had internally at Talis involved finding the height of the
> eiffel tower in feet. Just type those words into Google and you get a
> number of suggestions actually embedded into the abstracts of the
> search
> results. It even works with varying degrees of inaccuracy in spelling.
>
> One of the standard arguments against IR is that it only works when
you
> know the search terms to use, and that they must be present on or near
> the resulting pages. IMHO the same problem occurs in the semweb but
> actually the potential number of terms (properties) is even larger
than
> the human vocabularies we use to write our knowledge down in prose
> form.
>
> How do I even compose the question "what is the height of the eiffel
> tower in feet" using RDF and SPARQL (for example)? How do I know what
> terms to use for any of that query? How do I know that everyone will
be
> using the same terms? Even getting to Google's level of listing the
> places where you might find the answer seems very hard to formulate a
> query for.
>
> RDF mitigates this with run-time strategies for discovering how terms
> relate to one another, e.g. dereferencing URIs to discover RDF that
may
> relate the term to one you already know how to handle.
>
> So, my thought for a good challenge over "open web" and "big data" is
> to
> encourage the creation of several large datasets. These should use
some
> schemas dereferenceable at the property and class URIs and strongly
> interlinked using RDFS and OWL to terms in other schemas. We already
> have a pile of linked data that would form a great basis for this.
>
> The challenge is then to build the best agent that can explore these
> datasets and answer a series of questions known to be satisfiable. The
> agents must not hard code any schema information but should discover
> the
> relationships at run time. That might be impractical so perhaps some
> level of base schema information needs to be encoded, but not too
much.
>
> The goal is not to parse arbitrary natural language so the contestants
> could translate the queries into some internal format.
>
> The style of question I'm thinking of are:
>
> "Who won the best actor oscar in 1957?"
>
> "Which country has the highest GDP in South America?"
>
> "Find a recipe for guacamole"
>
> "Is 16th August 2008 a public holiday in Italy?"
>
> "How fast can a horse run in miles per hour?"
>
> Answering trivia is certainly not trivial.
>
> Ian
>
>
>
> Yahoo! Groups Links
>
>
>

#22 From: Kingsley Idehen <kidehen@...>
Date: Thu Dec 6, 2007 9:36 pm
Subject: Re: [Dbpedia-discussion] Re: Data set
kidehen
Offline Offline
Send Email Send Email
 
Sören Auer wrote:
> gtummarello wrote:
>
>> The current dbpedia does not serve as linked data all the information
>> it has. In fact it does not include the colinking information.
>> Although linking is, lets say, "weak semantic" it is still very important.
>>
>> Was wondering, will you be serving this information in the new version?
>>
>
> I guess this kind of discussion fits better to the DBpedia mailinglist
> (which I include in my reply) since as I understood the BT-Challenge
> will provide a dataset for download anyway and not refer to LOD endpoints.
> Within DBpedia Chris and Kingsley are the ones who care about the SPARQL
> endpoint and I guess they will be happy to comment on this. I suppose it
> does (due to performance issues) not make sense to host all DBpedia
> datasets within one single endpoint.
>
> Sören
>
> -------------------------------------------------------------------------
> SF.Net email is sponsored by: The Future of Linux Business White Paper
> from Novell.  From the desktop to the data center, Linux is going
> mainstream.  Let it simplify your IT future.
> http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@...
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
Soren,

We aren't worried about the size of the data sets :-) Even more so as we
add EC2 and Virtuoso Clustering to the mix :-)

--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO
OpenLink Software     Web: http://www.openlinksw.com

#21 From: Jim Hendler <hendler@...>
Date: Thu Dec 6, 2007 8:45 pm
Subject: Re: Re: Intro from Talis
james.hendler
Offline Offline
Send Email Send Email
 
Ian (I wanted to inline my comments, but my mailer and the yahoo inserted junk in the email don't cooperate to let me do that)...
 I don't disagree with anything you say below.  I would just point out that search, and even discovery,  is one many metaphors for interacting with a huge data space.
There's also browsing, mining, social-recommendations, visualizing and I'm sure many others.  Peter and I are not trying to keep people from doing any of these
that interest them - we don't want to focus them on one specific aspect or task.  The idea that this needs to be a large, highly heterogeneous and interlinked dataset
is something we agree with completely (and you can bet we'll ask you to contribute your million or whatever triples)

<DIGRESSION>
  that said, there is one part of the below I actually have a strong opinion on, and I've been using the example since my first talks on this stuff before we called it Semantic Web, so at the risk of digressing from the challenge, let me mention it  --- the example I used is "how many cows are in texas" (which looks a lot like your examples below) - turns out there not only isn't a correct answer to this question (Which means any answer we find would need to be exploired through a provenance system and such -- i.e. where did it come from) but also there was a web site I found the first time I typed this query using Altavista (that's how long ago it was) I found a radical vegetarian UFO web site (possibly a hoax, but who knows) which said there were no cows in Texas - that the aliens had replaced them with essentially fake cows, and that is why you shouldn't eat beef (and this was before the mad cow scare :-))  -- and it dawned on me, what makes the Web the Web is that to someone, something I believe, is as weird as their belief is to me -- any kind of consensus Q/A system is not going to work on the Web, whether built from keywords or semantics, until it can also realize that it has to find a lot of different answers on the Web, and not just average them together...
  but that's a digression as I said
 -Jim H
p.s. Hmm, used to be if you Googled "How many cows are in Texas" most of the top hits were to my talks - now they're mostly sites about livestock -- however, I still seem to show up in the top ten, so use that query to find my presentations on this topic and such...
</DIGRESSION>

On Dec 6, 2007, at 2:30 PM, Ian Davis wrote:

(this message grew in the telling...)

On Wed, 2007-12-05 at 20:45 +0000, serendipity588 wrote:

> With respect to your perspective, my response would be that we will
> certainly not define a particular task you have to perform with the
> data set. Much like with actually all good research contributions or
> business propositions, the question is, what is, in your estimate,
> that you could do with these billion triples that you believe your
> competition or peers around the world can not do? What makes you
> unique when it comes to handling this data set and how would you go
> about proving it?

Hmmm. That's a shame, because my feeling is that the challenge could
work really well with a particular task. The theme is the "open web"
with all the messiness and real-world expression of knowledge that
entails. There could be a very interesting challenge involving discovery
of information using a large number of unanticipated but interrelated
vocabularies.

I think one of the biggest challenges the semantic web faces is
demonstrating that it offers practical benefits over the standard IR
methods of the current crop of search engines. How can the semantic web
help people get the information they need more easily than simply
indexing textual content at scale.

I've not seen a convincing demonstration that the semweb does enable
this goal. I _believe_ it does, but I don't have any real evidence. One
debate we had internally at Talis involved finding the height of the
eiffel tower in feet. Just type those words into Google and you get a
number of suggestions actually embedded into the abstracts of the search
results. It even works with varying degrees of inaccuracy in spelling. 

One of the standard arguments against IR is that it only works when you
know the search terms to use, and that they must be present on or near
the resulting pages. IMHO the same problem occurs in the semweb but
actually the potential number of terms (properties) is even larger than
the human vocabularies we use to write our knowledge down in prose form.

How do I even compose the question "what is the height of the eiffel
tower in feet" using RDF and SPARQL (for example)? How do I know what
terms to use for any of that query? How do I know that everyone will be
using the same terms? Even getting to Google's level of listing the
places where you might find the answer seems very hard to formulate a
qu

RDF mitigates this with run-time strategies for discovering how terms
relate to one another, e.g. dereferencing URIs to discover RDF that may
relate the term to one you already know how to handle. 

So, my thought for a good challenge over "open web" and "big data" is to
encourage the creation of several large datasets. These should use some
schemas dereferenceable at the property and class URIs and strongly
interlinked using RDFS and OWL to terms in other schemas. We already
have a pile of linked data that would form a great basis for this.

The challenge is then to build the best agent that can explore these
datasets and answer a series of questions known to be satisfiable. The
agents must not hard code any schema information but should discover the
relationships at run time. That might be impractical so perhaps some
level of base schema information needs to be encoded, but not too much.

The goal is not to parse arbitrary natural language so the contestants
could translate the queries into some internal format. 

The style of question I'm thinking of are:

"Who won the best actor oscar in 1957?"

"Which country has the highest GDP in South America?"

"Find a recipe for guacamole"

"Is 16th August 2008 a public holiday in Italy?"

"How fast can a horse run in miles per hour?"

Answering trivia is certainly not trivial.

Ian


"If we knew what we were doing, it wouldn't be called research, would it?." - Albert Einstein

Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180





#20 From: Ian Davis <lists@...>
Date: Thu Dec 6, 2007 7:30 pm
Subject: Re: Re: Intro from Talis
ianalchemy
Offline Offline
Send Email Send Email
 
(this message grew in the telling...)

On Wed, 2007-12-05 at 20:45 +0000, serendipity588 wrote:

> With respect to your perspective, my response would be that we will
> certainly not define a particular task you have to perform with the
> data set. Much like with actually all good research contributions or
> business propositions, the question is, what is, in your estimate,
> that you could do with these billion triples that you believe your
> competition or peers around the world can not do? What makes you
> unique when it comes to handling this data set and how would you go
> about proving it?

Hmmm. That's a shame, because my feeling is that the challenge could
work really well with a particular task. The theme is the "open web"
with all the messiness and real-world expression of knowledge that
entails. There could be a very interesting challenge involving discovery
of information using a large number of unanticipated but interrelated
vocabularies.

I think one of the biggest challenges the semantic web faces is
demonstrating that it offers practical benefits over the standard IR
methods of the current crop of search engines. How can the semantic web
help people get the information they need more easily than simply
indexing textual content at scale.

I've not seen a convincing demonstration that the semweb does enable
this goal. I _believe_ it does, but I don't have any real evidence. One
debate we had internally at Talis involved finding the height of the
eiffel tower in feet. Just type those words into Google and you get a
number of suggestions actually embedded into the abstracts of the search
results. It even works with varying degrees of inaccuracy in spelling.

One of the standard arguments against IR is that it only works when you
know the search terms to use, and that they must be present on or near
the resulting pages. IMHO the same problem occurs in the semweb but
actually the potential number of terms (properties) is even larger than
the human vocabularies we use to write our knowledge down in prose form.

How do I even compose the question "what is the height of the eiffel
tower in feet" using RDF and SPARQL (for example)? How do I know what
terms to use for any of that query? How do I know that everyone will be
using the same terms? Even getting to Google's level of listing the
places where you might find the answer seems very hard to formulate a
query for.

RDF mitigates this with run-time strategies for discovering how terms
relate to one another, e.g. dereferencing URIs to discover RDF that may
relate the term to one you already know how to handle.

So, my thought for a good challenge over "open web" and "big data" is to
encourage the creation of several large datasets. These should use some
schemas dereferenceable at the property and class URIs and strongly
interlinked using RDFS and OWL to terms in other schemas. We already
have a pile of linked data that would form a great basis for this.

The challenge is then to build the best agent that can explore these
datasets and answer a series of questions known to be satisfiable. The
agents must not hard code any schema information but should discover the
relationships at run time. That might be impractical so perhaps some
level of base schema information needs to be encoded, but not too much.

The goal is not to parse arbitrary natural language so the contestants
could translate the queries into some internal format.

The style of question I'm thinking of are:

"Who won the best actor oscar in 1957?"

"Which country has the highest GDP in South America?"

"Find a recipe for guacamole"

"Is 16th August 2008 a public holiday in Italy?"

"How fast can a horse run in miles per hour?"

Answering trivia is certainly not trivial.

Ian

#19 From: "Georgi Kobilarov" <gkob@...>
Date: Thu Dec 6, 2007 7:20 pm
Subject: RE: Re: Data set
georgi.kobil...
Offline Offline
Send Email Send Email
 
Hi Giovanni,

> The current dbpedia does not serve as linked data all the information
> it has. In fact it does not include the colinking information.
> Although linking is, lets say, "weak semantic" it is still very
> important.

Could you give an example use case where is it useful to have that
dataset served as linked data?
I created it for the purpose of statistical analysis and that's done
only locally.

> Was wondering, will you be serving this information in the new
version?

Well, no, because I do not see any benefit and it would only slow down
the linked data access.

In my opinion Linked Data is about providing useful information, and
this is hopefully not only a pure scalability contest.

Cheers,
Georgi

--
Georgi Kobilarov
www.georgikobilarov.com

#18 From: Sören Auer <auer@...>
Date: Thu Dec 6, 2007 7:05 pm
Subject: Re: Re: Data set
soerenauer
Offline Offline
Send Email Send Email
 
gtummarello wrote:
> The current dbpedia does not serve as linked data all the information
> it has. In fact it does not include the colinking information.
> Although linking is, lets say, "weak semantic" it is still very important.
>
> Was wondering, will you be serving this information in the new version?

I guess this kind of discussion fits better to the DBpedia mailinglist
(which I include in my reply) since as I understood the BT-Challenge
will provide a dataset for download anyway and not refer to LOD endpoints.
Within DBpedia Chris and Kingsley are the ones who care about the SPARQL
endpoint and I guess they will be happy to comment on this. I suppose it
does (due to performance issues) not make sense to host all DBpedia
datasets within one single endpoint.

Sören

#17 From: Jim Hendler <hendler@...>
Date: Thu Dec 6, 2007 7:01 pm
Subject: Re: Re: General thoughts
james.hendler
Offline Offline
Send Email Send Email
 

The following, somewhat tongue in cheek interaction between me and Sean Palmer on the SWIG page (http://swig.xmlhack.com/2007/12/05/2007-12-05.html) may help show the fine line we are walking:

Open Web, billion triple challenge (ISWC 08)
posted by hendler at 2007-12-05 18:28 (+) tags:
hendler: spread the word
sbp: iter('_:p <http://example.org/#prop> "%s" .' % i for i in xrange(1000000000))
hendler: the trick is we get to define the triples -- sorry sbp...
sbp: Blast.

 


On one hand, we'd like people to show they're already using triple stores of this size, on the other hand, generating and storing 10^9 triples is not the challenge it once was (as your many incredible products and projects show).  What we have in mind, though, is to be very heterogeneous, as Peter mentioned, precisely to force this competition to explore what one does with datasets like this, rather than that one can create or store them (or that there exist some useful ones).  I like the idea of making it easy for lots of people to provide us with triples they think should be included (obviously we need to think how), but I'd suggest we would do something like not take more than a million or so from any one group, since one on-thousandth of the data coming from some source wouldn't be seen as giving anyone too much of an edge.  Taking all (or even a sizable fraction) from some existing source would seem to give an advantage

 that said, it seems to me that the interest shown by the folks on this list who already are playing with very large stores indicates that we should also consider some way that these things can compete

 We're open to ideas

  JH

p.s. One interesting thing is that the SWC has previously been primarily of interest to academics, clearly this thrust should also include the startups and other players in this space - so again, we need some ideas.


#16 From: "Kingsley Idehen" <kidehen@...>
Date: Thu Dec 6, 2007 6:40 pm
Subject: Re: General thoughts
kidehen
Offline Offline
Send Email Send Email
 
--- In billiontriples@yahoogroups.com, "serendipity588" <pmika@...> wrote:
>
> Hi Hugh,
>
> I completely share your perspective.
>
> About your first parameter: the Web indeed has been an important
> motivation to start doing this. Most of the submissions we have
> received in previous years for the SWC have been closed domain
> applications, built on their own data sets.
>
> I'm not sure whether we actually need to place the data in different
> places. I have the feeling that many people would collect it all as a
> first step anyway and then do something with it. My idea would be to
> provide the data set in one place, but in the form of quads,
> preserving the provenance of the data.
>
> About the ontology: I would also very dissatisfied if there was only
> one ontology in the data!
>
> Thanks again for your comments,
> Peter
>
>
> --- In billiontriples@yahoogroups.com, "hughglaser" <hg@> wrote:
> >
> > Ah. I was just about to ask that question.
> > My perspective would be:
> > "What can you *do* with a billion triples?"
> > "Go on, wow me with something useful!"
> >
> > So I think the challenge needs to be refined, which I suspect is the
> question we are being
> > asked at the moment.
> > I see at least two further parameters: Web and Ontology.
> >
> > I am not really very excited about someone doing something with a
> billion triples (BTs) in
> > one place. That is Semantic Web Technologies, not Semantic Web.
> > So I would suggest that you have to deal with a BTs where the
> storage is spread over, say,
> > at least 100 sites.
> >
> > I am also unexcited if the BTs are basically using the same
> Ontology. So again, maybe we
> > need to specify say, 100 different ontologies, of varying overlap.
> >
> > Maybe these numbers are too high, but certainly they should be more
> than 10 each.
> >
> > Then we should have further discussion on issues such as liveness of
> data (it should not
> > be acceptable just to grab all the triples and put them in one
> place, unless sophisticated
> > caching is implemented). Also, we should expect a degree of
> interlinking between the
> > sources.
> >
> > This is great stuff.
> > Hugh
> > --
> > Hugh Glaser,  Reader
> >               Dependable Systems & Software Engineering
> >               School of Electronics and Computer Science,
> >               University of Southampton,
> >               Southampton SO17 1BJ
> > Work: +44 (0)23 8059 3670, Fax: +44 (0)23 8059 3045
> > Mobile: +44 (0)78 9422 3822, Home: +44 (0)23 8061 5652
> > http://www.ecs.soton.ac.uk/~hg/
> >
> >
> > --- In billiontriples@yahoogroups.com, Jim Hendler <hendler@> wrote:
> > >
> > > Just to be clear - we have lots of thinking to do about what's in
> the
> > > triples, how they're collected, etc. - but the main challenge isn't
> > > to be able to store them, it will be showing what can be done with
> > > them -- anything from visualization and analysis through search,
> > > inference, etc.  The only thing we are sure of at the moment is that
> > > it will be a very heterogeneous set of data, including some, like
> > > Foaf, with RDFS and OWL that can provide inferencing guidance, etc.
> > > So we definitely will look forwarding to seeing you all provide your
> > > capabilities.  We welcome feedback, by the way, on types of data, on
> > > how we might best distribute it, and how best to scope the challenge.
> > >   -Jim H
> > >
> > >
> > > "If we knew what we were doing, it wouldn't be called research,
> would
> > > it?." - Albert Einstein
> > >
> > > Prof James Hendler 		 http://www.cs.rpi.edu/~hendler
> > > Tetherless World Constellation Chair
> > > Computer Science Dept
> > > Rensselaer Polytechnic Institute, Troy NY 12180
> > >
> >
>
Chris,

Very very important points!!

When done in an Open manner that leverages openly available linked data, we
actually get
to a Billion Triples really quickly, which then creates a nice segue to the real
question:
Show me something useful from this collection of openly interlinked data that
tangible
demonstrates the new frontier we all know as the Semantic Data Web (or Data Web
for
short).

As you know, I do believe that SIOC provides a really powerful Glue Ontology for
the
Linked Data Web (or GGG), and we have a great opportunity, via this project,  to
test this
hypothesis :-)


Kingsley

#15 From: "gtummarello" <g.tummarello@...>
Date: Thu Dec 6, 2007 6:30 pm
Subject: Re: Data set
gtummarello
Offline Offline
Send Email Send Email
 
The current dbpedia does not serve as linked data all the information
it has. In fact it does not include the colinking information.
Although linking is, lets say, "weak semantic" it is still very important.

Was wondering, will you be serving this information in the new version?

Giovanni

--- In billiontriples@yahoogroups.com, S�ren Auer <auer@...> wrote:
>
> Peter Mika wrote:
> > Question 2: Could you contribute data, and how much, from any of these
> > sources?
>
> We are currently in the process of preparing the DBpedia V3 release,
> which will include extractions of most datasets in the 14 largest
> language versions, including the semantically very interesting infobox
> dataset.
> These multi-lingual infobox datasets, will provide an extremely rich
and
> challenging testbed for Ontology mapping, merging, querying and the
like
> and I would be happy to see them as part of the BT-dataset.
> In fact they can be also seen to be datasets from 14 different websites
> (that's what *.wikipedia.org are), with the nice addition, that
(almost)
> sameAs-Links are established between them by means of interwiki links.
> Links to the Release Candidate datasets will be shortly posted to the
> DBpedia mailinglist [1]. I will be happy to forward a copy tho this
> list.
>
> Best,
>
> S�ren
>
> [1] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>

#14 From: Sören Auer <auer@...>
Date: Thu Dec 6, 2007 5:07 pm
Subject: Re: Data set
soerenauer
Offline Offline
Send Email Send Email
 
Peter Mika wrote:
> Question 2: Could you contribute data, and how much, from any of these
> sources?

We are currently in the process of preparing the DBpedia V3 release,
which will include extractions of most datasets in the 14 largest
language versions, including the semantically very interesting infobox
dataset.
These multi-lingual infobox datasets, will provide an extremely rich and
challenging testbed for Ontology mapping, merging, querying and the like
and I would be happy to see them as part of the BT-dataset.
In fact they can be also seen to be datasets from 14 different websites
(that's what *.wikipedia.org are), with the nice addition, that (almost)
sameAs-Links are established between them by means of interwiki links.
Links to the Release Candidate datasets will be shortly posted to the
DBpedia mailinglist [1]. I will be happy to forward a copy tho this
list.

Best,

Sören

[1] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

#13 From: Peter Mika <pmika@...>
Date: Thu Dec 6, 2007 11:19 am
Subject: Welcome to new members
serendipity588
Offline Offline
Send Email Send Email
 
All,

After less than 24 hours, we already have 44 members on our mailing
list, which is great to see. Welcome everyone!

Feel free to invite your colleagues who may have missed the announcement
on the mailing lists. I would appreciate also if you could blog about
our initiative, just make sure the discussion is channeled back here.

Best,
Peter

#12 From: Peter Mika <pmika@...>
Date: Thu Dec 6, 2007 11:12 am
Subject: Data set
serendipity588
Offline Offline
Send Email Send Email
 
Hi All,

As Jim said, one of the most important aspects to discuss is the data
set. As you can expect this is not something we have but something we
are planning to build with you as a community effort. (As it has done
for many evaluation initiatives in many other fields of science...)

The idea is to collect data from several sources, so I would like to
start with making a list of potential data sources (well, Chris started
already). So here we go:

D1. Linked data
D2. Crawls of Semantic Web search engines (mostly FOAF, I expect)
D3. Embedded metadata (RDFa, microformats)
D4. Folksonomies

Question 1: Is this list complete? Do you see other major sources?
Question 2: Could you contribute data, and how much, from any of these
sources?

With respect to D3, I'm planning to do some work myself. I expect
embedded metadata to be a significant source of information, and also
one where we can do an important favor to the community simply by making
it available for research.

My plan is to take one of the query logs of Yahoo's search engine for
the US and take a sample of queries that represent a typical work day.
Taking the top ten results for these queries, I plan to extract the
metadata from the webpages that show up. I expect this to be sparse, but
at this point's its anyone's guess just how sparse.

Why I believe this will be interesting? First, the reason for starting
with a query log is that it reflects the part of the Web that people
actually want to see. Second, embedded metadata is one that poses the
greatest challenges when it comes to data quality. In short, my
observation is that people have trouble writing HTML, let alone
following microformats...

Comments are more than welcome!

Peter

#11 From: "serendipity588" <pmika@...>
Date: Thu Dec 6, 2007 10:53 am
Subject: Re: General thoughts
serendipity588
Offline Offline
Send Email Send Email
 
Hi Hugh,

I completely share your perspective.

About your first parameter: the Web indeed has been an important
motivation to start doing this. Most of the submissions we have
received in previous years for the SWC have been closed domain
applications, built on their own data sets.

I'm not sure whether we actually need to place the data in different
places. I have the feeling that many people would collect it all as a
first step anyway and then do something with it. My idea would be to
provide the data set in one place, but in the form of quads,
preserving the provenance of the data.

About the ontology: I would also very dissatisfied if there was only
one ontology in the data!

Thanks again for your comments,
Peter


--- In billiontriples@yahoogroups.com, "hughglaser" <hg@...> wrote:
>
> Ah. I was just about to ask that question.
> My perspective would be:
> "What can you *do* with a billion triples?"
> "Go on, wow me with something useful!"
>
> So I think the challenge needs to be refined, which I suspect is the
question we are being
> asked at the moment.
> I see at least two further parameters: Web and Ontology.
>
> I am not really very excited about someone doing something with a
billion triples (BTs) in
> one place. That is Semantic Web Technologies, not Semantic Web.
> So I would suggest that you have to deal with a BTs where the
storage is spread over, say,
> at least 100 sites.
>
> I am also unexcited if the BTs are basically using the same
Ontology. So again, maybe we
> need to specify say, 100 different ontologies, of varying overlap.
>
> Maybe these numbers are too high, but certainly they should be more
than 10 each.
>
> Then we should have further discussion on issues such as liveness of
data (it should not
> be acceptable just to grab all the triples and put them in one
place, unless sophisticated
> caching is implemented). Also, we should expect a degree of
interlinking between the
> sources.
>
> This is great stuff.
> Hugh
> --
> Hugh Glaser,  Reader
>               Dependable Systems & Software Engineering
>               School of Electronics and Computer Science,
>               University of Southampton,
>               Southampton SO17 1BJ
> Work: +44 (0)23 8059 3670, Fax: +44 (0)23 8059 3045
> Mobile: +44 (0)78 9422 3822, Home: +44 (0)23 8061 5652
> http://www.ecs.soton.ac.uk/~hg/
>
>
> --- In billiontriples@yahoogroups.com, Jim Hendler <hendler@> wrote:
> >
> > Just to be clear - we have lots of thinking to do about what's in
the
> > triples, how they're collected, etc. - but the main challenge isn't
> > to be able to store them, it will be showing what can be done with
> > them -- anything from visualization and analysis through search,
> > inference, etc.  The only thing we are sure of at the moment is that
> > it will be a very heterogeneous set of data, including some, like
> > Foaf, with RDFS and OWL that can provide inferencing guidance, etc.
> > So we definitely will look forwarding to seeing you all provide your
> > capabilities.  We welcome feedback, by the way, on types of data, on
> > how we might best distribute it, and how best to scope the challenge.
> >   -Jim H
> >
> >
> > "If we knew what we were doing, it wouldn't be called research,
would
> > it?." - Albert Einstein
> >
> > Prof James Hendler 		 http://www.cs.rpi.edu/~hendler
> > Tetherless World Constellation Chair
> > Computer Science Dept
> > Rensselaer Polytechnic Institute, Troy NY 12180
> >
>

#10 From: "chrisbizer" <chris@...>
Date: Thu Dec 6, 2007 10:09 am
Subject: Re: General thoughts
chrisbizer
Offline Offline
Send Email Send Email
 
Hi Peter, Jim and all,

I think that it is a great idea to have a special track in the
Semantic Web challenge for applications that do innovative stuff with
lots of RDF data from the Web.

I also highly agree with Hugh that such a challenge will be most
interesting if the applications have to use data from lots of
independent data sources which will naturally be represented using
different vocabularies and ontologies.

Therefore, I wonder if you have to specify at all which datasets
should be used and how the datasets are distributed.

The challenge is about the WEB.

Therefore it would feel naturally to me if the candidates would use
any kind of data that is published on the WEB and access the data
using standard WEB procedures (meaning HTTP).

The Linking Open Data community
(http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOp
enData) has developed a network of independent and interlinked
datasources that publish data as Linked Data, RDF dumps and via
SPARQL endpoints on the Web.

Altogether the datasets are estimated to amount to around 2 or 3
billion triples.

See http://richard.cyganiak.de/2007/10/lod/ and
http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/D
ataSets for more information about the datasets.

Examples of applications that use that data are Semantic Web search
engines like Sindice or Falcons
http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/S
emanticWebSearchEngines or Semantic Web browsers like Tabulator, the
Zitgist browser, the OpenLink browser or DISCO
http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/S
emWebClients

The LOD dataset cloud has all the characteristics Hugh was asking
about:

- they are published by independent sources
- the cover lots of different domains
- they use different ontologies
- they are part of the WEB.

Working with this data would raise all the interesting issues that
Semantic Web applications will face in the future:

1. Scalability
2. Mapping between different vocabularies and ontologies
3. Reasoning over logically inconsistent data
4. Trust and information quality assessment
5. Dealing with new, unexpected vocabularies and ontologies that are
discovered at run-time.

Therefore, I think the Billion triple challenge should not divide the
Semantic Web into different parts, by specifying a closed dataset and
any non-standard distribution mechanisms for the challenge, but just
rely on the data that is already published on the Semantic Web.

If you have some specific datasets in mind that you would like to use
for the challenge, it would feel more natural to me if you would just
publish these datasets as Linked Data on the Web and interlink them
with existing datasets from the LOD cloud, so that the data can be
browsed and crawled.

I'm really looking forward to the first round of your challenge, as I
think such a challenge is exactly what the Semantic Web needs right
now to steer the community into a strict WEB direction.

Cheers

Chris


--- In billiontriples@yahoogroups.com, "hughglaser" <hg@...> wrote:
>
> Ah. I was just about to ask that question.
> My perspective would be:
> "What can you *do* with a billion triples?"
> "Go on, wow me with something useful!"
>
> So I think the challenge needs to be refined, which I suspect is
the question we are being
> asked at the moment.
> I see at least two further parameters: Web and Ontology.
>
> I am not really very excited about someone doing something with a
billion triples (BTs) in
> one place. That is Semantic Web Technologies, not Semantic Web.
> So I would suggest that you have to deal with a BTs where the
storage is spread over, say,
> at least 100 sites.
>
> I am also unexcited if the BTs are basically using the same
Ontology. So again, maybe we
> need to specify say, 100 different ontologies, of varying overlap.
>
> Maybe these numbers are too high, but certainly they should be more
than 10 each.
>
> Then we should have further discussion on issues such as liveness
of data (it should not
> be acceptable just to grab all the triples and put them in one
place, unless sophisticated
> caching is implemented). Also, we should expect a degree of
interlinking between the
> sources.
>
> This is great stuff.
> Hugh
> -- 
> Hugh Glaser,  Reader
>               Dependable Systems & Software Engineering
>               School of Electronics and Computer Science,
>               University of Southampton,
>               Southampton SO17 1BJ
> Work: +44 (0)23 8059 3670, Fax: +44 (0)23 8059 3045
> Mobile: +44 (0)78 9422 3822, Home: +44 (0)23 8061 5652
> http://www.ecs.soton.ac.uk/~hg/
>
>
> --- In billiontriples@yahoogroups.com, Jim Hendler <hendler@> wrote:
> >
> > Just to be clear - we have lots of thinking to do about what's in
the
> > triples, how they're collected, etc. - but the main challenge
isn't
> > to be able to store them, it will be showing what can be done
with
> > them -- anything from visualization and analysis through search,
> > inference, etc.  The only thing we are sure of at the moment is
that
> > it will be a very heterogeneous set of data, including some,
like
> > Foaf, with RDFS and OWL that can provide inferencing guidance,
etc.
> > So we definitely will look forwarding to seeing you all provide
your
> > capabilities.  We welcome feedback, by the way, on types of data,
on
> > how we might best distribute it, and how best to scope the
challenge.
> >   -Jim H
> >
> >
> > "If we knew what we were doing, it wouldn't be called research,
would
> > it?." - Albert Einstein
> >
> > Prof James Hendler
	 http://www.cs.rpi.edu/~hendler
> > Tetherless World Constellation Chair
> > Computer Science Dept
> > Rensselaer Polytechnic Institute, Troy NY 12180
> >
>

#9 From: Steve Harris <steve.harris@...>
Date: Thu Dec 6, 2007 12:11 am
Subject: ObIntro: Garlik
theno23
Offline Offline
Send Email Send Email
 
Hi All,

I'm Steve Harris, of Garlik Ltd.

We have several RDF stores containing 1-2 billion triples, which
provide data for a product called DataPatrol. It is what Hugh would
rightly call Semantic Web Technologies, not the Semantic Web :)

We have another product called QDOS that is a Semantic Web endeavour,
but is nowhere near a billion triples.

The company, and me in particular, is interested in scalability issues
in RDF stores and SPARQL query engines however.

- Steve

--
Steve Harris
Garlik Limited
2 Sheen Road
Richmond  TW9 1AE

T   +44(0)20 8973 2465
F   +44(0)20 8973 2301
www.garlik.com

Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
9AD

#8 From: "hughglaser" <hg@...>
Date: Wed Dec 5, 2007 10:49 pm
Subject: Re: General thoughts
hughglaser
Offline Offline
Send Email Send Email
 
Ah. I was just about to ask that question.
My perspective would be:
"What can you *do* with a billion triples?"
"Go on, wow me with something useful!"

So I think the challenge needs to be refined, which I suspect is the question we
are being
asked at the moment.
I see at least two further parameters: Web and Ontology.

I am not really very excited about someone doing something with a billion
triples (BTs) in
one place. That is Semantic Web Technologies, not Semantic Web.
So I would suggest that you have to deal with a BTs where the storage is spread
over, say,
at least 100 sites.

I am also unexcited if the BTs are basically using the same Ontology. So again,
maybe we
need to specify say, 100 different ontologies, of varying overlap.

Maybe these numbers are too high, but certainly they should be more than 10
each.

Then we should have further discussion on issues such as liveness of data (it
should not
be acceptable just to grab all the triples and put them in one place, unless
sophisticated
caching is implemented). Also, we should expect a degree of interlinking between
the
sources.

This is great stuff.
Hugh
-- 
Hugh Glaser,  Reader
              Dependable Systems & Software Engineering
              School of Electronics and Computer Science,
              University of Southampton,
              Southampton SO17 1BJ
Work: +44 (0)23 8059 3670, Fax: +44 (0)23 8059 3045
Mobile: +44 (0)78 9422 3822, Home: +44 (0)23 8061 5652
http://www.ecs.soton.ac.uk/~hg/


--- In billiontriples@yahoogroups.com, Jim Hendler <hendler@...> wrote:
>
> Just to be clear - we have lots of thinking to do about what's in the
> triples, how they're collected, etc. - but the main challenge isn't
> to be able to store them, it will be showing what can be done with
> them -- anything from visualization and analysis through search,
> inference, etc.  The only thing we are sure of at the moment is that
> it will be a very heterogeneous set of data, including some, like
> Foaf, with RDFS and OWL that can provide inferencing guidance, etc.
> So we definitely will look forwarding to seeing you all provide your
> capabilities.  We welcome feedback, by the way, on types of data, on
> how we might best distribute it, and how best to scope the challenge.
>   -Jim H
>
>
> "If we knew what we were doing, it wouldn't be called research, would
> it?." - Albert Einstein
>
> Prof James Hendler 		 http://www.cs.rpi.edu/~hendler
> Tetherless World Constellation Chair
> Computer Science Dept
> Rensselaer Polytechnic Institute, Troy NY 12180
>

#7 From: Jim Hendler <hendler@...>
Date: Wed Dec 5, 2007 10:33 pm
Subject: Re: General thoughts
james.hendler
Offline Offline
Send Email Send Email
 
Just to be clear - we have lots of thinking to do about what's in the triples, how they're collected, etc. - but the main challenge isn't to be able to store them, it will be showing what can be done with them -- anything from visualization and analysis through search, inference, etc.  The only thing we are sure of at the moment is that it will be a very heterogeneous set of data, including some, like Foaf, with RDFS and OWL that can provide inferencing guidance, etc.  So we definitely will look forwarding to seeing you all provide your capabilities.  We welcome feedback, by the way, on types of data, on how we might best distribute it, and how best to scope the challenge.
 -Jim H


"If we knew what we were doing, it wouldn't be called research, would it?." - Albert Einstein

Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180





#6 From: "Kingsley Idehen" <kidehen@...>
Date: Wed Dec 5, 2007 10:02 pm
Subject: Introduction: OpenLink Software
kidehen
Offline Offline
Send Email Send Email
 
All,

I assume this competition is really about demonstrating the construction of a
valuable
graph comprised of a least a Billion Triples? In our parlance this means,
produce a Data
Space of a Billion+ Triples with immediate, relevant, and obvious value.

Naturally, we are interested, across the spectrum of our Semantic Data Web
technology
portfolio which includes:

Virtuoso [1] - Quad Store which can manage data storage for our solutions or
those of
others (e.g. we now have an EC2 AMI instance [2] for Virtuoso amongst other
things). The
key thing with Virtuoso is demonstrable performance and scalability for Billion+
scale RDF
data management projects. Live examples include DBpedia [3] and the HCLS Banff
demo
[4]

OpenLink Data Spaces [5] - A solution that meshes Identity and RDF Linked Data
at the
Internet, Intranet, and Internet (i.e. Linked Data Spaces in the Clouds) levels.
For instance,
we can see all Web 2.0-style user generated data as Linked Data coherently
partitioned
using SIOC, FOAF, SKOS, Annotea Bookmarks, Annotea Annotations, vCard, RDF
Calendar,
and other shared ontologies via Data Space Containers.

OpenLink Ajax Toolkit (OAT) [6] - collection of Javasript APIs, Widgets, and
Applications
that bind transparently to RDF Linked Data  as exemplified by the iSPARQL Query
Builder,
OpenLink RDF Browser, and Zitgist Browser amongst other things. Thus, re this
project
these tools can be used to demonstrate value a myriad of views of the Billions
of Triples in
the Data Spaces that emerge from this project.

Finally, benchmarking is a passion of ours, and we've been working on a social-
networking style benchmark based on SIOC for a while. This could be a nice place
to work
on this effort [7].

Links:
1. http://dbpedia.org/resource/Virtuoso_Universal_Server
2. If you have an Amazon AWS account: AMI ID: ami-e2ca2f8b, Manifest: virtuoso-
images/virtuoso-dataspace-server-manifest.xml
3. http://dbpedia.org
4. http://esw.w3.org/topic/HCLS/Banff2007Demo
5. http://virtuoso.openlinksw.com/wiki/main/Main/OdsIndex - ODS (Open Source
Edition)
6. http://sourceforge.net/projects/oat - OAT Toolkit (live demos at:
http://demo.openlinksw.com/oatdemo )
7.
http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1276

Kingsley Idehen
Personal Web Data Space URI:
http://kidehen.idehen.net/dataspace/person/kidehen#this
Blog Data Space URL: http://www.openlinksw.com/blog/~kidehen

OpenLink Software
Organization Web Data Space URI:
http://www.openlinksw.com/dataspace/organization/openlink#this
Company Web Site URL: http://www.openlinksw.com

#5 From: "Orri Erling" <erling@...>
Date: Wed Dec 5, 2007 9:48 pm
Subject: Intro from OpenLink
oerling
Offline Offline
Send Email Send Email
 

 

 

 

 

 

Colleagues

 

It is great to see all this interest in questions of sem web scalability.

 

For OpenLink, this challenge represents a forum for both  refining the requirements on our technology as well as testing our ongoing development work, most recently in clustered triple stores and mapping of relational data to RDF.

 

Concurrently with this, there is a nascent activity in the form of a W3C experimental group for gathering experiences and discussing the theme of RDF storage and access benchmarking.  HP Labs, Deri and OpenLink , the founders of the group will be circulating invitations shortly.

 

I see lots of possible synergy between this challenge and the benchmarking experimental group.

 

                                                            OpenLink will be offering its new generation RDF databasing platform to participants as hosted on Amazon  EC2, as well as a part of its Virtuoso Open Source line. 

 

Previews on the new generation of Virtuoso, as well as ideas for a social web oriented RDF benchmark workload are found at

 

http://virtuoso.openlinksw.com/blog

 

We are interested  from the pure triple store scaling angle as well as the matter of combining  RDF and analytics and extending the reach of SPARQL.  Innovative applications of large, web-style data sets is an ideal platform for this.

 

 

Regards

 

Orri Erling

program Manager, Virtuoso

OpenLink Software

 

 

 


#4 From: "serendipity588" <pmika@...>
Date: Wed Dec 5, 2007 8:45 pm
Subject: Re: Intro from Talis
serendipity588
Offline Offline
Send Email Send Email
 
Ian,

Thank you very much for your offer to host the data set. We will be
very much looking for similar technical assistance in the future.

With respect to your perspective, my response would be that we will
certainly not define a particular task you have to perform with the
data set. Much like with actually all good research contributions or
business propositions, the question is, what is, in your estimate,
that you could do with these billion triples that you believe your
competition or peers around the world can not do? What makes you
unique when it comes to handling this data set and how would you go
about proving it?

I expect most participants will have different points to make about
these billion triples, and different ways to prove it. But I hope the
winners will be those who manage to convince the jury that they did
something truly unique.

So what is your billion triples Challenge?

Best,
Peter



--- In billiontriples@yahoogroups.com, "Ian Davis" <lists@...> wrote:
>
> Hi all,
>
> Looks like there are quite a few people here (28 according to the
> website) so I thought it might be time to say hello.
>
> I'm not sure of the exact format for the billion triple challenge but
> I'm told that's why this group exists. With that in mind I'd like to
> throw Talis' perspective into the mix.
>
> If there is to be a billion triple challenge then we'd like to offer
> hosting a copy of the data set in our platform for free use by the
> entrants. We hope that by doing so we can enable many more entrants than
> would otherwise be possible since we will bear the infrastructure costs
> associated with doing interesting things at scale. The platform offers
> sparql, full text querying, linked data and various other REST based
> services that could be used to build applications. We could also offer
> write access to the data if that would make the challenge more
> interesting.
>
> Alternatively, if the challenge is more about who can host the biggest
> triple store or something like that then we're very unlikely to be
> interested in competing. We're much more interested in enabling exciting
> uses of large data sets by as many people as possible.
>
> Anyway, it'd be good to hear other people's views on what could be
> possible and what would make for a great challenge.
>
> Best regards,
>
> Ian Davis
> Chief Technology Officer, Talis
>

#3 From: Peter Mika <pmika@...>
Date: Wed Dec 5, 2007 8:37 pm
Subject: Welcome!
serendipity588
Offline Offline
Send Email Send Email
 
All,

Welcome to the mailing list!

For now, I would just want to note how delighted I am by the early
reactions to the new Challenge. The sun hasn't even set in Hawaii, and
we already have close to thirty subscriptions. I consider this very
positive, given how early a stage we are at.

We will wait a bit, and then start with some questions to you.

Best,
Peter

#2 From: "Ian Davis" <lists@...>
Date: Wed Dec 5, 2007 8:24 pm
Subject: Intro from Talis
ianalchemy
Offline Offline
Send Email Send Email
 
Hi all,

Looks like there are quite a few people here (28 according to the
website) so I thought it might be time to say hello.

I'm not sure of the exact format for the billion triple challenge but
I'm told that's why this group exists. With that in mind I'd like to
throw Talis' perspective into the mix.

If there is to be a billion triple challenge then we'd like to offer
hosting a copy of the data set in our platform for free use by the
entrants. We hope that by doing so we can enable many more entrants than
would otherwise be possible since we will bear the infrastructure costs
associated with doing interesting things at scale. The platform offers
sparql, full text querying, linked data and various other REST based
services that could be used to build applications. We could also offer
write access to the data if that would make the challenge more
interesting.

Alternatively, if the challenge is more about who can host the biggest
triple store or something like that then we're very unlikely to be
interested in competing. We're much more interested in enabling exciting
uses of large data sets by as many people as possible.

Anyway, it'd be good to hear other people's views on what could be
possible and what would make for a great challenge.

Best regards,

Ian Davis
Chief Technology Officer, Talis

#1 From: Jim Hendler <hendler@...>
Date: Wed Dec 5, 2007 3:11 pm
Subject: Re: Billion triples announce
james.hendler
Offline Offline
Send Email Send Email
 
OK, another test -- 

Interesting - to change my email address I had to rejoin the group, and then verify the new address - then it removes the other alternate address -- so the problem is that Yahoo! only allows 2 addresses, which I guess makes sense for most people - I have six or seven, but then I'm probably unusual (they accrued during the nearly 30 years I've been using email - I'm getting so danged old)
 -JH



On Dec 5, 2007, at 9:40 AM, Peter Mika wrote:

Hi Jim,

I've written a brief pre-announcement. Feel free to edit it! As you noticed, the group has been set up.

Best,
Peter


------------------- cut here -------------------
ANN: The Open Web, Billion Triples Challenge

This is the first public pre-announcement of the Open Web, Billion Triples Challenge, which will be organized as a special track of next year's Semantic Web Challenge [1]. This track will be in addition to the traditional SWC competition and it will focus on pushing the limits in tool design on the fronts of scalability in size and robustness in the face of data typically found on the Web. The goal of the competition is also to generate new application ideas, i.e. to show what is possible with Web metadata today.

The details of this Challenge are yet to be determined and we are calling on the Semantic Web community to help us in its formation. For this purpose we have set up a mailing list [2] and would like to invite everyone interested in this new Challenge to join this list. The mailing list will serve to discuss the data sets and rules of competition, and later to disseminate all other information regarding the new Challenge.

Best Regards,

The co-chairs of the SWC:

Jim Hendler (Rensselaer Polytechnic Institute)
Peter Mika (Yahoo! Research)





"If we knew what we were doing, it wouldn't be called research, would it?." - Albert Einstein

Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180





Messages 1 - 30 of 141   Newest  |  < Newer  |  Older >  |  Oldest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help