Here's my views:
Triple Store:
the big problem with semantic web, no matter how big promises it
makes, is the amount of triples that can be stored and dealt with. As
the size of triples increase, developers suffer from resource problem.
So the question is how can I work with billion triples? I am not
backed by organizations to give me resources for working on the big B
of billion. Do we have the sandbox ???
System requirements and benchmarking criteria are not clear.
Linked data:
this is most probably the best part of realizing the semantic web and
i hope some killer apps gonna be developed that will make people think
'This is the reason to shift to semantic web !!!" Till now, Semantic
web is just an academic hype.
Reasoning:
Reasoning comes after Triple Store. Resource problem again !!!
Ontology Research:
Could you fine tune this section? Is it creation of new ontologies or
creation of new language or sth else ?
---
Amit Krishna Joshi
--- In billiontriples@yahoogroups.com, Jim Hendler <hendler@...> wrote:
>
> All-
> Peter feels that we now have the collection and distribution of the
> triples underway, which means he gets to make me do some work finally...
> My role at the moment is to figure out what we would like to make
> the challenge part of the challenge be,
> Here are some thoughts, I welcome feedback
> We see four, very non disjoint audiences for the challenge (in
> fact, Peter, me, and most of the people on this list are in at least
> several categories):
> Triple store developers, linked data technology developers, Semantic
> Web researchers interested in scalable reasoning, ontology-based
> research groups
>
> Here are some of my thoughts with respect to these
>
> A - Triple Store Developers
> We do not want this to be a "triple store shootout" in the sense
> of who can process a query fastest or such. We don't see that
> competition as being all that useful at a time when people are still
> very much in development mode. Rather, we would like the outcome of
> this event to be a realization in the outside world that triple-stores
> can and do handle these sorts of numbers (the DB folks still say
> "triple stores break at a million triples" at conferences I go to - I
> have no idea where they get that, but let's push it up a few orders of
> magnitude!!)
> So at the moment my thinking on this area is that we would like to
> give you folks bragging rights for being able to support systems other
> people develop (i.e. any of you who host this data and make it
> available via SPARQL should be listed as "winners" in some way)
> I also think that if some interesting, large, and complex SPARQL
> queries are developed against this dataset (say including filters and
> optionals), then those would become useful benchmarks, so we would
> like to find a way to encourage the sharing of these (maybe for a
> future date when a benchmarking shootout would be more appropriate)
>
> B - Linked data technology developers
> We write a lot about the Semantic Web as being the Web of linked
> data, but to date, in practice, most of that data is either within an
> enterprise or locked in a particular application. We are purposely
> designing this dataset to be very heterogeneous, but with many
> connections between pieces, so it should be a great dataset for
> showing off tools that can exploit the dataweb.
> In this area we are thinking of having some goals like "visualize
> (or browse) the dataweb", Datamining of this sort of data, etc. --
> seems to us this is a ripe area for a challenge
>
> C - SW researchers interested in scalable reasoning
> The data set we are developing will include a (large) number of
> triples tied to FOAF, DOAP and other "small o" ontologies. We also
> have a lot of data that will be made available that was crawled from
> microformats (where the "semantics" are well specified). This is thus
> an ideal proving grounds for the "little semantics goes a long way"
> philosophy, and thus this also seems like an appropriate challenge area
>
> D - Ontology research
> Big A-Box, you got it! Show us something.
>
> So, I think we will have the "competition" be fairly unspecified - we
> will identify several areas of interest from the above and work out
> how to tie that into an "announcible" competition.
>
> I welcome, NEED, your feedback on this
> -Jim H.
>
>
>
>
> "If we knew what we were doing, it wouldn't be called research, would
> it?." - Albert Einstein
>
> Prof James Hendler http://www.cs.rpi.edu/~hendler
> Tetherless World Constellation Chair
> Computer Science Dept
> Rensselaer Polytechnic Institute, Troy NY 12180
>
All-
Peter feels that we now have the collection and distribution of the
triples underway, which means he gets to make me do some work finally...
My role at the moment is to figure out what we would like to make
the challenge part of the challenge be,
Here are some thoughts, I welcome feedback
We see four, very non disjoint audiences for the challenge (in
fact, Peter, me, and most of the people on this list are in at least
several categories):
Triple store developers, linked data technology developers, Semantic
Web researchers interested in scalable reasoning, ontology-based
research groups
Here are some of my thoughts with respect to these
A - Triple Store Developers
We do not want this to be a "triple store shootout" in the sense
of who can process a query fastest or such. We don't see that
competition as being all that useful at a time when people are still
very much in development mode. Rather, we would like the outcome of
this event to be a realization in the outside world that triple-stores
can and do handle these sorts of numbers (the DB folks still say
"triple stores break at a million triples" at conferences I go to - I
have no idea where they get that, but let's push it up a few orders of
magnitude!!)
So at the moment my thinking on this area is that we would like to
give you folks bragging rights for being able to support systems other
people develop (i.e. any of you who host this data and make it
available via SPARQL should be listed as "winners" in some way)
I also think that if some interesting, large, and complex SPARQL
queries are developed against this dataset (say including filters and
optionals), then those would become useful benchmarks, so we would
like to find a way to encourage the sharing of these (maybe for a
future date when a benchmarking shootout would be more appropriate)
B - Linked data technology developers
We write a lot about the Semantic Web as being the Web of linked
data, but to date, in practice, most of that data is either within an
enterprise or locked in a particular application. We are purposely
designing this dataset to be very heterogeneous, but with many
connections between pieces, so it should be a great dataset for
showing off tools that can exploit the dataweb.
In this area we are thinking of having some goals like "visualize
(or browse) the dataweb", Datamining of this sort of data, etc. --
seems to us this is a ripe area for a challenge
C - SW researchers interested in scalable reasoning
The data set we are developing will include a (large) number of
triples tied to FOAF, DOAP and other "small o" ontologies. We also
have a lot of data that will be made available that was crawled from
microformats (where the "semantics" are well specified). This is thus
an ideal proving grounds for the "little semantics goes a long way"
philosophy, and thus this also seems like an appropriate challenge area
D - Ontology research
Big A-Box, you got it! Show us something.
So, I think we will have the "competition" be fairly unspecified - we
will identify several areas of interest from the above and work out
how to tie that into an "announcible" competition.
I welcome, NEED, your feedback on this
-Jim H.
"If we knew what we were doing, it wouldn't be called research, would
it?." - Albert Einstein
Prof James Hendler http://www.cs.rpi.edu/~hendler
Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180
A
Look at C-Store, Java, and Data Grid Approaches to Semantic Web Applications
With
the rising importance of data analytics, there is more evidence than ever that
graph style data systems can achieve new benefits by making it easier to link
and re-combine complex data. But the Achilles heel of graph style tuple storage
has always been a lack of performance at scale. Will the Semantic Web and
modern analytics finally drive innovation that makes these systems scalable? In
this SDForum interactive panel discussion we will explore that question and
more.
Join
us for three unique presentations that will explore cutting-edge techniques for
scalable RDF/OWL storage, and the kinds of applications that make use of those
systems. First, we are honored to have representation from Vertica and the
Massachusetts Institute of Technology to describe how columnar store (C-Store)
data warehouse technology can enable large scale data graphs supporting
billions of RDF triples. Next, we’ll get a peek at some GeoTemporal and Social
Network Analysis applications based off the federated Java RDF database from
Franz Technologies.Finally, a short
synopsis of Oracle’s various approaches for tuple-based storage (including
in-memory, data grid, and Oracle Database RDF solutions) will be presented and
tradeoffs discussed.
Our
expert guests include Andy Palmer from Vertica, Samuel R. Madden from MIT, Jans
Aasman from Franz Technologies. Jeff Pollock from Oracle will moderate as well
as present a short summary of technical approaches to scalable RDF systems.
** our apologies if you receive multiple copies of this message **
==================================================================
CALL FOR PAPERS
ESWC 2008 Workshop
Identity and Reference on the Semantic Web (IRSW2008)
--------------------------------------------
Entity-centric Approaches to Information and
Knowledge Management on the Web
Tenerife, Spain - June 1 2008
http://www.okkam.org/IRSW2008
==================================================================
The recent developments of the Semantic Web - and the fast rise of Web
2.0 applications - make more and more evident that the problem of
identity and reference through URIs is perhaps the single most
important issue for fostering the Semantic Web on a global scale. In a
nutshell: the effective use of the Semantic Web on a global scale
requires the systematic reuse of stable and global URIs. This in turn
requires that there exist decentralized agreement on how URIs can be
used to identify and refer to the same object. So far, uniqueness of
URIs and reference have often been taken for granted. Initiatives like
Linked Data, OntoWorld and the large number of proposals aiming at
using popular identifiers (e.g. Wikipedia's) as "canonical" URIs
(especially for "real world" objects that aren't accessible on the
Web) show that a solution to this issue is both urgent and relevant.
Solving this issue would enable and foster the decentralized and open
publication of data on the Semantic Web, would allow better and faster
semantic search engines, would be the basis for a new generation of
Semantic Web browsers, would start the development of smarter
applications on the Web. Other vertical (and often commercial)
initiatives (like XRIs, LSID, DOI, etc.) prove that there is also a
practical and business potential in a standard solution.
So far, there is little agreement on how this problem should be
addressed and solved. On the one hand we need to address technical
issues:
* How do we make sure that people and applications can find
and reuse pre-existing URIs for different types of entity?
* Is HTTP the most appropriate addressing scheme for these URIs?
* Should URIs for commonly identified entities, like people,
organizations or countries, be managed by a central service? If so,
under what conditions?
* Are centralized registries of URIs for different types of
entities necessary? Can such a registries be built in a decentralized
manner while still linking data?
There are also issues of trust and security:
* What if the same URI is used to make contradictory or undesired
statements about an entity?
* Do people or groups really want that a single URIs is
consistently used to represent knowledge about them on the Web, one
that could be used to effectively gather data about them?
* What is an acceptable level of security for any kind of URI registry?
* Where is the boundary between describing entities and violating
their privacy?
Despite the high level of awareness in the community, the potential
for the integration of information currently published on the Semantic
Web is still mostly unexploited. FOAF profiles do not have canonical
and reusable URIs for pointing to people one knows (only ad hoc
solutions are available, like the email hashcode); the most popular
ontology editors mint new URIs for any newly started OWL project;
social networks are not easily portable.
Starting from such a situation, this workshop aims at collecting
contributions which can roughly be grouped as follows:
* Foundations: formal and conceptual theories of identity and
reference for the Semantic Web
* Vision papers: visionary solutions to the problems of identity
and reference
* Project papers: descriptions of research & development projects
in this area
* Experiences: contributions from research and industry that
illustrate case studies or approaches to deal with the issues of
identity and reference
* Critical viewpoints: discussions of advantages and disadvantages
of previously proposed approaches.
We especially encourage contributions from groups or organizations
which are working on identification schemes for large semantic data
collections, in order to compare the different practical solutions
that have been developed to integrate Semantic Web data..
Workshop's anticipated outcome:
The anticipated outcome of the workshop is to assess the state of the
art in the area, as well as to discuss the approach and critically
evaluate the next steps in pursuing this topic. There is the potential
for creating the core of a consortium for future R&D projects on the
topic for both
academia and industry.
Submission Details
------------------
All submissions will undergo a thorough peer-review process by
an international program committee, made up of leading members of
different communities from "Web 2.0", Semantic Web and Information
Retrieval researchers and companies.
Accepted contributions will be included on the ESWC2008
Conference CD as well as made available as CEUR Online Proceedings
We invite submissions of two types:
1. full papers (up to 15 pages in LNCS format)
2. extended abstracts (up to 4 pages in LNCS format).
The authors of accepted abstracts will be requested to produce a full
paper by the time the camera-ready version is due. Accepted
contributions will be presented at the workshop. Additionally, some
submissions may be accepted as posters.
Submissions should be formatted in Springer LNCS format
(http://www.springer.de/comp/lncs/authors.html) and submitted in PDF
format.
The submission site can be reached through the webpage
http://www.easychair.org/conferences/?conf=irsw2008
Please note that at least one author of an accepted paper must
register for the ESWC 2008 conference
Important Dates
* Paper/abstract submission: March 7, 2008
* Notification of acceptance: April 4, 2008
* Camera ready Paper submission: April 18, 2008
* Workshop: June 1, 2008
Organization
------------
Chair
Paolo Bouquet, University of Trento
Program Co-Chairs
Heiko Stoermer, University of Trento
Giovanni Tummarello, DERI Galway
Harry Halpin, University of Edinburgh
Program Committee:
Karl Aberer EPFL
Chris Bizer Freie Universität Berlin
David Booth HP
Werner Ceusters University of Buffalo
Richard Cyganiak DERI Galway
Anita De Waard Elsevier
Stefan Decker DERI Galway
Hugh Glaser University of Southampton
Andreas Harth DERI Galway
Tom Heath Talis Information Ltd
Kingsley Idehen OpenLink Software
Pierre Levy University of Ottawa
Alexander Löser SAP Research
Antonio Mana University of Malaga
Christian Morbidoni Universita' Politecnica delle Marche
Claudia Niederée L3S Research Center
Alan Ruttenberg Science Commons US
Matthias Samwald DERI Galway
Leo Sauermann DFKI
Henry Thompson University of Edinburgh UK
Marco Varone ExpertSystem IT
Bernard Vatant Mondeca FR
Thnx for the info.
-
Amit
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Hi Amit,
>
> No, I don't as I'm not familiar with Jena. But basically the
> MeasurableInputStream that you get as a result of the
> response.contentAsStream() call on line 143 is a Java IOStream
that you
> can process further with any API.
>
> Best,
> Peter
>
> crossthelimit wrote:
> >
> > Hello Peter,
> > Do we have any codes written in Jena?
> >
> > -
> > Amit
> >
> > --- In billiontriples@yahoogroups.com
> > <mailto:billiontriples%40yahoogroups.com>, Jans Aasman <ja@>
wrote:
> > >
> > > thanks for the clarification, jans
> > >
> > > Peter Mika wrote:
> > > >
> > > > Hi Jans,
> > > >
> > > > The plan is to have the entire dataset available for
download in
> > the
> > > > WARC format as a set of files. (Some users may have
limitations
> > storing
> > > > files larger than 2GB.)
> > > >
> > > > The WARC format is a general format for storing the results
of
> > crawls.
> > > > It contains a header with the metadata and the HTTP
response. The
> > > > example I've sent recreates the HTTP response, which you
need to
> > do if
> > > > you only have the content. (You can also store metadata in
the
> > HTTP
> > > > Response headers.)
> > > >
> > > > 100 million triples on our side seems to compress to about 3
GB.
> > > >
> > > > Best,
> > > > Peter
> > > >
> > > > jans.aasman wrote:
> > > > >
> > > > > Hi Peter, I'm not entirely sure what you are going to give
us
> > access
> > > > > to. You (if everything goes right at Yahoo) will give us
> > access to a
> > > > > 100 G crawl in ntriples but the format of the triples is
based
> > on
> > > > > Warc? Jans
> > > > >
> > > > > Peter Mika wrote:
> > > > >
> > > > >> Dear All,
> > > > >>
> > > > >> After some long and careful consideration, we have made
the
> > > > decision not
> > > > >> to invent our own format for exchanging data but to rely
on
> > an existing
> > > > >> format known as WARC [1], in particular WARC version 0.9.
> > WARC archives
> > > > >> store provenance (URL) and timestamp in the header. The
only
> > additional
> > > > >> agreement we need to make is that we are going to encode
> > files in
> > > > >> N-Triples format. (If that is a problem, let us know.)
> > > > >>
> > > > >> What convinced us ultimately about WARC is the excellent
tool
> > > > support in
> > > > >> the form of a Java API from the Laboratory for Web
> > Algorithmics [2] of
> > > > >> the Università degli studi di Milano <http://www.unimi.
it/
> > > > <http://www.unimi.it/ <http://www.unimi.it/>>
> > > > >> <http://www.unimi. it/ <http://www.unimi.it/
> > <http://www.unimi.it/>>>>. The API can
> > > > >> be downloaded from [3] and there is a separate tarball
with
> > all the
> > > > >> dependencies. (The license in LGPL). One of the nice
features
> > of this
> > > > >> API is the ability to work with streams of compressed WARC
> > records,
> > > > >> where metadata about each record is stored in the gzip
> > header. This
> > > > >> means that the metadata can be read without uncompressing
the
> > > > content of
> > > > >> the record itself. Further, there are skip pointers in the
> > file, which
> > > > >> means that a record can be easily skipped over.
> > > > >>
> > > > >> To make it really easy, I've also created sample code that
> > demonstrates
> > > > >> how to create WARC archives from a set of files or a
> > directory
> > > > structure
> > > > >> on disk, and how to read back the resulting WARC archive.
The
> > code is
> > > > >> simply attached to this email, if all is well. (First
time I
> > send
> > > > >> attachments to a Y! Group.) Many thanks to Sebastiano
Vigna,
> > one of the
> > > > >> authors of the LAW API, for his help and advice.
> > > > >>
> > > > >> To support the Challenge, we at Yahoo! Research Barcelona
are
> > also hard
> > > > >> at work to get permission to release a microformat crawl
of
> > 100 million
> > > > >> triples. We hope this will be a significant contribution
to
> > the
> > > > >> state-of-the- art and will complement the existing data
sets
> > to be
> > > > >> provided by Semantic Web search engines.
> > > > >>
> > > > >> As always, your comments and questions are more than
> > appreciated. In
> > > > >> particular those of you planning to provide some data,
please
> > let us
> > > > >> know if you need any further help.
> > > > >>
> > > > >> Thanks,
> > > > >> Peter
> > > > >>
> > > > >> [1]
> > > > >> http://archive- access.sourcefor ge.net/warc/ warc_file_
> > format-0.
> > > > 9.html
> > > > <http://archive-access.sourceforge.net/warc/warc_file_format-
> > <http://archive-access.sourceforge.net/warc/warc_file_format->
> > 0.9.html>
> > > > >> <http://archive- access.sourcefor ge.net/warc/ warc_file_
> > format-0.
> > > > 9.html
> > > > <http://archive-access.sourceforge.net/warc/warc_file_format-
> > <http://archive-access.sourceforge.net/warc/warc_file_format->
> > 0.9.html>>
> > > > >> [2] http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/
> > <http://law.dsi.unimi.it/>>
> > > > <http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/
> > <http://law.dsi.unimi.it/>>>
> > > > >> [3]
> > > > >> http://law.dsi. unimi.it/ index.php? option=com_
> > content&task=
> > > > section&id= 5&Itemid= 42
> > > > <http://law.dsi.unimi.it/index.php?
> > <http://law.dsi.unimi.it/index.php?>
> > option=com_content&task=section&id=5&Itemid=42>
> > > >
> > > > >> <http://law.dsi. unimi.it/ index.php? option=com_
> > content&task=
> > > > section&id= 5&Itemid= 42
> > > > <http://law.dsi.unimi.it/index.php?
> > <http://law.dsi.unimi.it/index.php?>
> > option=com_content&task=section&id=5&Itemid=42>>
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
> >
>
Hi Amit,
No, I don't as I'm not familiar with Jena. But basically the
MeasurableInputStream that you get as a result of the
response.contentAsStream() call on line 143 is a Java IOStream that you
can process further with any API.
Best,
Peter
crossthelimit wrote:
>
> Hello Peter,
> Do we have any codes written in Jena?
>
> -
> Amit
>
> --- In billiontriples@yahoogroups.com
> <mailto:billiontriples%40yahoogroups.com>, Jans Aasman <ja@...> wrote:
> >
> > thanks for the clarification, jans
> >
> > Peter Mika wrote:
> > >
> > > Hi Jans,
> > >
> > > The plan is to have the entire dataset available for download in
> the
> > > WARC format as a set of files. (Some users may have limitations
> storing
> > > files larger than 2GB.)
> > >
> > > The WARC format is a general format for storing the results of
> crawls.
> > > It contains a header with the metadata and the HTTP response. The
> > > example I've sent recreates the HTTP response, which you need to
> do if
> > > you only have the content. (You can also store metadata in the
> HTTP
> > > Response headers.)
> > >
> > > 100 million triples on our side seems to compress to about 3 GB.
> > >
> > > Best,
> > > Peter
> > >
> > > jans.aasman wrote:
> > > >
> > > > Hi Peter, I'm not entirely sure what you are going to give us
> access
> > > > to. You (if everything goes right at Yahoo) will give us
> access to a
> > > > 100 G crawl in ntriples but the format of the triples is based
> on
> > > > Warc? Jans
> > > >
> > > > Peter Mika wrote:
> > > >
> > > >> Dear All,
> > > >>
> > > >> After some long and careful consideration, we have made the
> > > decision not
> > > >> to invent our own format for exchanging data but to rely on
> an existing
> > > >> format known as WARC [1], in particular WARC version 0.9.
> WARC archives
> > > >> store provenance (URL) and timestamp in the header. The only
> additional
> > > >> agreement we need to make is that we are going to encode
> files in
> > > >> N-Triples format. (If that is a problem, let us know.)
> > > >>
> > > >> What convinced us ultimately about WARC is the excellent tool
> > > support in
> > > >> the form of a Java API from the Laboratory for Web
> Algorithmics [2] of
> > > >> the Università degli studi di Milano <http://www.unimi. it/
> > > <http://www.unimi.it/ <http://www.unimi.it/>>
> > > >> <http://www.unimi. it/ <http://www.unimi.it/
> <http://www.unimi.it/>>>>. The API can
> > > >> be downloaded from [3] and there is a separate tarball with
> all the
> > > >> dependencies. (The license in LGPL). One of the nice features
> of this
> > > >> API is the ability to work with streams of compressed WARC
> records,
> > > >> where metadata about each record is stored in the gzip
> header. This
> > > >> means that the metadata can be read without uncompressing the
> > > content of
> > > >> the record itself. Further, there are skip pointers in the
> file, which
> > > >> means that a record can be easily skipped over.
> > > >>
> > > >> To make it really easy, I've also created sample code that
> demonstrates
> > > >> how to create WARC archives from a set of files or a
> directory
> > > structure
> > > >> on disk, and how to read back the resulting WARC archive. The
> code is
> > > >> simply attached to this email, if all is well. (First time I
> send
> > > >> attachments to a Y! Group.) Many thanks to Sebastiano Vigna,
> one of the
> > > >> authors of the LAW API, for his help and advice.
> > > >>
> > > >> To support the Challenge, we at Yahoo! Research Barcelona are
> also hard
> > > >> at work to get permission to release a microformat crawl of
> 100 million
> > > >> triples. We hope this will be a significant contribution to
> the
> > > >> state-of-the- art and will complement the existing data sets
> to be
> > > >> provided by Semantic Web search engines.
> > > >>
> > > >> As always, your comments and questions are more than
> appreciated. In
> > > >> particular those of you planning to provide some data, please
> let us
> > > >> know if you need any further help.
> > > >>
> > > >> Thanks,
> > > >> Peter
> > > >>
> > > >> [1]
> > > >> http://archive- access.sourcefor ge.net/warc/ warc_file_
> format-0.
> > > 9.html
> > > <http://archive-access.sourceforge.net/warc/warc_file_format-
> <http://archive-access.sourceforge.net/warc/warc_file_format->
> 0.9.html>
> > > >> <http://archive- access.sourcefor ge.net/warc/ warc_file_
> format-0.
> > > 9.html
> > > <http://archive-access.sourceforge.net/warc/warc_file_format-
> <http://archive-access.sourceforge.net/warc/warc_file_format->
> 0.9.html>>
> > > >> [2] http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/
> <http://law.dsi.unimi.it/>>
> > > <http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/
> <http://law.dsi.unimi.it/>>>
> > > >> [3]
> > > >> http://law.dsi. unimi.it/ index.php? option=com_
> content&task=
> > > section&id= 5&Itemid= 42
> > > <http://law.dsi.unimi.it/index.php?
> <http://law.dsi.unimi.it/index.php?>
> option=com_content&task=section&id=5&Itemid=42>
> > >
> > > >> <http://law.dsi. unimi.it/ index.php? option=com_
> content&task=
> > > section&id= 5&Itemid= 42
> > > <http://law.dsi.unimi.it/index.php?
> <http://law.dsi.unimi.it/index.php?>
> option=com_content&task=section&id=5&Itemid=42>>
> > > >>
> > > >
> > >
> > >
> >
>
>
Hello Peter,
Do we have any codes written in Jena?
-
Amit
--- In billiontriples@yahoogroups.com, Jans Aasman <ja@...> wrote:
>
> thanks for the clarification, jans
>
> Peter Mika wrote:
> >
> > Hi Jans,
> >
> > The plan is to have the entire dataset available for download in
the
> > WARC format as a set of files. (Some users may have limitations
storing
> > files larger than 2GB.)
> >
> > The WARC format is a general format for storing the results of
crawls.
> > It contains a header with the metadata and the HTTP response. The
> > example I've sent recreates the HTTP response, which you need to
do if
> > you only have the content. (You can also store metadata in the
HTTP
> > Response headers.)
> >
> > 100 million triples on our side seems to compress to about 3 GB.
> >
> > Best,
> > Peter
> >
> > jans.aasman wrote:
> > >
> > > Hi Peter, I'm not entirely sure what you are going to give us
access
> > > to. You (if everything goes right at Yahoo) will give us
access to a
> > > 100 G crawl in ntriples but the format of the triples is based
on
> > > Warc? Jans
> > >
> > > Peter Mika wrote:
> > >
> > >> Dear All,
> > >>
> > >> After some long and careful consideration, we have made the
> > decision not
> > >> to invent our own format for exchanging data but to rely on
an existing
> > >> format known as WARC [1], in particular WARC version 0.9.
WARC archives
> > >> store provenance (URL) and timestamp in the header. The only
additional
> > >> agreement we need to make is that we are going to encode
files in
> > >> N-Triples format. (If that is a problem, let us know.)
> > >>
> > >> What convinced us ultimately about WARC is the excellent tool
> > support in
> > >> the form of a Java API from the Laboratory for Web
Algorithmics [2] of
> > >> the Università degli studi di Milano <http://www.unimi. it/
> > <http://www.unimi.it/>
> > >> <http://www.unimi. it/ <http://www.unimi.it/>>>. The API can
> > >> be downloaded from [3] and there is a separate tarball with
all the
> > >> dependencies. (The license in LGPL). One of the nice features
of this
> > >> API is the ability to work with streams of compressed WARC
records,
> > >> where metadata about each record is stored in the gzip
header. This
> > >> means that the metadata can be read without uncompressing the
> > content of
> > >> the record itself. Further, there are skip pointers in the
file, which
> > >> means that a record can be easily skipped over.
> > >>
> > >> To make it really easy, I've also created sample code that
demonstrates
> > >> how to create WARC archives from a set of files or a
directory
> > structure
> > >> on disk, and how to read back the resulting WARC archive. The
code is
> > >> simply attached to this email, if all is well. (First time I
send
> > >> attachments to a Y! Group.) Many thanks to Sebastiano Vigna,
one of the
> > >> authors of the LAW API, for his help and advice.
> > >>
> > >> To support the Challenge, we at Yahoo! Research Barcelona are
also hard
> > >> at work to get permission to release a microformat crawl of
100 million
> > >> triples. We hope this will be a significant contribution to
the
> > >> state-of-the- art and will complement the existing data sets
to be
> > >> provided by Semantic Web search engines.
> > >>
> > >> As always, your comments and questions are more than
appreciated. In
> > >> particular those of you planning to provide some data, please
let us
> > >> know if you need any further help.
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >> [1]
> > >> http://archive- access.sourcefor ge.net/warc/ warc_file_
format-0.
> > 9.html
> > <http://archive-access.sourceforge.net/warc/warc_file_format-
0.9.html>
> > >> <http://archive- access.sourcefor ge.net/warc/ warc_file_
format-0.
> > 9.html
> > <http://archive-access.sourceforge.net/warc/warc_file_format-
0.9.html>>
> > >> [2] http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/>
> > <http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/>>
> > >> [3]
> > >> http://law.dsi. unimi.it/ index.php? option=com_
content&task=
> > section&id= 5&Itemid= 42
> > <http://law.dsi.unimi.it/index.php?
option=com_content&task=section&id=5&Itemid=42>
> >
> > >> <http://law.dsi. unimi.it/ index.php? option=com_
content&task=
> > section&id= 5&Itemid= 42
> > <http://law.dsi.unimi.it/index.php?
option=com_content&task=section&id=5&Itemid=42>>
> > >>
> > >
> >
> >
>
thanks for the clarification, jans
Peter Mika wrote:
>
> Hi Jans,
>
> The plan is to have the entire dataset available for download in the
> WARC format as a set of files. (Some users may have limitations storing
> files larger than 2GB.)
>
> The WARC format is a general format for storing the results of crawls.
> It contains a header with the metadata and the HTTP response. The
> example I've sent recreates the HTTP response, which you need to do if
> you only have the content. (You can also store metadata in the HTTP
> Response headers.)
>
> 100 million triples on our side seems to compress to about 3 GB.
>
> Best,
> Peter
>
> jans.aasman wrote:
> >
> > Hi Peter, I'm not entirely sure what you are going to give us access
> > to. You (if everything goes right at Yahoo) will give us access to a
> > 100 G crawl in ntriples but the format of the triples is based on
> > Warc? Jans
> >
> > Peter Mika wrote:
> >
> >> Dear All,
> >>
> >> After some long and careful consideration, we have made the
> decision not
> >> to invent our own format for exchanging data but to rely on an existing
> >> format known as WARC [1], in particular WARC version 0.9. WARC archives
> >> store provenance (URL) and timestamp in the header. The only additional
> >> agreement we need to make is that we are going to encode files in
> >> N-Triples format. (If that is a problem, let us know.)
> >>
> >> What convinced us ultimately about WARC is the excellent tool
> support in
> >> the form of a Java API from the Laboratory for Web Algorithmics [2] of
> >> the Università degli studi di Milano <http://www.unimi. it/
> <http://www.unimi.it/>
> >> <http://www.unimi. it/ <http://www.unimi.it/>>>. The API can
> >> be downloaded from [3] and there is a separate tarball with all the
> >> dependencies. (The license in LGPL). One of the nice features of this
> >> API is the ability to work with streams of compressed WARC records,
> >> where metadata about each record is stored in the gzip header. This
> >> means that the metadata can be read without uncompressing the
> content of
> >> the record itself. Further, there are skip pointers in the file, which
> >> means that a record can be easily skipped over.
> >>
> >> To make it really easy, I've also created sample code that demonstrates
> >> how to create WARC archives from a set of files or a directory
> structure
> >> on disk, and how to read back the resulting WARC archive. The code is
> >> simply attached to this email, if all is well. (First time I send
> >> attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of the
> >> authors of the LAW API, for his help and advice.
> >>
> >> To support the Challenge, we at Yahoo! Research Barcelona are also hard
> >> at work to get permission to release a microformat crawl of 100 million
> >> triples. We hope this will be a significant contribution to the
> >> state-of-the- art and will complement the existing data sets to be
> >> provided by Semantic Web search engines.
> >>
> >> As always, your comments and questions are more than appreciated. In
> >> particular those of you planning to provide some data, please let us
> >> know if you need any further help.
> >>
> >> Thanks,
> >> Peter
> >>
> >> [1]
> >> http://archive- access.sourcefor ge.net/warc/ warc_file_ format-0.
> 9.html
> <http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html>
> >> <http://archive- access.sourcefor ge.net/warc/ warc_file_ format-0.
> 9.html
> <http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html>>
> >> [2] http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/>
> <http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/>>
> >> [3]
> >> http://law.dsi. unimi.it/ index.php? option=com_ content&task=
> section&id= 5&Itemid= 42
>
<http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=4\
2>
>
> >> <http://law.dsi. unimi.it/ index.php? option=com_ content&task=
> section&id= 5&Itemid= 42
>
<http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=4\
2>>
> >>
> >
>
>
Hi Jans,
The plan is to have the entire dataset available for download in the
WARC format as a set of files. (Some users may have limitations storing
files larger than 2GB.)
The WARC format is a general format for storing the results of crawls.
It contains a header with the metadata and the HTTP response. The
example I've sent recreates the HTTP response, which you need to do if
you only have the content. (You can also store metadata in the HTTP
Response headers.)
100 million triples on our side seems to compress to about 3 GB.
Best,
Peter
jans.aasman wrote:
>
> Hi Peter, I'm not entirely sure what you are going to give us access
> to. You (if everything goes right at Yahoo) will give us access to a
> 100 G crawl in ntriples but the format of the triples is based on
> Warc? Jans
>
> Peter Mika wrote:
>
>> Dear All,
>>
>> After some long and careful consideration, we have made the decision not
>> to invent our own format for exchanging data but to rely on an existing
>> format known as WARC [1], in particular WARC version 0.9. WARC archives
>> store provenance (URL) and timestamp in the header. The only additional
>> agreement we need to make is that we are going to encode files in
>> N-Triples format. (If that is a problem, let us know.)
>>
>> What convinced us ultimately about WARC is the excellent tool support in
>> the form of a Java API from the Laboratory for Web Algorithmics [2] of
>> the Università degli studi di Milano <http://www.unimi.it/
>> <http://www.unimi.it/>>. The API can
>> be downloaded from [3] and there is a separate tarball with all the
>> dependencies. (The license in LGPL). One of the nice features of this
>> API is the ability to work with streams of compressed WARC records,
>> where metadata about each record is stored in the gzip header. This
>> means that the metadata can be read without uncompressing the content of
>> the record itself. Further, there are skip pointers in the file, which
>> means that a record can be easily skipped over.
>>
>> To make it really easy, I've also created sample code that demonstrates
>> how to create WARC archives from a set of files or a directory structure
>> on disk, and how to read back the resulting WARC archive. The code is
>> simply attached to this email, if all is well. (First time I send
>> attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of the
>> authors of the LAW API, for his help and advice.
>>
>> To support the Challenge, we at Yahoo! Research Barcelona are also hard
>> at work to get permission to release a microformat crawl of 100 million
>> triples. We hope this will be a significant contribution to the
>> state-of-the-art and will complement the existing data sets to be
>> provided by Semantic Web search engines.
>>
>> As always, your comments and questions are more than appreciated. In
>> particular those of you planning to provide some data, please let us
>> know if you need any further help.
>>
>> Thanks,
>> Peter
>>
>> [1]
>> http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html
>> <http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html>
>> [2] http://law.dsi.unimi.it/ <http://law.dsi.unimi.it/>
>> [3]
>>
http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=42
>>
<http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=4\
2>
>>
>
Hi Peter, I'm not entirely sure what you are going to give us access
to. You (if everything goes right at Yahoo) will give us access to a
100 G crawl in ntriples but the format of the triples is based on Warc?
Jans
Peter Mika wrote:
Dear All,
After some long and careful consideration, we have made the decision
not
to invent our own format for exchanging data but to rely on an existing
format known as WARC [1], in particular WARC version 0.9. WARC archives
store provenance (URL) and timestamp in the header. The only additional
agreement we need to make is that we are going to encode files in
N-Triples format. (If that is a problem, let us know.)
What convinced us ultimately about WARC is the excellent tool support
in
the form of a Java API from the Laboratory for Web Algorithmics [2] of
the Università degli studi di Milano <http://www.unimi.it/>. The API
can
be downloaded from [3] and there is a separate tarball with all the
dependencies. (The license in LGPL). One of the nice features of this
API is the ability to work with streams of compressed WARC records,
where metadata about each record is stored in the gzip header. This
means that the metadata can be read without uncompressing the content
of
the record itself. Further, there are skip pointers in the file, which
means that a record can be easily skipped over.
To make it really easy, I've also created sample code that demonstrates
how to create WARC archives from a set of files or a directory
structure
on disk, and how to read back the resulting WARC archive. The code is
simply attached to this email, if all is well. (First time I send
attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of the
authors of the LAW API, for his help and advice.
To support the Challenge, we at Yahoo! Research Barcelona are also hard
at work to get permission to release a microformat crawl of 100 million
triples. We hope this will be a significant contribution to the
state-of-the-art and will complement the existing data sets to be
provided by Semantic Web search engines.
As always, your comments and questions are more than appreciated. In
particular those of you planning to provide some data, please let us
know if you need any further help.
Dear All,
After some long and careful consideration, we have made the decision not
to invent our own format for exchanging data but to rely on an existing
format known as WARC [1], in particular WARC version 0.9. WARC archives
store provenance (URL) and timestamp in the header. The only additional
agreement we need to make is that we are going to encode files in
N-Triples format. (If that is a problem, let us know.)
What convinced us ultimately about WARC is the excellent tool support in
the form of a Java API from the Laboratory for Web Algorithmics [2] of
the Università degli studi di Milano <http://www.unimi.it/>. The API can
be downloaded from [3] and there is a separate tarball with all the
dependencies. (The license in LGPL). One of the nice features of this
API is the ability to work with streams of compressed WARC records,
where metadata about each record is stored in the gzip header. This
means that the metadata can be read without uncompressing the content of
the record itself. Further, there are skip pointers in the file, which
means that a record can be easily skipped over.
To make it really easy, I've also created sample code that demonstrates
how to create WARC archives from a set of files or a directory structure
on disk, and how to read back the resulting WARC archive. The code is
simply attached to this email, if all is well. (First time I send
attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of the
authors of the LAW API, for his help and advice.
To support the Challenge, we at Yahoo! Research Barcelona are also hard
at work to get permission to release a microformat crawl of 100 million
triples. We hope this will be a significant contribution to the
state-of-the-art and will complement the existing data sets to be
provided by Semantic Web search engines.
As always, your comments and questions are more than appreciated. In
particular those of you planning to provide some data, please let us
know if you need any further help.
Thanks,
Peter
[1] http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html
[2] http://law.dsi.unimi.it/
[3]
http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=42
package com.yahoo.corp.barcelona.billiontriples;
import it.unimi.dsi.fastutil.io.FastBufferedInputStream;
import it.unimi.dsi.fastutil.io.FastBufferedOutputStream;
import it.unimi.dsi.law.warc.io.GZWarcRecord;
import it.unimi.dsi.law.warc.io.WarcRecord;
import it.unimi.dsi.law.warc.util.BURL;
import it.unimi.dsi.law.warc.util.BasicHttpResponse;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.Date;
import javax.xml.transform.TransformerConfigurationException;
/** Sample code for creating Warc packages. This class is executable.
*
* @author pmika@...
*
*/
public class WarcPackager {
public final static int MAX_RECORDS = -1;
private int count = 0;
//MODIFY THIS if your filenames are not URLs
protected BURL getURL(File file) {
BURL result = null;
try {
result = BURL.parse(URLDecoder.decode(file.getName(), "UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return result;
}
//MODIFY this if the last modification date of the file != crawl date
protected Date getDate(File file) {
return new Date(file.lastModified());
}
private WarcRecord createRecord(File file) throws UnsupportedEncodingException,
IOException {
GZWarcRecord result = new GZWarcRecord();
InputStream fis = new FileInputStream(file);
BasicHttpResponse response = new BasicHttpResponse();
BURL url = getURL(file);
if (url == null) {
throw new IllegalArgumentException("Warning: getURL() returned null for " +
file);
}
response.url(getURL(file));
response.statusLine("HTTP/1.1 200 OK");
response.status(200);
response.contentAsStream(new FastBufferedInputStream(fis));
response.toWarcRecord(result);
Date date = getDate(file);
if (date == null) {
throw new IllegalArgumentException("Warning: getDate() returned null for " +
file);
}
result.header.creationDate = getDate(file);
return result;
}
//recursive
public void processFileOrDir(OutputStream out, File file) throws IOException {
//if MAX_RECORDS is specified, and we've reached the limit, return
if (MAX_RECORDS != -1 && count > MAX_RECORDS) {
return;
}
if (count++ % 99999 == 0) System.err.println("Processed " + count + "
files.");
if (file.isDirectory()) {
for (String name : file.list()) {
processFileOrDir(out, new File(file.getAbsolutePath() +
System.getProperty("file.separator") + name));
}
} else {
//Catch exceptions: failure to write a single file should not make us abort
try {
WarcRecord record = createRecord(file);
record.write(out);
} catch (Exception e) {
System.err.println(e);
}
}
}
/**
* Package the files or directories passed in as arguments.
* Directories are processed recursively.
*
* The result is printed to standard out, errors/diagnostic messages to std
err.
*
* @param args
* @throws TransformerConfigurationException
* @throws IOException
* @throws UnsupportedEncodingException
*/
public static void main(String[] args) throws
TransformerConfigurationException, UnsupportedEncodingException, IOException {
if (args.length < 1) {
System.err.println("Usage: WarcPackage <fileOrDir> ...");
}
FastBufferedOutputStream out = new FastBufferedOutputStream(System.out);
WarcPackager packager = new WarcPackager();
for (String arg: args) {
packager.processFileOrDir(out, new File(arg));
}
out.close();
}
}
package com.yahoo.corp.barcelona.billiontriples;
import it.unimi.dsi.fastutil.io.FastBufferedInputStream;
import it.unimi.dsi.fastutil.io.MeasurableInputStream;
import it.unimi.dsi.law.warc.filters.Filter;
import it.unimi.dsi.law.warc.filters.Filters;
import it.unimi.dsi.law.warc.io.GZWarcRecord;
import it.unimi.dsi.law.warc.io.WarcFilteredIterator;
import it.unimi.dsi.law.warc.io.WarcRecord;
import it.unimi.dsi.law.warc.util.BURL;
import it.unimi.dsi.law.warc.util.WarcHttpResponse;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.openrdf.model.Statement;
import org.openrdf.rio.RDFHandlerException;
import org.openrdf.rio.RDFParseException;
import org.openrdf.rio.helpers.RDFHandlerBase;
import org.openrdf.rio.ntriples.NTriplesParser;
/** Sample code for reading Warc packages.
*
* This class is executable.
*
* @author pmika@...
*
*/
public class WarcReader {
private NTriplesParser parser = new NTriplesParser();
private CountHandler countHandler = new CountHandler();
private int tripleCount = 0;
private int lineCount = 0;
public class CountHandler extends RDFHandlerBase {
private int count = 0;
public void endRDF() throws RDFHandlerException {
super.endRDF();
//System.out.println("Counted " + count + " statements.");
}
public void handleStatement(Statement st) {
count++;
}
public void startRDF() throws RDFHandlerException {
super.startRDF();
count = 0;
}
}
public static class TrueFilter extends Filter<BURL> {
@Override
public boolean accept( BURL x ) {
return true;
}
@Override
public String toExternalForm() {
return "true";
}
}
public void countTriples(MeasurableInputStream block, String base) {
parser.setRDFHandler(countHandler);
try {
parser.parse(block, base);
tripleCount += countHandler.count;
} catch (RDFParseException e) {
e.printStackTrace();
} catch (RDFHandlerException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public void countLines(MeasurableInputStream block) throws IOException {
int c = 0;
while ((c = block.read()) != -1) {
if (c == '\n') {
lineCount++;
}
}
}
public void dumpContent(MeasurableInputStream block) throws IOException {
int c = 0;
while ((c = block.read()) != -1) {
System.out.write(c);
}
}
/**
* @param args
* @throws FileNotFoundException
*/
public static void main(String[] args) throws FileNotFoundException {
if (args.length < 1) {
System.err.println("Usage: WarcReader <file>");
}
final FastBufferedInputStream in = new FastBufferedInputStream(new
FileInputStream(new File(args[0])));
GZWarcRecord record = new GZWarcRecord();
Filter<WarcRecord> filter = Filters.adaptFilterBURL2WarcRecord(new
TrueFilter());
WarcFilteredIterator it = new WarcFilteredIterator(in, record, filter);
int urlCount = 0;
WarcReader reader = new WarcReader();
WarcHttpResponse response = new WarcHttpResponse();
try {
while (it.hasNext()) {
if (urlCount++ % 99999 == 0) System.err.println("Processed " + urlCount + "
files.");
WarcRecord nextRecord = it.next();
//Get the HttpResponse
try {
response.fromWarcRecord(nextRecord);
System.out.println("Processing: " + nextRecord.header.subjectUri);
//This will dump the content of the record
//reader.dumpContent(response.contentAsStream());
//This will count the number of triples by parsing the RDF
//reader.countTriples(response.contentAsStream(),
nextRecord.header.subjectUri.toString());
//This will count the number of lines, which is equivalent to
//the number of triples in N-Triples format
reader.countLines(response.contentAsStream());
} catch (IOException e) {
e.printStackTrace();
continue;
}
}
} catch (RuntimeException re) {}
System.out.println("Counted " + reader.lineCount + " triples from " + urlCount
+ " urls.");
}
}
A
Look at C-Store, Java, and Data Grid Approaches to Semantic Web Applications
With
the rising importance of data analytics, there is more evidence than ever that
graph style data systems can achieve new benefits by making it easier to link
and re-combine complex data. But the Achilles heel of graph style tuple storage
has always been a lack of performance at scale. Will the Semantic Web and
modern analytics finally drive innovation that makes these systems scalable? In
this SDForum interactive panel discussion we will explore that question and
more.
Join
us for three unique presentations that will explore cutting-edge techniques for
scalable RDF/OWL storage, and the kinds of applications that make use of those
systems. First, we are honored to have representation from Vertica and the
Massachusetts Institute of Technology to describe how columnar store (C-Store)
data warehouse technology can enable large scale data graphs supporting
billions of RDF triples. Next, we’ll get a peek at some GeoTemporal and Social
Network Analysis applications based off the federated Java RDF database from Franz
Technologies.Finally, a short synopsis
of Oracle’s various approaches for tuple-based storage (including in-memory,
data grid, and Oracle Database RDF solutions) will be presented and tradeoffs
discussed.
Our
expert guests include Andy Palmer from Vertica, Samuel R. Madden from MIT, Jans
Aasman from Franz Technologies. Jeff Pollock from Oracle will moderate as well
as present a short summary of technical approaches to scalable RDF systems.
Hi list,
Peter Mika wrote:
>> (...)
>> URL = http://challenge.semanticweb.org/somefile.rdf
>> (...)
>> and the file would go in directory
>>
>> /A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A
>>
>> If we take the checksum on the contents of the file and create enough
>> levels, we can also make sure that files that are duplicates end up in
>> the same subdirectory regardless of the URL.
I quite like this last solution for one, very selfish reason: this is very
similar to the way the cache of Watson is organized. For example,
http://kmi-web05.open.ac.uk:81/cache/a/a0d/89b9/a7dd3/6e33577582/44cab50d0e34cb3\
ce
is the location of the file for which the sha1 checksum of the content is
aa0d89b9a7dd36e3357758244cab50d0e34cb3ce
This structure is pretty convenient for managing a large number of files,
providing a unique ID to each file according to its content, and therefore,
automatically avoiding duplicates.
> from my experience, file systems will have trouble at some point when
> there are too many files around. Thus, we avoid writing individual files
> to the file system.
> (...)
> The nice thing about ZIP archives is that you can access them from
> within any programming language (we've tried Java and Python).
Note that zip archives also have serious limitations, in particular a limit to
65,536 files
(see http://www.info-zip.org/FAQ.html#limits).
Regards,
Mathieu.
Hi Peter,
Peter Mika wrote:
> I like this solution as well, the only thing I'm slightly worried about
> now is what happens when you unzip a large number of files. My extended
> suggestion is thus to take the SHA1 sum of the URL and create
> subdirectories based on that, say three level deep. For example, take a
> file with URL
>
> URL = http://challenge.semanticweb.org/somefile.rdf
>
> Now we could take the checksum of the URL or the checksum of the contents:
>
> checksum = ABCDEFG0123456789
>
> and the file would go in directory
>
> /A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A
>
> If we take the checksum on the contents of the file and create enough
> levels, we can also make sure that files that are duplicates end up in
> the same subdirectory regardless of the URL.
>
> What do you think?
>
from my experience, file systems will have trouble at some point when
there are too many files around. Thus, we avoid writing individual files
to the file system.
What worked here is:
put source files into ZIP archives with URI urlencoded as filename
for each file in the ZIP archive:
process file
That way, we never have to actually put all files on the filesystem,
but do (de)compression on the fly. If we use command line tools in
the process, we iterate over the ZIP contents, write one file to disk,
process the file with the command line tool, and remove the file
again.
The nice thing about ZIP archives is that you can access them from
within any programming language (we've tried Java and Python).
Regards,
Andreas.
--
http://harth.org/andreas/
Hi Andreas,
I like this solution as well, the only thing I'm slightly worried about
now is what happens when you unzip a large number of files. My extended
suggestion is thus to take the SHA1 sum of the URL and create
subdirectories based on that, say three level deep. For example, take a
file with URL
URL = http://challenge.semanticweb.org/somefile.rdf
Now we could take the checksum of the URL or the checksum of the contents:
checksum = ABCDEFG0123456789
and the file would go in directory
/A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A
If we take the checksum on the contents of the file and create enough
levels, we can also make sure that files that are duplicates end up in
the same subdirectory regardless of the URL.
What do you think?
Best,
Peter
andreasharth wrote:
>
> Hi,
>
> --- In billiontriples@yahoogroups.com
> <mailto:billiontriples%40yahoogroups.com>, Peter Mika <pmika@...> wrote:
> > The goal is basically to find a way to transfer quints, i.e. RDF
> triples
> > with provenance and timestamp. I will start by proposing two
> > alternatives (with variations) and then leave the floor to others.
> >
> > Here we go:
> >
> > Each Semantic Web Document (SWD) is stored in a single file using
> Turtle
> > format. The files are zipped together to form a single file.
>
> as already discussed, I'd prefer this solution. Filenames in the ZIP
> archive are the url-encoded URI of the file. Actually, ZIPs do also
> preserve the timestamp, so the ZIP archive contains all the
> information you require.
>
> Regards,
> Andreas.
>
>
Ian, good point, we will work hard to make sure all the data is freely sharable and displayable, having a good license that makes that clear would make a lot of sense - JH
On Feb 2, 2008, at 10:12 AM, Ian Davis wrote:
What licensing terms will the data be issued under? I encourage this project to adopt the ODC Public Domain Dedication and Licence, a licence that Talis and others developed in conjunction with Science Commons:
We expect this licence to be out of beta in a couple of weeks.
Our goal in developing the licence was to convert the web of data into a web of _useable_ data.
Ian
On Fri, 2008-02-01 at 18:23 +0100, Peter Mika wrote: > Dear All, > > We are looking for persons or organizations who would like to offer > their help in hosting the Billion Triples data set. This is an important > part of the Challenge in that we expect many more developers who would > like to work with the data, but themselves do not have the means to host > the data. The organizers reserve the right to award the fastest and the > most reliable hosting service based on feedback from the participants > using their service. > > The minimal criteria for hosting is to provide a SPARQL endpoint to the > dataset and an email address for support (with a maximum response time > of 24 hours). Hosting locations will be posted on the Semantic Web > Challenge website. > > Thanks, > Jim and Peter >
>
"If we knew what we were doing, it wouldn't be called research, would it?." - Albert Einstein
Following some inquiries, i'd like to clarify that its not the main
Sindice infrastructure providing a sparql endpoint (e.g. over the
entire dataset), its just one of the machines we use for it that can
be setup for that .
Giovanni
--- In billiontriples@yahoogroups.com, "gtummarello"
<g.tummarello@...> wrote:
>
> Hi,
>
> we can provide one such data hosting within Sindice.
> let us know when/how/what and we'll get it running.
>
> Giovanni
>
> --- In billiontriples@yahoogroups.com, Peter Mika <pmika@> wrote:
> >
> > Dear All,
> >
> > We are looking for persons or organizations who would like to offer
> > their help in hosting the Billion Triples data set. This is an
> important
> > part of the Challenge in that we expect many more developers who
would
> > like to work with the data, but themselves do not have the means to
> host
> > the data. The organizers reserve the right to award the fastest
and the
> > most reliable hosting service based on feedback from the participants
> > using their service.
> >
> > The minimal criteria for hosting is to provide a SPARQL endpoint
to the
> > dataset and an email address for support (with a maximum response
time
> > of 24 hours). Hosting locations will be posted on the Semantic Web
> > Challenge website.
> >
> > Thanks,
> > Jim and Peter
> >
>
What licensing terms will the data be issued under? I encourage this
project to adopt the ODC Public Domain Dedication and Licence, a licence
that Talis and others developed in conjunction with Science Commons:
http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/
and the community norms:
http://www.opendatacommons.org/odc-community-norms/
We expect this licence to be out of beta in a couple of weeks.
Our goal in developing the licence was to convert the web of data into a
web of _useable_ data.
Ian
On Fri, 2008-02-01 at 18:23 +0100, Peter Mika wrote:
> Dear All,
>
> We are looking for persons or organizations who would like to offer
> their help in hosting the Billion Triples data set. This is an important
> part of the Challenge in that we expect many more developers who would
> like to work with the data, but themselves do not have the means to host
> the data. The organizers reserve the right to award the fastest and the
> most reliable hosting service based on feedback from the participants
> using their service.
>
> The minimal criteria for hosting is to provide a SPARQL endpoint to the
> dataset and an email address for support (with a maximum response time
> of 24 hours). Hosting locations will be posted on the Semantic Web
> Challenge website.
>
> Thanks,
> Jim and Peter
>
>
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> We are looking for persons or organizations who would like to offer
> their help in hosting the Billion Triples data set. This is an important
> part of the Challenge in that we expect many more developers who would
> like to work with the data, but themselves do not have the means to host
> the data. The organizers reserve the right to award the fastest and the
> most reliable hosting service based on feedback from the participants
> using their service.
>
> The minimal criteria for hosting is to provide a SPARQL endpoint to the
> dataset and an email address for support (with a maximum response time
> of 24 hours). Hosting locations will be posted on the Semantic Web
> Challenge website.
>
> Thanks,
> Jim and Peter
>
Jim & Peter,
As we do with DBpedia[1][2], we are happy to be one of hopefully numerous RDF
data
store providers for this effort.
Count OpenLink Software in re. our Virtuoso Quad Store [3] :-)
Links:
1. http://dbpedia.org
2. http://www4.wiwiss.fu-berlin.de/benchmarks-200801/
3. http://en.wikipedia.org/wiki/Virtuoso_Universal_Server
Kingsley Idehen
Hi,
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
> The goal is basically to find a way to transfer quints, i.e. RDF
triples
> with provenance and timestamp. I will start by proposing two
> alternatives (with variations) and then leave the floor to others.
>
> Here we go:
>
> Each Semantic Web Document (SWD) is stored in a single file using
Turtle
> format. The files are zipped together to form a single file.
as already discussed, I'd prefer this solution. Filenames in the ZIP
archive are the url-encoded URI of the file. Actually, ZIPs do also
preserve the timestamp, so the ZIP archive contains all the
information you require.
Regards,
Andreas.
Hi,
we can provide one such data hosting within Sindice.
let us know when/how/what and we'll get it running.
Giovanni
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> We are looking for persons or organizations who would like to offer
> their help in hosting the Billion Triples data set. This is an
important
> part of the Challenge in that we expect many more developers who would
> like to work with the data, but themselves do not have the means to
host
> the data. The organizers reserve the right to award the fastest and the
> most reliable hosting service based on feedback from the participants
> using their service.
>
> The minimal criteria for hosting is to provide a SPARQL endpoint to the
> dataset and an email address for support (with a maximum response time
> of 24 hours). Hosting locations will be posted on the Semantic Web
> Challenge website.
>
> Thanks,
> Jim and Peter
>
I'm new to this discussion list. I will introduce myself, I'm Marc-Alexandre Nolin from the Bio2RDF project (http://bio2rdf.org). His the billions triples is just about having a billions triples, a billions triple that we can query in a single triple store or a billions triple that we can query in multiple triple store?
I have a way to generate a huge amount of RDF with genomics data. Our current triple store, Sesame, can't hold that much and we are in the process of installing and moving to Virtuoso in the hope that it can hold all of it.
I will keep you posted on the number off triples a manage to put in it.
My two cents: In the spirit of RDF, why not provide a 'directory'
triple file that has resources identifying each file and provides
timestamps, provenance etc as properties? Do we want the ability to
query statements on provenance information?
To introduce myself, I'm a PhD student, working with Joel Saltz at The
Ohio State University. I have developed a research prototype of a
parallel semantic engine that can run in a cluster setting. I am
hoping to use the billion triples challenge to see how it measures up .
I apologize in advance if I say something nonsensical :)
--Sivaramakrishnan
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> In the past few days we had talked to several of you about providing
> data for the billion triples challenge. I would like to start a brief
> discussion on the data format that we intend to use to provide data and
> to exchange the billion triples data set. This discussion is relevant
> for those who would be providing data and those would like to host this
> data set; we are hoping that the majority of participants will be able
> to rely on these hosting services to build applications.
>
> The goal is basically to find a way to transfer quints, i.e. RDF
triples
> with provenance and timestamp. I will start by proposing two
> alternatives (with variations) and then leave the floor to others.
>
> Here we go:
>
> Each Semantic Web Document (SWD) is stored in a single file using
Turtle
> format. The files are zipped together to form a single file.
>
> Alternative #1: Each SWD file is named using the SHA1 hash of the URL
> identifying the provenance of the file. There is a separate file
linking
> such hash codes to the full URLs and timestamps.
>
> Alternative #2: The name of the file is irrelevant: provenance and
> timestamp are encoded as comments in Turtle.
>
> Comments, suggestions?
>
> Thanks,
> Peter
>
My two cents: In the spirit of RDF, why not provide a 'directory'
triple file that has resources identifying each file and provides
timestamps, provenance etc as properties? Do we want the ability to
query statements on provenance information?
To introduce myself, I'm a PhD student, working with Joel Saltz at The
Ohio State University. I have developed a research prototype of a
parallel semantic engine that can run in a cluster setting. I am
hoping to use the billion triples challenge to see how it measures up .
I apologize in advance if I say something nonsensical :)
--Sivaramakrishnan
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> In the past few days we had talked to several of you about providing
> data for the billion triples challenge. I would like to start a brief
> discussion on the data format that we intend to use to provide data and
> to exchange the billion triples data set. This discussion is relevant
> for those who would be providing data and those would like to host this
> data set; we are hoping that the majority of participants will be able
> to rely on these hosting services to build applications.
>
> The goal is basically to find a way to transfer quints, i.e. RDF
triples
> with provenance and timestamp. I will start by proposing two
> alternatives (with variations) and then leave the floor to others.
>
> Here we go:
>
> Each Semantic Web Document (SWD) is stored in a single file using
Turtle
> format. The files are zipped together to form a single file.
>
> Alternative #1: Each SWD file is named using the SHA1 hash of the URL
> identifying the provenance of the file. There is a separate file
linking
> such hash codes to the full URLs and timestamps.
>
> Alternative #2: The name of the file is irrelevant: provenance and
> timestamp are encoded as comments in Turtle.
>
> Comments, suggestions?
>
> Thanks,
> Peter
>
In the past few days we had talked to several of you about providing
data for the billion triples challenge. I would like to start a brief
discussion on the data format that we intend to use to provide data and
to exchange the billion triples data set. This discussion is relevant
for those who would be providing data and those would like to host this
data set; we are hoping that the majority of participants will be able
to rely on these hosting services to build applications.
The goal is basically to find a way to transfer quints, i.e. RDF
triples
with provenance and timestamp. I will start by proposing two
alternatives (with variations) and then leave the floor to others.
Here we go:
Each Semantic Web Document (SWD) is stored in a single file using
Turtle
format. The files are zipped together to form a single file.
Alternative #1: Each SWD file is named using the SHA1 hash of the URL
identifying the provenance of the file. There is a separate file
linking
such hash codes to the full URLs and timestamps.
Alternative #2: The name of the file is irrelevant: provenance and
timestamp are encoded as comments in Turtle.
Dear All,
We are looking for persons or organizations who would like to offer
their help in hosting the Billion Triples data set. This is an important
part of the Challenge in that we expect many more developers who would
like to work with the data, but themselves do not have the means to host
the data. The organizers reserve the right to award the fastest and the
most reliable hosting service based on feedback from the participants
using their service.
The minimal criteria for hosting is to provide a SPARQL endpoint to the
dataset and an email address for support (with a maximum response time
of 24 hours). Hosting locations will be posted on the Semantic Web
Challenge website.
Thanks,
Jim and Peter
Dear All,
In the past few days we had talked to several of you about providing
data for the billion triples challenge. I would like to start a brief
discussion on the data format that we intend to use to provide data and
to exchange the billion triples data set. This discussion is relevant
for those who would be providing data and those would like to host this
data set; we are hoping that the majority of participants will be able
to rely on these hosting services to build applications.
The goal is basically to find a way to transfer quints, i.e. RDF triples
with provenance and timestamp. I will start by proposing two
alternatives (with variations) and then leave the floor to others.
Here we go:
Each Semantic Web Document (SWD) is stored in a single file using Turtle
format. The files are zipped together to form a single file.
Alternative #1: Each SWD file is named using the SHA1 hash of the URL
identifying the provenance of the file. There is a separate file linking
such hash codes to the full URLs and timestamps.
Alternative #2: The name of the file is irrelevant: provenance and
timestamp are encoded as comments in Turtle.
Comments, suggestions?
Thanks,
Peter
Sören,
Hardly empty. I think "pre-release" is the better term.
The project documentation site is:
http://www.bigdata.com/projects/
The RDF layer documentation is:
http://www.bigdata.com/projects/multiproject/bigdata-rdf/index.html
The store is currently a triple store with a Sesame 1.x integration
scaling well past the 1B triple point with RDFS + owl:sameAs and
friends, truth maintenance, etc.
-bryan
--- In billiontriples@yahoogroups.com, Sören Auer <auer@...> wrote:
>
> thompsonbry wrote:
> > I would suggest that 1B is hardly the challenge point. Let's try 10B
> > or more.
> > [1] http://www.sourceforge.net/projects/bigdata
>
> Regarding the empty SourceForge project you are quite bold ;-)
>
> Good luck anyway!
>
> Sören
>
thompsonbry wrote:
> I would suggest that 1B is hardly the challenge point. Let's try 10B
> or more.
> [1] http://www.sourceforge.net/projects/bigdata
Regarding the empty SourceForge project you are quite bold ;-)
Good luck anyway!
Sören
Hello,
I am with SYSTAP, LLC. Among other things, we are interested in scale-
out databases and their applications to RDF and Topic maps. We have an
open source project, bigdata(R) [1], based on a scale-out database
architecture. The project is in pre-release, but the RDF layer
supports inference, etc. at the 1B+ statements level on a single host.
I would suggest that 1B is hardly the challenge point. Let's try 10B
or more.
[1] http://www.sourceforge.net/projects/bigdata
Thanks,
-bryan