Search the web
Sign In
New User? Sign Up
billiontriples · The Billion Triples Challenge
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 37 - 68 of 141   Newest  |  < Newer  |  Older >  |  Oldest
Messages: Show Message Summaries   (Group by Topic) Sort by Date v  
#68 From: "crossthelimit" <joshiamitkrishna@...>
Date: Thu Mar 27, 2008 11:26 am
Subject: Re: From collection to use -- thinking about the competition
crossthelimit
Offline Offline
Send Email Send Email
 
Here's my views:

Triple Store:
the big problem with semantic web, no matter how big promises it
makes, is the amount of triples that can be stored and dealt with. As
the size of triples increase, developers suffer from resource problem.
So the question is how can I work with billion triples? I am not
backed by organizations to give me resources for working on the big B
of billion. Do we have the sandbox ???

System requirements and benchmarking criteria are not clear.

Linked data:

this is most probably the best part of realizing the semantic web and
i hope some killer apps gonna be developed that will make people think
'This is the reason to shift to semantic web !!!" Till now, Semantic
web is just an academic hype.

Reasoning:
Reasoning comes after Triple Store. Resource problem again !!!

Ontology Research:
Could you fine tune this section? Is it creation of new ontologies or
creation of new language or sth else ?

---
Amit Krishna Joshi


--- In billiontriples@yahoogroups.com, Jim Hendler <hendler@...> wrote:
>
> All-
>   Peter feels that we now have the collection and distribution of the
> triples underway, which means he gets to make me do some work finally...
>   My role at the moment is to figure out what we would like to make
> the challenge part of the challenge be,
>   Here are some thoughts, I welcome feedback
>   We see four, very non disjoint audiences for the challenge  (in
> fact, Peter, me, and most of the people on this list are in at least
> several categories):
> Triple store developers, linked data technology developers, Semantic
> Web researchers interested in scalable reasoning, ontology-based
> research groups
>
> Here are some of my thoughts with respect to these
>
> A - Triple Store Developers
>     We do not want this to be a "triple store shootout" in the sense
> of who can process a query fastest or such.  We don't see that
> competition as being all that useful at a time when people are still
> very much in development mode.  Rather, we would like the outcome of
> this event to be a realization in the outside world that triple-stores
> can and do handle these sorts of numbers (the DB folks still say
> "triple stores break at a million triples" at conferences I go to - I
> have no idea where they get that, but let's push it up a few orders of
> magnitude!!)
>     So at the moment my thinking on this area is that we would like to
> give you folks bragging rights for being able to support systems other
> people develop (i.e. any of you who host this data and make it
> available via SPARQL should be listed as "winners" in some way)
>    I also think that if some interesting, large, and complex SPARQL
> queries are developed against this dataset (say including filters and
> optionals), then those would become useful benchmarks, so we would
> like to find a way to encourage the sharing of these (maybe for a
> future date when a benchmarking shootout would be more appropriate)
>
> B - Linked data technology developers
>   We write a lot about the Semantic Web as being the Web of linked
> data, but to date, in practice, most of that data is either within an
> enterprise or locked in a particular application.  We are purposely
> designing this dataset to be very heterogeneous, but with many
> connections between pieces, so it should be a great dataset for
> showing off tools that can exploit the dataweb.
>    In this area we are thinking of having some goals like "visualize
> (or browse) the dataweb", Datamining of this sort of data, etc.  --
> seems to us this is a ripe area for a challenge
>
> C - SW researchers interested in scalable reasoning
>   The data set we are developing will include a (large) number of
> triples tied to FOAF, DOAP and other "small o" ontologies.  We also
> have a lot of data that will be made available that was crawled from
> microformats (where the "semantics" are well specified).  This is thus
> an ideal proving grounds for the "little semantics goes a long way"
> philosophy, and thus this also seems like an appropriate challenge area
>
> D - Ontology research
>   Big A-Box, you got it!  Show us something.
>
> So, I think we will have the "competition" be fairly unspecified - we
> will identify several areas of interest from the above and work out
> how to tie that into an "announcible" competition.
>
> I welcome, NEED, your feedback on this
>   -Jim H.
>
>
>
>
> "If we knew what we were doing, it wouldn't be called research, would
> it?." - Albert Einstein
>
> Prof James Hendler 		 http://www.cs.rpi.edu/~hendler
> Tetherless World Constellation Chair
> Computer Science Dept
> Rensselaer Polytechnic Institute, Troy NY 12180
>

#67 From: Jim Hendler <hendler@...>
Date: Wed Mar 26, 2008 5:23 pm
Subject: From collection to use -- thinking about the competition
james.hendler
Offline Offline
Send Email Send Email
 
All-
   Peter feels that we now have the collection and distribution of the
triples underway, which means he gets to make me do some work finally...
   My role at the moment is to figure out what we would like to make
the challenge part of the challenge be,
   Here are some thoughts, I welcome feedback
   We see four, very non disjoint audiences for the challenge  (in
fact, Peter, me, and most of the people on this list are in at least
several categories):
Triple store developers, linked data technology developers, Semantic
Web researchers interested in scalable reasoning, ontology-based
research groups

Here are some of my thoughts with respect to these

A - Triple Store Developers
     We do not want this to be a "triple store shootout" in the sense
of who can process a query fastest or such.  We don't see that
competition as being all that useful at a time when people are still
very much in development mode.  Rather, we would like the outcome of
this event to be a realization in the outside world that triple-stores
can and do handle these sorts of numbers (the DB folks still say
"triple stores break at a million triples" at conferences I go to - I
have no idea where they get that, but let's push it up a few orders of
magnitude!!)
     So at the moment my thinking on this area is that we would like to
give you folks bragging rights for being able to support systems other
people develop (i.e. any of you who host this data and make it
available via SPARQL should be listed as "winners" in some way)
    I also think that if some interesting, large, and complex SPARQL
queries are developed against this dataset (say including filters and
optionals), then those would become useful benchmarks, so we would
like to find a way to encourage the sharing of these (maybe for a
future date when a benchmarking shootout would be more appropriate)

B - Linked data technology developers
   We write a lot about the Semantic Web as being the Web of linked
data, but to date, in practice, most of that data is either within an
enterprise or locked in a particular application.  We are purposely
designing this dataset to be very heterogeneous, but with many
connections between pieces, so it should be a great dataset for
showing off tools that can exploit the dataweb.
    In this area we are thinking of having some goals like "visualize
(or browse) the dataweb", Datamining of this sort of data, etc.  --
seems to us this is a ripe area for a challenge

C - SW researchers interested in scalable reasoning
   The data set we are developing will include a (large) number of
triples tied to FOAF, DOAP and other "small o" ontologies.  We also
have a lot of data that will be made available that was crawled from
microformats (where the "semantics" are well specified).  This is thus
an ideal proving grounds for the "little semantics goes a long way"
philosophy, and thus this also seems like an appropriate challenge area

D - Ontology research
   Big A-Box, you got it!  Show us something.

So, I think we will have the "competition" be fairly unspecified - we
will identify several areas of interest from the above and work out
how to tie that into an "announcible" competition.

I welcome, NEED, your feedback on this
   -Jim H.




"If we knew what we were doing, it wouldn't be called research, would
it?." - Albert Einstein

Prof James Hendler 		 http://www.cs.rpi.edu/~hendler
Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180

#65 From: "Jeff Pollock" <jeff.pollock@...>
Date: Mon Mar 3, 2008 6:30 pm
Subject: RE: Are Scalable Graph Data Applications Possible? A Look at C-Store, Java, and Data Grid Approaches to Semantic Web Applications
jeff_pollock
Offline Offline
Send Email Send Email
 

REMINDER

 


Subject: Are Scalable Graph Data Applications Possible? A Look at C-Store, Java, and Data Grid Approaches to Semantic Web Applications

 

http://www.sdforum.org/index.cfm?fuseaction=Page.viewPage&pageId=656&parentID=483&nodeID=1

 

SDForum Semantic Web SIG Event

 

Are Scalable Graph Data Applications Possible?

A Look at C-Store, Java, and Data Grid Approaches to Semantic Web Applications

 

With the rising importance of data analytics, there is more evidence than ever that graph style data systems can achieve new benefits by making it easier to link and re-combine complex data. But the Achilles heel of graph style tuple storage has always been a lack of performance at scale. Will the Semantic Web and modern analytics finally drive innovation that makes these systems scalable? In this SDForum interactive panel discussion we will explore that question and more.

 

Join us for three unique presentations that will explore cutting-edge techniques for scalable RDF/OWL storage, and the kinds of applications that make use of those systems. First, we are honored to have representation from Vertica and the Massachusetts Institute of Technology to describe how columnar store (C-Store) data warehouse technology can enable large scale data graphs supporting billions of RDF triples. Next, we’ll get a peek at some GeoTemporal and Social Network Analysis applications based off the federated Java RDF database from Franz Technologies.  Finally, a short synopsis of Oracle’s various approaches for tuple-based storage (including in-memory, data grid, and Oracle Database RDF solutions) will be presented and tradeoffs discussed.

 

Our expert guests include Andy Palmer from Vertica, Samuel R. Madden from MIT, Jans Aasman from Franz Technologies. Jeff Pollock from Oracle will moderate as well as present a short summary of technical approaches to scalable RDF systems.

 

Next Meeting:

 

6:30 PM - 9:00 PM March 5, 2008

Cubberley Community Center

4000 Middlefield Rd., Room H-1

Palo Alto,, CA

94105

 

Agenda:

 

6:30pm  - 7:00pm   Registration / Networking / Refreshments / Pizza

 

7:00pm – 7:10pm   Community announcement

 

7:10pm  -  7:50pm  The Vertica C-Store DBMS for Scalable RDF Persistence

 

7:50pm  -  8:30pm  Franz Technologies RDF Applications

 

8:30pm  -  8:45pm  Oracle Infrastructure for Tuple-based Graph Storage

 

8:45pm  -  9:00pm(+)  Dedicated Q&A period

 

 

 

Oracle
Jeff T. Pollock | Senior Director | Direct:650-506-4700

100 Oracle Parkway, Redwood Shores, California 94065
Oracle Fusion Middleware | Main:800-ORACLE1 | Fax:801-607-6504 | Mobile:415-971-2223
Middleware | Data Integrator | jeff.pollock@... | Blog | LinkedIn Profile | Publications

 


#64 From: "help.ittutor" <help.ittutor@...>
Date: Sat Mar 1, 2008 11:31 am
Subject: ccsptutorial.info is sites for certification http://ccsptutorial.info/
help.ittutor
Offline Offline
Send Email Send Email
 
ccsptutorial.info is sites for certification
http://ccsptutorial.info/

#63 From: "Giovanni Tummarello" <g.tummarello@...>
Date: Fri Feb 29, 2008 12:11 pm
Subject: Identity and Reference on the Semantic Web (IRSW2008) at ESWC2008 - deadline approaching
gtummarello
Offline Offline
Send Email Send Email
 
** our apologies if you receive multiple copies of this message **

  ==================================================================

                         CALL FOR PAPERS

                        ESWC 2008 Workshop

        Identity and Reference on the Semantic Web (IRSW2008)
         --------------------------------------------
         Entity-centric Approaches to Information and
                Knowledge Management on the Web

                  Tenerife, Spain - June 1 2008

                 http://www.okkam.org/IRSW2008

  ==================================================================

  The recent developments of the Semantic Web - and the fast rise of Web
  2.0 applications - make more and more evident that the problem of
  identity and reference through URIs is perhaps the single most
  important issue for fostering the Semantic Web on a global scale. In a
  nutshell: the effective use of the Semantic Web on a global scale
  requires the systematic reuse of stable and global URIs. This in turn
  requires that there exist decentralized agreement on how URIs can be
  used to identify and refer to the same object. So far, uniqueness of
  URIs and reference have often been taken for granted. Initiatives like
  Linked Data, OntoWorld and the large number of proposals aiming at
  using popular identifiers (e.g. Wikipedia's) as "canonical" URIs
  (especially for "real world" objects that aren't accessible on the
  Web) show that a solution to this issue is both urgent and relevant.

  Solving this issue would enable and foster the decentralized and open
  publication of data on the Semantic Web, would allow better and faster
  semantic search engines, would be the basis for a new generation of
  Semantic Web browsers, would start the development of smarter
  applications on the Web. Other vertical (and often commercial)
  initiatives (like XRIs, LSID, DOI, etc.) prove that there is also a
  practical and business potential in a standard solution.

  So far, there is little agreement on how this problem should be
  addressed and solved. On the one hand we need to address technical
  issues:

     *      How do we make sure that people and applications can find
  and reuse pre-existing URIs for different types of entity?
     *      Is HTTP the most appropriate addressing scheme for these URIs?
     *      Should URIs for commonly identified entities, like people,
  organizations or countries, be managed by a central service? If so,
  under what conditions?
     *      Are centralized registries of URIs for different types of
  entities necessary? Can such a registries be built in a decentralized
  manner while still linking data?

  There are also issues of trust and security:

     * What if the same URI is used to make contradictory or undesired
  statements about an entity?
     * Do people or groups really want that a single URIs is
  consistently used to represent knowledge about them on the Web, one
  that could be used to effectively gather data about them?
     * What is an acceptable level of security for any kind of URI registry?
     * Where is the boundary between describing entities and violating
  their privacy?

  Despite the high level of awareness in the community, the potential
  for the integration of information currently published on the Semantic
  Web is still mostly unexploited. FOAF profiles do not have canonical
  and reusable URIs for pointing to people one knows (only ad hoc
  solutions are available, like the email hashcode); the most popular
  ontology editors mint new URIs for any newly started OWL project;
  social networks are not easily portable.

  Starting from such a situation, this workshop aims at collecting
  contributions which can roughly be grouped as follows:

     * Foundations: formal and conceptual theories of identity and
  reference for the Semantic Web
     * Vision papers: visionary solutions to the problems of identity
  and reference
     * Project papers: descriptions of research & development projects
  in this area
     * Experiences: contributions from research and industry that
  illustrate case studies or approaches to deal with the issues of
  identity and reference
     * Critical viewpoints: discussions of advantages and disadvantages
  of previously proposed approaches.

  We especially encourage contributions from groups or organizations
  which are working on identification schemes for large semantic data
  collections,  in order to compare the different practical solutions
  that have been developed to integrate Semantic Web data..

  Workshop's anticipated outcome:

  The anticipated outcome of the workshop is to assess the state of the
  art in the area, as well as to discuss the approach and critically
  evaluate the next steps in pursuing this topic. There is the potential
  for creating the core of a consortium for future R&D projects on the
  topic for both
  academia and industry.

  Submission Details
  ------------------

  All submissions will undergo a thorough peer-review process by
  an international program committee, made up of leading members of
  different communities from "Web 2.0", Semantic Web and Information
  Retrieval researchers and companies.

  Accepted contributions will be included on the ESWC2008
  Conference CD as well as made available as CEUR Online Proceedings

  We invite submissions of two types:

    1. full papers (up to 15 pages in LNCS format)
    2. extended abstracts (up to 4 pages in LNCS format).

  The authors of accepted abstracts will be requested to produce a full
  paper by the time the camera-ready version is due. Accepted
  contributions will be presented at the workshop. Additionally, some
  submissions may be accepted as posters.

  Submissions should be formatted in Springer LNCS format
  (http://www.springer.de/comp/lncs/authors.html) and submitted in PDF
  format.
  The submission site can be reached through the webpage
  http://www.easychair.org/conferences/?conf=irsw2008

  Please note that at least one author of an accepted paper must
  register for the ESWC 2008 conference
  Important Dates

     * Paper/abstract submission: March 7, 2008
     * Notification of acceptance: April 4, 2008
     * Camera ready Paper submission: April 18, 2008
     * Workshop: June 1, 2008


  Organization
  ------------
  Chair
   Paolo Bouquet, University of Trento
  Program Co-Chairs
   Heiko Stoermer, University of Trento
   Giovanni Tummarello, DERI Galway
   Harry Halpin, University of Edinburgh


  Program Committee:

  Karl Aberer             EPFL
  Chris Bizer             Freie Universität Berlin
  David Booth             HP
  Werner Ceusters         University of Buffalo
  Richard Cyganiak        DERI Galway
  Anita De Waard          Elsevier
  Stefan Decker           DERI Galway
  Hugh Glaser             University of Southampton
  Andreas Harth           DERI Galway
  Tom Heath               Talis Information Ltd
  Kingsley Idehen       OpenLink Software
  Pierre Levy             University of Ottawa
  Alexander Löser         SAP Research
  Antonio Mana            University of Malaga
  Christian Morbidoni     Universita' Politecnica delle Marche
  Claudia Niederée        L3S Research Center
  Alan Ruttenberg Science Commons US
  Matthias Samwald        DERI Galway
  Leo Sauermann           DFKI
  Henry Thompson          University of Edinburgh UK
  Marco Varone            ExpertSystem    IT
  Bernard Vatant          Mondeca FR

#62 From: "crossthelimit" <joshiamitkrishna@...>
Date: Thu Feb 28, 2008 4:10 pm
Subject: Re: Data format
crossthelimit
Offline Offline
Send Email Send Email
 
Thnx for the info.

-
Amit

--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Hi Amit,
>
> No, I don't as I'm not familiar with Jena. But basically the
> MeasurableInputStream that you get as a result of the
> response.contentAsStream() call on line 143 is a Java IOStream
that you
> can process further with any API.
>
> Best,
> Peter
>
> crossthelimit wrote:
> >
> > Hello Peter,
> > Do we have any codes written in Jena?
> >
> > -
> > Amit
> >
> > --- In billiontriples@yahoogroups.com
> > <mailto:billiontriples%40yahoogroups.com>, Jans Aasman <ja@>
wrote:
> > >
> > > thanks for the clarification, jans
> > >
> > > Peter Mika wrote:
> > > >
> > > > Hi Jans,
> > > >
> > > > The plan is to have the entire dataset available for
download in
> > the
> > > > WARC format as a set of files. (Some users may have
limitations
> > storing
> > > > files larger than 2GB.)
> > > >
> > > > The WARC format is a general format for storing the results
of
> > crawls.
> > > > It contains a header with the metadata and the HTTP
response. The
> > > > example I've sent recreates the HTTP response, which you
need to
> > do if
> > > > you only have the content. (You can also store metadata in
the
> > HTTP
> > > > Response headers.)
> > > >
> > > > 100 million triples on our side seems to compress to about 3
GB.
> > > >
> > > > Best,
> > > > Peter
> > > >
> > > > jans.aasman wrote:
> > > > >
> > > > > Hi Peter, I'm not entirely sure what you are going to give
us
> > access
> > > > > to. You (if everything goes right at Yahoo) will give us
> > access to a
> > > > > 100 G crawl in ntriples but the format of the triples is
based
> > on
> > > > > Warc? Jans
> > > > >
> > > > > Peter Mika wrote:
> > > > >
> > > > >> Dear All,
> > > > >>
> > > > >> After some long and careful consideration, we have made
the
> > > > decision not
> > > > >> to invent our own format for exchanging data but to rely
on
> > an existing
> > > > >> format known as WARC [1], in particular WARC version 0.9.
> > WARC archives
> > > > >> store provenance (URL) and timestamp in the header. The
only
> > additional
> > > > >> agreement we need to make is that we are going to encode
> > files in
> > > > >> N-Triples format. (If that is a problem, let us know.)
> > > > >>
> > > > >> What convinced us ultimately about WARC is the excellent
tool
> > > > support in
> > > > >> the form of a Java API from the Laboratory for Web
> > Algorithmics [2] of
> > > > >> the Università degli studi di Milano <http://www.unimi.
it/
> > > > <http://www.unimi.it/ <http://www.unimi.it/>>
> > > > >> <http://www.unimi. it/ <http://www.unimi.it/
> > <http://www.unimi.it/>>>>. The API can
> > > > >> be downloaded from [3] and there is a separate tarball
with
> > all the
> > > > >> dependencies. (The license in LGPL). One of the nice
features
> > of this
> > > > >> API is the ability to work with streams of compressed WARC
> > records,
> > > > >> where metadata about each record is stored in the gzip
> > header. This
> > > > >> means that the metadata can be read without uncompressing
the
> > > > content of
> > > > >> the record itself. Further, there are skip pointers in the
> > file, which
> > > > >> means that a record can be easily skipped over.
> > > > >>
> > > > >> To make it really easy, I've also created sample code that
> > demonstrates
> > > > >> how to create WARC archives from a set of files or a
> > directory
> > > > structure
> > > > >> on disk, and how to read back the resulting WARC archive.
The
> > code is
> > > > >> simply attached to this email, if all is well. (First
time I
> > send
> > > > >> attachments to a Y! Group.) Many thanks to Sebastiano
Vigna,
> > one of the
> > > > >> authors of the LAW API, for his help and advice.
> > > > >>
> > > > >> To support the Challenge, we at Yahoo! Research Barcelona
are
> > also hard
> > > > >> at work to get permission to release a microformat crawl
of
> > 100 million
> > > > >> triples. We hope this will be a significant contribution
to
> > the
> > > > >> state-of-the- art and will complement the existing data
sets
> > to be
> > > > >> provided by Semantic Web search engines.
> > > > >>
> > > > >> As always, your comments and questions are more than
> > appreciated. In
> > > > >> particular those of you planning to provide some data,
please
> > let us
> > > > >> know if you need any further help.
> > > > >>
> > > > >> Thanks,
> > > > >> Peter
> > > > >>
> > > > >> [1]
> > > > >> http://archive- access.sourcefor ge.net/warc/ warc_file_
> > format-0.
> > > > 9.html
> > > > <http://archive-access.sourceforge.net/warc/warc_file_format-

> > <http://archive-access.sourceforge.net/warc/warc_file_format->
> > 0.9.html>
> > > > >> <http://archive- access.sourcefor ge.net/warc/ warc_file_
> > format-0.
> > > > 9.html
> > > > <http://archive-access.sourceforge.net/warc/warc_file_format-

> > <http://archive-access.sourceforge.net/warc/warc_file_format->
> > 0.9.html>>
> > > > >> [2] http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/
> > <http://law.dsi.unimi.it/>>
> > > > <http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/
> > <http://law.dsi.unimi.it/>>>
> > > > >> [3]
> > > > >> http://law.dsi. unimi.it/ index.php? option=com_
> > content&task=
> > > > section&id= 5&Itemid= 42
> > > > <http://law.dsi.unimi.it/index.php?
> > <http://law.dsi.unimi.it/index.php?>
> > option=com_content&task=section&id=5&Itemid=42>
> > > >
> > > > >> <http://law.dsi. unimi.it/ index.php? option=com_
> > content&task=
> > > > section&id= 5&Itemid= 42
> > > > <http://law.dsi.unimi.it/index.php?
> > <http://law.dsi.unimi.it/index.php?>
> > option=com_content&task=section&id=5&Itemid=42>>
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
> >
>

#61 From: Peter Mika <pmika@...>
Date: Thu Feb 28, 2008 3:55 pm
Subject: Re: Re: Data format
serendipity588
Online Now Online Now
Send Email Send Email
 
Hi Amit,

No, I don't as I'm not familiar with Jena. But basically the
MeasurableInputStream that you get as a result of the
response.contentAsStream() call on line 143 is a Java IOStream that you
can process further with any API.

Best,
Peter

crossthelimit wrote:
>
> Hello Peter,
> Do we have any codes written in Jena?
>
> -
> Amit
>
> --- In billiontriples@yahoogroups.com
> <mailto:billiontriples%40yahoogroups.com>, Jans Aasman <ja@...> wrote:
> >
> > thanks for the clarification, jans
> >
> > Peter Mika wrote:
> > >
> > > Hi Jans,
> > >
> > > The plan is to have the entire dataset available for download in
> the
> > > WARC format as a set of files. (Some users may have limitations
> storing
> > > files larger than 2GB.)
> > >
> > > The WARC format is a general format for storing the results of
> crawls.
> > > It contains a header with the metadata and the HTTP response. The
> > > example I've sent recreates the HTTP response, which you need to
> do if
> > > you only have the content. (You can also store metadata in the
> HTTP
> > > Response headers.)
> > >
> > > 100 million triples on our side seems to compress to about 3 GB.
> > >
> > > Best,
> > > Peter
> > >
> > > jans.aasman wrote:
> > > >
> > > > Hi Peter, I'm not entirely sure what you are going to give us
> access
> > > > to. You (if everything goes right at Yahoo) will give us
> access to a
> > > > 100 G crawl in ntriples but the format of the triples is based
> on
> > > > Warc? Jans
> > > >
> > > > Peter Mika wrote:
> > > >
> > > >> Dear All,
> > > >>
> > > >> After some long and careful consideration, we have made the
> > > decision not
> > > >> to invent our own format for exchanging data but to rely on
> an existing
> > > >> format known as WARC [1], in particular WARC version 0.9.
> WARC archives
> > > >> store provenance (URL) and timestamp in the header. The only
> additional
> > > >> agreement we need to make is that we are going to encode
> files in
> > > >> N-Triples format. (If that is a problem, let us know.)
> > > >>
> > > >> What convinced us ultimately about WARC is the excellent tool
> > > support in
> > > >> the form of a Java API from the Laboratory for Web
> Algorithmics [2] of
> > > >> the Università degli studi di Milano <http://www.unimi. it/
> > > <http://www.unimi.it/ <http://www.unimi.it/>>
> > > >> <http://www.unimi. it/ <http://www.unimi.it/
> <http://www.unimi.it/>>>>. The API can
> > > >> be downloaded from [3] and there is a separate tarball with
> all the
> > > >> dependencies. (The license in LGPL). One of the nice features
> of this
> > > >> API is the ability to work with streams of compressed WARC
> records,
> > > >> where metadata about each record is stored in the gzip
> header. This
> > > >> means that the metadata can be read without uncompressing the
> > > content of
> > > >> the record itself. Further, there are skip pointers in the
> file, which
> > > >> means that a record can be easily skipped over.
> > > >>
> > > >> To make it really easy, I've also created sample code that
> demonstrates
> > > >> how to create WARC archives from a set of files or a
> directory
> > > structure
> > > >> on disk, and how to read back the resulting WARC archive. The
> code is
> > > >> simply attached to this email, if all is well. (First time I
> send
> > > >> attachments to a Y! Group.) Many thanks to Sebastiano Vigna,
> one of the
> > > >> authors of the LAW API, for his help and advice.
> > > >>
> > > >> To support the Challenge, we at Yahoo! Research Barcelona are
> also hard
> > > >> at work to get permission to release a microformat crawl of
> 100 million
> > > >> triples. We hope this will be a significant contribution to
> the
> > > >> state-of-the- art and will complement the existing data sets
> to be
> > > >> provided by Semantic Web search engines.
> > > >>
> > > >> As always, your comments and questions are more than
> appreciated. In
> > > >> particular those of you planning to provide some data, please
> let us
> > > >> know if you need any further help.
> > > >>
> > > >> Thanks,
> > > >> Peter
> > > >>
> > > >> [1]
> > > >> http://archive- access.sourcefor ge.net/warc/ warc_file_
> format-0.
> > > 9.html
> > > <http://archive-access.sourceforge.net/warc/warc_file_format-
> <http://archive-access.sourceforge.net/warc/warc_file_format->
> 0.9.html>
> > > >> <http://archive- access.sourcefor ge.net/warc/ warc_file_
> format-0.
> > > 9.html
> > > <http://archive-access.sourceforge.net/warc/warc_file_format-
> <http://archive-access.sourceforge.net/warc/warc_file_format->
> 0.9.html>>
> > > >> [2] http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/
> <http://law.dsi.unimi.it/>>
> > > <http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/
> <http://law.dsi.unimi.it/>>>
> > > >> [3]
> > > >> http://law.dsi. unimi.it/ index.php? option=com_
> content&task=
> > > section&id= 5&Itemid= 42
> > > <http://law.dsi.unimi.it/index.php?
> <http://law.dsi.unimi.it/index.php?>
> option=com_content&task=section&id=5&Itemid=42>
> > >
> > > >> <http://law.dsi. unimi.it/ index.php? option=com_
> content&task=
> > > section&id= 5&Itemid= 42
> > > <http://law.dsi.unimi.it/index.php?
> <http://law.dsi.unimi.it/index.php?>
> option=com_content&task=section&id=5&Itemid=42>>
> > > >>
> > > >
> > >
> > >
> >
>
>

#60 From: "crossthelimit" <joshiamitkrishna@...>
Date: Thu Feb 28, 2008 3:49 pm
Subject: Re: Data format
crossthelimit
Offline Offline
Send Email Send Email
 
Hello Peter,
  Do we have any codes written in Jena?

-
Amit

--- In billiontriples@yahoogroups.com, Jans Aasman <ja@...> wrote:
>
> thanks for the clarification, jans
>
> Peter Mika wrote:
> >
> > Hi Jans,
> >
> > The plan is to have the entire dataset available for download in
the
> > WARC format as a set of files. (Some users may have limitations
storing
> > files larger than 2GB.)
> >
> > The WARC format is a general format for storing the results of
crawls.
> > It contains a header with the metadata and the HTTP response. The
> > example I've sent recreates the HTTP response, which you need to
do if
> > you only have the content. (You can also store metadata in the
HTTP
> > Response headers.)
> >
> > 100 million triples on our side seems to compress to about 3 GB.
> >
> > Best,
> > Peter
> >
> > jans.aasman wrote:
> > >
> > > Hi Peter, I'm not entirely sure what you are going to give us
access
> > > to. You (if everything goes right at Yahoo) will give us
access to a
> > > 100 G crawl in ntriples but the format of the triples is based
on
> > > Warc? Jans
> > >
> > > Peter Mika wrote:
> > >
> > >> Dear All,
> > >>
> > >> After some long and careful consideration, we have made the
> > decision not
> > >> to invent our own format for exchanging data but to rely on
an existing
> > >> format known as WARC [1], in particular WARC version 0.9.
WARC archives
> > >> store provenance (URL) and timestamp in the header. The only
additional
> > >> agreement we need to make is that we are going to encode
files in
> > >> N-Triples format. (If that is a problem, let us know.)
> > >>
> > >> What convinced us ultimately about WARC is the excellent tool
> > support in
> > >> the form of a Java API from the Laboratory for Web
Algorithmics [2] of
> > >> the Università degli studi di Milano <http://www.unimi. it/
> > <http://www.unimi.it/>
> > >> <http://www.unimi. it/ <http://www.unimi.it/>>>. The API can
> > >> be downloaded from [3] and there is a separate tarball with
all the
> > >> dependencies. (The license in LGPL). One of the nice features
of this
> > >> API is the ability to work with streams of compressed WARC
records,
> > >> where metadata about each record is stored in the gzip
header. This
> > >> means that the metadata can be read without uncompressing the
> > content of
> > >> the record itself. Further, there are skip pointers in the
file, which
> > >> means that a record can be easily skipped over.
> > >>
> > >> To make it really easy, I've also created sample code that
demonstrates
> > >> how to create WARC archives from a set of files or a
directory
> > structure
> > >> on disk, and how to read back the resulting WARC archive. The
code is
> > >> simply attached to this email, if all is well. (First time I
send
> > >> attachments to a Y! Group.) Many thanks to Sebastiano Vigna,
one of the
> > >> authors of the LAW API, for his help and advice.
> > >>
> > >> To support the Challenge, we at Yahoo! Research Barcelona are
also hard
> > >> at work to get permission to release a microformat crawl of
100 million
> > >> triples. We hope this will be a significant contribution to
the
> > >> state-of-the- art and will complement the existing data sets
to be
> > >> provided by Semantic Web search engines.
> > >>
> > >> As always, your comments and questions are more than
appreciated. In
> > >> particular those of you planning to provide some data, please
let us
> > >> know if you need any further help.
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >> [1]
> > >> http://archive- access.sourcefor ge.net/warc/ warc_file_
format-0.
> > 9.html
> > <http://archive-access.sourceforge.net/warc/warc_file_format-
0.9.html>
> > >> <http://archive- access.sourcefor ge.net/warc/ warc_file_
format-0.
> > 9.html
> > <http://archive-access.sourceforge.net/warc/warc_file_format-
0.9.html>>
> > >> [2] http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/>
> > <http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/>>
> > >> [3]
> > >> http://law.dsi. unimi.it/ index.php? option=com_
content&task=
> > section&id= 5&Itemid= 42
> > <http://law.dsi.unimi.it/index.php?
option=com_content&task=section&id=5&Itemid=42>
> >
> > >> <http://law.dsi. unimi.it/ index.php? option=com_
content&task=
> > section&id= 5&Itemid= 42
> > <http://law.dsi.unimi.it/index.php?
option=com_content&task=section&id=5&Itemid=42>>
> > >>
> > >
> >
> >
>

#59 From: Jans Aasman <ja@...>
Date: Wed Feb 27, 2008 9:04 pm
Subject: Re: Data format
jannesaasman
Offline Offline
Send Email Send Email
 
thanks for the clarification, jans

Peter Mika wrote:
>
> Hi Jans,
>
> The plan is to have the entire dataset available for download in the
> WARC format as a set of files. (Some users may have limitations storing
> files larger than 2GB.)
>
> The WARC format is a general format for storing the results of crawls.
> It contains a header with the metadata and the HTTP response. The
> example I've sent recreates the HTTP response, which you need to do if
> you only have the content. (You can also store metadata in the HTTP
> Response headers.)
>
> 100 million triples on our side seems to compress to about 3 GB.
>
> Best,
> Peter
>
> jans.aasman wrote:
> >
> > Hi Peter, I'm not entirely sure what you are going to give us access
> > to. You (if everything goes right at Yahoo) will give us access to a
> > 100 G crawl in ntriples but the format of the triples is based on
> > Warc? Jans
> >
> > Peter Mika wrote:
> >
> >> Dear All,
> >>
> >> After some long and careful consideration, we have made the
> decision not
> >> to invent our own format for exchanging data but to rely on an existing
> >> format known as WARC [1], in particular WARC version 0.9. WARC archives
> >> store provenance (URL) and timestamp in the header. The only additional
> >> agreement we need to make is that we are going to encode files in
> >> N-Triples format. (If that is a problem, let us know.)
> >>
> >> What convinced us ultimately about WARC is the excellent tool
> support in
> >> the form of a Java API from the Laboratory for Web Algorithmics [2] of
> >> the Università degli studi di Milano <http://www.unimi. it/
> <http://www.unimi.it/>
> >> <http://www.unimi. it/ <http://www.unimi.it/>>>. The API can
> >> be downloaded from [3] and there is a separate tarball with all the
> >> dependencies. (The license in LGPL). One of the nice features of this
> >> API is the ability to work with streams of compressed WARC records,
> >> where metadata about each record is stored in the gzip header. This
> >> means that the metadata can be read without uncompressing the
> content of
> >> the record itself. Further, there are skip pointers in the file, which
> >> means that a record can be easily skipped over.
> >>
> >> To make it really easy, I've also created sample code that demonstrates
> >> how to create WARC archives from a set of files or a directory
> structure
> >> on disk, and how to read back the resulting WARC archive. The code is
> >> simply attached to this email, if all is well. (First time I send
> >> attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of the
> >> authors of the LAW API, for his help and advice.
> >>
> >> To support the Challenge, we at Yahoo! Research Barcelona are also hard
> >> at work to get permission to release a microformat crawl of 100 million
> >> triples. We hope this will be a significant contribution to the
> >> state-of-the- art and will complement the existing data sets to be
> >> provided by Semantic Web search engines.
> >>
> >> As always, your comments and questions are more than appreciated. In
> >> particular those of you planning to provide some data, please let us
> >> know if you need any further help.
> >>
> >> Thanks,
> >> Peter
> >>
> >> [1]
> >> http://archive- access.sourcefor ge.net/warc/ warc_file_ format-0.
> 9.html
> <http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html>
> >> <http://archive- access.sourcefor ge.net/warc/ warc_file_ format-0.
> 9.html
> <http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html>>
> >> [2] http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/>
> <http://law.dsi. unimi.it/ <http://law.dsi.unimi.it/>>
> >> [3]
> >> http://law.dsi. unimi.it/ index.php? option=com_ content&task=
> section&id= 5&Itemid= 42
>
<http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=4\
2>
>
> >> <http://law.dsi. unimi.it/ index.php? option=com_ content&task=
> section&id= 5&Itemid= 42
>
<http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=4\
2>>
> >>
> >
>
>

#58 From: Peter Mika <pmika@...>
Date: Wed Feb 27, 2008 4:41 pm
Subject: Re: Data format
serendipity588
Online Now Online Now
Send Email Send Email
 
Hi Jans,

The plan is to have the entire dataset available for download in the
WARC format as a set of files. (Some users may have limitations storing
files larger than 2GB.)

The WARC format is a general format for storing the results of crawls.
It contains a header with the metadata and the HTTP response. The
example I've sent recreates the HTTP response, which you need to do if
you only have the content. (You can also store metadata in the HTTP
Response headers.)

100 million triples on our side seems to compress to about 3 GB.

Best,
Peter



jans.aasman wrote:
>
> Hi Peter, I'm not entirely sure what you are going to give us access
> to. You (if everything goes right at Yahoo) will give us access to a
> 100 G crawl in ntriples but the format of the triples is based on
> Warc? Jans
>
> Peter Mika wrote:
>
>> Dear All,
>>
>> After some long and careful consideration, we have made the decision not
>> to invent our own format for exchanging data but to rely on an existing
>> format known as WARC [1], in particular WARC version 0.9. WARC archives
>> store provenance (URL) and timestamp in the header. The only additional
>> agreement we need to make is that we are going to encode files in
>> N-Triples format. (If that is a problem, let us know.)
>>
>> What convinced us ultimately about WARC is the excellent tool support in
>> the form of a Java API from the Laboratory for Web Algorithmics [2] of
>> the Università degli studi di Milano <http://www.unimi.it/
>> <http://www.unimi.it/>>. The API can
>> be downloaded from [3] and there is a separate tarball with all the
>> dependencies. (The license in LGPL). One of the nice features of this
>> API is the ability to work with streams of compressed WARC records,
>> where metadata about each record is stored in the gzip header. This
>> means that the metadata can be read without uncompressing the content of
>> the record itself. Further, there are skip pointers in the file, which
>> means that a record can be easily skipped over.
>>
>> To make it really easy, I've also created sample code that demonstrates
>> how to create WARC archives from a set of files or a directory structure
>> on disk, and how to read back the resulting WARC archive. The code is
>> simply attached to this email, if all is well. (First time I send
>> attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of the
>> authors of the LAW API, for his help and advice.
>>
>> To support the Challenge, we at Yahoo! Research Barcelona are also hard
>> at work to get permission to release a microformat crawl of 100 million
>> triples. We hope this will be a significant contribution to the
>> state-of-the-art and will complement the existing data sets to be
>> provided by Semantic Web search engines.
>>
>> As always, your comments and questions are more than appreciated. In
>> particular those of you planning to provide some data, please let us
>> know if you need any further help.
>>
>> Thanks,
>> Peter
>>
>> [1]
>> http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html
>> <http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html>
>> [2] http://law.dsi.unimi.it/ <http://law.dsi.unimi.it/>
>> [3]
>>
http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=42
>>
<http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=4\
2>
>>
>

#57 From: "jans.aasman" <ja@...>
Date: Wed Feb 27, 2008 4:30 pm
Subject: Re: Data format
jannesaasman
Offline Offline
Send Email Send Email
 
Hi Peter, I'm not entirely sure what you are going to give us access to. You (if everything goes right at Yahoo) will give us access to a 100 G crawl in ntriples but the format of the triples is based on Warc? Jans

Peter Mika wrote:

Dear All,

After some long and careful consideration, we have made the decision not
to invent our own format for exchanging data but to rely on an existing
format known as WARC [1], in particular WARC version 0.9. WARC archives
store provenance (URL) and timestamp in the header. The only additional
agreement we need to make is that we are going to encode files in
N-Triples format. (If that is a problem, let us know.)

What convinced us ultimately about WARC is the excellent tool support in
the form of a Java API from the Laboratory for Web Algorithmics [2] of
the Università degli studi di Milano <http://www.unimi.it/>. The API can
be downloaded from [3] and there is a separate tarball with all the
dependencies. (The license in LGPL). One of the nice features of this
API is the ability to work with streams of compressed WARC records,
where metadata about each record is stored in the gzip header. This
means that the metadata can be read without uncompressing the content of
the record itself. Further, there are skip pointers in the file, which
means that a record can be easily skipped over.

To make it really easy, I've also created sample code that demonstrates
how to create WARC archives from a set of files or a directory structure
on disk, and how to read back the resulting WARC archive. The code is
simply attached to this email, if all is well. (First time I send
attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of the
authors of the LAW API, for his help and advice.

To support the Challenge, we at Yahoo! Research Barcelona are also hard
at work to get permission to release a microformat crawl of 100 million
triples. We hope this will be a significant contribution to the
state-of-the-art and will complement the existing data sets to be
provided by Semantic Web search engines.

As always, your comments and questions are more than appreciated. In
particular those of you planning to provide some data, please let us
know if you need any further help.

Thanks,
Peter

[1] http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html
[2] http://law.dsi.unimi.it/
[3]
http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=42


#56 From: Peter Mika <pmika@...>
Date: Wed Feb 27, 2008 3:13 pm
Subject: Data format
serendipity588
Online Now Online Now
Send Email Send Email
 
Dear All,

After some long and careful consideration, we have made the decision not
to invent our own format for exchanging data but to rely on an existing
format known as WARC [1], in particular WARC version 0.9. WARC archives
store provenance (URL) and timestamp in the header. The only additional
agreement we need to make is that we are going to encode files in
N-Triples format. (If that is a problem, let us know.)

What convinced us ultimately about WARC is the excellent tool support in
the form of a Java API from the Laboratory for Web Algorithmics [2] of
the Università degli studi di Milano <http://www.unimi.it/>. The API can
be downloaded from [3] and there is a separate tarball with all the
dependencies. (The license in LGPL). One of the nice features of this
API is the ability to work with streams of compressed WARC records,
where metadata about each record is stored in the gzip header. This
means that the metadata can be read without uncompressing the content of
the record itself. Further, there are skip pointers in the file, which
means that a record can be easily skipped over.

To make it really easy, I've also created sample code that demonstrates
how to create WARC archives from a set of files or a directory structure
on disk, and how to read back the resulting WARC archive. The code is
simply attached to this email, if all is well. (First time I send
attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of the
authors of the LAW API, for his help and advice.

To support the Challenge, we at Yahoo! Research Barcelona are also hard
at work to get permission to release a microformat crawl of 100 million
triples. We hope this will be a significant contribution to the
state-of-the-art and will complement the existing data sets to be
provided by Semantic Web search engines.

As always, your comments and questions are more than appreciated. In
particular those of you planning to provide some data, please let us
know if you need any further help.

Thanks,
Peter

[1] http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html
[2] http://law.dsi.unimi.it/
[3]
http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&Itemid=42
package com.yahoo.corp.barcelona.billiontriples;

import it.unimi.dsi.fastutil.io.FastBufferedInputStream;
import it.unimi.dsi.fastutil.io.FastBufferedOutputStream;
import it.unimi.dsi.law.warc.io.GZWarcRecord;
import it.unimi.dsi.law.warc.io.WarcRecord;
import it.unimi.dsi.law.warc.util.BURL;
import it.unimi.dsi.law.warc.util.BasicHttpResponse;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.Date;

import javax.xml.transform.TransformerConfigurationException;

/** Sample code for creating Warc packages. This class is executable.
  *
  * @author pmika@...
  *
  */
public class WarcPackager {

	 public final static int MAX_RECORDS = -1;

	 private int count = 0;

	 //MODIFY THIS if your filenames are not URLs
	 protected BURL getURL(File file) {
		 BURL result = null;
		 try {
			 result = BURL.parse(URLDecoder.decode(file.getName(), "UTF-8"));
		 } catch (UnsupportedEncodingException e) {

			 e.printStackTrace();
		 }
		 return result;
	 }

	 //MODIFY this if the last modification date of the file != crawl date
	 protected Date getDate(File file) {
		 return new Date(file.lastModified());
	 }

	 private WarcRecord createRecord(File file) throws UnsupportedEncodingException,
IOException {
		 GZWarcRecord result = new GZWarcRecord();

		 InputStream fis = new FileInputStream(file);

		 BasicHttpResponse response = new BasicHttpResponse();

		 BURL url = getURL(file);

		 if (url == null) {
			 throw new IllegalArgumentException("Warning: getURL() returned null for " +
file);
		 }

		 response.url(getURL(file));

		 response.statusLine("HTTP/1.1 200 OK");
		 response.status(200);
		 response.contentAsStream(new FastBufferedInputStream(fis));

		 response.toWarcRecord(result);

		 Date date = getDate(file);
		 if (date == null) {
			 throw new IllegalArgumentException("Warning: getDate() returned null for " +
file);
		 }

		 result.header.creationDate = getDate(file);

		 return result;
	 }


	 //recursive
	 public void processFileOrDir(OutputStream out, File file) throws IOException {

		 //if MAX_RECORDS is specified, and we've reached the limit, return
		 if (MAX_RECORDS != -1 && count > MAX_RECORDS) {
			 return;
		 }

		 if (count++ % 99999 == 0) System.err.println("Processed " + count + "
files.");

		 if (file.isDirectory()) {
			 for (String name : file.list()) {
				 processFileOrDir(out, new File(file.getAbsolutePath() +
System.getProperty("file.separator") + name));
			 }
		 } else {
			 //Catch exceptions: failure to write a single file should not make us abort
			 try {
				 WarcRecord record = createRecord(file);
				 record.write(out);
			 } catch (Exception e) {
				 System.err.println(e);
			 }
		 }

	 }


	 /**
	  * Package the files or directories passed in as arguments.
	  * Directories are processed recursively.
	  *
	  * The result is printed to standard out, errors/diagnostic messages to std
err.
	  *
	  * @param args
	  * @throws TransformerConfigurationException
	  * @throws IOException
	  * @throws UnsupportedEncodingException
	  */
	 public static void main(String[] args) throws
TransformerConfigurationException, UnsupportedEncodingException, IOException {

		 if (args.length < 1) {
			 System.err.println("Usage: WarcPackage <fileOrDir> ...");
		 }

		 FastBufferedOutputStream out = new FastBufferedOutputStream(System.out);
		 WarcPackager packager = new WarcPackager();

		 for (String arg: args) {
			 packager.processFileOrDir(out, new File(arg));
		 }

		 out.close();


	 }
}
package com.yahoo.corp.barcelona.billiontriples;

import it.unimi.dsi.fastutil.io.FastBufferedInputStream;
import it.unimi.dsi.fastutil.io.MeasurableInputStream;
import it.unimi.dsi.law.warc.filters.Filter;
import it.unimi.dsi.law.warc.filters.Filters;
import it.unimi.dsi.law.warc.io.GZWarcRecord;
import it.unimi.dsi.law.warc.io.WarcFilteredIterator;
import it.unimi.dsi.law.warc.io.WarcRecord;
import it.unimi.dsi.law.warc.util.BURL;
import it.unimi.dsi.law.warc.util.WarcHttpResponse;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.openrdf.model.Statement;
import org.openrdf.rio.RDFHandlerException;
import org.openrdf.rio.RDFParseException;
import org.openrdf.rio.helpers.RDFHandlerBase;
import org.openrdf.rio.ntriples.NTriplesParser;

/** Sample code for reading Warc packages.
  *
  * This class is executable.
  *
  * @author pmika@...
  *
  */
public class WarcReader {

	 private NTriplesParser parser = new NTriplesParser();

	 private CountHandler countHandler = new CountHandler();

	 private int tripleCount = 0;
	 private int lineCount =  0;

	 public class CountHandler extends RDFHandlerBase {

		 private int count = 0;

		 public void endRDF() throws RDFHandlerException {
			 super.endRDF();
			 //System.out.println("Counted " + count + " statements.");
		 }

		 public void handleStatement(Statement st) {
			 count++;
		 }

		 public void startRDF() throws RDFHandlerException {
			 super.startRDF();
			 count = 0;
		 }

	 }

	 public static class TrueFilter extends Filter<BURL> {

		 @Override
		 public boolean accept( BURL x ) {
			 return true;
		 }

		 @Override
		 public String toExternalForm() {

			 return "true";
		 }

	 }

	 public void countTriples(MeasurableInputStream block, String base) {
		 parser.setRDFHandler(countHandler);

		 try {
			 parser.parse(block, base);
			 tripleCount += countHandler.count;
		 } catch (RDFParseException e) {
			 e.printStackTrace();
		 } catch (RDFHandlerException e) {
			 e.printStackTrace();
		 } catch (IOException e) {
			 e.printStackTrace();
		 }
	 }

	 public void countLines(MeasurableInputStream block) throws IOException {
		 int c = 0;
		 while ((c = block.read()) != -1) {
			 if (c == '\n') {
				 lineCount++;
			 }
		 }
	 }

	 public void dumpContent(MeasurableInputStream block) throws IOException {
		 int c = 0;
		 while ((c = block.read()) != -1) {
			 System.out.write(c);
		 }
	 }


	 /**
	  * @param args
	  * @throws FileNotFoundException
	  */
	 public static void main(String[] args) throws FileNotFoundException {
		 if (args.length < 1) {
			 System.err.println("Usage: WarcReader <file>");
		 }

		 final FastBufferedInputStream in = new FastBufferedInputStream(new
FileInputStream(new File(args[0])));
		 GZWarcRecord record = new GZWarcRecord();
		 Filter<WarcRecord> filter = Filters.adaptFilterBURL2WarcRecord(new
TrueFilter());
		 WarcFilteredIterator it = new WarcFilteredIterator(in, record, filter);
		 int urlCount = 0;

		 WarcReader reader = new WarcReader();
		 WarcHttpResponse response = new WarcHttpResponse();
		 try {
			 while (it.hasNext()) {

				 if (urlCount++ % 99999 == 0) System.err.println("Processed " + urlCount + "
files.");

				 WarcRecord nextRecord = it.next();
				 //Get the HttpResponse
				 try {
					 response.fromWarcRecord(nextRecord);
					 System.out.println("Processing: " + nextRecord.header.subjectUri);

					 //This will dump the content of the record
					 //reader.dumpContent(response.contentAsStream());

					 //This will count the number of triples by parsing the RDF
					 //reader.countTriples(response.contentAsStream(),
nextRecord.header.subjectUri.toString());

					 //This will count the number of lines, which is equivalent to
					 //the number of triples in N-Triples format
					 reader.countLines(response.contentAsStream());
				 } catch (IOException e) {
					 e.printStackTrace();
					 continue;
				 }
			 }
		 } catch (RuntimeException re) {}

		 System.out.println("Counted " + reader.lineCount + " triples from " + urlCount
+ " urls.");


	 }

}

#54 From: "Jeff Pollock" <jeff.pollock@...>
Date: Fri Feb 8, 2008 5:30 pm
Subject: Are Scalable Graph Data Applications Possible? A Look at C-Store, Java, and Data Grid Approaches to Semantic Web Applications
jeff_pollock
Offline Offline
Send Email Send Email
 

http://www.sdforum.org/index.cfm?fuseaction=Page.viewPage&pageId=656&parentID=483&nodeID=1

 

SDForum Semantic Web SIG Event

 

Are Scalable Graph Data Applications Possible?

A Look at C-Store, Java, and Data Grid Approaches to Semantic Web Applications

 

With the rising importance of data analytics, there is more evidence than ever that graph style data systems can achieve new benefits by making it easier to link and re-combine complex data. But the Achilles heel of graph style tuple storage has always been a lack of performance at scale. Will the Semantic Web and modern analytics finally drive innovation that makes these systems scalable? In this SDForum interactive panel discussion we will explore that question and more.

 

Join us for three unique presentations that will explore cutting-edge techniques for scalable RDF/OWL storage, and the kinds of applications that make use of those systems. First, we are honored to have representation from Vertica and the Massachusetts Institute of Technology to describe how columnar store (C-Store) data warehouse technology can enable large scale data graphs supporting billions of RDF triples. Next, we’ll get a peek at some GeoTemporal and Social Network Analysis applications based off the federated Java RDF database from Franz Technologies.  Finally, a short synopsis of Oracle’s various approaches for tuple-based storage (including in-memory, data grid, and Oracle Database RDF solutions) will be presented and tradeoffs discussed.

 

Our expert guests include Andy Palmer from Vertica, Samuel R. Madden from MIT, Jans Aasman from Franz Technologies. Jeff Pollock from Oracle will moderate as well as present a short summary of technical approaches to scalable RDF systems.

 

Next Meeting:

 

6:30 PM - 9:00 PM March 5, 2008

Cubberley Community Center

4000 Middlefield Rd., Room H-1

Palo Alto,, CA

94105

 

Agenda:

 

6:30pm  - 7:00pm   Registration / Networking / Refreshments / Pizza

 

7:00pm – 7:10pm   Community announcement

 

7:10pm  -  7:50pm  The Vertica C-Store DBMS for Scalable RDF Persistence

 

7:50pm  -  8:30pm  Franz Technologies RDF Applications

 

8:30pm  -  8:45pm  Oracle Infrastructure for Tuple-based Graph Storage

 

8:45pm  -  9:00pm(+)  Dedicated Q&A period

 

 

 

Oracle
Jeff T. Pollock | Senior Director | Direct:650-506-4700

100 Oracle Parkway, Redwood Shores, California 94065
Oracle Fusion Middleware | Main:800-ORACLE1 | Fax:801-607-6504 | Mobile:415-971-2223
Middleware | Data Integrator | jeff.pollock@... | Blog | LinkedIn Profile | Publications

 


#53 From: "M.Daquin" <m.daquin@...>
Date: Thu Feb 7, 2008 7:32 pm
Subject: RE: Re: data format
mathieu_daquin
Offline Offline
Send Email Send Email
 
Hi list,

Peter Mika wrote:
>> (...)
>> URL = http://challenge.semanticweb.org/somefile.rdf
>> (...)
>> and the file would go in directory
>>
>> /A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A
>>
>> If we take the checksum on the contents of the file and create enough
>> levels, we can also make sure that files that are duplicates end up in
>> the same subdirectory regardless of the URL.

I quite like this last solution for one, very selfish reason: this is very
similar to the way the cache of Watson is organized. For example,
    
http://kmi-web05.open.ac.uk:81/cache/a/a0d/89b9/a7dd3/6e33577582/44cab50d0e34cb3\
ce
is the location of the file for which the sha1 checksum of the content is
     aa0d89b9a7dd36e3357758244cab50d0e34cb3ce
This structure is pretty convenient for managing a large number of files,
providing a unique ID to each file according to its content, and therefore,
automatically avoiding duplicates.

> from my experience, file systems will have trouble at some point when
> there are too many files around.  Thus, we avoid writing individual files
> to the file system.
> (...)
> The nice thing about ZIP archives is that you can access them from
> within any programming language (we've tried Java and Python).

Note that zip archives also have serious limitations, in particular a limit to
65,536 files
(see http://www.info-zip.org/FAQ.html#limits).

Regards,
Mathieu.

#52 From: Andreas Harth <andreas.harth@...>
Date: Thu Feb 7, 2008 7:28 pm
Subject: Re: Re: data format
andreasharth
Offline Offline
Send Email Send Email
 
Hi Peter,

Peter Mika wrote:
> I like this solution as well, the only thing I'm slightly worried about
> now  is what happens when you unzip a large number of files. My extended
> suggestion is thus to take the SHA1 sum of the URL and create
> subdirectories based on that, say three level deep. For example,  take a
> file with URL
>
> URL = http://challenge.semanticweb.org/somefile.rdf
>
> Now we could take the checksum of the URL or the checksum of the contents:
>
> checksum = ABCDEFG0123456789
>
> and the file would go in directory
>
> /A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A
>
> If we take the checksum on the contents of the file and create enough
> levels, we can also make sure that files that are duplicates end up in
> the same subdirectory regardless of the URL.
>
> What do you think?
>

from my experience, file systems will have trouble at some point when
there are too many files around.  Thus, we avoid writing individual files
to the file system.

What worked here is:

put source files into ZIP archives with URI urlencoded as filename
for each file in the ZIP archive:
        process file

That way, we never have to actually put all files on the filesystem,
but do (de)compression on the fly.  If we use command line tools in
the process, we iterate over the ZIP contents, write one file to disk,
process the file with the command line tool, and remove the file
again.

The nice thing about ZIP archives is that you can access them from
within any programming language (we've tried Java and Python).

Regards,
Andreas.

--
http://harth.org/andreas/

#51 From: Peter Mika <pmika@...>
Date: Thu Feb 7, 2008 4:47 pm
Subject: Re: Re: data format
serendipity588
Online Now Online Now
Send Email Send Email
 
Hi Andreas,

I like this solution as well, the only thing I'm slightly worried about
now  is what happens when you unzip a large number of files. My extended
suggestion is thus to take the SHA1 sum of the URL and create
subdirectories based on that, say three level deep. For example,  take a
file with URL

URL = http://challenge.semanticweb.org/somefile.rdf

Now we could take the checksum of the URL or the checksum of the contents:

checksum = ABCDEFG0123456789

and the file would go in directory

/A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A

If we take the checksum on the contents of the file and create enough
levels, we can also make sure that files that are duplicates end up in
the same subdirectory regardless of the URL.

What do you think?

Best,
Peter



andreasharth wrote:
>
> Hi,
>
> --- In billiontriples@yahoogroups.com
> <mailto:billiontriples%40yahoogroups.com>, Peter Mika <pmika@...> wrote:
> > The goal is basically to find a way to transfer quints, i.e. RDF
> triples
> > with provenance and timestamp. I will start by proposing two
> > alternatives (with variations) and then leave the floor to others.
> >
> > Here we go:
> >
> > Each Semantic Web Document (SWD) is stored in a single file using
> Turtle
> > format. The files are zipped together to form a single file.
>
> as already discussed, I'd prefer this solution. Filenames in the ZIP
> archive are the url-encoded URI of the file. Actually, ZIPs do also
> preserve the timestamp, so the ZIP archive contains all the
> information you require.
>
> Regards,
> Andreas.
>
>

#50 From: Jim Hendler <hendler@...>
Date: Wed Feb 6, 2008 5:57 pm
Subject: Re: data hosting
james.hendler
Offline Offline
Send Email Send Email
 
Ian, good point, we will work hard to make sure all the data is freely sharable and displayable, having a good license that makes that clear would make a lot of sense - JH


On Feb 2, 2008, at 10:12 AM, Ian Davis wrote:


What licensing terms will the data be issued under? I encourage this
project to adopt the ODC Public Domain Dedication and Licence, a licence
that Talis and others developed in conjunction with Science Commons:

http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/

and the community norms:

http://www.opendatacommons.org/odc-community-norms/

We expect this licence to be out of beta in a couple of weeks.

Our goal in developing the licence was to convert the web of data into a
web of _useable_ data.

Ian

On Fri, 2008-02-01 at 18:23 +0100, Peter Mika wrote: 
> Dear All,
> 
> We are looking for persons or organizations who would like to offer 
> their help in hosting the Billion Triples data set. This is an important 
> part of the Challenge in that we expect many more developers who would 
> like to work with the data, but themselves do not have the means to host 
> the data. The organizers reserve the right to award the fastest and the 
> most reliable hosting service based on feedback from the participants 
> using their service.
> 
> The minimal criteria for hosting is to provide a SPARQL endpoint to the 
> dataset and an email address for support (with a maximum response time 
> of 24 hours). Hosting locations will be posted on the Semantic Web 
> Challenge website.
> 
> Thanks,
> Jim and Peter
> 

> 


"If we knew what we were doing, it wouldn't be called research, would it?." - Albert Einstein

Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180





#49 From: "gtummarello" <g.tummarello@...>
Date: Sat Feb 2, 2008 10:13 pm
Subject: Re: data hosting
gtummarello
Offline Offline
Send Email Send Email
 
Following some inquiries, i'd like to clarify that its not the main
Sindice infrastructure providing a sparql endpoint (e.g. over the
entire dataset), its just one of the machines we use for it that can
be setup for that .
Giovanni

--- In billiontriples@yahoogroups.com, "gtummarello"
<g.tummarello@...> wrote:
>
> Hi,
>
> we can provide one such data hosting within Sindice.
> let us know when/how/what and we'll get it running.
>
> Giovanni
>
> --- In billiontriples@yahoogroups.com, Peter Mika <pmika@> wrote:
> >
> > Dear All,
> >
> > We are looking for persons or organizations who would like to offer
> > their help in hosting the Billion Triples data set. This is an
> important
> > part of the Challenge in that we expect many more developers who
would
> > like to work with the data, but themselves do not have the means to
> host
> > the data. The organizers reserve the right to award the fastest
and the
> > most reliable hosting service based on feedback from the participants
> > using their service.
> >
> > The minimal criteria for hosting is to provide a SPARQL endpoint
to the
> > dataset and an email address for support (with a maximum response
time
> > of 24 hours). Hosting locations will be posted on the Semantic Web
> > Challenge website.
> >
> > Thanks,
> > Jim and Peter
> >
>

#48 From: Ian Davis <lists@...>
Date: Sat Feb 2, 2008 3:12 pm
Subject: Re: data hosting
ianalchemy
Offline Offline
Send Email Send Email
 
What licensing terms will the data be issued under? I encourage this
project to adopt the ODC Public Domain Dedication and Licence, a licence
that Talis and others developed in conjunction with Science Commons:

http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/

and the community norms:

http://www.opendatacommons.org/odc-community-norms/

We expect this licence to be out of beta in a couple of weeks.

Our goal in developing the licence was to convert the web of data into a
web of _useable_ data.

Ian


On Fri, 2008-02-01 at 18:23 +0100, Peter Mika wrote:
> Dear All,
>
> We are looking for persons or organizations who would like to offer
> their help in hosting the Billion Triples data set. This is an important
> part of the Challenge in that we expect many more developers who would
> like to work with the data, but themselves do not have the means to host
> the data. The organizers reserve the right to award the fastest and the
> most reliable hosting service based on feedback from the participants
> using their service.
>
> The minimal criteria for hosting is to provide a SPARQL endpoint to the
> dataset and an email address for support (with a maximum response time
> of 24 hours). Hosting locations will be posted on the Semantic Web
> Challenge website.
>
> Thanks,
> Jim and Peter
>

>

#47 From: "Kingsley Idehen" <kidehen@...>
Date: Fri Feb 1, 2008 10:37 pm
Subject: Re: data hosting
kidehen
Offline Offline
Send Email Send Email
 
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> We are looking for persons or organizations who would like to offer
> their help in hosting the Billion Triples data set. This is an important
> part of the Challenge in that we expect many more developers who would
> like to work with the data, but themselves do not have the means to host
> the data. The organizers reserve the right to award the fastest and the
> most reliable hosting service based on feedback from the participants
> using their service.
>
> The minimal criteria for hosting is to provide a SPARQL endpoint to the
> dataset and an email address for support (with a maximum response time
> of 24 hours). Hosting locations will be posted on the Semantic Web
> Challenge website.
>
> Thanks,
> Jim and Peter
>

Jim & Peter,

As we do with DBpedia[1][2], we are happy to be one of hopefully numerous RDF
data
store providers for this effort.

Count OpenLink Software in re. our Virtuoso Quad Store [3] :-)

Links:
1. http://dbpedia.org
2. http://www4.wiwiss.fu-berlin.de/benchmarks-200801/
3. http://en.wikipedia.org/wiki/Virtuoso_Universal_Server



Kingsley Idehen

#46 From: "andreasharth" <andreas.harth@...>
Date: Fri Feb 1, 2008 7:17 pm
Subject: Re: data format
andreasharth
Offline Offline
Send Email Send Email
 
Hi,

--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
> The goal is basically to find a way to transfer quints, i.e. RDF
triples
> with provenance and timestamp. I will start by proposing two
> alternatives (with variations) and then leave the floor to others.
>
> Here we go:
>
> Each Semantic Web Document (SWD) is stored in a single file using
Turtle
> format. The files are zipped together to form a single file.

as already discussed, I'd prefer this solution.  Filenames in the ZIP
archive are the url-encoded URI of the file.  Actually, ZIPs do also
preserve the timestamp, so the ZIP archive contains all the
information you require.

Regards,
Andreas.

#45 From: "gtummarello" <g.tummarello@...>
Date: Fri Feb 1, 2008 6:31 pm
Subject: Re: data hosting
gtummarello
Offline Offline
Send Email Send Email
 
Hi,

we can provide one such data hosting within Sindice.
let us know when/how/what and we'll get it running.

Giovanni

--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> We are looking for persons or organizations who would like to offer
> their help in hosting the Billion Triples data set. This is an
important
> part of the Challenge in that we expect many more developers who would
> like to work with the data, but themselves do not have the means to
host
> the data. The organizers reserve the right to award the fastest and the
> most reliable hosting service based on feedback from the participants
> using their service.
>
> The minimal criteria for hosting is to provide a SPARQL endpoint to the
> dataset and an email address for support (with a maximum response time
> of 24 hours). Hosting locations will be posted on the Semantic Web
> Challenge website.
>
> Thanks,
> Jim and Peter
>

#44 From: "Marc-Alexandre Nolin" <lotus@...>
Date: Fri Feb 1, 2008 6:30 pm
Subject: Re: Re: data format
marc_alexand...
Offline Offline
Send Email Send Email
 
Hi,

I'm new to this discussion list. I will introduce myself, I'm Marc-Alexandre Nolin from the Bio2RDF project (http://bio2rdf.org). His the billions triples is just about having a billions triples, a billions triple that we can query in a single triple store or a billions triple that we can query in multiple triple store?

I have a way to generate a huge amount of RDF with genomics data. Our current triple store, Sesame, can't hold that much and we are in the process of installing and moving to Virtuoso in the hope that it can hold all of it.

I will keep you posted on the number off triples a manage to put in it.

Bye !!

Marc-Alexandre

2008/2/1, N. Sivaramakrishnan <k2_181@...>:

My two cents: In the spirit of RDF, why not provide a 'directory'
triple file that has resources identifying each file and provides
timestamps, provenance etc as properties? Do we want the ability to
query statements on provenance information?

To introduce myself, I'm a PhD student, working with Joel Saltz at The
Ohio State University. I have developed a research prototype of a
parallel semantic engine that can run in a cluster setting. I am
hoping to use the billion triples challenge to see how it measures up .

I apologize in advance if I say something nonsensical :)

--Sivaramakrishnan



--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> In the past few days we had talked to several of you about providing
> data for the billion triples challenge. I would like to start a brief
> discussion on the data format that we intend to use to provide data and
> to exchange the billion triples data set. This discussion is relevant
> for those who would be providing data and those would like to host this
> data set; we are hoping that the majority of participants will be able
> to rely on these hosting services to build applications.
>
> The goal is basically to find a way to transfer quints, i.e. RDF
triples
> with provenance and timestamp. I will start by proposing two
> alternatives (with variations) and then leave the floor to others.
>
> Here we go:
>
> Each Semantic Web Document (SWD) is stored in a single file using
Turtle
> format. The files are zipped together to form a single file.
>
> Alternative #1: Each SWD file is named using the SHA1 hash of the URL
> identifying the provenance of the file. There is a separate file
linking
> such hash codes to the full URLs and timestamps.
>
> Alternative #2: The name of the file is irrelevant: provenance and
> timestamp are encoded as comments in Turtle.
>
> Comments, suggestions?
>
> Thanks,
> Peter
>



#43 From: "N. Sivaramakrishnan" <k2_181@...>
Date: Fri Feb 1, 2008 6:16 pm
Subject: Re: data format
k2_181
Offline Offline
Send Email Send Email
 
My two cents: In the spirit of RDF, why not provide a 'directory'
triple file that has resources identifying each file and provides
timestamps, provenance etc as properties? Do we want the ability to
query statements on provenance information?

To introduce myself, I'm a PhD student, working with Joel Saltz at The
Ohio State University. I have developed a research prototype of a
parallel semantic engine that can run in a cluster setting. I am
hoping to use the billion triples challenge to see how it measures up .

I apologize in advance if I say something nonsensical :)

--Sivaramakrishnan

--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> In the past few days we had talked to several of you about providing
> data for the billion triples challenge. I would like to start a brief
> discussion on the data format that we intend to use to provide data and
> to exchange the billion triples data set. This discussion is relevant
> for those who would be providing data and those would like to host this
> data set; we are hoping that the majority of participants will be able
> to rely on these hosting services to build applications.
>
> The goal is basically to find a way to transfer quints, i.e. RDF
triples
> with provenance and timestamp. I will start by proposing two
> alternatives (with variations) and then leave the floor to others.
>
> Here we go:
>
> Each Semantic Web Document (SWD) is stored in a single file using
Turtle
> format. The files are zipped together to form a single file.
>
> Alternative #1: Each SWD file is named using the SHA1 hash of the URL
> identifying the provenance of the file. There is a separate file
linking
> such hash codes to the full URLs and timestamps.
>
> Alternative #2: The name of the file is irrelevant: provenance and
> timestamp are encoded as comments in Turtle.
>
> Comments, suggestions?
>
> Thanks,
> Peter
>

#42 From: "jans.aasman" <ja@...>
Date: Fri Feb 1, 2008 6:10 pm
Subject: Re: data format
jannesaasman
Offline Offline
Send Email Send Email
 
Hi Peter, I vote for option # 2, Jans

Peter Mika wrote:

Dear All,

In the past few days we had talked to several of you about providing
data for the billion triples challenge. I would like to start a brief
discussion on the data format that we intend to use to provide data and
to exchange the billion triples data set. This discussion is relevant
for those who would be providing data and those would like to host this
data set; we are hoping that the majority of participants will be able
to rely on these hosting services to build applications.

The goal is basically to find a way to transfer quints, i.e. RDF triples
with provenance and timestamp. I will start by proposing two
alternatives (with variations) and then leave the floor to others.

Here we go:

Each Semantic Web Document (SWD) is stored in a single file using Turtle
format. The files are zipped together to form a single file.

Alternative #1: Each SWD file is named using the SHA1 hash of the URL
identifying the provenance of the file. There is a separate file linking
such hash codes to the full URLs and timestamps.

Alternative #2: The name of the file is irrelevant: provenance and
timestamp are encoded as comments in Turtle.

Comments, suggestions?

Thanks,
Peter


#41 From: Peter Mika <pmika@...>
Date: Fri Feb 1, 2008 5:23 pm
Subject: data hosting
serendipity588
Online Now Online Now
Send Email Send Email
 
Dear All,

We are looking for persons or organizations who would like to offer
their help in hosting the Billion Triples data set. This is an important
part of the Challenge in that we expect many more developers who would
like to work with the data, but themselves do not have the means to host
the data. The organizers reserve the right to award the fastest and the
most reliable hosting service based on feedback from the participants
using their service.

The minimal criteria for hosting is to provide a SPARQL endpoint to the
dataset and an email address for support (with a maximum response time
of 24 hours). Hosting locations will be posted on the Semantic Web
Challenge website.

Thanks,
Jim and Peter

#40 From: Peter Mika <pmika@...>
Date: Fri Feb 1, 2008 5:22 pm
Subject: data format
serendipity588
Online Now Online Now
Send Email Send Email
 
Dear All,

In the past few days we had talked to several of you about providing
data for the billion triples challenge. I would like to start a brief
discussion on the data format that we intend to use to provide data and
to exchange the billion triples data set. This discussion is relevant
for those who would be providing data and those would like to host this
data set; we are hoping that the majority of participants will be able
to rely on these hosting services to build applications.

The goal is basically to find a way to transfer quints, i.e. RDF triples
with provenance and timestamp. I will start by proposing two
alternatives (with variations) and then leave the floor to others.

Here we go:

Each Semantic Web Document (SWD) is stored in a single file using Turtle
format. The files are zipped together to form a single file.

Alternative #1: Each SWD file is named using the SHA1 hash of the URL
identifying the provenance of the file. There is a separate file linking
such hash codes to the full URLs and timestamps.

Alternative #2: The name of the file is irrelevant: provenance and
timestamp are encoded as comments in Turtle.

Comments, suggestions?

Thanks,
Peter

#39 From: "thompsonbry" <bryan@...>
Date: Tue Dec 18, 2007 11:01 am
Subject: Re: Intro: bigdata
thompsonbry
Offline Offline
Send Email Send Email
 
Sören,

Hardly empty. I think "pre-release" is the better term.

The project documentation site is:

http://www.bigdata.com/projects/

The RDF layer documentation is:

http://www.bigdata.com/projects/multiproject/bigdata-rdf/index.html

The store is currently a triple store with a Sesame 1.x integration
scaling well past the 1B triple point with RDFS + owl:sameAs and
friends, truth maintenance, etc.

-bryan

--- In billiontriples@yahoogroups.com, Sören Auer <auer@...> wrote:
>
> thompsonbry wrote:
> > I would suggest that 1B is hardly the challenge point. Let's try 10B
> > or more.
> > [1] http://www.sourceforge.net/projects/bigdata
>
> Regarding the empty SourceForge project you are quite bold ;-)
>
> Good luck anyway!
>
> Sören
>

#38 From: Sören Auer <auer@...>
Date: Mon Dec 17, 2007 6:53 pm
Subject: Re: Intro: bigdata
soerenauer
Offline Offline
Send Email Send Email
 
thompsonbry wrote:
> I would suggest that 1B is hardly the challenge point. Let's try 10B
> or more.
> [1] http://www.sourceforge.net/projects/bigdata

Regarding the empty SourceForge project you are quite bold ;-)

Good luck anyway!

Sören

#37 From: "thompsonbry" <bryan@...>
Date: Mon Dec 17, 2007 6:45 pm
Subject: Intro: bigdata
thompsonbry
Offline Offline
Send Email Send Email
 
Hello,

I am with SYSTAP, LLC.  Among other things, we are interested in scale-
out databases and their applications to RDF and Topic maps.  We have an
open source project, bigdata(R) [1], based on a scale-out database
architecture.  The project is in pre-release, but the RDF layer
supports inference, etc. at the 1B+ statements level on a single host.

I would suggest that 1B is hardly the challenge point.  Let's try 10B
or more.

[1] http://www.sourceforge.net/projects/bigdata

Thanks,

-bryan

Messages 37 - 68 of 141   Newest  |  < Newer  |  Older >  |  Oldest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help