Hi,
I am a student and trying to work on BTC along with few of my friends.
I downloaded the BTC dataset and tried to load the dataset to Virtuoso store and
I am getting errors related to missing name space declaration.
Any idea where all the name spaces are defined ? I mean, there should be a
single file that has all the name space declarations for the entire dataset.
Please let me know if you know where this name space declarations are present.
Thanks a lot,
Best Regards,
Pramod.
Hi Jans,
Jans Aasman wrote:
> I do have another question about the blank nodes. I want to do a
> multistream load of the BTC but in that case I have to be sure that
> blank nodes are completely local to named graphs. Do you know if this is
> the case? To be more precise: would it be possible that an blank node
> that was introduced in one named graph is referred to in another named
> graph?
we've rewritten/skolemised the blank node IDs so that they
are globally unique and hence local to a named graph. You're ok
to just load the data without touching blank node IDs.
Regards,
Andreas.
thanks Andreas: that is great... now we can publish a conformant nquads
reader.
I do have another question about the blank nodes. I want to do a
multistream load of the BTC but in that case I have to be sure that
blank nodes are completely local to named graphs. Do you know if this is
the case? To be more precise: would it be possible that an blank node
that was introduced in one named graph is referred to in another named
graph?
Jans
Andreas Harth wrote:
>
>
> Hi,
>
> I've updated the BTC 2009 dataset at [1]. The updates
> are minor fixes and concern encoding issues: Unicode
> escaping in URIs and bnode identifiers containing
> characters outside the alphanumeric range.
>
> .
>
>
Hi,
I've updated the BTC 2009 dataset at [1]. The updates
are minor fixes and concern encoding issues: Unicode
escaping in URIs and bnode identifiers containing
characters outside the alphanumeric range.
Regards,
Andreas.
[1] http://vmlion25.deri.ie/
Dear All,
We would like to give a brief update on the submissions we have
received for the Semantic Web Challenge 2008 and the process going
forward.
We are pleased to share that we have received a record number of
submissions: altogether 24 individuals and groups have submitted their
work. We have received 14 submissions for the Open Track and 10
submissions for the Billion Triples Track.
We are currently evaluating whether the submissions meet the minimal
criteria we have posted for these tracks. We will send out
notifications later this week. (We expect that the vast majority of
submissions will meet the minimal criteria.)
Best,
Jim & Peter
Dear All,
For those who submitted, we have just sent out a short notification
that we have received your submission. If you don't receive this email
for any reason, please let us know.
Thanks,
Jim and Peter
serendipity588 wrote:
> The submission deadline for the Semantic Web Challenge 2008 is
> tomorrow end of day by Central European Time.
Peter, thanks for clarifying this! The mathematician in me attempted to
interpret your deadline representation in the call (October 1, 2008,
12am CET) according to the common rules [1] and then the deadline would
have been tonight and I already scared my colleagues (who are still
heavily working on two submissions).
Maybe we can switch to European 24h date/time representations for
deadlines to avoid such confusion. ;-)
Sören
[1] http://en.wikipedia.org/wiki/12-hour_clock
--
--------------------------------------------------------------
Sören Auer, AKSW/Computer Science Dept., University of Leipzig
http://www.informatik.uni-leipzig.de/~auer, Skype: soerenauer
All,
The submission deadline for the Semantic Web Challenge 2008 is
tomorrow end of day by Central European Time. You may work on your
submission afterwords until the conference but we will instruct the
judges not to take into account any features that have not been part
of the original description.
We thank you for your participation,
Jim & Peter
#############################################
SUBMISSIONS OPEN FOR THE SW CHALLENGE 2008
Open Track and Billion Triples Track
Submission deadline: October 1, 2008
Please visit http://challenge.semanticweb.org
#############################################
Call for Participation
for the Sixth Semantic Web Challenge
Open Track and Billion Triples Track
at the International Semantic Web Conference ISWC 2008
Karlsruhe, Germany
October 26-30, 2008
http://challenge.semanticweb.org/
****************************************************************************
We invite submissions to the sixth annual Semantic Web Challenge,
the premiere event for demonstrating practical progress towards
achieving the vision of the Semantic Web.
The central idea of the Semantic Web is to extend the current
human-readable web by encoding some of the semantics of resources
in a machine-processable form. Moving beyond syntax opens the door
to more advanced applications and functionality on the
Web. Computers will be better able to search, process, integrate
and present the content of these resources in a meaningful,
intelligent manner.
As the core technological building blocks are now in place,
the next challenge is to show off the benefits of semantic
technologies by developing integrated, easy to use applications
that can provide new levels of Web functionality for end users on
the Web or within enterprise settings. Applications submitted
should demonstrate clear practical value that goes above and
beyond what is possible with conventional web technologies alone.
Unlike in previous years, the Semantic Web Challenge of 2008 will
consist of two tracks: the Open Track and the Billion Triples Track.
The key difference between the two tracks is that the Billion
Triples Track requires the participants to make use of the data set
--a billion triples-- provided by the organizers.
The Open Track has no such restrictions.
As before, the Challenge is open to everyone from academia and
industry. The authors of the best applications will be awarded
prizes and featured prominently at special sessions during
the conference.
GOALS
-----
The overall goal of this event is to advance our understanding of
how semantic technologies can be exploited to produce useful
applications for the Web. Semantic Web applications should
integrate, combine, and deduce information from various sources
to assist users in performing specific tasks.
The specific goal of the Billion Triples Track is to demonstrate
the scalability of applications as well as to encourage the
development of applications that can deal with Web data.
We stress that the goal of this is not to be a benchmarking effort
between triple stores, but rather to demonstrate applications that
can scale to a Web scale using realistic Web-quality data.
Minimal Requirements
--------------------
Submissions for the Semantic Web Challenge must meet the
following minimum requirements:
For the Open Track:
~~~~~~~~~~~~~~~~~~~
1. The meaning of data has to play a central role.
* Meaning must be represented using formal descriptions.
* Data must be manipulated/processed in interesting ways to
derive useful information and
* this semantic information processing has to play a central
role in achieving things that alternative technologies
cannot do as well, or at all;
2. The information sources used
* should be under diverse ownership or control
* should be heterogeneous (syntactically, structurally, and
semantically), and
* should contain substantial quantities of real world data
(i.e. not toy examples).
3. The application has to be an end-user application, i.e. an
application that provides a practical value to domain experts.
Although we expect that most applications will use RDF, RDF Schema,
or OWL this is not a requirement. What is more important is that
whatever semantic technology is used, it plays a central role in
achieving interesting new levels of functionality or performance.
It is required that all applications assume an open world, i.e.
that the information is never complete.
Additional Desirable Features
-----------------------------
In addition to the above minimum requirements, we note other desirable
features that will be used as criteria to evaluate submissions.
- The application provides an attractive and functional Web interface
(for human users)
- Rigorous evaluations have taken place that demonstrate the benefits
of semantic technologies, or validate the results obtained.
- The application should be scalable (in terms of the amount of data
used and in terms of distributed components working together)
- Novelty, in applying semantic technology to a domain or task that
have not been considered before
- Functionality is different from or goes beyond pure information
retrieval
- The application has clear commercial potential and/or large existing
user base
- Contextual information is used for ratings or rankings
- Multi-media documents are used in some way
- There is a use of dynamic data (e.g. workflows), perhaps in
combination with static information
- The results should be as accurate as possible (e.g. use a ranking
of results according to context)
- There is support for multiple languages and
accessibility on a range of devices
For the Billion Triples Track:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. The primary goal is to for submissions to show how they add value
to the very large triple store. This can involved anything from
helping people figure out what is in the store via browsing,
visualization, etc; could include inferencing that adds
information not directly queriable in the original dataset;
could involve showing how ontological information could be tied
to part(s) or the whole of the dataset; etc.
2. The tool or application has to make use of at least a significant
portion of the data provided by the organizers.
3. The tool or application is allowed to use other data that can be
linked to the target dataset, but there is still an expectation
that the primary focus will be on the data provided.
4. The tool or application does not have to be specifically an
end-user application, as defined for the Open Track Challenge,
but usability is a concern. The key goal is to demonstrate an
interaction with the large data-set driven by a user
or an application. However, given the scale of this challenge,
solutions that can be justified as leading to such applications,
or as crucial to the success of future applications, will be
considered.
It is desired that all applications assume an open world, i.e. that
the information is never complete. However, applications that can show
useful ways to "close the world" for sections of the very large dataset
will be considered.
Additional Desirable Features
-----------------------------
In addition to the above minimum requirements, we note other desirable
features that will be used as criteria to evaluate submissions.
- The application should do more than simply store/retrieve
large numbers of triples
- The application or tool(s) should be scalable (in terms of the amount
of data used and in terms of distributed components working together)
- The application or tool(s) should show the use of the very large,
mixed quality data set
- The application should either function in real-time or,
if pre-computation is needed, have a real-time realization
(but we will take a wide view of "real time" depending on the
scale of what is done)
How to participate
------------------
Visit http://challenge.semanticweb.org in order to participate and
register
for the Semantic Web Challenge by submitting the required information as
well as a link to the application on the online registration form. The
form
will be open until October 1, 2008, 12am CET. The requirements of this
entry
are:
1) Abstract: no more than 200 words.
2) Description: The description will show details of the system including
why the system is innovative, which features or functions the system
provides, what design choices were made and what lessons were learned.
Papers should not exceed eight pages and must be formatted according
to the
same guidelines as the papers in the Research Track
(see http://iswc2008.semanticweb.org)
3) Web access: The application should be accessible via the web. If the
application is not publicly accessible, passwords should be provided. We
also ask to provide a (short) instruction on how to start and use the
application.
Descriptions will be published in the form of an online proceedings.
Prizes
------
A prize in money will be provided to the winners along with publicity for
their work. The winners will also be asked to give a live demonstration of
their application at the ISWC 2008 conference. The best applications will
also have a chance to appear as full papers in the Journal of Web
Semantics.
In the event that one of the tracks receive less than a minimal number of
submissions, the organizers reserve the right to merge the two tracks of
the competition.
IMPORTANT DATES
--------- -----
October 1, 2008 Submissions due
October 26-30, 2008 ISWC 2008 Technical Program
SWC Co-Chairs
-------------
Jim Hendler (Rensselaer Polytechnic Institute)
Peter Mika (Yahoo! Research Barcelona)
SWC Advisory Board
-------------------
Dean Allemang (TopQuadrant) Jürgen Angele (Ontoprise) Mike Dean
(BBN Technologies) Stefan Decker (DERI, Galway) Jérôme Euzenat
(INRIA Rhone-Alpes) Ian Horrocks (University of Manchester)
Atanas Kiryakov (OntoText) Michel Klein (Vrije Universiteit,
Amsterdam) Deborah McGuinness (Stanford University) Rob Shearer
(University of Manchester) Amit Sheth (Wright State University)
York Sure (University of Karlsruhe) Hideaki Takeda (National
Institute of Informatics, Tokyo) Ubbo Visser (University of
Bremen)
Contact:
--------
Peter Mika Yahoo! Research Barcelona Ocata 1 08001 Barcelona,
Spain Tel: +34 935 421 165 Fax: +34 935 421 150 Email: pmika at
yahoo-inc.com Web: http://www.cs.vu.nl/~pmika/
Hi,
The content inside the WARC is encoded in N-Triples, see the sample code
(added to the files of the Yahoo! Group, see [1]) on how to extract it.
Once you have the N-Triples you can as you say process them using any
library. The sample code shows how to count the triples using Sesame.
Best,
Peter
[1]
http://f1.grp.yahoofs.com/v1/gLOhSKL3vLBuPUqI5V-eqZBzl0sjZGF52nYvngkFkAg-JaeZUgr\
YYx75kRXm7qz0uSZZeTvYnzWCU2lfNJi0hA/WarcReader.java
--- In billiontriples@yahoogroups.com, "huanxuezhou" <huanxuezhou@...>
wrote:
>
> Thanks for explanation. Personally, I really appreciate if you can
> encode files in N-Triples format, since this format represents RDF
> well and can be easily parsed by jena.
>
Thanks for explanation. Personally, I really appreciate if you can
encode files in N-Triples format, since this format represents RDF
well and can be easily parsed by jena.
Hi,
Do you refer to the issue that the content might have changed since
the URL was crawled? I would say that for the sake of comparability,
please use the version included in the content. (Or do you have a
strong preference for recrawling?)
Thanks,
Peter
--- In billiontriples@yahoogroups.com, "huanxuezhou" <huanxuezhou@...>
wrote:
>
> Hi Peter, one thing confuses me. A warc record consists subject URI
> and content block. Since sometimes the content from content block is
> different from the one from URI, which content should we use?
>
Hi Peter, one thing confuses me. A warc record consists subject URI
and content block. Since sometimes the content from content block is
different from the one from URI, which content should we use?
OK.
Here is Wordnet as LD: http://wordnet.rkbexplorer.com/
Typical URI: http://wordnet.rkbexplorer.com/id/synset-odd-toed_ungulate-noun-1
(Note the use of owl:sameAs :-).)
If someone would really find some other sets useful I might manage it if asked
:-)
Best
Hugh
------ Forwarded Message
From: Peter Mika <pmika@...>
Date: Fri, 11 Jul 2008 14:46:05 +0100
To: Hugh Glaser <hg@...>
Subject: Re: Hello
Hi Hugh,
Nice hearing from you and thanks for the heads up! If you have a minute,
please share your work with the billion triples mailing list:
billiontriples@yahoogroups.com
Even if you don't have time, maybe someone else can pick up the ball of
'lodding' the data set.
Cheers,
Peter
Hugh Glaser wrote:
> Hi Peter,
> Nice to meet you again.
> You may remember that you asked me about whether we were bringing up LD
> sites for the billion challenge.
> Well I went away and brought up wordnet at
> http://wordnet.rkbexplorer.com/
> But didn't have time to check it out, or bring up any more.
> And having got back seem to have no time for anything.
> So just thought I would tell you :-)
> Best
> Hugh
>
>
------ End of Forwarded Message
tried and true tree based filesystems provide services for a wealth of
carrier class functions.
use digest based filenames to evenly populate namespace and enable
fixed-length strings as labels as inodes.
use hierarchically mounted spindles instead of constrained access to
specialty class storage.
Best
Jim
--- In billiontriples@yahoogroups.com, Andreas Harth
<andreas.harth@...> wrote:
>
> Hi Peter,
>
> Peter Mika wrote:
> > I like this solution as well, the only thing I'm slightly worried
about
> > now is what happens when you unzip a large number of files. My
extended
> > suggestion is thus to take the SHA1 sum of the URL and create
> > subdirectories based on that, say three level deep. For example,
take a
> > file with URL
> >
> > URL = http://challenge.semanticweb.org/somefile.rdf
> >
> > Now we could take the checksum of the URL or the checksum of the
contents:
> >
> > checksum = ABCDEFG0123456789
> >
> > and the file would go in directory
> >
> >
/A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A
> >
> > If we take the checksum on the contents of the file and create enough
> > levels, we can also make sure that files that are duplicates end
up in
> > the same subdirectory regardless of the URL.
> >
> > What do you think?
> >
>
> from my experience, file systems will have trouble at some point when
> there are too many files around. Thus, we avoid writing individual
files
> to the file system.
>
> What worked here is:
>
> put source files into ZIP archives with URI urlencoded as filename
> for each file in the ZIP archive:
> process file
>
> That way, we never have to actually put all files on the filesystem,
> but do (de)compression on the fly. If we use command line tools in
> the process, we iterate over the ZIP contents, write one file to disk,
> process the file with the command line tool, and remove the file
> again.
>
> The nice thing about ZIP archives is that you can access them from
> within any programming language (we've tried Java and Python).
>
> Regards,
> Andreas.
>
> --
> http://harth.org/andreas/
>
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Dear All,
>
> After some long and careful consideration, we have made the decision
not
> to invent our own format for exchanging data but to rely on an
existing
> format known as WARC [1], in particular WARC version 0.9. WARC
archives
> store provenance (URL) and timestamp in the header. The only
additional
> agreement we need to make is that we are going to encode files in
> N-Triples format. (If that is a problem, let us know.)
>
> What convinced us ultimately about WARC is the excellent tool support
in
> the form of a Java API from the Laboratory for Web Algorithmics [2] of
> the Università degli studi di Milano <http://www.unimi.it/>. The
API can
> be downloaded from [3] and there is a separate tarball with all the
> dependencies. (The license in LGPL). One of the nice features of this
> API is the ability to work with streams of compressed WARC records,
> where metadata about each record is stored in the gzip header. This
> means that the metadata can be read without uncompressing the content
of
> the record itself. Further, there are skip pointers in the file, which
> means that a record can be easily skipped over.
>
> To make it really easy, I've also created sample code that
demonstrates
> how to create WARC archives from a set of files or a directory
structure
> on disk, and how to read back the resulting WARC archive. The code is
> simply attached to this email, if all is well. (First time I send
> attachments to a Y! Group.) Many thanks to Sebastiano Vigna, one of
the
> authors of the LAW API, for his help and advice.
>
> To support the Challenge, we at Yahoo! Research Barcelona are also
hard
> at work to get permission to release a microformat crawl of 100
million
> triples. We hope this will be a significant contribution to the
> state-of-the-art and will complement the existing data sets to be
> provided by Semantic Web search engines.
>
> As always, your comments and questions are more than appreciated. In
> particular those of you planning to provide some data, please let us
> know if you need any further help.
>
> Thanks,
> Peter
>
> [1]
http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html
> [2] http://law.dsi.unimi.it/
> [3]
>
http://law.dsi.unimi.it/index.php?option=com_content&task=section&id=5&I\
temid=42
>
> package com.yahoo.corp.barcelona.billiontriples;
>
> import it.unimi.dsi.fastutil.io.FastBufferedInputStream;
> import it.unimi.dsi.fastutil.io.FastBufferedOutputStream;
> import it.unimi.dsi.law.warc.io.GZWarcRecord;
> import it.unimi.dsi.law.warc.io.WarcRecord;
> import it.unimi.dsi.law.warc.util.BURL;
> import it.unimi.dsi.law.warc.util.BasicHttpResponse;
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.OutputStream;
> import java.io.UnsupportedEncodingException;
> import java.net.URLDecoder;
> import java.util.Date;
>
> import javax.xml.transform.TransformerConfigurationException;
>
> /** Sample code for creating Warc packages. This class is executable.
> *
> * @author pmika@...
> *
> */
> public class WarcPackager {
>
> public final static int MAX_RECORDS = -1;
>
> private int count = 0;
>
> //MODIFY THIS if your filenames are not URLs
> protected BURL getURL(File file) {
> BURL result = null;
> try {
> result = BURL.parse(URLDecoder.decode(file.getName(), "UTF-8"));
> } catch (UnsupportedEncodingException e) {
>
> e.printStackTrace();
> }
> return result;
> }
>
> //MODIFY this if the last modification date of the file != crawl date
> protected Date getDate(File file) {
> return new Date(file.lastModified());
> }
>
> private WarcRecord createRecord(File file) throws
UnsupportedEncodingException, IOException {
> GZWarcRecord result = new GZWarcRecord();
>
> InputStream fis = new FileInputStream(file);
>
> BasicHttpResponse response = new BasicHttpResponse();
>
> BURL url = getURL(file);
>
> if (url == null) {
> throw new IllegalArgumentException("Warning: getURL() returned null
for " + file);
> }
>
> response.url(getURL(file));
>
> response.statusLine("HTTP/1.1 200 OK");
> response.status(200);
> response.contentAsStream(new FastBufferedInputStream(fis));
>
> response.toWarcRecord(result);
>
> Date date = getDate(file);
> if (date == null) {
> throw new IllegalArgumentException("Warning: getDate() returned
null for " + file);
> }
>
> result.header.creationDate = getDate(file);
>
> return result;
> }
>
>
> //recursive
> public void processFileOrDir(OutputStream out, File file) throws
IOException {
>
> //if MAX_RECORDS is specified, and we've reached the limit, return
> if (MAX_RECORDS != -1 && count > MAX_RECORDS) {
> return;
> }
>
> if (count++ % 99999 == 0) System.err.println("Processed " + count +
" files.");
>
> if (file.isDirectory()) {
> for (String name : file.list()) {
> processFileOrDir(out, new File(file.getAbsolutePath() +
System.getProperty("file.separator") + name));
> }
> } else {
> //Catch exceptions: failure to write a single file should not make
us abort
> try {
> WarcRecord record = createRecord(file);
> record.write(out);
> } catch (Exception e) {
> System.err.println(e);
> }
> }
>
> }
>
>
> /**
> * Package the files or directories passed in as arguments.
> * Directories are processed recursively.
> *
> * The result is printed to standard out, errors/diagnostic messages
to std err.
> *
> * @param args
> * @throws TransformerConfigurationException
> * @throws IOException
> * @throws UnsupportedEncodingException
> */
> public static void main(String[] args) throws
TransformerConfigurationException, UnsupportedEncodingException,
IOException {
>
> if (args.length < 1) {
> System.err.println("Usage: WarcPackage <fileOrDir> ...");
> }
>
> FastBufferedOutputStream out = new
FastBufferedOutputStream(System.out);
> WarcPackager packager = new WarcPackager();
>
> for (String arg: args) {
> packager.processFileOrDir(out, new File(arg));
> }
>
> out.close();
>
>
> }
> }
>
> package com.yahoo.corp.barcelona.billiontriples;
>
> import it.unimi.dsi.fastutil.io.FastBufferedInputStream;
> import it.unimi.dsi.fastutil.io.MeasurableInputStream;
> import it.unimi.dsi.law.warc.filters.Filter;
> import it.unimi.dsi.law.warc.filters.Filters;
> import it.unimi.dsi.law.warc.io.GZWarcRecord;
> import it.unimi.dsi.law.warc.io.WarcFilteredIterator;
> import it.unimi.dsi.law.warc.io.WarcRecord;
> import it.unimi.dsi.law.warc.util.BURL;
> import it.unimi.dsi.law.warc.util.WarcHttpResponse;
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileNotFoundException;
> import java.io.IOException;
>
> import org.openrdf.model.Statement;
> import org.openrdf.rio.RDFHandlerException;
> import org.openrdf.rio.RDFParseException;
> import org.openrdf.rio.helpers.RDFHandlerBase;
> import org.openrdf.rio.ntriples.NTriplesParser;
>
> /** Sample code for reading Warc packages.
> *
> * This class is executable.
> *
> * @author pmika@...
> *
> */
> public class WarcReader {
>
> private NTriplesParser parser = new NTriplesParser();
>
> private CountHandler countHandler = new CountHandler();
>
> private int tripleCount = 0;
> private int lineCount = 0;
>
> public class CountHandler extends RDFHandlerBase {
>
> private int count = 0;
>
> public void endRDF() throws RDFHandlerException {
> super.endRDF();
> //System.out.println("Counted " + count + " statements.");
> }
>
> public void handleStatement(Statement st) {
> count++;
> }
>
> public void startRDF() throws RDFHandlerException {
> super.startRDF();
> count = 0;
> }
>
> }
>
> public static class TrueFilter extends Filter<BURL> {
>
> @Override
> public boolean accept( BURL x ) {
> return true;
> }
>
> @Override
> public String toExternalForm() {
>
> return "true";
> }
>
> }
>
> public void countTriples(MeasurableInputStream block, String base) {
> parser.setRDFHandler(countHandler);
>
> try {
> parser.parse(block, base);
> tripleCount += countHandler.count;
> } catch (RDFParseException e) {
> e.printStackTrace();
> } catch (RDFHandlerException e) {
> e.printStackTrace();
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
>
> public void countLines(MeasurableInputStream block) throws
IOException {
> int c = 0;
> while ((c = block.read()) != -1) {
> if (c == '\n') {
> lineCount++;
> }
> }
> }
>
> public void dumpContent(MeasurableInputStream block) throws
IOException {
> int c = 0;
> while ((c = block.read()) != -1) {
> System.out.write(c);
> }
> }
>
>
> /**
> * @param args
> * @throws FileNotFoundException
> */
> public static void main(String[] args) throws FileNotFoundException {
> if (args.length < 1) {
> System.err.println("Usage: WarcReader <file>");
> }
>
> final FastBufferedInputStream in = new FastBufferedInputStream(new
FileInputStream(new File(args[0])));
> GZWarcRecord record = new GZWarcRecord();
> Filter<WarcRecord> filter = Filters.adaptFilterBURL2WarcRecord(new
TrueFilter());
> WarcFilteredIterator it = new WarcFilteredIterator(in, record,
filter);
> int urlCount = 0;
>
> WarcReader reader = new WarcReader();
> WarcHttpResponse response = new WarcHttpResponse();
> try {
> while (it.hasNext()) {
>
> if (urlCount++ % 99999 == 0) System.err.println("Processed " +
urlCount + " files.");
>
> WarcRecord nextRecord = it.next();
> //Get the HttpResponse
> try {
> response.fromWarcRecord(nextRecord);
> System.out.println("Processing: " +
nextRecord.header.subjectUri);
>
> //This will dump the content of the record
> //reader.dumpContent(response.contentAsStream());
>
> //This will count the number of triples by parsing the RDF
> //reader.countTriples(response.contentAsStream(),
nextRecord.header.subjectUri.toString());
>
> //This will count the number of lines, which is equivalent to
> //the number of triples in N-Triples format
> reader.countLines(response.contentAsStream());
> } catch (IOException e) {
> e.printStackTrace();
> continue;
> }
> }
> } catch (RuntimeException re) {}
>
> System.out.println("Counted " + reader.lineCount + " triples from "
+ urlCount + " urls.");
>
>
> }
>
> }
>
On Mon, May 12, 2008 at 4:30 PM, serendipity588 <pmika@...> wrote:
Hi Kostis,
I've checked and indeed there seem to have been some problem with the
archive toward the end of the set. I've re-uploaded the data, please
let me know if you still have problems.
>
> Hi All,
> I have some trouble downloading the US Census data. The md5sum
> signature seems ok, but the extraction fails after extracing
> sumfile-towns-3-65.nt...
> I downloaded the file 3 times (md5sum was always correct), but the
> same error occurs:
> gzip: stdin: unexpected end of file
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
>
> Anybody having the same problem?
>
> Regards,
> Kostis
>
> --- In billiontriples@yahoogroups.com, Peter Mika <pmika@> wrote:
> >
> > Hi All,
> >
> > The CFP and the dataset for the Billion Triples Challenge have been
> > posted at [1]. Please let us know of any immediate problems you see
> with
> > accessing the dataset.
> >
> > From this point on we are also looking for volunteers who would
> like to
> > host this data and provide query access to it, e.g. in the form of
> > SPARQL endpoints.
> >
> > Thanks,
> > Peter
> >
> >
> > [1] http://challenge.semanticweb.org
> >
>
Hi Kostis,
I've checked and indeed there seem to have been some problem with the
archive toward the end of the set. I've re-uploaded the data, please
let me know if you still have problems.
Thanks,
Peter
--- In billiontriples@yahoogroups.com, "Kostis Kyzirakos" <kkyzir@...>
wrote:
>
> Hi All,
> I have some trouble downloading the US Census data. The md5sum
> signature seems ok, but the extraction fails after extracing
> sumfile-towns-3-65.nt...
> I downloaded the file 3 times (md5sum was always correct), but the
> same error occurs:
> gzip: stdin: unexpected end of file
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
>
> Anybody having the same problem?
>
> Regards,
> Kostis
>
> --- In billiontriples@yahoogroups.com, Peter Mika <pmika@> wrote:
> >
> > Hi All,
> >
> > The CFP and the dataset for the Billion Triples Challenge have been
> > posted at [1]. Please let us know of any immediate problems you see
> with
> > accessing the dataset.
> >
> > From this point on we are also looking for volunteers who would
> like to
> > host this data and provide query access to it, e.g. in the form of
> > SPARQL endpoints.
> >
> > Thanks,
> > Peter
> >
> >
> > [1] http://challenge.semanticweb.org
> >
>
Hi All,
I have some trouble downloading the US Census data. The md5sum
signature seems ok, but the extraction fails after extracing
sumfile-towns-3-65.nt...
I downloaded the file 3 times (md5sum was always correct), but the
same error occurs:
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Anybody having the same problem?
Regards,
Kostis
--- In billiontriples@yahoogroups.com, Peter Mika <pmika@...> wrote:
>
> Hi All,
>
> The CFP and the dataset for the Billion Triples Challenge have been
> posted at [1]. Please let us know of any immediate problems you see
with
> accessing the dataset.
>
> From this point on we are also looking for volunteers who would
like to
> host this data and provide query access to it, e.g. in the form of
> SPARQL endpoints.
>
> Thanks,
> Peter
>
>
> [1] http://challenge.semanticweb.org
>
Hi Steve,
The data providers support the Challenge by making the data available
for the purposes of the Challenge, which was a criterion for inclusion.
If you would like to use some or all of the data for different purposes,
you may need to talk to the owners of the data.
Cheers,
Peter
Steve Harris wrote:
>
> On 6 May 2008, at 12:35, Peter Mika wrote:
>
> > Hi All,
> >
> > The CFP and the dataset for the Billion Triples Challenge have been
> > posted at [1]. Please let us know of any immediate problems you see
> > with
> > accessing the dataset.
>
> There's no licence listed for the majority of the data.
>
> That makes it problematic to use.
>
> - Steve
>
> --
> Steve Harris
> Garlik Limited
> 2 Sheen Road
> Richmond TW9 1AE
>
> T +44(0)20 8973 2465
> F +44(0)20 8973 2301
> www.garlik.com
>
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
> 9AD
>
>
Hi Andreas,
Thanks! I've made the change.
And although I haven't spelled this out in my previous email: a big
hearty thanks to all of you who have provided data for the Challenge.
Best,
Peter
Andreas Harth wrote:
>
> Hi Peter,
>
> Peter Mika wrote:
> > The CFP and the dataset for the Billion Triples Challenge have been
> > posted at [1]. Please let us know of any immediate problems you see
> with
> > accessing the dataset.
> >
> > From this point on we are also looking for volunteers who would like to
> > host this data and provide query access to it, e.g. in the form of
> > SPARQL endpoints.
> >
> it would be great if you could re-name the YARS-1 and YARS-2 dataset
> to SWSE-1 and SWSE-2 since YARS denotes the RDF store and SWSE
> the search engine. Thanks!
>
> We'll download the data and are willing to provide SPARQL access
> to the datasets. I'll let you know once the endpoint is available.
>
> Regards,
> Andreas.
>
> --
> http://swse.deri.org/ <http://swse.deri.org/>
>
>
On 6 May 2008, at 12:35, Peter Mika wrote:
> Hi All,
>
> The CFP and the dataset for the Billion Triples Challenge have been
> posted at [1]. Please let us know of any immediate problems you see
> with
> accessing the dataset.
There's no licence listed for the majority of the data.
That makes it problematic to use.
- Steve
--
Steve Harris
Garlik Limited
2 Sheen Road
Richmond TW9 1AE
T +44(0)20 8973 2465
F +44(0)20 8973 2301
www.garlik.com
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
9AD
Hi Peter,
Peter Mika wrote:
> The CFP and the dataset for the Billion Triples Challenge have been
> posted at [1]. Please let us know of any immediate problems you see with
> accessing the dataset.
>
> From this point on we are also looking for volunteers who would like to
> host this data and provide query access to it, e.g. in the form of
> SPARQL endpoints.
>
it would be great if you could re-name the YARS-1 and YARS-2 dataset
to SWSE-1 and SWSE-2 since YARS denotes the RDF store and SWSE
the search engine. Thanks!
We'll download the data and are willing to provide SPARQL access
to the datasets. I'll let you know once the endpoint is available.
Regards,
Andreas.
--
http://swse.deri.org/
Hi All,
The CFP and the dataset for the Billion Triples Challenge have been
posted at [1]. Please let us know of any immediate problems you see with
accessing the dataset.
From this point on we are also looking for volunteers who would like to
host this data and provide query access to it, e.g. in the form of
SPARQL endpoints.
Thanks,
Peter
[1] http://challenge.semanticweb.org
On Mar 27, 2008, at 15:40, "avi.bernstein" <bernstein@...> wrote:
Dear Jim, dear all
May I propose another measuring criteria or facet of the challenge:
Can a user interactively do something useful with the data?
I think that it is great if we can store/retrieve/reason/etc with a billion triples. But, ultimately, one of the challenges should be if users can use it interactively for a useful task.
For this to work storing/retrieving (e.g., SPARQL), reasoning, processing linked data, etc. might be useful prerequisites.
Best
Avi
--- In billiontriples@yahoogroups.com, Jim Hendler <hendler@...> wrote: > > All- > Peter feels that we now have the collection and distribution of the > triples underway, which means he gets to make me do some work finally... > My role at the moment is to figure out what we would like to make > the challenge part of the challenge be, > Here are some thoughts, I welcome feedback > We see four, very non disjoint audiences for the challenge (in > fact, Peter, me, and most of the people on this list are in at least > several categories): > Triple store developers, linked data technology developers, Semantic > Web researchers interested in scalable reasoning, ontology-based > research groups > > Here are some of my thoughts with respect to these > > A - Triple Store Developers > We do not want this to be a "triple store shootout" in the sense > of who can process a query fastest or such. We don't see that > competition as being all that useful at a time when people are still > very much in development mode. Rather, we would like the outcome of > this event to be a realization in the outside world that triple-stores > can and do handle these sorts of numbers (the DB folks still say > "triple stores break at a million triples" at conferences I go to - I > have no idea whe
re they get that, but let's push it up a few orders of > magnitude!!) > So at the moment my thinking on this area is that we would like to > give you folks bragging rights for being able to support systems other > people develop (i.e. any of you who host this data and make it > available via SPARQL should be listed as "winners" in some way) > I also think that if some interesting, large, and complex SPARQL > queries are developed against this dataset (say including filters and > optionals), then those would become useful benchmarks, so we would > like to find a way to encourage the sharing of these (maybe for a > future date when a benchmarking shootout would be more appropriate) > > B - Linked data technology developers > We write a lot about the Semantic Web as being the Web of linked > data, but to date, in practice, most of that data is either within an > enterprise or locked in a particular application. We are purposely > designing this dataset to be very heterogeneous, but with many > connections between pieces, so it should be a great dataset for > showing off tools that can exploit the dataweb. > In this area we are thinking of having some goals like "visualize > (or browse) the dataweb", Datamining of this sort of data, etc. -- > seems to us this is a ripe area for a challenge > > C - SW researchers interested in scalable reasoning > The data set we are developing will include a (large) number of > triples tied to FOAF, DOAP and other "small o" ontologies. We also > have a lot of data that will be made available that was crawled from > microformats (where the "semantics" are well specified). This is thus > an ideal proving grounds for the "little semantics goes a long way" > philosophy, and thus this also seems like an appropriate challenge area > > D - Ontolog
y research > Big A-Box, you got it! Show us something. > > So, I think we will have the "competition" be fairly unspecified - we > will identify several areas of interest from the above and work out > how to tie that into an "announcible" competition. > > I welcome, NEED, your feedback on this > -Jim H. > > > > > "If we knew what we were doing, it wouldn't be called research, would > it?." - Albert Einstein > > Prof James Hendler http://www.cs.rpi.edu/~hendler > Tetherless World Constellation Chair > Computer Science Dept > Rensselaer Polytechnic Institute, Troy NY 12180 >
May I propose another measuring criteria or facet of the challenge:
Can a user interactively do something useful with the data?
I think that it is great if we can store/retrieve/reason/etc with a billion triples. But, ultimately, one of the challenges should be if users can use it interactively for a useful task.
For this to work storing/retrieving (e.g., SPARQL), reasoning, processing linked data, etc. might be useful prerequisites.
Best
Avi
--- In billiontriples@yahoogroups.com, Jim Hendler <hendler@...> wrote: > > All- > Peter feels that we now have the collection and distribution of the > triples underway, which means he gets to make me do some work finally... > My role at the moment is to figure out what we would like to make > the challenge part of the challenge be, > Here are some thoughts, I welcome feedback > We see four, very non disjoint audiences for the challenge (in > fact, Peter, me, and most of the people on this list are in at least > several categories): > Triple store developers, linked data technology developers, Semantic > Web researchers interested in scalable reasoning, ontology-based > research groups > > Here are some of my thoughts with respect to these > > A - Triple Store Developers > We do not want this to be a "triple store shootout" in the sense > of who can process a query fastest or such. We don't see that > competition as being all that useful at a time when people are still > very much in development mode. Rather, we would like the outcome of > this event to be a realization in the outside world that triple-stores > can and do handle these sorts of numbers (the DB folks still say > "triple stores break at a million triples" at conferences I go to - I > have no idea where they get that, but let's push it up a few orders of > magnitude!!) > So at the moment my thinking on this area is that we would like to > give you folks bragging rights for being able to support systems other > people develop (i.e. any of you who host this data and make it > available via SPARQL should be listed as "winners" in some way) > I also think that if some interesting, large, and complex SPARQL > queries are developed against this dataset (say including filters and > optionals), then those would become useful benchmarks, so we would > like to find a way to encourage the sharing of these (maybe for a > future date when a benchmarking shootout would be more appropriate) > > B - Linked data technology developers > We write a lot about the Semantic Web as being the Web of linked > data, but to date, in practice, most of that data is either within an > enterprise or locked in a particular application. We are purposely > designing this dataset to be very heterogeneous, but with many > connections between pieces, so it should be a great dataset for > showing off tools that can exploit the dataweb. > In this area we are thinking of having some goals like "visualize > (or browse) the dataweb", Datamining of this sort of data, etc. -- > seems to us this is a ripe area for a challenge > > C - SW researchers interested in scalable reasoning > The data set we are developing will include a (large) number of > triples tied to FOAF, DOAP and other "small o" ontologies. We also > have a lot of data that will be made available that was crawled from > microformats (where the "semantics" are well specified). This is thus > an ideal proving grounds for the "little semantics goes a long way" > philosophy, and thus this also seems like an appropriate challenge area > > D - Ontology research > Big A-Box, you got it! Show us something. > > So, I think we will have the "competition" be fairly unspecified - we > will identify several areas of interest from the above and work out > how to tie that into an "announcible" competition. > > I welcome, NEED, your feedback on this > -Jim H. > > > > > "If we knew what we were doing, it wouldn't be called research, would > it?." - Albert Einstein > > Prof James Hendler http://www.cs.rpi.edu/~hendler > Tetherless World Constellation Chair > Computer Science Dept > Rensselaer Polytechnic Institute, Troy NY 12180 >
Here's my views:
Triple Store:
the big problem with semantic web, no matter how big promises it
makes, is the amount of triples that can be stored and dealt with. As
the size of triples increase, developers suffer from resource problem.
So the question is how can I work with billion triples? I am not
backed by organizations to give me resources for working on the big B
of billion. Do we have the sandbox ???
System requirements and benchmarking criteria are not clear.
Linked data:
this is most probably the best part of realizing the semantic web and
i hope some killer apps gonna be developed that will make people think
'This is the reason to shift to semantic web !!!" Till now, Semantic
web is just an academic hype.
Reasoning:
Reasoning comes after Triple Store. Resource problem again !!!
Ontology Research:
Could you fine tune this section? Is it creation of new ontologies or
creation of new language or sth else ?
---
Amit Krishna Joshi
--- In billiontriples@yahoogroups.com, Jim Hendler <hendler@...> wrote:
>
> All-
> Peter feels that we now have the collection and distribution of the
> triples underway, which means he gets to make me do some work finally...
> My role at the moment is to figure out what we would like to make
> the challenge part of the challenge be,
> Here are some thoughts, I welcome feedback
> We see four, very non disjoint audiences for the challenge (in
> fact, Peter, me, and most of the people on this list are in at least
> several categories):
> Triple store developers, linked data technology developers, Semantic
> Web researchers interested in scalable reasoning, ontology-based
> research groups
>
> Here are some of my thoughts with respect to these
>
> A - Triple Store Developers
> We do not want this to be a "triple store shootout" in the sense
> of who can process a query fastest or such. We don't see that
> competition as being all that useful at a time when people are still
> very much in development mode. Rather, we would like the outcome of
> this event to be a realization in the outside world that triple-stores
> can and do handle these sorts of numbers (the DB folks still say
> "triple stores break at a million triples" at conferences I go to - I
> have no idea where they get that, but let's push it up a few orders of
> magnitude!!)
> So at the moment my thinking on this area is that we would like to
> give you folks bragging rights for being able to support systems other
> people develop (i.e. any of you who host this data and make it
> available via SPARQL should be listed as "winners" in some way)
> I also think that if some interesting, large, and complex SPARQL
> queries are developed against this dataset (say including filters and
> optionals), then those would become useful benchmarks, so we would
> like to find a way to encourage the sharing of these (maybe for a
> future date when a benchmarking shootout would be more appropriate)
>
> B - Linked data technology developers
> We write a lot about the Semantic Web as being the Web of linked
> data, but to date, in practice, most of that data is either within an
> enterprise or locked in a particular application. We are purposely
> designing this dataset to be very heterogeneous, but with many
> connections between pieces, so it should be a great dataset for
> showing off tools that can exploit the dataweb.
> In this area we are thinking of having some goals like "visualize
> (or browse) the dataweb", Datamining of this sort of data, etc. --
> seems to us this is a ripe area for a challenge
>
> C - SW researchers interested in scalable reasoning
> The data set we are developing will include a (large) number of
> triples tied to FOAF, DOAP and other "small o" ontologies. We also
> have a lot of data that will be made available that was crawled from
> microformats (where the "semantics" are well specified). This is thus
> an ideal proving grounds for the "little semantics goes a long way"
> philosophy, and thus this also seems like an appropriate challenge area
>
> D - Ontology research
> Big A-Box, you got it! Show us something.
>
> So, I think we will have the "competition" be fairly unspecified - we
> will identify several areas of interest from the above and work out
> how to tie that into an "announcible" competition.
>
> I welcome, NEED, your feedback on this
> -Jim H.
>
>
>
>
> "If we knew what we were doing, it wouldn't be called research, would
> it?." - Albert Einstein
>
> Prof James Hendler http://www.cs.rpi.edu/~hendler
> Tetherless World Constellation Chair
> Computer Science Dept
> Rensselaer Polytechnic Institute, Troy NY 12180
>
All-
Peter feels that we now have the collection and distribution of the
triples underway, which means he gets to make me do some work finally...
My role at the moment is to figure out what we would like to make
the challenge part of the challenge be,
Here are some thoughts, I welcome feedback
We see four, very non disjoint audiences for the challenge (in
fact, Peter, me, and most of the people on this list are in at least
several categories):
Triple store developers, linked data technology developers, Semantic
Web researchers interested in scalable reasoning, ontology-based
research groups
Here are some of my thoughts with respect to these
A - Triple Store Developers
We do not want this to be a "triple store shootout" in the sense
of who can process a query fastest or such. We don't see that
competition as being all that useful at a time when people are still
very much in development mode. Rather, we would like the outcome of
this event to be a realization in the outside world that triple-stores
can and do handle these sorts of numbers (the DB folks still say
"triple stores break at a million triples" at conferences I go to - I
have no idea where they get that, but let's push it up a few orders of
magnitude!!)
So at the moment my thinking on this area is that we would like to
give you folks bragging rights for being able to support systems other
people develop (i.e. any of you who host this data and make it
available via SPARQL should be listed as "winners" in some way)
I also think that if some interesting, large, and complex SPARQL
queries are developed against this dataset (say including filters and
optionals), then those would become useful benchmarks, so we would
like to find a way to encourage the sharing of these (maybe for a
future date when a benchmarking shootout would be more appropriate)
B - Linked data technology developers
We write a lot about the Semantic Web as being the Web of linked
data, but to date, in practice, most of that data is either within an
enterprise or locked in a particular application. We are purposely
designing this dataset to be very heterogeneous, but with many
connections between pieces, so it should be a great dataset for
showing off tools that can exploit the dataweb.
In this area we are thinking of having some goals like "visualize
(or browse) the dataweb", Datamining of this sort of data, etc. --
seems to us this is a ripe area for a challenge
C - SW researchers interested in scalable reasoning
The data set we are developing will include a (large) number of
triples tied to FOAF, DOAP and other "small o" ontologies. We also
have a lot of data that will be made available that was crawled from
microformats (where the "semantics" are well specified). This is thus
an ideal proving grounds for the "little semantics goes a long way"
philosophy, and thus this also seems like an appropriate challenge area
D - Ontology research
Big A-Box, you got it! Show us something.
So, I think we will have the "competition" be fairly unspecified - we
will identify several areas of interest from the above and work out
how to tie that into an "announcible" competition.
I welcome, NEED, your feedback on this
-Jim H.
"If we knew what we were doing, it wouldn't be called research, would
it?." - Albert Einstein
Prof James Hendler http://www.cs.rpi.edu/~hendler
Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180