Search the web
Sign In
New User? Sign Up
billiontriples · The Billion Triples Challenge
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
data format   Message List  
Reply | Forward Message #82 of 141 |
Re: data format

tried and true tree based filesystems provide services for a wealth of
carrier class functions.

use digest based filenames to evenly populate namespace and enable
fixed-length strings as labels as inodes.

use hierarchically mounted spindles instead of constrained access to
specialty class storage.

Best
Jim



--- In billiontriples@yahoogroups.com, Andreas Harth
<andreas.harth@...> wrote:
>
> Hi Peter,
>
> Peter Mika wrote:
> > I like this solution as well, the only thing I'm slightly worried
about
> > now is what happens when you unzip a large number of files. My
extended
> > suggestion is thus to take the SHA1 sum of the URL and create
> > subdirectories based on that, say three level deep. For example,
take a
> > file with URL
> >
> > URL = http://challenge.semanticweb.org/somefile.rdf
> >
> > Now we could take the checksum of the URL or the checksum of the
contents:
> >
> > checksum = ABCDEFG0123456789
> >
> > and the file would go in directory
> >
> >
/A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A
> >
> > If we take the checksum on the contents of the file and create enough
> > levels, we can also make sure that files that are duplicates end
up in
> > the same subdirectory regardless of the URL.
> >
> > What do you think?
> >
>
> from my experience, file systems will have trouble at some point when
> there are too many files around. Thus, we avoid writing individual
files
> to the file system.
>
> What worked here is:
>
> put source files into ZIP archives with URI urlencoded as filename
> for each file in the ZIP archive:
> process file
>
> That way, we never have to actually put all files on the filesystem,
> but do (de)compression on the fly. If we use command line tools in
> the process, we iterate over the ZIP contents, write one file to disk,
> process the file with the command line tool, and remove the file
> again.
>
> The nice thing about ZIP archives is that you can access them from
> within any programming language (we've tried Java and Python).
>
> Regards,
> Andreas.
>
> --
> http://harth.org/andreas/
>





Sat May 31, 2008 2:38 am

james_northrup
Offline Offline
Send Email Send Email

Forward
Message #82 of 141 |
Expand Messages Author Sort by Date

Dear All, In the past few days we had talked to several of you about providing data for the billion triples challenge. I would like to start a brief discussion...
Peter Mika
serendipity588
Online Now Send Email
Feb 1, 2008
5:22 pm

Hi Peter, I vote for option # 2, Jans...
jans.aasman
jannesaasman
Offline Send Email
Feb 1, 2008
6:10 pm

My two cents: In the spirit of RDF, why not provide a 'directory' triple file that has resources identifying each file and provides timestamps, provenance etc...
N. Sivaramakrishnan
k2_181
Offline Send Email
Feb 1, 2008
6:16 pm

Hi, I'm new to this discussion list. I will introduce myself, I'm Marc-Alexandre Nolin from the Bio2RDF project (http://bio2rdf.org). His the billions triples...
Marc-Alexandre Nolin
marc_alexand...
Offline Send Email
Feb 1, 2008
6:30 pm

Hi, ... triples ... Turtle ... as already discussed, I'd prefer this solution. Filenames in the ZIP archive are the url-encoded URI of the file. Actually,...
andreasharth
Offline Send Email
Feb 1, 2008
7:17 pm

Hi Andreas, I like this solution as well, the only thing I'm slightly worried about now is what happens when you unzip a large number of files. My extended ...
Peter Mika
serendipity588
Online Now Send Email
Feb 7, 2008
4:48 pm

Hi Peter, ... from my experience, file systems will have trouble at some point when there are too many files around. Thus, we avoid writing individual files ...
Andreas Harth
andreasharth
Offline Send Email
Feb 7, 2008
6:29 pm

tried and true tree based filesystems provide services for a wealth of carrier class functions. use digest based filenames to evenly populate namespace and...
Jim Northrup
james_northrup
Offline Send Email
May 31, 2008
2:38 am

Hi list, ... I quite like this last solution for one, very selfish reason: this is very similar to the way the cache of Watson is organized. For example, ...
M.Daquin
mathieu_daquin
Offline Send Email
Feb 7, 2008
7:32 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help