tried and true tree based filesystems provide services for a wealth of
carrier class functions.
use digest based filenames to evenly populate namespace and enable
fixed-length strings as labels as inodes.
use hierarchically mounted spindles instead of constrained access to
specialty class storage.
Best
Jim
--- In billiontriples@yahoogroups.com, Andreas Harth
<andreas.harth@...> wrote:
>
> Hi Peter,
>
> Peter Mika wrote:
> > I like this solution as well, the only thing I'm slightly worried
about
> > now is what happens when you unzip a large number of files. My
extended
> > suggestion is thus to take the SHA1 sum of the URL and create
> > subdirectories based on that, say three level deep. For example,
take a
> > file with URL
> >
> > URL = http://challenge.semanticweb.org/somefile.rdf
> >
> > Now we could take the checksum of the URL or the checksum of the
contents:
> >
> > checksum = ABCDEFG0123456789
> >
> > and the file would go in directory
> >
> >
/A/B/C/http%3A%2F%2Fchallenge%2Esemanticweb%2Eorg%2Fsomefile%2Erdf%0D%0A
> >
> > If we take the checksum on the contents of the file and create enough
> > levels, we can also make sure that files that are duplicates end
up in
> > the same subdirectory regardless of the URL.
> >
> > What do you think?
> >
>
> from my experience, file systems will have trouble at some point when
> there are too many files around. Thus, we avoid writing individual
files
> to the file system.
>
> What worked here is:
>
> put source files into ZIP archives with URI urlencoded as filename
> for each file in the ZIP archive:
> process file
>
> That way, we never have to actually put all files on the filesystem,
> but do (de)compression on the fly. If we use command line tools in
> the process, we iterate over the ZIP contents, write one file to disk,
> process the file with the command line tool, and remove the file
> again.
>
> The nice thing about ZIP archives is that you can access them from
> within any programming language (we've tried Java and Python).
>
> Regards,
> Andreas.
>
> --
> http://harth.org/andreas/
>