Hi, Xiaoming.
As I had previously mentioned to Herbert when he raised this issue in an
offlist email, I would agree that there's a problem with the meaning of
the MIME field in ARC files. As you note, it isn't really the MIME type
of the following data, but rather of some (protocol-specific) subpart of the
following data.
If it were truly a MIME field for the following ARC record, it should
probably be something like RFC2616's "message/http" type.
Instead, it may make sense to consider the "MIME" field in ARC version 1
records as being "URI-scheme-specific-information". In the case of "http" URIs,
it is the MIME type of the content-body, but no assumptions should be made
about what it describes for other captures.
This means that any reader must be sensitive to the URI types of records
but that seems appropriate to me. If a reader does not recognize the URI
scheme in a record, it should not assume that it can make any sense of the
record content.
This should be cleaned up in any future revision of the ARC format. Any
progress towards such a revision -- such as concrete proposals -- will be
discussed on this list, though the eventual decision to endorse a newer
format in preference to classic ARCs could arise from discussions inside
the IIPC (<http://www.netpreserve.org>), the consortium of national
libraries working on a number of these issues.
- Gordon @ IA
xmliu_23508 wrote:
> Hi, All,
>
> We are exploring the use of the ARC file format to store our
> collection of digital scholarly assets
> (mainly scholarly journal articles and related information from
> scientific publishers). Gordon Mohr has advised us to post questions
> in this list since there is discussion going on for future ARC extension.
>
> In the case of our work, the files (typically scholarly articles etc)
> would not really be retrieved from the network in the sense that we
> think the ARC specification assumes. In our case, files are obtained
> through many different ways, including shipment via tapes, ftp of tars
> of thousands of files, ... They are typcially not 'harvested' as
> single files from the Web. We are looking into storing such files,
> obtained in whichever way, into ARC files. So, our files are very
> much 'local' files that do not really have a network location.
>
> As a result, an essential difference in our kind of use is that what
> we store is not protocol-related. The file we store has a URI (say a
> DOI), and it has a mime-type, immediately after the header we expect
> the actual 'content'. Thus we have following questions:
>
> 1. In current ARC implementation, the mime-type in the header for a
> file stored in the ARC file provides informatoin on what to expect
> immediately _after_ that protocol-specific part. Indeed, text/html is
> the mime-type of the actual 'content', and does not specify the
> mime-type of the protocol-specific information. However, to store
> local files, the mime-type really specify what immediately after the
> header. We think there are subtle differences here.
>
> 2. For a 'reader' of the ARC file to understand that there will be no
> protocol-related information following the file header, the 'reader'
> would have to keep a list of all URI schemes for which indeed no
> protocol info will be provided -- such a 'reader' will be probably
> very hard to implement.
>
> So, somehow it seems to us that in order to allow for the
> implementation of ARC files in our use case (which could be a use case
> in many other archives that not related to Web crawling), an extension
> of ARC format might be required with such possilities: (1) some flags
> (probably one bit in 'reserved' field?) to indicate whether or not a
> 'local' or 'networked' document is archived, or, (2) seperation
> between identifier and access protocol? for example, an archived file
> may have an http address, but is indeed archived by ftp or local file
> copy.
>
> I look forward to your feedback and guide on these issues.
>
> many thanks,
>
> Xiaoming Liu
> digital library research & prototyping
> Los Alamos National Laboratory - Research Library
>
>
>
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>