cross-posted from [Archive-access-discuss]
On Tue, 30 Jun 2009 13:47:51 -0400 Zhenzhen Xue <zjuzhenzhen@...>
wrote:
> I have some warc files in hand which do not strictly follow
> warc standard. For example, each record is ended with \n\n
> rather then \r\n\r\n.
interesting. is this really necessary?
> I found that the field "Content-Length" is always causing
> some problem.
what is the problem exactly? do you have an error message?
if the error came from a WARCReader class, it's likely from
Heritrix common classes, which is included in NutchWAX. please
post your question to the Heritrix "archive-crawler" list for
more help. :)
> I want to know how should content-Length be calcuated? I calculated it
> as the number of chracters starting from the next line of
> "Conent-Length" to the end of the record which is \r\n\r\n.
Content-length is defined in the WARC spec as "the number of
octets in the block, similar to [RFC2616]", and the block
is defined as part of the WARC record as follows:
warc-file = 1*warc-record
warc-record = header CRLF
block CRLF CRLF
header = version warc-fields
version = "WARC/0.1718" CRLF
warc-fields = *named-field CRLF
block = *OCTET
so i think you want to _exclude_ the CRLFs after the block.
the WARCRecord class is designed to comply with the WARC spec
http://archive-access.sourceforge.net/warc/
and to be compatible with WARC Tools
http://code.google.com/p/warc-tools/
hope that helps.
/steve@...