Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Re: how should the field "Content-Length" be calculated?   Message List  
Reply | Forward Message #5911 of 6140 |
cross-posted from [Archive-access-discuss]

On Tue, 30 Jun 2009 13:47:51 -0400 Zhenzhen Xue <zjuzhenzhen@...>
wrote:
> I have some warc files in hand which do not strictly follow
> warc standard. For example, each record is ended with \n\n
> rather then \r\n\r\n.

interesting. is this really necessary?

> I found that the field "Content-Length" is always causing
> some problem.

what is the problem exactly? do you have an error message?

if the error came from a WARCReader class, it's likely from
Heritrix common classes, which is included in NutchWAX. please
post your question to the Heritrix "archive-crawler" list for
more help. :)

> I want to know how should content-Length be calcuated? I calculated it
> as the number of chracters starting from the next line of
> "Conent-Length" to the end of the record which is \r\n\r\n.

Content-length is defined in the WARC spec as "the number of
octets in the block, similar to [RFC2616]", and the block
is defined as part of the WARC record as follows:

warc-file = 1*warc-record
warc-record = header CRLF
block CRLF CRLF
header = version warc-fields
version = "WARC/0.1718" CRLF
warc-fields = *named-field CRLF
block = *OCTET

so i think you want to _exclude_ the CRLFs after the block.

the WARCRecord class is designed to comply with the WARC spec
http://archive-access.sourceforge.net/warc/

and to be compatible with WARC Tools
http://code.google.com/p/warc-tools/

hope that helps.


/steve@...







Wed Jul 1, 2009 6:00 pm

stearcorg
Offline Offline
Send Email Send Email

Forward
Message #5911 of 6140 |
Expand Messages Author Sort by Date

cross-posted from [Archive-access-discuss] On Tue, 30 Jun 2009 13:47:51 -0400 Zhenzhen Xue <zjuzhenzhen@...> ... interesting. is this really necessary? ...
steve@...
stearcorg
Offline Send Email
Jul 1, 2009
6:01 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help