Search the web
Sign In
New User? Sign Up
billiontriples · The Billion Triples Challenge
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Data format   Message List  
Reply | Forward Message #91 of 141 |
Re: Data format

Hi,

Do you refer to the issue that the content might have changed since
the URL was crawled? I would say that for the sake of comparability,
please use the version included in the content. (Or do you have a
strong preference for recrawling?)

Thanks,
Peter

--- In billiontriples@yahoogroups.com, "huanxuezhou" <huanxuezhou@...>
wrote:
>
> Hi Peter, one thing confuses me. A warc record consists subject URI
> and content block. Since sometimes the content from content block is
> different from the one from URI, which content should we use?
>





Tue Aug 12, 2008 8:19 am

serendipity588
Online Now Online Now
Send Email Send Email

Forward
Message #91 of 141 |
Expand Messages Author Sort by Date

Dear All, After some long and careful consideration, we have made the decision not to invent our own format for exchanging data but to rely on an existing ...
Peter Mika
serendipity588
Online Now Send Email
Feb 27, 2008
3:15 pm

Hi Peter, I'm not entirely sure what you are going to give us access to. You (if everything goes right at Yahoo) will give us access to a 100 G crawl in...
jans.aasman
jannesaasman
Offline Send Email
Feb 27, 2008
4:32 pm

Hi Jans, The plan is to have the entire dataset available for download in the WARC format as a set of files. (Some users may have limitations storing files...
Peter Mika
serendipity588
Online Now Send Email
Feb 27, 2008
4:42 pm

thanks for the clarification, jans...
Jans Aasman
jannesaasman
Offline Send Email
Feb 27, 2008
9:03 pm

Hello Peter, Do we have any codes written in Jena? - Amit ... the ... storing ... crawls. ... do if ... HTTP ... access ... access to a ... on ... an existing ...
crossthelimit
Offline Send Email
Feb 28, 2008
3:49 pm

Hi Amit, No, I don't as I'm not familiar with Jena. But basically the MeasurableInputStream that you get as a result of the response.contentAsStream() call on...
Peter Mika
serendipity588
Online Now Send Email
Feb 28, 2008
3:56 pm

Thnx for the info. - Amit ... that you ... download in ... limitations ... of ... response. The ... need to ... the ... GB. ... us ... based ... the ... on ......
crossthelimit
Offline Send Email
Feb 28, 2008
4:10 pm

... not ... existing ... archives ... additional ... in ... API can ... of ... demonstrates ... structure ... the ... hard ... million ... ...
gwking2005
Offline Send Email
May 20, 2008
10:48 pm

Hi Peter, one thing confuses me. A warc record consists subject URI and content block. Since sometimes the content from content block is different from the one...
huanxuezhou
Offline Send Email
Aug 12, 2008
2:10 am

Hi, Do you refer to the issue that the content might have changed since the URL was crawled? I would say that for the sake of comparability, please use the...
serendipity588
Online Now Send Email
Aug 12, 2008
8:19 am

Thanks for explanation. Personally, I really appreciate if you can encode files in N-Triples format, since this format represents RDF well and can be easily...
huanxuezhou
Offline Send Email
Aug 12, 2008
4:18 pm

Hi, The content inside the WARC is encoded in N-Triples, see the sample code (added to the files of the Yahoo! Group, see [1]) on how to extract it. Once you...
serendipity588
Online Now Send Email
Aug 12, 2008
4:43 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help