Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

archive-crawler

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 798
  • Category: Cyberculture
  • Founded: Dec 1, 2002
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Messages

Advanced
Messages Help
Messages 364 - 393 of 8173   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand Author Sort by Date ^
364 Lars Clausen
lrclause Send Email
May 3, 2004
1:38 pm
... To get away from the Alexa tools, which can be quite difficult to compile, we have developed some Java tools that are available at www.netarchive.dk. They...
365 Andrew Boyko
andyboyko Send Email
May 3, 2004
6:28 pm
I've successfully used a patched version of dk.netarkivet.ArcTools.ExtractCDX to generate Wayback-compliant CDX files from an Alexa-style DAT file. My patch is...
366 Gordon Mohr (Internet...
gojomo Send Email
May 3, 2004
10:06 pm
Thanks for the comments, everyone. Some of these are good ideas that we won't get to for 1.0, but will keep on the docket for later. ... Not automatically: it...
367 Lars Clausen
lrclause Send Email
May 4, 2004
7:29 am
... Using a Map instead of a Hashtable is the right thing, but I don't want to have DAT-extracts[1] automatically conform to Wayback format. If it was an...
368 Kaisa Kaunonen
kaisa_kaunonen Send Email
May 4, 2004
11:56 am
Hi, it turned out that I only needed offset information from cdx files. Those java tools are quite grand for such task and I managed to write a couple of Perl...
369 Lars Clausen
lrclause Send Email
May 4, 2004
12:29 pm
... Indeed. A confusing bit is that an ARC block does not have a newline after it, instead there's a newline before the metadata line. ... Yes, as mentioned...
370 Andrew Boyko
andyboyko Send Email
May 4, 2004
3:18 pm
... There's at least one component of that patch that's a necessary bug fix to the existing code - in extractFromDat(), the fieldsread.clear() call when the...
371 Michael Stack
stack@... Send Email
May 5, 2004
1:38 am
... Good stuff. I just added a pointer to the developer documentation: http://crawler.archive.org/articles/developer_manual.html#arcreader. The same location...
372 penguinoamante2 Send Email May 10, 2004
9:41 pm
What are the best practices for submitting a batch of jobs. I have a list of fqdn's in a database and I want heritrix to consider each one a seperate job....
373 Michael Stack
stack@... Send Email
May 10, 2004
9:58 pm
... That sounds right. Make sure the crawler is the 'Crawling state' so that it'll just start the next job soon as its finished the current job. Yours, St.Ack...
374 Andrew Boyko
andyboyko Send Email
May 10, 2004
10:09 pm
We've seen a couple of cases here in which the alexa-tools av_procarc DAT file maker generates a DAT incorrectly, from Heritrix ARC files. For a small number...
375 Igor Ranitovic
iranitovic Send Email
May 10, 2004
10:28 pm
Hi Sunny, One problem with a batch of jobs is that there is no guarantee that the jobs will finish within the reasonable time window. If you don't have...
376 Igor Ranitovic
iranitovic Send Email
May 10, 2004
10:40 pm
Hi Andy, I would love to get my hands on some of these arc files so that I can reproduce this error. Please let me know if this is possible. Take care. i. P.S....
377 penguinoamante2 Send Email May 11, 2004
3:39 pm
First try was not successful. I create a directory called batchjob in the jobs directoy which contains three files: batchjob.job job-batchjob.xml and...
378 Michael Stack
stack@... Send Email
May 11, 2004
5:03 pm
... The code that reads the directory is only run on application startup it looks like. Restart. Does it work? Here is the pertinent code: ...
379 Kristinn Sigurdsson
kristsi25 Send Email
May 11, 2004
5:24 pm
Michael is correct. Jobs are only read from disk during program startup. At other times in memory chaching is used. A suitable workaround might by to create a...
380 penguinoamante2 Send Email May 11, 2004
5:53 pm
Yes you guys are right. When I restart heritrix the pending jobs on disk get loaded into the crawler. Thanks for the tips. I should be asking weather this...
381 Michael Stack
stack@... Send Email
May 11, 2004
6:14 pm
... Doing the latter sounds more manageable. See http://crawler.archive.org/articles/developer_manual.html#arcreader for a few notes on reading arcs. St.Ack ...
382 Miles Crawford
mcrawfor@... Send Email
May 12, 2004
12:25 am
Crawlers, I am trying to start crawls from an outside java class, but I'm having trouble. I looked in the new.jsp pages, but I'm getting: Exception in thread...
383 Michael Stack
stack@... Send Email
May 12, 2004
12:45 am
... Is there anything at Heritrix.getConfdir().getAbsolutePath() + File.separator + "profiles";? See ...
384 Miles Crawford
mcrawfor@... Send Email
May 12, 2004
1:04 am
Well, I was trying to add jobs to an already-running instance of Heritrix, so that they could then be monitored via the web ui. Tell you what though, I think...
385 Michael Stack
stack@... Send Email
May 12, 2004
1:12 am
... Ok (The aforementioned selftest method creates a job and runs it if you still want to go the other route). St.Ack...
386 penguinoamante2 Send Email May 13, 2004
3:10 pm
How is it more manageable crawling through all the customers URIs and then dealing with all their data mixed up in one file? What programs exist for querying...
387 stack
stack@... Send Email
May 13, 2004
4:28 pm
... The general notion is that heritrix does crawling only. How the downloaded content is mined is domain specific and outside of the heritrix purview. That...
388 penguinoamante2 Send Email May 14, 2004
4:44 pm
Find all the pages that have "mailto", "contact", or that match the regular expression "(\d\d&#92;d)\d\d&#92;d-\d\d&#92;d\d". It is unclear how I would do this using the...
389 penguinoamante2 Send Email May 14, 2004
5:02 pm
When compiling heritrix witih maven it fails one of the tests: This is from heritrix-0.6.0-src.tar.gz. test:test: [junit] Running ...
390 Tom Emerson
tree02139 Send Email
May 14, 2004
5:19 pm
... I am writing a C++ library for accessing information in ARC files, and implementing something like the above would be doable wtih that. I agree with St.Ack...
391 Michael Stack
stack@... Send Email
May 14, 2004
6:23 pm
... Sorry. Its a lot to bite off if you ain't swimming in it every day (Even then...). ... The way our build is written, it runs tests before it produces jars...
392 Michael Stack
stack@... Send Email
May 14, 2004
8:13 pm
... You'd use an instance of ARCReader or the netarchive.dk tools to get an iterator onto the content of your ARC and pass the content of each item found...
393 penguinoamante2 Send Email May 14, 2004
10:30 pm
Ok I want to try this. I fired up eclipse. Imported the filesystem heritrix/src/java/org into my project. I also imported the heritrix.jar file. Two errors...
Messages 364 - 393 of 8173   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help