Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 364 - 393 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
364
... To get away from the Alexa tools, which can be quite difficult to compile, we have developed some Java tools that are available at www.netarchive.dk. They...
Lars Clausen
lrclause
Offline Send Email
May 3, 2004
1:38 pm
365
I've successfully used a patched version of dk.netarkivet.ArcTools.ExtractCDX to generate Wayback-compliant CDX files from an Alexa-style DAT file. My patch is...
Andrew Boyko
andyboyko
Online Now Send Email
May 3, 2004
6:28 pm
366
Thanks for the comments, everyone. Some of these are good ideas that we won't get to for 1.0, but will keep on the docket for later. ... Not automatically: it...
Gordon Mohr (Internet...
gojomo
Online Now Send Email
May 3, 2004
10:06 pm
367
... Using a Map instead of a Hashtable is the right thing, but I don't want to have DAT-extracts[1] automatically conform to Wayback format. If it was an...
Lars Clausen
lrclause
Offline Send Email
May 4, 2004
7:29 am
368
Hi, it turned out that I only needed offset information from cdx files. Those java tools are quite grand for such task and I managed to write a couple of Perl...
Kaisa Kaunonen
kaisa_kaunonen
Offline Send Email
May 4, 2004
11:56 am
369
... Indeed. A confusing bit is that an ARC block does not have a newline after it, instead there's a newline before the metadata line. ... Yes, as mentioned...
Lars Clausen
lrclause
Offline Send Email
May 4, 2004
12:29 pm
370
... There's at least one component of that patch that's a necessary bug fix to the existing code - in extractFromDat(), the fieldsread.clear() call when the...
Andrew Boyko
andyboyko
Online Now Send Email
May 4, 2004
3:18 pm
371
... Good stuff. I just added a pointer to the developer documentation: http://crawler.archive.org/articles/developer_manual.html#arcreader. The same location...
Michael Stack
stack@...
Send Email
May 5, 2004
1:38 am
372
What are the best practices for submitting a batch of jobs. I have a list of fqdn's in a database and I want heritrix to consider each one a seperate job....
penguinoamante2
Offline Send Email
May 10, 2004
9:41 pm
373
... That sounds right. Make sure the crawler is the 'Crawling state' so that it'll just start the next job soon as its finished the current job. Yours, St.Ack...
Michael Stack
stack@...
Send Email
May 10, 2004
9:58 pm
374
We've seen a couple of cases here in which the alexa-tools av_procarc DAT file maker generates a DAT incorrectly, from Heritrix ARC files. For a small number...
Andrew Boyko
andyboyko
Online Now Send Email
May 10, 2004
10:09 pm
375
Hi Sunny, One problem with a batch of jobs is that there is no guarantee that the jobs will finish within the reasonable time window. If you don't have...
Igor Ranitovic
iranitovic
Offline Send Email
May 10, 2004
10:28 pm
376
Hi Andy, I would love to get my hands on some of these arc files so that I can reproduce this error. Please let me know if this is possible. Take care. i. P.S....
Igor Ranitovic
iranitovic
Offline Send Email
May 10, 2004
10:40 pm
377
First try was not successful. I create a directory called batchjob in the jobs directoy which contains three files: batchjob.job job-batchjob.xml and...
penguinoamante2
Offline Send Email
May 11, 2004
3:39 pm
378
... The code that reads the directory is only run on application startup it looks like. Restart. Does it work? Here is the pertinent code: ...
Michael Stack
stack@...
Send Email
May 11, 2004
5:03 pm
379
Michael is correct. Jobs are only read from disk during program startup. At other times in memory chaching is used. A suitable workaround might by to create a...
Kristinn Sigurdsson
kristsi25
Offline Send Email
May 11, 2004
5:24 pm
380
Yes you guys are right. When I restart heritrix the pending jobs on disk get loaded into the crawler. Thanks for the tips. I should be asking weather this...
penguinoamante2
Offline Send Email
May 11, 2004
5:53 pm
381
... Doing the latter sounds more manageable. See http://crawler.archive.org/articles/developer_manual.html#arcreader for a few notes on reading arcs. St.Ack ...
Michael Stack
stack@...
Send Email
May 11, 2004
6:14 pm
382
Crawlers, I am trying to start crawls from an outside java class, but I'm having trouble. I looked in the new.jsp pages, but I'm getting: Exception in thread...
Miles Crawford
mcrawfor@...
Send Email
May 12, 2004
12:25 am
383
... Is there anything at Heritrix.getConfdir().getAbsolutePath() + File.separator + "profiles"? See ...
Michael Stack
stack@...
Send Email
May 12, 2004
12:45 am
384
Well, I was trying to add jobs to an already-running instance of Heritrix, so that they could then be monitored via the web ui. Tell you what though, I think...
Miles Crawford
mcrawfor@...
Send Email
May 12, 2004
1:04 am
385
... Ok (The aforementioned selftest method creates a job and runs it if you still want to go the other route). St.Ack...
Michael Stack
stack@...
Send Email
May 12, 2004
1:12 am
386
How is it more manageable crawling through all the customers URIs and then dealing with all their data mixed up in one file? What programs exist for querying...
penguinoamante2
Offline Send Email
May 13, 2004
3:10 pm
387
... The general notion is that heritrix does crawling only. How the downloaded content is mined is domain specific and outside of the heritrix purview. That...
stack
stack@...
Send Email
May 13, 2004
4:28 pm
388
Find all the pages that have "mailto", "contact", or that match the regular expression "(\d\d\d)\d\d\d-\d\d\d\d". It is unclear how I would do this using the...
penguinoamante2
Offline Send Email
May 14, 2004
4:44 pm
389
When compiling heritrix witih maven it fails one of the tests: This is from heritrix-0.6.0-src.tar.gz. test:test: [junit] Running ...
penguinoamante2
Offline Send Email
May 14, 2004
5:02 pm
390
... I am writing a C++ library for accessing information in ARC files, and implementing something like the above would be doable wtih that. I agree with St.Ack...
Tom Emerson
tree02139
Offline Send Email
May 14, 2004
5:19 pm
391
... Sorry. Its a lot to bite off if you ain't swimming in it every day (Even then...). ... The way our build is written, it runs tests before it produces jars...
Michael Stack
stack@...
Send Email
May 14, 2004
6:23 pm
392
... You'd use an instance of ARCReader or the netarchive.dk tools to get an iterator onto the content of your ARC and pass the content of each item found...
Michael Stack
stack@...
Send Email
May 14, 2004
8:13 pm
393
Ok I want to try this. I fired up eclipse. Imported the filesystem heritrix/src/java/org into my project. I also imported the heritrix.jar file. Two errors...
penguinoamante2
Offline Send Email
May 14, 2004
10:30 pm
Messages 364 - 393 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help