... To get away from the Alexa tools, which can be quite difficult to compile, we have developed some Java tools that are available at www.netarchive.dk. They...
365
Andrew Boyko
andyboyko
May 3, 2004 6:28 pm
I've successfully used a patched version of dk.netarkivet.ArcTools.ExtractCDX to generate Wayback-compliant CDX files from an Alexa-style DAT file. My patch is...
366
Gordon Mohr (Internet...
gojomo
May 3, 2004 10:06 pm
Thanks for the comments, everyone. Some of these are good ideas that we won't get to for 1.0, but will keep on the docket for later. ... Not automatically: it...
367
Lars Clausen
lrclause
May 4, 2004 7:29 am
... Using a Map instead of a Hashtable is the right thing, but I don't want to have DAT-extracts[1] automatically conform to Wayback format. If it was an...
368
Kaisa Kaunonen
kaisa_kaunonen
May 4, 2004 11:56 am
Hi, it turned out that I only needed offset information from cdx files. Those java tools are quite grand for such task and I managed to write a couple of Perl...
369
Lars Clausen
lrclause
May 4, 2004 12:29 pm
... Indeed. A confusing bit is that an ARC block does not have a newline after it, instead there's a newline before the metadata line. ... Yes, as mentioned...
370
Andrew Boyko
andyboyko
May 4, 2004 3:18 pm
... There's at least one component of that patch that's a necessary bug fix to the existing code - in extractFromDat(), the fieldsread.clear() call when the...
371
Michael Stack
stack@...
May 5, 2004 1:38 am
... Good stuff. I just added a pointer to the developer documentation: http://crawler.archive.org/articles/developer_manual.html#arcreader. The same location...
372
penguinoamante2
May 10, 2004 9:41 pm
What are the best practices for submitting a batch of jobs. I have a list of fqdn's in a database and I want heritrix to consider each one a seperate job....
373
Michael Stack
stack@...
May 10, 2004 9:58 pm
... That sounds right. Make sure the crawler is the 'Crawling state' so that it'll just start the next job soon as its finished the current job. Yours, St.Ack...
374
Andrew Boyko
andyboyko
May 10, 2004 10:09 pm
We've seen a couple of cases here in which the alexa-tools av_procarc DAT file maker generates a DAT incorrectly, from Heritrix ARC files. For a small number...
375
Igor Ranitovic
iranitovic
May 10, 2004 10:28 pm
Hi Sunny, One problem with a batch of jobs is that there is no guarantee that the jobs will finish within the reasonable time window. If you don't have...
376
Igor Ranitovic
iranitovic
May 10, 2004 10:40 pm
Hi Andy, I would love to get my hands on some of these arc files so that I can reproduce this error. Please let me know if this is possible. Take care. i. P.S....
377
penguinoamante2
May 11, 2004 3:39 pm
First try was not successful. I create a directory called batchjob in the jobs directoy which contains three files: batchjob.job job-batchjob.xml and...
378
Michael Stack
stack@...
May 11, 2004 5:03 pm
... The code that reads the directory is only run on application startup it looks like. Restart. Does it work? Here is the pertinent code: ...
379
Kristinn Sigurdsson
kristsi25
May 11, 2004 5:24 pm
Michael is correct. Jobs are only read from disk during program startup. At other times in memory chaching is used. A suitable workaround might by to create a...
380
penguinoamante2
May 11, 2004 5:53 pm
Yes you guys are right. When I restart heritrix the pending jobs on
disk get loaded into the crawler.
Thanks for the tips. I should be asking weather this...
381
Michael Stack
stack@...
May 11, 2004 6:14 pm
... Doing the latter sounds more manageable. See http://crawler.archive.org/articles/developer_manual.html#arcreader for a few notes on reading arcs. St.Ack ...
382
Miles Crawford
mcrawfor@...
May 12, 2004 12:25 am
Crawlers, I am trying to start crawls from an outside java class, but I'm having trouble. I looked in the new.jsp pages, but I'm getting: Exception in thread...
383
Michael Stack
stack@...
May 12, 2004 12:45 am
... Is there anything at Heritrix.getConfdir().getAbsolutePath() + File.separator + "profiles"? See ...
384
Miles Crawford
mcrawfor@...
May 12, 2004 1:04 am
Well, I was trying to add jobs to an already-running instance of Heritrix, so that they could then be monitored via the web ui. Tell you what though, I think...
385
Michael Stack
stack@...
May 12, 2004 1:12 am
... Ok (The aforementioned selftest method creates a job and runs it if you still want to go the other route). St.Ack...
386
penguinoamante2
May 13, 2004 3:10 pm
How is it more manageable crawling through all the customers URIs and
then dealing with all their data mixed up in one file? What programs
exist for querying...
387
stack
stack@...
May 13, 2004 4:28 pm
... The general notion is that heritrix does crawling only. How the downloaded content is mined is domain specific and outside of the heritrix purview. That...
388
penguinoamante2
May 14, 2004 4:44 pm
Find all the pages that have "mailto", "contact", or that match the regular expression "(\d\d92;d)\d\d92;d-\d\d92;d\d". It is unclear how I would do this using the...
389
penguinoamante2
May 14, 2004 5:02 pm
When compiling heritrix witih maven it fails one of the tests: This is from heritrix-0.6.0-src.tar.gz. test:test: [junit] Running ...
390
Tom Emerson
tree02139
May 14, 2004 5:19 pm
... I am writing a C++ library for accessing information in ARC files, and implementing something like the above would be doable wtih that. I agree with St.Ack...
391
Michael Stack
stack@...
May 14, 2004 6:23 pm
... Sorry. Its a lot to bite off if you ain't swimming in it every day (Even then...). ... The way our build is written, it runs tests before it produces jars...
392
Michael Stack
stack@...
May 14, 2004 8:13 pm
... You'd use an instance of ARCReader or the netarchive.dk tools to get an iterator onto the content of your ARC and pass the content of each item found...
393
penguinoamante2
May 14, 2004 10:30 pm
Ok I want to try this. I fired up eclipse. Imported the filesystem heritrix/src/java/org into my project. I also imported the heritrix.jar file. Two errors...