... To get away from the Alexa tools, which can be quite difficult to compile, we have developed some Java tools that are available at www.netarchive.dk. They...
I've successfully used a patched version of dk.netarkivet.ArcTools.ExtractCDX to generate Wayback-compliant CDX files from an Alexa-style DAT file. My patch is...
Thanks for the comments, everyone. Some of these are good ideas that we won't get to for 1.0, but will keep on the docket for later. ... Not automatically: it...
... Using a Map instead of a Hashtable is the right thing, but I don't want to have DAT-extracts[1] automatically conform to Wayback format. If it was an...
Hi, it turned out that I only needed offset information from cdx files. Those java tools are quite grand for such task and I managed to write a couple of Perl...
... Indeed. A confusing bit is that an ARC block does not have a newline after it, instead there's a newline before the metadata line. ... Yes, as mentioned...
... There's at least one component of that patch that's a necessary bug fix to the existing code - in extractFromDat(), the fieldsread.clear() call when the...
... Good stuff. I just added a pointer to the developer documentation: http://crawler.archive.org/articles/developer_manual.html#arcreader. The same location...
Michael Stack
stack@...
May 5, 2004 1:38 am
372
What are the best practices for submitting a batch of jobs. I have a list of fqdn's in a database and I want heritrix to consider each one a seperate job....
... That sounds right. Make sure the crawler is the 'Crawling state' so that it'll just start the next job soon as its finished the current job. Yours, St.Ack...
Michael Stack
stack@...
May 10, 2004 9:58 pm
374
We've seen a couple of cases here in which the alexa-tools av_procarc DAT file maker generates a DAT incorrectly, from Heritrix ARC files. For a small number...
Hi Sunny, One problem with a batch of jobs is that there is no guarantee that the jobs will finish within the reasonable time window. If you don't have...
Hi Andy, I would love to get my hands on some of these arc files so that I can reproduce this error. Please let me know if this is possible. Take care. i. P.S....
First try was not successful. I create a directory called batchjob in the jobs directoy which contains three files: batchjob.job job-batchjob.xml and...
... The code that reads the directory is only run on application startup it looks like. Restart. Does it work? Here is the pertinent code: ...
Michael Stack
stack@...
May 11, 2004 5:03 pm
379
Michael is correct. Jobs are only read from disk during program startup. At other times in memory chaching is used. A suitable workaround might by to create a...
Yes you guys are right. When I restart heritrix the pending jobs on
disk get loaded into the crawler.
Thanks for the tips. I should be asking weather this...
... Doing the latter sounds more manageable. See http://crawler.archive.org/articles/developer_manual.html#arcreader for a few notes on reading arcs. St.Ack ...
Michael Stack
stack@...
May 11, 2004 6:14 pm
382
Crawlers, I am trying to start crawls from an outside java class, but I'm having trouble. I looked in the new.jsp pages, but I'm getting: Exception in thread...
Miles Crawford
mcrawfor@...
May 12, 2004 12:25 am
383
... Is there anything at Heritrix.getConfdir().getAbsolutePath() + File.separator + "profiles"? See ...
Michael Stack
stack@...
May 12, 2004 12:45 am
384
Well, I was trying to add jobs to an already-running instance of Heritrix, so that they could then be monitored via the web ui. Tell you what though, I think...
Miles Crawford
mcrawfor@...
May 12, 2004 1:04 am
385
... Ok (The aforementioned selftest method creates a job and runs it if you still want to go the other route). St.Ack...
Michael Stack
stack@...
May 12, 2004 1:12 am
386
How is it more manageable crawling through all the customers URIs and
then dealing with all their data mixed up in one file? What programs
exist for querying...
... The general notion is that heritrix does crawling only. How the downloaded content is mined is domain specific and outside of the heritrix purview. That...
stack
stack@...
May 13, 2004 4:28 pm
388
Find all the pages that have "mailto", "contact", or that match the regular expression "(\d\d\d)\d\d\d-\d\d\d\d". It is unclear how I would do this using the...
... I am writing a C++ library for accessing information in ARC files, and implementing something like the above would be doable wtih that. I agree with St.Ack...
... Sorry. Its a lot to bite off if you ain't swimming in it every day (Even then...). ... The way our build is written, it runs tests before it produces jars...
Michael Stack
stack@...
May 14, 2004 6:23 pm
392
... You'd use an instance of ARCReader or the netarchive.dk tools to get an iterator onto the content of your ARC and pass the content of each item found...
Michael Stack
stack@...
May 14, 2004 8:13 pm
393
Ok I want to try this. I fired up eclipse. Imported the filesystem heritrix/src/java/org into my project. I also imported the heritrix.jar file. Two errors...