I only crawl test URLs of myself. And Heritrix need to crawl robots and DNS firstly, which is cost lots of time. I donot need the Heritrix to crawl the robots...
I dont know for the robots.txt part, but you will still be forced to "crawl" (eg, contact) the DNS to obtain the IP address of the server. This is mandatory. ...
Jean-Noël Rivasseau
elvanor@...
Oct 8, 2008 8:34 am
5503
Oh, I know. But could anyone tell me how to strip the robots.txt -Thanks ... server. ... robots and ... Heritrix to...
Hi, Could we please not publish this? I don't know if the original writer is real, or not. But it is suspicious. Eliminating one text file from a crawl to save...
I also don't know the reason for not downloading the robots.txt, but if it's because of not wanting to follow the "advice" in them, then there's a simple ...
As you said, I realize robots.txt is a very importent file for limit illegal crawler. I'll start to learn it more. I prefer to let it remain. Thanks a lot. ......
As you said, I realize robots.txt is a very importent file for limit illegal crawler. I'll start to learn it more. I prefer to let it remain. Thanks a lot. ......
Hi I tried to start a crawl-job with Heritirx 2.0.1 and the AdaptiveRevisitFrontier but unfortunately I run into some NullPointerExceptions (the sheet config...
Juergen Umbrich
juergen@...
Oct 10, 2008 4:29 pm
5509
Hi Juergen, as far as I know the AdaptiveRevisitFrontier is broken in Heritrix 2.X. Adaptive revisit was an experiment for Heritrix 1.0x and is in the current...
Hi, We had a problem with Heritrix not writing any crawl reports in the following case: The job was first paused due to a Low Disk Pause (caused by the ...
Hi, During our last (broad) crawl we stumbled upon the following fact: directly after the start of the crawl the download rate (kb/s as tracked in the...
I think this is an FAQ, but I could not completely understand the current status of the incremental crawling with heritrix. I read the documents, the mailing...
Takeshi Kobayakawa
tskoba@...
Oct 14, 2008 8:13 pm
5513
Hi all, Is there a tool built in to the Heritrix 2 package that will create DAT files? If not, are there open source tools available? I appreciate any...
Hi Group, Does anyone has any code via which one can automate the monitoring the crawler. I would like to see stats like how long a job is taking and how many...
Sorry, I mean DAT <http://www.archive.org/web/researcher/dat_file_format.php> files like the Internet Archive uses as a type of index for Arcs. thanks, Lauren...
Hi, Heritrix users may interested in a project we just released to open source. CloudBase is a data warehouse system built on top of Hadoop. It is developed by...
I don't know if in Heritrix 2 you can do that, but in heritrix 1.14.1 with the cmdline-jmxclient you can monitoring all of these stats. Check in these forum a...
Hy everybody. I'm using heritrix-1.14.1 and i want to occupy my bandwidth with the crawler. I'm using a profile that only download the html code by ...
Hi, we are using munin [1], to plot e.g. no docs, MB/s etc. Integrating with munin is simple you can use any scripting language to produce the value you want...
Hi everyone. I am new to the Heritrix project and looking forward to use this software for a personal projet. However, I have been trying to set up my Eclipse...
I just realized that compiling from the root pom.xml didn't create any lib folder in dist/target/... I just tried running a build (which fails) from...
Hi, In Heritrix 1.12 you could read popup help messages in the Configure Settings pages when you clicked the question mark beside a property. In Heritrix 2.0...
Hi, I really need some help, because I think I'm stuck in Heritrix 2.0.1. I want to crawl a list of seeds, and any URI that contains a keyword I want to rank...
Hi all I wrote a module which detects the "real" media type (mime type) of a file based on the magic number approach. At the moment I am comparing the Apache...
Juergen Umbrich
juergen@...
Oct 22, 2008 4:34 pm
5526
I wrote a DecideRule that checks the header from the contents to see what media type it was, but this requires the file to be downloaded first. Is that sort of...
I've asked 2 java devs to compile heritrix 2.0.1 and none of them were able to do so... Anyone else getting errors ? I feel it has something to do with Maven...
Hi ... ah ok, behind my approach is the requirement that the crawler should avoid unnecessary HTTP lookups. Given resource limitations and assuming that over...
Juergen Umbrich
juergen@...
Oct 22, 2008 5:07 pm
5529
I tried for 2 days to get it to compile. I was never able to get it to work inside Eclipse. The Maven plugin will not work for me. I did have better results...
Many thanks for the reply. I am relieved to know that I am not the only one having difficulties to compile heritrix. I thought I was doing something wrong, but...