Dear Gordan, After input the proxy address and port using expert setting, the crawler seeems to work a bit better. I enter a few seeds for testing, but only...
Hi, I think I have managed to install heratrix on a Fedora Core 3 machine. Made some crawls and now I want to display the results - BUT HOW? Also I'm wondering...
I still suspect proxy-related issues. Are you using the latest Heritrix code? Just a couple weeks ago, a problem with our HTTPClient library allowing...
... Heritrix's focus has been on the total veracity of the material collected. Thus, its native storage format is "ARC" files, which are essentially perfect...
In Heritrix, unit tests sit beside the classes tested though the common practise is to keep tests apart from the runtime src, in a directory of tests only....
+1 Never did like having all that clutter. - Kris ... From: archive-crawler@yahoogroups.com [mailto:archive-crawler@yahoogroups.com] On Behalf Of stack Sent:...
Here are what we here at the Archive see as priorities for the 1.6 release of Heritrix. The "General Theme" will be "remote job-control and monitoring ...
Hi, I'd like to use Heritrix but I cannot seem to get around this issue: I have 100k seed urls. These have been split into 4k subsets. A Heritrix process is...
... I would definitely run Heritrix 1.4 in preference to 1.0.4 and 1.2. What sort of threading issues came up with Sun 1.5 JVM? Heritrix 1.4 does not require...
It would be nice for us, if we could override the "max-document-download" setting, i.e have it defined pr. domain We are planning to crawl a bunch of domains...
-1 Mainly because it makes it much harder to test non-public methods. The other reasons for me to keep it as it is, are mostly a matter of taste. I don't see...
Hi. In beginning of section 4.4.1, we read: ############# excerpt begin ############### 4.4.1. Pre-fetch processing chain The first chain is responsible for...
... Hi John Erik, just a small comment for clarification. Moving classes to another source folder does not mean moving into another package. The separation of...
Christian Kohlschuetter
ck-heritrix@...
May 4, 2005 12:29 pm
1788
This is my fault. The packages uploaded to sourceforge were compiled on my desktop w/ 1.5.0. Let me replace the sourceforge files w/ versions compiled w/ ...
What Stack fails to mention is that the DomainSensitiveFrontier is based on the now deprecated HostQueuesFrontier. A possible workaround for you Soren, would...
Just to correct myself, I see that Stack DID mention the relation to the HostQueuesFrontier. I need to have my eyes checked :smile: - Kris ... From:...
Thanks for clarification Christian. Then it's just the taste left of the reason for my vote. I still prefer the current solution, but if others don't, that's...
A someway similar behavior could be obtained by using the cost-policies to ensure download of a maximum of e.g. 2000 objects from each host at a time - this...
-.5 It's convenient to have the test right there with the sources. Even better, whenever possible, have a main unit test in the class itself (in addition to...
-1. I am not under impression that the src tree is (yet) polluted with unit tests. Though, it would be great to be in that position. At this point, I think...
-1 for now, for the same reasons Igor mentioned: it's not too much clutter yet, and the alongside-prominence may help highlight where they're lacking. (If/when...
... It's true. Heritrix still does not have all classes tested. However, we already have more than 60 unit tests (and more than 300 application classes). ...
Christian Kohlschuetter
ck-heritrix@...
May 6, 2005 8:29 am
1801
Thanks for your help, everyone! Even though it is deprecated, the solution using the DomainSensitiveFrontier fits the bill perfectly. because we need to make ...
Dear all. It seems, that in version 1.4.0 the crawl-reports and seeds reports are not available from the GUI after the job is finished. On the screen, you get...