The sourceforge-link is wrong. It should be: http://sourceforge.net/tracker/index.php?func=detail&aid=1106992&group_id=73833&atid=539102 regards ... Søren...
... Exception is coming from here: http://crawler.archive.org/xref/org/archive/crawler/framework/CrawlController.html#651 Are you using bdbfrontier? What kind...
Sounds like a regression in Heritrix, Jay. Can we have the original page URL. Will help making the fix (You can send privately if you like). Thanks, St.Ack...
Will heritrix do HTTP1.1 requests ? If not - any plans to make it do ? best -- Bjarne Andersen IT-udvikler STATSBIBLIOTEKET Universitetsparken 8000 Århus C ...
Not at the moment. There has been talk about making this an option, but no concrete plans last time I heard. I think people were unsure how you could best...
Hi all, I downloaded and set up the 1.5 version of Heritrix from svn, on the hopes that its memory performance was significantly better than the older 1.3...
I don't think it is correct to extand the memory usage in a linear fashion like that. I'm currently running a crawl that has completed 5.3 million documents...
Hi, I want to know that how heretrix stops toeThreads from copying the already seen URIs in FrontierDB. is there any chance of DUPLICACY OF URIs in Database. ...
The only thing I did was to: * Login into the web admin. * Create a new job based on the default template * Settings / change 'user agent', and 'from' fields *...
I'm not exactly sure what you are asking about, but I'll try to answer. The ToeThreads do not handle duplicate detection. This is done in the Frontier....
Hi, i am running the crawler with 50 threads. But many times in console i see "Active count thread : 0 of 50". And i found the crawler with no progress....
Due to politeness rules, if you are crawling only a few hosts, the crawler will often be idleing, waiting before it can go fetch the next document. This is...
Hi! I'm going to do further work on the DominnameQueueAssignmentPolicy that Bjarne posted earlier, which splits the host name down to the last two parts (need...
... Ok. You did nothing out-of-the-ordinary. ... Thats what I'd look at next only I have no winxp box on my end. Would be great if you could figure what...
Hey Tom: Is HERITRIX_HOME set? Otherwise, missing from your stdout/stderr output are the usual: 23:20:43.884 EVENT Starting Jetty/4.2.23 23:20:44.956 EVENT...
... Queue names can be arbitrary Strings -- the exact format depends on the QueueAssignmentPolicy in use. Other parts of the code are not looking into the...
Hello, Going through the code I get the feeling that the organization of urls in pendingUrisDB (present in BdbMultipleWorkQueues) have been organized from the...
Hi St.Ack / Michael Hansen , As per as my limited experience with winxp and Heritrix 1.4.0, it doesn't take default profiles from jar file. This works fine...
... Yeah. I'd guess the problem is here: http://crawler.archive.org/xref/org/archive/crawler/admin/CrawlJobHandler.html#335. We're using File.separator when...
Trying to build CVS head with Maven 1.0.2 on JDK 1.5 under Solaris 10 is giving me fits. I'll assume that others can build with Maven 1.0.2 on JDK 1.5 on Linux...
Dear all, I have been running heritrix (1.4.0) for about 10 days, with about 10,000 seeds, broad scope, Tom Emerson's "HTML only" filters, 150 threads, and ...
... In my experience the crawl is pretty much dead at that point. I have yet to succeed in doing anything beyond shutting it down. I'm re-running a crawl that...
Folks, I am having trouble crawling where moderately large numbers of seeds are involved. Some of the seeds are accepted, but most defer for reasons I cannot...
... The version complaint usually happens when you mix classfiles made with different versions of the jdk (1.4 vs. 1.5). What happens if you do a 'maven...
Thanks Tom. This one has been around for a while (See http://sourceforge.net/tracker/index.php?func=detail&aid=1218961&group_id=73833&atid=539099. Kris also...