At Thu, 30 Apr 2009 12:32:37 -0400, ... Hi Eric - I think that most tools depend to some extent on hijacking the display system and turning that into a raster...
Cool! It's one of those things that seems like everybody wants it, but no one has quite figured out. And the various "services" like thumbshots all feel kinda...
At Sun, 3 May 2009 11:19:17 -0400, ... Here is the text (thanks to Mark Phillips for this): Khtml2png - http://khtml2png.sourceforge.net/ “Khtml2png is a ...
Hi all, How can I tweak heritrix to crawl only within a seed? E.g. if my seed is www.espn.com, I would like to retrieve/download only links within espn.com. I...
... If you're just starting with Heritrix, we recommend using 1.14.3. With the default configuration, crawling will generally stay on the sites defined by the...
Hi Gordan, Thanks for the reply. I have been using heritrix 2.0.0 for a couple of months. Is the process you have mentioned the same for 2.0.0? Thanks...
Hey everyone, I am experimenting with Heritrix to try out some simple search algorithms that I have designed. Unfortunately, my bandwidth sucks hence I will...
... i've the same question. i've a bunch of url, resolved with handle.net how do i configure heritrix2 to follow redirection from hdl and then crawl only the...
raffaele messuti
raffaele@...
May 9, 2009 4:43 pm
5829
Hey everyone, I have been experimenting with heritrix over the weekend but was not able to obtain any fruitful results. I have read the documentation quite...
I always recommend starting with the default rules, then making individual changes that are each understood. You've left off the PrerequisiteAcceptDecideRule...
I used the default regex expression with Heritrix 2.0.2 to display remaining URIs without a problem. I used the same expression in 1.14.3 and get no results....
Make sure there are no stray spaces in your 1.14.3 regex. Use the frontier report or other status info to make sure there are still URIs queued. Also, I don't...
Hi all , I would like to know if heritrix has some modules or functions to support sitemaps[1] and the sitemap protocol[2]. Especially if heritrix is parsing...
juergen@...
May 14, 2009 4:54 pm
5839
hello. could somebody help in using this tool. it wont work on my computer with windows xp os...... I tried so hard but still it wont work. I have followed...
I'm running a job seeded with 7M urls. It ran to about 25% completion, then died. I'm now trying to restart the job (new job based on recovery-log), but the...
One simple thing I've done is move the seedlist file from the job directory to something else, and put in a one site seed list. After all, by now, the whole...
It seems you need change the attribute of the file named "jmxremote.password" to only can be changed by the owner(in Windows you can use command: cacls). If...
It seems you need change the attribute of the file named "jmxremote.password" to only can be changed by the owner(in Windows you can use command: cacls). If...
The jobs and profiles in under the directory ./jobs, the complete job should be named as "completed-randomNumber", copy this directory and rename it to...
Can you provide some details? A screenshot of what the page looks like? Here is what the Plone project recommends as an approach for asking for help: ...
to avoid the message about password file protection, I couldn't simply set the file as read only. after right click on it and select the properties option, I...
My search engine course just released a Java Sitemap Parser on SourceForge available at http://sourceforge.net/projects/sitemap-parser/ This could be...
Hi, I am still wondering if there's any publicly available arc writer for integrating Heritrix with Solr? how did you integrate both systems? Thanks! Tony -- ...