Hi, I'm using Heritrix 2.0.2 with a few custom processor classes which I use to extract relevant data from the HTML (by replacing ARCWriterProcessor with one...
Enrico Detoma
enrico.detoma@...
Jul 1, 2009 7:35 am
5911
cross-posted from [Archive-access-discuss] On Tue, 30 Jun 2009 13:47:51 -0400 Zhenzhen Xue <zjuzhenzhen@...> ... interesting. is this really necessary? ...
Hi we have found what was the problem with speed we had set mirrorwriter processor, which caused a lot of disk operations which slowed disk down and used a lot...
Hi, I am trying to read a compressed archive with C# and not having any luck. I tried to use the GZipStream, but it stops after reading in the first 10 lines...
Hi people! I need to write a crawler that could find all the RSS links on a domain and store them in a database. However, I wouldnt like to crawl whole domains...
Hello nfoscarini, Each record in a compressed [w]arc is individually gzipped. A series of gzipped chunks concatenated together is itself a valid gzipped file....
Hi, any link / suggestion on how to embed/reuse heritrix engine in another Java app? I just want to call the crawler with an initial seed URL, and get HTML...
Hi, I've been crawling for two days selective crawl with 300 toe threds. After this time heritrix unexpectedly crash with this message in heritrix_out.log # #...
Hi people! I'm running a broad crawl, and after about 40-50 minutes getting a TMOE exception, followed by "ROS/RIS Already open for ToeThread ##" exceptions. ...
Hello, We are trying to set up dedupe in our crawls right now but we are having some issues launching the Recrawl jobs. We have no problems adding the...
Ignacio Garcia
igc.csmail@...
Jul 14, 2009 4:23 pm
5921
Hi, How to crawl only those URL which satisfy regular expression? For ex. I have one domain "www.example.com". I want to crawl only those URL which contains...
Hello, I think you want to try the regular MatchesRegExpDecideRule instead of the MatchesFilePatternDecideRule which I believe looks at just the suffix of the...
At Tue, 14 Jul 2009 09:14:24 -0000, ... See: <http://webarchive.jira.com/browse/HER-1126> Your java version is somewhat out of date, it seems that upgrading to...
... This looks like the same issue as discussed in this previous message: http://tech.groups.yahoo.com/group/archive-crawler/message/5859 Increasing the...
That's generally the right approach. Can you find more of the exception stack, perhaps in heritrix_out.log? (The 'IllegalStateException' and anything that...
... You can see an example of a ReplayInputStream ('RIS') acquired and then properly closed in ARCWriterProcessor.innerProcessResult(). It's acquired via... ...
Hello Gordon, Here's the complete error log from heritrix_out.log: 2009-07-13 20:33:02.524::WARN: /crawler_area/do_launch.jsp: ...
Ignacio Garcia
igc.csmail@...
Jul 15, 2009 12:37 pm
5928
Hi Gordon! Thanks a lot for the attention and suggestions! Well, I have checked the whole code, and none of the custom processors (I have only 2 of them) use...
Gordon, you are the best! :) Today I have seen updates on svn, especially those for closing ReplayInputStream and the closeQuietly() utility. After an update,...
You're welcome; I wasn't sure the changes would make a difference, good to hear they have. (It's probably only the change in RecordingOutputStream that's...
Hi, I have been trying to build 1.4.3 source code taken from the sourceforge page since the past 2 days without any luck. Something or the other comes up with...
I've never used your crawler, but I used to write my own simple solutions in Perl for crawling and scraping pages. I would like to ask you for a comment on the...
At Fri, 17 Jul 2009 04:23:16 -0000, ... Hi Utsav - Your message highlights one of the (408) warnings that the build produced, but not an error. Do you know...
Hi Erik, The error is in the import sun.net.www.protocol.fileUrlConnection. It says the sun.net cannot be resolved. So I commented out the portion using ...
At Sat, 18 Jul 2009 10:40:10 +0800, ... Hi Utsav - Heritrix 1.14.3 uses Maven 1.0.2 so be sure that you are using that version of Maven. The sun.net code is...
Hi Erik, Firstly, Thanks for taking time out to look into this. I have removed maven 1.1 and installed 1.0.2 from the archive. I am using Ubuntu and jdk 1.5 ...
The use case you describe is not web crawling, it is just downloading a set of files. While Heritrix could be (using some custom bean shell scripts) configured...
Thanks a lot Erik, that worked. Also I reinstalled java and the advice on the link pointed at by you only worked after that. Now I want to add Javascript...