Respected sir, Our problem is this. We need to create one TXT file that contain only the crawled uri.There are lots of thing in crawl.log file likewise ...
Note: forwarded message attached. Yahoo! India Matrimony: Find your life partneronline. Respected sir, Our problem is this. We need to create one TXT file...
Hi Guys, I want to use Heritrix to crawl google search results and harvest the URLs, these URLs are then going to be put into a directory as in a listing. So...
David Cullen
mrdavidcullen@...
Apr 11, 2005 3:11 pm
1715
... Look at the crawl.log. This will tell you if you are crawling or not. Are you only fetching the google robots.txt and then it goes no futher? Google bars...
Hi Folks, I am crawling a site and it has been running for 54 hrs and 99% crawl job. Normally it runs about 24 hr in the couple of last crawl. I am pretty...
I just updated my sources from HEAD and get the following failure in the unit tests: [junit] Running org.archive.crawler.frontier.AdaptiveRevisitHostQueueTest ...
... I suggest first pausing the crawl, then using the Frontier report or 'view or edit frontier URIs' option (which only appears in a paused crawl) to verify...
Hi all, I received a couple of NPEs in the new UbiCrawler-integrated code. The crawl continues, apparently just fine. But I thought I'd pass one along: Title:...
Hi, I've found that for some reason or another some webmasters want to allow robots to access scripts, but not with some certain parameters, i.e., in...
... Looks like an imrpvoement to me. I'll apply it. Thanks, good catch! More comment: The various robots.txt specifications all use the term 'path' to describe...
Hi, I ran Heritrix 1.2.0 from command line on a debian machine to do a broadscope crawl on a set of 200 seed urls. I used 15 therads and allocated 1280 Megs of...
Broad=scope memory issues is a known problem. First off, see http://crawler.archive.org/faq.html#oome_broadcrawl Secondly, I'd grab a 1.3 snapshot and try...
What is the reason that number of writers in the ARC writer pool cannot be increased mid-crawl? I increased the number of toe-threads from 50 to 90, and all 90...
... There's an old RFE asking for this feature: http://sourceforge.net/tracker/index.php?func=detail&aid=919727&group_id=73833&atid=539102. Its of priority 3....
... This is interesting reading, thanks. Could I ask you to elaborate why increasing the number of writers seemed to lower through-put? -- Tom Emerson...
... Embarrassingly, we didn't spend enough time looking. More writers would mean more contention for the disk head possibly holding threads in the write step...
... That would make sense. Now that you can specify multiple ARC directories, you could spread writers across multiple drives. One thing that I haven't done in...
... Current status: Run time: 175h., 7 min. and 25 sec. Processed docs/sec: 9.57 (6.74) KB/sc: 175 (107) Active Thread Count: 150 of 150 Total data received:...
... Paste it into a message Tom and send it over. ConcurrentModificationExceptions trying to view Frontier report are a known issue but you shouldn't be...
Thanks for your reply Tom. I tried running a broadscope crawl using the current 1.3 snapshot of the CVS head. I also used BdBFrontier with 50 threads and...
Once an OutOfMemoryError is hit, a crawl attempts to pause, but will often be unable to proceed in any meaningful way, depending on what operation was ...
Also, was the crawl job that OOME'd started after another job had run w/o a restart of Heritrix between the running of the jobs? Is this a case of '[ 1123230...
Thanks Michael and Gordon for your inputs, ... I am running my crawls from commandline so I dont think above might be the case. Once crawl stops due to OOME I...
Thanks a lot for your reply Michael. ... particular, ... I am not overriding the heritrix.property file apart from specifying a higher value of heap in...