Thanks a lot for your reply Michael. ... particular, ... I am not overriding the heritrix.property file apart from specifying a higher value of heap in...
1739
stack
stackarchiveorg
Apr 18, 2005 4:48 pm
... Are you overriding the default heritrix.properties file? In particular, this line: 'org.archive.crawler.datamodel.BigMapFactory.class = ...
1738
Saurabh Pathak
sau_pathak
Apr 18, 2005 1:27 pm
Thanks Michael and Gordon for your inputs, ... I am running my crawls from commandline so I dont think above might be the case. Once crawl stops due to OOME I...
1737
stack
stackarchiveorg
Apr 15, 2005 4:11 pm
Also, was the crawl job that OOME'd started after another job had run w/o a restart of Heritrix between the running of the jobs? Is this a case of '[ 1123230...
1736
Gordon Mohr (@Interne...
gojomo
Apr 15, 2005 3:59 pm
Once an OutOfMemoryError is hit, a crawl attempts to pause, but will often be unable to proceed in any meaningful way, depending on what operation was ...
1735
Saurabh Pathak
sau_pathak
Apr 15, 2005 4:50 am
Thanks for your reply Tom. I tried running a broadscope crawl using the current 1.3 snapshot of the CVS head. I also used BdBFrontier with 50 threads and...
1734
Tom Emerson
tree02139
Apr 15, 2005 12:47 am
... [...] That's the one. The crawl continues, so no worries. -- Tom Emerson Basis Technology Corp. Software Architect...
1733
stack
stackarchiveorg
Apr 15, 2005 12:42 am
... Paste it into a message Tom and send it over. ConcurrentModificationExceptions trying to view Frontier report are a known issue but you shouldn't be...
1732
Tom Emerson
tree02139
Apr 15, 2005 12:31 am
... Current status: Run time: 175h., 7 min. and 25 sec. Processed docs/sec: 9.57 (6.74) KB/sc: 175 (107) Active Thread Count: 150 of 150 Total data received:...
1731
Tom Emerson
tree02139
Apr 15, 2005 12:22 am
... That would make sense. Now that you can specify multiple ARC directories, you could spread writers across multiple drives. One thing that I haven't done in...
1730
stack
stackarchiveorg
Apr 14, 2005 9:41 pm
... Embarrassingly, we didn't spend enough time looking. More writers would mean more contention for the disk head possibly holding threads in the write step...
1729
Igor Ranitovic
iranitovic
Apr 14, 2005 9:15 pm
Hi Tom, That is pretty good! When you get a chance could you please check how many queues (hosts) are part of the crawl? Thanks. i....
1728
Tom Emerson
tree02139
Apr 14, 2005 8:31 pm
... This is interesting reading, thanks. Could I ask you to elaborate why increasing the number of writers seemed to lower through-put? -- Tom Emerson...
1727
stack
stackarchiveorg
Apr 14, 2005 8:23 pm
... There's an old RFE asking for this feature: http://sourceforge.net/tracker/index.php?func=detail&aid=919727&group_id=73833&atid=539102. Its of priority 3....
1726
Tom Emerson
tree02139
Apr 14, 2005 8:03 pm
What is the reason that number of writers in the ARC writer pool cannot be increased mid-crawl? I increased the number of toe-threads from 50 to 90, and all 90...
1725
Tom Emerson
tree02139
Apr 14, 2005 7:34 pm
Broad=scope memory issues is a known problem. First off, see http://crawler.archive.org/faq.html#oome_broadcrawl Secondly, I'd grab a 1.3 snapshot and try...
1724
Saurabh Pathak
sau_pathak
Apr 14, 2005 7:24 pm
Hi, I ran Heritrix 1.2.0 from command line on a debian machine to do a broadscope crawl on a set of 200 seed urls. I used 15 therads and allocated 1280 Megs of...
1723
Gordon Mohr (Internet...
gojomo
Apr 13, 2005 7:17 pm
... Looks like an imrpvoement to me. I'll apply it. Thanks, good catch! More comment: The various robots.txt specifications all use the term 'path' to describe...
1722
ogrenholm
Apr 13, 2005 12:22 pm
Hi, I've found that for some reason or another some webmasters want to allow robots to access scripts, but not with some certain parameters, i.e., in...
1721
Tom Emerson
tree02139
Apr 12, 2005 3:58 pm
Hi all, I received a couple of NPEs in the new UbiCrawler-integrated code. The crawl continues, apparently just fine. But I thought I'd pass one along: Title:...
1720
stack
stackarchiveorg
Apr 11, 2005 9:56 pm
... I just did a fresh checkout and all built fine (Did you do a 'cvs update -Pd' to get all newly added items?). St.Ack...
1719
Mr. J
bighead007us
Apr 11, 2005 8:56 pm
... Thanks, work like a charm....
1718
Gordon Mohr (Internet...
gojomo
Apr 11, 2005 8:36 pm
... I suggest first pausing the crawl, then using the Frontier report or 'view or edit frontier URIs' option (which only appears in a paused crawl) to verify...
1717
Tom Emerson
tree02139
Apr 11, 2005 8:23 pm
I just updated my sources from HEAD and get the following failure in the unit tests: [junit] Running org.archive.crawler.frontier.AdaptiveRevisitHostQueueTest ...
1716
Mr. J
bighead007us
Apr 11, 2005 8:15 pm
Hi Folks, I am crawling a site and it has been running for 54 hrs and 99% crawl job. Normally it runs about 24 hr in the couple of last crawl. I am pretty...
1715
stack
stackarchiveorg
Apr 11, 2005 3:45 pm
... Look at the crawl.log. This will tell you if you are crawling or not. Are you only fetching the google robots.txt and then it goes no futher? Google bars...
1714
David Cullen
mrdavidcullen@...
Apr 11, 2005 3:11 pm
Hi Guys, I want to use Heritrix to crawl google search results and harvest the URLs, these URLs are then going to be put into a directory as in a listing. So...
1713
stackarchiveorg
Apr 4, 2005 6:12 pm
... There's a bunch of ways to do this. One way would be to register a new logger -- study how loggers are registered here ...
1712
chirag chauhan
chirag_299
Apr 4, 2005 9:06 am
Note: forwarded message attached. Yahoo! India Matrimony: Find your life partneronline. Respected sir, Our problem is this. We need to create one TXT file...
1711
chirag chauhan
chirag_299
Apr 4, 2005 5:59 am
Respected sir, Our problem is this. We need to create one TXT file that contain only the crawled uri.There are lots of thing in crawl.log file likewise ...