Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

archive-crawler

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 795
  • Category: Cyberculture
  • Founded: Dec 1, 2002
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

Advanced
Messages Help
Messages 1711 - 1740 of 8125   Newest  |  < Newer  |  Older >  |  Oldest
Messages: Simplify | Expand Author Sort by Date v
1740 Saurabh Pathak
sau_pathak Send Email
Apr 18, 2005
7:37 pm
Thanks a lot for your reply Michael. ... particular, ... I am not overriding the heritrix.property file apart from specifying a higher value of heap in...
1739 stack
stackarchiveorg Send Email
Apr 18, 2005
4:48 pm
... Are you overriding the default heritrix.properties file? In particular, this line: 'org.archive.crawler.datamodel.BigMapFactory.class = ...
1738 Saurabh Pathak
sau_pathak Send Email
Apr 18, 2005
1:27 pm
Thanks Michael and Gordon for your inputs, ... I am running my crawls from commandline so I dont think above might be the case. Once crawl stops due to OOME I...
1737 stack
stackarchiveorg Send Email
Apr 15, 2005
4:11 pm
Also, was the crawl job that OOME'd started after another job had run w/o a restart of Heritrix between the running of the jobs? Is this a case of '[ 1123230...
1736 Gordon Mohr (@Interne...
gojomo Send Email
Apr 15, 2005
3:59 pm
Once an OutOfMemoryError is hit, a crawl attempts to pause, but will often be unable to proceed in any meaningful way, depending on what operation was ...
1735 Saurabh Pathak
sau_pathak Send Email
Apr 15, 2005
4:50 am
Thanks for your reply Tom. I tried running a broadscope crawl using the current 1.3 snapshot of the CVS head. I also used BdBFrontier with 50 threads and...
1734 Tom Emerson
tree02139 Send Email
Apr 15, 2005
12:47 am
... [...] That's the one. The crawl continues, so no worries. -- Tom Emerson Basis Technology Corp. Software Architect...
1733 stack
stackarchiveorg Send Email
Apr 15, 2005
12:42 am
... Paste it into a message Tom and send it over. ConcurrentModificationExceptions trying to view Frontier report are a known issue but you shouldn't be...
1732 Tom Emerson
tree02139 Send Email
Apr 15, 2005
12:31 am
... Current status: Run time: 175h., 7 min. and 25 sec. Processed docs/sec: 9.57 (6.74) KB/sc: 175 (107) Active Thread Count: 150 of 150 Total data received:...
1731 Tom Emerson
tree02139 Send Email
Apr 15, 2005
12:22 am
... That would make sense. Now that you can specify multiple ARC directories, you could spread writers across multiple drives. One thing that I haven't done in...
1730 stack
stackarchiveorg Send Email
Apr 14, 2005
9:41 pm
... Embarrassingly, we didn't spend enough time looking. More writers would mean more contention for the disk head possibly holding threads in the write step...
1729 Igor Ranitovic
iranitovic Send Email
Apr 14, 2005
9:15 pm
Hi Tom, That is pretty good! When you get a chance could you please check how many queues (hosts) are part of the crawl? Thanks. i....
1728 Tom Emerson
tree02139 Send Email
Apr 14, 2005
8:31 pm
... This is interesting reading, thanks. Could I ask you to elaborate why increasing the number of writers seemed to lower through-put? -- Tom Emerson...
1727 stack
stackarchiveorg Send Email
Apr 14, 2005
8:23 pm
... There's an old RFE asking for this feature: http://sourceforge.net/tracker/index.php?func=detail&aid=919727&group_id=73833&atid=539102. Its of priority 3....
1726 Tom Emerson
tree02139 Send Email
Apr 14, 2005
8:03 pm
What is the reason that number of writers in the ARC writer pool cannot be increased mid-crawl? I increased the number of toe-threads from 50 to 90, and all 90...
1725 Tom Emerson
tree02139 Send Email
Apr 14, 2005
7:34 pm
Broad=scope memory issues is a known problem. First off, see http://crawler.archive.org/faq.html#oome_broadcrawl Secondly, I'd grab a 1.3 snapshot and try...
1724 Saurabh Pathak
sau_pathak Send Email
Apr 14, 2005
7:24 pm
Hi, I ran Heritrix 1.2.0 from command line on a debian machine to do a broadscope crawl on a set of 200 seed urls. I used 15 therads and allocated 1280 Megs of...
1723 Gordon Mohr (Internet...
gojomo Send Email
Apr 13, 2005
7:17 pm
... Looks like an imrpvoement to me. I'll apply it. Thanks, good catch! More comment: The various robots.txt specifications all use the term 'path' to describe...
1722 ogrenholm Send Email Apr 13, 2005
12:22 pm
Hi, I've found that for some reason or another some webmasters want to allow robots to access scripts, but not with some certain parameters, i.e., in...
1721 Tom Emerson
tree02139 Send Email
Apr 12, 2005
3:58 pm
Hi all, I received a couple of NPEs in the new UbiCrawler-integrated code. The crawl continues, apparently just fine. But I thought I'd pass one along: Title:...
1720 stack
stackarchiveorg Send Email
Apr 11, 2005
9:56 pm
... I just did a fresh checkout and all built fine (Did you do a 'cvs update -Pd' to get all newly added items?). St.Ack...
1719 Mr. J
bighead007us Send Email
Apr 11, 2005
8:56 pm
... Thanks, work like a charm....
1718 Gordon Mohr (Internet...
gojomo Send Email
Apr 11, 2005
8:36 pm
... I suggest first pausing the crawl, then using the Frontier report or 'view or edit frontier URIs' option (which only appears in a paused crawl) to verify...
1717 Tom Emerson
tree02139 Send Email
Apr 11, 2005
8:23 pm
I just updated my sources from HEAD and get the following failure in the unit tests: [junit] Running org.archive.crawler.frontier.AdaptiveRevisitHostQueueTest ...
1716 Mr. J
bighead007us Send Email
Apr 11, 2005
8:15 pm
Hi Folks, I am crawling a site and it has been running for 54 hrs and 99% crawl job. Normally it runs about 24 hr in the couple of last crawl. I am pretty...
1715 stack
stackarchiveorg Send Email
Apr 11, 2005
3:45 pm
... Look at the crawl.log. This will tell you if you are crawling or not. Are you only fetching the google robots.txt and then it goes no futher? Google bars...
1714 David Cullen
mrdavidcullen@... Send Email
Apr 11, 2005
3:11 pm
Hi Guys, I want to use Heritrix to crawl google search results and harvest the URLs, these URLs are then going to be put into a directory as in a listing. So...
1713 stackarchiveorg Send Email Apr 4, 2005
6:12 pm
... There's a bunch of ways to do this. One way would be to register a new logger -- study how loggers are registered here ...
1712 chirag chauhan
chirag_299 Send Email
Apr 4, 2005
9:06 am
Note: forwarded message attached. Yahoo! India Matrimony: Find your life partneronline. Respected sir, Our problem is this. We need to create one TXT file...
1711 chirag chauhan
chirag_299 Send Email
Apr 4, 2005
5:59 am
Respected sir, Our problem is this. We need to create one TXT file that contain only the crawled uri.There are lots of thing in crawl.log file likewise ...
Messages 1711 - 1740 of 8125   Newest  |  < Newer  |  Older >  |  Oldest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help