Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4648 - 4677 of 6141   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4648
Hi Folks, I am using Heritrix 1.12.1 to crawl 500,000 thousand URLs. The setup has 2 machines with 1 GB RAM, each running a crawler instance with 10 Toe...
Goel, Ankur
ankur_goel79
Offline Send Email
Nov 2, 2007
6:52 am
4649
If you have permission/assurance that it's OK to crawl the target sites with no politeness delays, I'd say try it: set the delays to 0 and see how fast things...
Gordon Mohr
gojomo
Online Now Send Email
Nov 2, 2007
11:14 pm
4650
Is there a way to figure out the "parent" URI that the current URI you are crawling was discovered from? I want to create a "breadcrumb" type trail of the...
Andrew Serff
andrewserff
Offline Send Email
Nov 5, 2007
4:19 pm
4651
Hello Andrew, Take a look at org.archive.crawler.datamodel.CandidateURI.getVia() method. Note that the 'via' URI itself will not have information about its own...
Igor Ranitovic
iranitovic
Offline Send Email
Nov 5, 2007
5:24 pm
4652
So, I have a simple class to invoke crawl jobs on Heritrix via JMX. It seems to work fine for about 600 total jobs submitted, but eventually I cannot connect...
acidbluebriggs
Offline Send Email
Nov 5, 2007
6:10 pm
4653
Sorry, subject should have read "Connecting to Heritrix via JMX eventually fails after many calls"...
acidbluebriggs
Offline Send Email
Nov 5, 2007
6:29 pm
4654
Can you get thread dumps from the heritrix instance and try and figure why its stopped accepting connections and what its doing w/ that 100% of CPU (Would be...
Michael Stack
stackarchiveorg
Offline Send Email
Nov 5, 2007
6:57 pm
4655
I'll submit my jobs to reproduce it. It will take quite while but when it happens again I'll dump the threads. The UI is dead when this problem occurs so I...
acidbluebriggs
Offline Send Email
Nov 5, 2007
7:16 pm
4656
Well, I may have jumped the gun again. I was thinking that this symptom might be due to another post wrote. The issue might be that my single domain crawl...
acidbluebriggs
Offline Send Email
Nov 5, 2007
8:03 pm
4657
... You are your own worst enemy. Smile. ... Could be because the code that prints out that message ain't too smart. The UI is looking for an instance named...
Michael Stack
stackarchiveorg
Offline Send Email
Nov 5, 2007
8:11 pm
4658
The issue is in the web application. The /admin/include/handler.jsp contains the code: snip ... if (handler == null) { if(Heritrix.isSingleInstance()) { ...
acidbluebriggs
Offline Send Email
Nov 5, 2007
8:36 pm
4659
I began a crawl with 10 or so seed items. I am using a surt_source_file for Decision: ACCEPT and another surt_source_file for Decision: REJECT For both...
mjjjhjemj
Offline Send Email
Nov 6, 2007
4:04 pm
4660
Hi Mike, Once an URI is finished, crawled successfully or not, it will not be crawled again. In order to crawl a finished URI you will have to re-fetch them....
Igor Ranitovic
iranitovic
Offline Send Email
Nov 6, 2007
4:58 pm
4661
Hi Andrew, I just realized that you can import URIs from the GUI as well. Once the crawl is paused, you can follow the "View or Edit Frontier URIs" link in the...
Igor Ranitovic
iranitovic
Offline Send Email
Nov 6, 2007
5:42 pm
4662
Hi, I am running into this weird problem while fetching pages from a webiste http://www.vanillasoft.com The url I am trying to fetch is : ...
molzbh
Online Now Send Email
Nov 7, 2007
1:00 am
4663
Hi, Can anyone suggest a sample configuration and typical crawling speed observed using Heritrix 1.12.1 on a single box setup having 1 GB of RAM trying t crawl...
Goel, Ankur
ankur_goel79
Offline Send Email
Nov 7, 2007
6:19 am
4664
hi, modified my job management script in a way that will not allow more than 4 instances at the same time and changed the bdb-cache-percent value to 35 in...
hinoglu
Online Now Send Email
Nov 7, 2007
10:12 am
4665
I recently paused my crawl and added several seeds to the seed list. Viewing the seed report under the Reports tab I see the following displayed under 'Status...
mjjjhjemj
Offline Send Email
Nov 7, 2007
3:58 pm
4666
600Kbps isn't very much bandwidth for crawling: it would allow 75KB to be collected per second. We often find content to average near 20KB per URL (larger for...
Gordon Mohr
gojomo
Online Now Send Email
Nov 7, 2007
4:08 pm
4667
... As 4*35% = 140% of heap, if at any time the 4 contemporaneous crawls try to use their maximum bdb-cache allocation, you will get an OOME. ... It depends on...
Gordon Mohr
gojomo
Online Now Send Email
Nov 7, 2007
4:38 pm
4668
... try ... i have a single machine/multiple crawl jobs system, that value was the best i have without killing the system so far :o( ... my changes on...
hinoglu
Online Now Send Email
Nov 7, 2007
9:29 pm
4669
Hi, I was wondering if there is anywhere a single resource for the various important settings when doing a larger crawl, and/or experience with ...
Holger Lausen
holger.lausen@...
Send Email
Nov 8, 2007
11:35 am
4670
I recently paused my crawl and added several seeds to the seed list. Viewing the seed report under the Reports tab I see the following displayed under 'Status...
mjjjhjemj
Offline Send Email
Nov 8, 2007
5:33 pm
4671
I am performing a very large domain crawl that is going extremely slow as many of the pages are dynamically created pages from database driven sites. As a...
mjjjhjemj
Offline Send Email
Nov 8, 2007
5:54 pm
4672
I'm using heretrix to crawl sites, download software packages and get signatures for the packages. One of the site I need to crawl is Red Hat's "network" site....
kbhanor
Online Now Send Email
Nov 9, 2007
9:13 pm
4673
Well, doesn't it just figure... After two days working on this problem, I decide to post... And then I find something on Red Hat's site. They had an...
kbhanor
Online Now Send Email
Nov 9, 2007
9:54 pm
4674
... If you politeness (or the target server's performance) is the gating factor, and you're going to adjust dual crawlers to be just as polite, what's the...
Gordon Mohr
gojomo
Online Now Send Email
Nov 9, 2007
10:23 pm
4675
... Are the affected seeds DNS URLs or IP dotted-quad (eg. 208.70.27.35) hosts? Is the separate crawl on the same machine? Can the success/failure machines...
Gordon Mohr
gojomo
Online Now Send Email
Nov 9, 2007
11:38 pm
4676
The affected entries are DNS URLs. The separate crawl is on the same machine utilizing another heritrix instance under Setup. I don't have access to ping on...
mikej
mjjjhjemj
Offline Send Email
Nov 12, 2007
5:18 pm
4677
Dear IA-Team, looking at JIRA and SVN I'm well aware that you guys are really busy with work on Heritrix 2.x, but I'm wondering what chances are to see issue...
pandae667
Offline Send Email
Nov 13, 2007
3:03 pm
Messages 4648 - 4677 of 6141   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help