Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5031 - 5060 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5031
My name is Shannon ,I¡¯m a student and I want to master heritrix. I want to enable FTP support for my crawls and I haved change...
shenyan yang
ysheny668
Offline Send Email
Mar 1, 2008
10:12 am
5032
Hi, I have a problem starting Heritrix 2.0 under windows from Java main, using: Heritrix.main(new String[]{"-a testPassword"}); When I run this, I get the...
nathaliest
Offline Send Email
Mar 3, 2008
5:53 pm
5033
... When you say 'nothing more happens' -- does that mean the browser hangs, spinning, waiting for a response? Is there any output to the JVM's standard out? ...
Gordon Mohr
gojomo
Offline Send Email
Mar 3, 2008
9:30 pm
5034
Hi Nathalie. The following code works for me on my Ubuntu Linux machine. Try launching this code in your favorite IDE, setting the classpath to the libaries...
Christian Krumm
chuk_ol
Offline Send Email
Mar 3, 2008
10:47 pm
5035
Hi Christian, The problem was indeed the String[] that I submitted as parameter. Using new String[] {"-a", "password"} everything works fine. Thanks a lot for...
nathaliest
Offline Send Email
Mar 3, 2008
11:02 pm
5036
I'd like to be able to crawl with the following 4 filters: 1 - Crawl by path (ie. http://sub.example.com/foo/ ) 2 - Crawl by host (ie. http://sub.example.com )...
Micah Wedemeyer
mwedeme@...
Send Email
Mar 4, 2008
9:44 pm
5037
Hi, We are currently deploying the Portuguese web archive and the next step is to start using the deduplicator. I read the paper "Managing duplicates across...
Daniel Gomes
daniel.gomes@...
Send Email
Mar 5, 2008
3:12 pm
5038
Still having trouble here... I've tried adding the SURT prefixes to the seeds file, and it doesn't seem to limit the crawl scope. In addition, I see the...
Micah Wedemeyer
mwedeme@...
Send Email
Mar 5, 2008
10:37 pm
5039
(Sorry for spamming, but I wanted to head off the inevitable reply...) I looked deeper and found the notes about starting each SURT prefix line with a "+"...
Micah Wedemeyer
mwedeme@...
Send Email
Mar 5, 2008
10:58 pm
5040
Typically (and in the rule progression you've shown), SURT prefixes are used to rule things in, but not out. The general operation of your rules in plain...
Gordon Mohr
gojomo
Offline Send Email
Mar 6, 2008
1:56 am
5041
To also answer some of your earlier questions: ... The 'implied conversion' which is done automatically if you choose to use your seeds as SURT prefixes is...
Gordon Mohr
gojomo
Offline Send Email
Mar 6, 2008
2:08 am
5042
Gordon, This really clears things up, especially the transclusion rules. For our purposes, we're only doing text analysis, not archiving, so losing images and...
Micah Wedemeyer
mwedeme@...
Send Email
Mar 6, 2008
3:04 pm
5043
My crawl jobs are slowing down drastically at the end of the crawl despite several threads being active according to web UI. I would understand if the active...
Daniel Clark
daniel_a_clark
Offline Send Email
Mar 6, 2008
11:10 pm
5044
Hi! I'm new to Heritrix and was wondering what resource is mainly responsible for the amount of memory Heritrix uses. Is it the number of queued links? Gerwin...
Gerwin van Doorn
gvd78
Offline Send Email
Mar 7, 2008
8:30 am
5045
It's a combination of active threads, queue size, and number of seeds. If your trying to crawl a large domain, then I'd recommend braking the domain down into...
nfoscarini
Offline Send Email
Mar 7, 2008
2:42 pm
5046
At Thu, 6 Mar 2008 18:09:22 -0500, ... This is not unusual (others may correct me if I’m wrong). Often times you have basically run out of URLs except for a...
Erik Hetzner
e_hetzner
Offline Send Email
Mar 7, 2008
9:35 pm
5047
I don't know either, but I was thinking that maybe you have come across some URLs that are trigging an error (such as a timeout). The default for Heritrix is...
nfoscarini
Offline Send Email
Mar 7, 2008
9:39 pm
5048
Example: schedule with high priority if the URI matches /.*?\.(pdf|ppt)/. It appears like BdbMultipleWorkQueues could be used to achieve this. What's the...
robotjunior
Offline Send Email
Mar 9, 2008
7:44 am
5049
It’s definitely not just a large number of URLs on one site nor the retry settings. There are at least 7 threads and 7 sites and my retry settings are set...
Daniel Clark
daniel_a_clark
Offline Send Email
Mar 11, 2008
4:13 pm
5050
Hello, I would like to download a single page (and all of its images, javascript,...) with heritrix, but i do not know the real settings for this. Either only...
winter_lamm
Offline Send Email
Mar 11, 2008
8:15 pm
5051
This might be no help, but based on my (very limited) experience, here's what I would look at: Set up your decision rules as follows: 1) REJECT all...
Micah Wedemeyer
mwedeme@...
Send Email
Mar 11, 2008
9:28 pm
5052
Hi All, As part of my research work I have been developing a Focused Crawler that uses Heritrix as its foundation. I just wanted to ask a couple of questions ...
Seamus Lawless
seamuslawless
Offline Send Email
Mar 11, 2008
9:44 pm
5053
Hi all, I'm interested in writing a new extractor, but I wanted to get peoples input first before I get started. Maybe this has already been done by someone,...
nfoscarini
Offline Send Email
Mar 11, 2008
10:00 pm
5054
Here is the decision chaine that we use to collect a single page (Heritrix 1.12.0): 1) REJECT by default (RejectDecideRule) 2) ACCEPT if Surt prefixed...
Bert Wendland
bwendland42
Offline Send Email
Mar 12, 2008
10:25 pm
5055
I have just deployed Heritrix and am in the process of trying to optimize is. We are running it on an 8-core/16GB RAM server and in the passed we have ...
Jordan Mendler
jmendler
Offline Send Email
Mar 13, 2008
6:06 pm
5056
If you have access to a DNS server on your network. You could configure multiple domains to point to that external domain. So that Heritrix would see it as...
nfoscarini
Offline Send Email
Mar 13, 2008
6:20 pm
5057
What you're describing could look like a DoS attack from the server side. Do you have permission from the admins of the site to hit them that hard? I'd...
Micah Wedemeyer
mwedeme@...
Send Email
Mar 13, 2008
8:00 pm
5058
Hi, I'm using Heritrix 1.12.1. In a standard frontier report, what does the "active balance" mean? In the following queue, what's the meaning of the different...
João Miranda
miranda_fccn
Offline Send Email
Mar 14, 2008
2:22 am
5059
Hi, I'm using Heritrix 1.12.1. Under which conditions does the code "-4001 Too many link hops away from seed" appear in logs? I intended to log the links not...
João Miranda
miranda_fccn
Offline Send Email
Mar 14, 2008
2:22 am
5060
Hello, Thank you for your hints, but when I use your settings, i will always get only the html page. (At the crawl report only the mime type text/html ist...
winter_lamm
Offline Send Email
Mar 14, 2008
2:18 pm
Messages 5031 - 5060 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help