Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4485 - 4514 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4485
... ...
tztwh
Offline Send Email
Aug 1, 2007
8:28 pm
4486
Hi, all: I have met a question. I Want to use Heirtrix to realize increasemetal crawling, but i don't know how to do it . does anyone know how to solve...
vretr
Offline Send Email
Aug 2, 2007
7:09 am
4487
This is a somewhat loaded question as 'incremental crawling' can be a vague term. It may help to explain what you hope to achieve (reduce data volume etc.) For...
Kristinn Siguršsson
kristsi25
Offline Send Email
Aug 2, 2007
10:40 am
4488
I am crawling a large (prominent) website and I am encountering an issue with their links. During a crawl I discovered what I believe is a bug in HttpClient,...
Jeff
destroyr2
Offline Send Email
Aug 3, 2007
4:00 pm
4489
Hi all, I am trying to run a crawl with scope=DecidingScope and rule=TooManyHopsDecideRule with max-hop=3. I get the alert mentioned in the subject of the...
verbalkint81
Offline Send Email
Aug 6, 2007
7:13 am
4490
OK - I agree - this is not a bug. The manual should though mention that the minimum setting for max-retries should be 3 best Bjarne ... -- Bjarne Andersen ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 6, 2007
10:46 am
4491
I have downloaded the latest Heritrix source from Sourceforge - 1.12.1. It mentions in the manual somewhere that the .project and .classpath files are included...
mjjjhjemj
Offline Send Email
Aug 6, 2007
3:33 pm
4492
The eclipse dot files must not be bundled into the src tarball. Download them individually here from the heritrix_1_12 src branch -- ...
Michael Stack
stackarchiveorg
Offline Send Email
Aug 6, 2007
3:53 pm
4493
The .classpath and .project files are not included in the source archives. If you have Subclipse installed on your box I recommend just using it to checkout...
Tom Emerson
TEmerson@...
Send Email
Aug 6, 2007
4:22 pm
4494
Hello Michael and others, I gathered the .project and .classpath files from the link below and dropped these in the top level directory of the heritrix-1.12.1...
mikej
mjjjhjemj
Offline Send Email
Aug 6, 2007
10:54 pm
4495
The yellow exclamations are code style warnings; they can be safely ignored (or turned off by type in Eclipse preferences). - Gordon @ IA...
Gordon Mohr
gojomo
Online Now Send Email
Aug 6, 2007
11:04 pm
4496
When I try setting up credentials for a HTML-login on some page I have a problem with the first seed not getting crawled. I have 2 seeds: http://www.foo.dk/ ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 7, 2007
10:21 am
4497
Looking at your log below, it seems that www.foo.dk is http://www.arto.dk/ <http://www.arto.dk/> and the login page is http://www.arto.dk/login.asp. If that...
Michael Stack
stackarchiveorg
Offline Send Email
Aug 8, 2007
6:22 pm
4498
I want to add (or find) functionality that allows for notifications to be received when a crawl is completed. Does anyone know where the best place to...
acidbluebriggs
Offline Send Email
Aug 9, 2007
9:42 pm
4499
After running for a few seconds, heritrix is crashed. I found the following URL error message in the logs/uri-error file :: 2007-08-10T15:01:27.978Z...
ahmed ghouzia
ghouzia
Offline Send Email
Aug 10, 2007
7:53 pm
4500
Ahmed - The "Contains non-LDH characters" URL error is minor and advisory; it won't stop a crawl. The "Stream closed" error is more serious; the recovery log...
Gordon Mohr
gojomo
Online Now Send Email
Aug 10, 2007
8:03 pm
4501
I checked my configuration and I checked all the things you mentioned , but finally I found these errors did not appear when i remove a preprocessor that I...
ahmed ghouzia
ghouzia
Offline Send Email
Aug 12, 2007
10:14 pm
4502
Oh - my manual scrambling of hostname was ineffective ;-) arto.dk should have been foo.dk in the crawl.log extract. So - the correct crawl.log is: ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 13, 2007
11:00 am
4503
I am using 1.10.2 and have paused the crawl. I then clicked the 'Logs' tab and scrolled down to the bottom where I see 'Rotate crawler logs'. I clicked on this...
mjjjhjemj
Offline Send Email
Aug 15, 2007
4:36 pm
4504
Hi all, I am using Heritrix 1.12.1 for crawling. Could any one tell me how to filter some urls from crawling. i.e. I don't want to crawl the contactus,...
chandubigc
Offline Send Email
Aug 19, 2007
8:04 pm
4505
Hi there I am running Heritrix 1.12.1 and I have a very strange problem. When I submit a job via JMX (details to come) the seed URL causes an internal error,...
joeyfreund
Offline Send Email
Aug 22, 2007
5:13 pm
4506
I have run into this. Did you create your own XML file for submitting jobs? Did you add anything (new nodes) to the document? Does the document have any xml...
acidbluebriggs
Offline Send Email
Aug 22, 2007
6:03 pm
4507
First of all, thanks for the response. As for you questions: 1. I am using my profile's order.xml file, and I modify only the name, description and date (I set...
joeyfreund
Offline Send Email
Aug 22, 2007
6:24 pm
4508
I've got Heritrix running in Amazon EC2, and I'm still mucking around with the configuration. I let a crawl run for a few days, and the job's state/ directory...
Ted Dziuba
ted@...
Send Email
Aug 22, 2007
7:38 pm
4509
Nope. No steady state. 8G is chump change for a long running crawler. Its a state db of where have you been. So, of course it grows, as you run longer. I...
lekash
Offline Send Email
Aug 22, 2007
8:56 pm
4510
So is the size of the state directory a function of number of URLs visited, and not amount of data downloaded? Tangentially, I think that EC2 will do for our...
Ted Dziuba
ted@...
Send Email
Aug 22, 2007
11:03 pm
4511
While I haven't taken the code apart, I think so. EC2 can work for a number of small crawls, no problem. I just hate running systems that fall apart cause they...
lekash
Offline Send Email
Aug 22, 2007
11:20 pm
4512
The state directory is the home of the BerkeleyDB-JE environment used by the crawler. The three main things stored there are: (1) A series of disk-backed maps...
Gordon Mohr
gojomo
Online Now Send Email
Aug 23, 2007
12:13 am
4513
Hello All, I the latest Eclipse IDE running on my win pc. I would like to get to the point where I can run Heritrix within the Eclipse environment so I can...
mjjjhjemj
Offline Send Email
Aug 23, 2007
8:08 pm
4514
... There's the developer's guide: http://crawler.archive.org/articles/developer_manual/building.html#eclipse ... You'll need to copy the /lib directory jar...
Paul Jack
poetbeware
Offline Send Email
Aug 23, 2007
8:45 pm
Messages 4485 - 4514 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help