Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5501 - 5530 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5501
I only crawl test URLs of myself. And Heritrix need to crawl robots and DNS firstly, which is cost lots of time. I donot need the Heritrix to crawl the robots...
happyxinglele
Offline Send Email
Oct 8, 2008
8:29 am
5502
I dont know for the robots.txt part, but you will still be forced to "crawl" (eg, contact) the DNS to obtain the IP address of the server. This is mandatory. ...
Jean-Noël Rivasseau
elvanor@...
Send Email
Oct 8, 2008
8:34 am
5503
Oh, I know. But could anyone tell me how to strip the robots.txt -Thanks ... server. ... robots and ... Heritrix to...
happyxinglele
Offline Send Email
Oct 8, 2008
8:37 am
5504
Hi, Could we please not publish this? I don't know if the original writer is real, or not. But it is suspicious. Eliminating one text file from a crawl to save...
John Lekashman
lekash
Offline Send Email
Oct 8, 2008
3:47 pm
5505
I also don't know the reason for not downloading the robots.txt, but if it's because of not wanting to follow the "advice" in them, then there's a simple ...
Bart Kiers
prometheuzz
Offline Send Email
Oct 8, 2008
3:55 pm
5506
As you said, I realize robots.txt is a very importent file for limit illegal crawler. I'll start to learn it more. I prefer to let it remain. Thanks a lot. ......
happyxinglele
Offline Send Email
Oct 9, 2008
2:11 am
5507
As you said, I realize robots.txt is a very importent file for limit illegal crawler. I'll start to learn it more. I prefer to let it remain. Thanks a lot. ......
happyxinglele
Offline Send Email
Oct 9, 2008
2:14 am
5508
Hi I tried to start a crawl-job with Heritirx 2.0.1 and the AdaptiveRevisitFrontier but unfortunately I run into some NullPointerExceptions (the sheet config...
Juergen Umbrich
juergen@...
Send Email
Oct 10, 2008
4:29 pm
5509
Hi Juergen, as far as I know the AdaptiveRevisitFrontier is broken in Heritrix 2.X. Adaptive revisit was an experiment for Heritrix 1.0x and is in the current...
Christian Krumm
chuk_ol
Offline Send Email
Oct 10, 2008
11:19 pm
5510
Hi, We had a problem with Heritrix not writing any crawl reports in the following case: The job was first paused due to a Low Disk Pause (caused by the ...
Nathalie Steinmetz
nathaliest
Offline Send Email
Oct 10, 2008
11:34 pm
5511
Hi, During our last (broad) crawl we stumbled upon the following fact: directly after the start of the crawl the download rate (kb/s as tracked in the...
Nathalie Steinmetz
nathaliest
Offline Send Email
Oct 11, 2008
12:22 am
5512
I think this is an FAQ, but I could not completely understand the current status of the incremental crawling with heritrix. I read the documents, the mailing...
Takeshi Kobayakawa
tskoba@...
Send Email
Oct 14, 2008
8:13 pm
5513
Hi all, Is there a tool built in to the Heritrix 2 package that will create DAT files? If not, are there open source tools available? I appreciate any...
laurendko
Offline Send Email
Oct 15, 2008
11:42 pm
5514
Err, DAT <http://www.fileinfo.net/extension/dat> file? Regards, Bart....
Bart Kiers
prometheuzz
Offline Send Email
Oct 16, 2008
10:04 am
5515
Hi Group, Does anyone has any code via which one can automate the monitoring the crawler. I would like to see stats like how long a job is taking and how many...
Vaijanath N. Rao
vaiju1981
Offline Send Email
Oct 16, 2008
10:53 am
5516
Sorry, I mean DAT <http://www.archive.org/web/researcher/dat_file_format.php> files like the Internet Archive uses as a type of index for Arcs. thanks, Lauren...
laurendko
Offline Send Email
Oct 16, 2008
1:19 pm
5517
Hi, Heritrix users may interested in a project we just released to open source. CloudBase is a data warehouse system built on top of Hadoop. It is developed by...
Leo Dagum
leo_dagum
Online Now Send Email
Oct 17, 2008
6:48 pm
5518
I don't know if in Heritrix 2 you can do that, but in heritrix 1.14.1 with the cmdline-jmxclient you can monitoring all of these stats. Check in these forum a...
pilotboy_84
Offline Send Email
Oct 17, 2008
6:50 pm
5519
Hy everybody. I'm using heritrix-1.14.1 and i want to occupy my bandwidth with the crawler. I'm using a profile that only download the html code by ...
pilotboy_84
Offline Send Email
Oct 17, 2008
7:12 pm
5520
Hi, we are using munin [1], to plot e.g. no docs, MB/s etc. Integrating with munin is simple you can use any scripting language to produce the value you want...
Holger Lausen
hlausen
Offline Send Email
Oct 18, 2008
7:29 am
5521
Hi everyone. I am new to the Heritrix project and looking forward to use this software for a personal projet. However, I have been trying to set up my Eclipse...
o.lalonde
Offline Send Email
Oct 20, 2008
12:24 am
5522
I just realized that compiling from the root pom.xml didn't create any lib folder in dist/target/... I just tried running a build (which fails) from...
o.lalonde
Offline Send Email
Oct 20, 2008
10:51 am
5523
Hi, In Heritrix 1.12 you could read popup help messages in the Configure Settings pages when you clicked the question mark beside a property. In Heritrix 2.0...
nfoscarini
Offline Send Email
Oct 21, 2008
1:17 pm
5524
Hi, I really need some help, because I think I'm stuck in Heritrix 2.0.1. I want to crawl a list of seeds, and any URI that contains a keyword I want to rank...
nfoscarini
Offline Send Email
Oct 22, 2008
1:47 pm
5525
Hi all I wrote a module which detects the "real" media type (mime type) of a file based on the magic number approach. At the moment I am comparing the Apache...
Juergen Umbrich
juergen@...
Send Email
Oct 22, 2008
4:34 pm
5526
I wrote a DecideRule that checks the header from the contents to see what media type it was, but this requires the file to be downloaded first. Is that sort of...
Mathew Nik Foscarini
nfoscarini
Offline Send Email
Oct 22, 2008
4:45 pm
5527
I've asked 2 java devs to compile heritrix 2.0.1 and none of them were able to do so... Anyone else getting errors ? I feel it has something to do with Maven...
o.lalonde
Offline Send Email
Oct 22, 2008
5:01 pm
5528
Hi ... ah ok, behind my approach is the requirement that the crawler should avoid unnecessary HTTP lookups. Given resource limitations and assuming that over...
Juergen Umbrich
juergen@...
Send Email
Oct 22, 2008
5:07 pm
5529
I tried for 2 days to get it to compile. I was never able to get it to work inside Eclipse. The Maven plugin will not work for me. I did have better results...
Mathew Nik Foscarini
nfoscarini
Offline Send Email
Oct 22, 2008
5:08 pm
5530
Many thanks for the reply. I am relieved to know that I am not the only one having difficulties to compile heritrix. I thought I was doing something wrong, but...
o.lalonde
Offline Send Email
Oct 22, 2008
5:23 pm
Messages 5501 - 5530 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help