Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2771 - 2800 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2771
... Exactly this happened to me a year or two back (different site and different crawler, of course). I can understand it too. Chances are that the admin...
Gordon Paynter
Gordon.Paynter@...
Send Email
Apr 5, 2006
6:33 am
2772
... Looks like I'd already setup auto-building of the hcc jar. Get latest version from here: ...
Michael Stack
stackarchiveorg
Offline Send Email
Apr 10, 2006
10:25 pm
2773
Hi, I'm using Heritrix 1.2 on windows. I also created job which takes long time to crawl like 2 to 3 days, but yet its not completed and after all i have to...
hiraldesai77
Offline Send Email
Apr 12, 2006
8:52 am
2774
... Upgrade. Latest is 1.6 (1.8 release is imminent). It can be obtained here: http://crawler.archive.org/downloads.html. ... Recovery mechanisms are...
Michael Stack
stackarchiveorg
Offline Send Email
Apr 12, 2006
4:24 pm
2775
Crawler stuck on the links and cannot be paused (hence cannot be checkpointed) ================= LINK 1 ========================= [ToeThread #15:...
joehung302
Offline Send Email
Apr 14, 2006
5:12 pm
2776
Hi guys, when using NutchWax to search the ARC file. If I want to save the search result of a query string, which part of code I should modify? I mean which...
Andy Lee
lqy_nku
Offline Send Email
Apr 15, 2006
7:14 am
2777
... You want to save the search result page everytime a query is run? Study the search.jsp in a checkout of the nutchwax source or study the ...
Michael Stack
stackarchiveorg
Offline Send Email
Apr 15, 2006
5:06 pm
2778
Ok, I see. Thank you very much, St.Ack!...
Andy Lee
lqy_nku
Offline Send Email
Apr 16, 2006
5:44 am
2779
Hi my friends, There are many websites now using forms to generate dynamic webpages. How can I use Heritrix to crawl such kind of dynamic webpages? Thanks for...
Andy Lee
lqy_nku
Offline Send Email
Apr 16, 2006
5:48 am
2780
Any suggestions on the easiest way of rerunning a form login? Theres a key input value being generated on the first hit and I couldn't figure a way to...
Samuel
samendonca
Offline Send Email
Apr 17, 2006
1:43 pm
2781
... Check out http://crawler.archive.org/articles/user_manual.html#credentials. Does this help? St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Apr 17, 2006
3:36 pm
2782
Can anyone give me a clue? Or does Heritrix support this kind of function? Where can I find such kind of information or examples? Sorry, I am a newbie of...
Andy Lee
lqy_nku
Offline Send Email
Apr 17, 2006
6:12 pm
2783
... Did you see this mail from this morning Andy? http://groups.yahoo.com/group/archive-crawler/message/2781 Might be what you want. St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Apr 17, 2006
8:54 pm
2784
Yes, Michael. I saw the mail. But I still don't know how to do the setting of Heritrix to do that. Because there are some name/value pairs I need to assign for...
Andy Lee
lqy_nku
Offline Send Email
Apr 17, 2006
10:50 pm
2785
... Well, I've already managed to login successfully on several hosts using HTML form credentials, but lately I've been getting some ...
Samuel
samendonca
Offline Send Email
Apr 18, 2006
1:45 pm
2786
Hello, Samuel, I am trying to use HTML form credentials too. But I didn't know how to do it. Can you tell me where did you set your username/password pairs for...
AndyFrun
lqy_nku
Offline Send Email
Apr 18, 2006
3:12 pm
2787
Regards the below, Andy wrote me off list clarifying what it is that he wants to do. He wants to be able to POST to arbitrary html forms arbitrary data. The...
stack@...
stackarchiveorg
Offline Send Email
Apr 18, 2006
3:53 pm
2788
Andy, have you read the link cited below? If it is an insufficent description of the login functionality, I would like to know so I can redress. Thanks, St.Ack...
stack@...
stackarchiveorg
Offline Send Email
Apr 18, 2006
3:55 pm
2789
... Is there a reproducible sequence of events that lead to your getting below exception? Is it with HEAD of heritrix or a released version? Sounds like a bug...
Michael Stack
stackarchiveorg
Offline Send Email
Apr 18, 2006
7:03 pm
2790
St.Ack, Time to start testing HCC. Any tips? I have seen that now the sources and jar can be downloaded directly, but the "Getting Started" section in the...
tizo_trico
Offline Send Email
Apr 18, 2006
10:49 pm
2791
Hi Occasionally we want to be able to restart the crawler and start a new job and carry across the already-seen list, but nothing else (such as the frontier,...
Greg Kempe
gregkza
Offline Send Email
Apr 19, 2006
1:07 am
2792
... Another option would be to use the 'recover' log; all of the 'F+' lines are URLs that the crawl considered 'seen'. A typical recover-from-log scans the log...
Gordon Mohr
gojomo
Offline Send Email
Apr 19, 2006
6:29 am
2793
Ok, I see. Yes, this is a little bit complex operation. Will this one be one of those features of Heritrix in the future? I think this could be useful for...
AndyFrun
lqy_nku
Offline Send Email
Apr 19, 2006
6:35 pm
2794
Hi everyone, I'm using an up-to-date CVS version of heritrix and came across the following problem: I'm trying to spider the site...
pandae667
Offline Send Email
Apr 19, 2006
8:00 pm
2795
... Check it out now Tizo. Our Danny Bernstein, the main developer and user of hcc, filled out the overview doc: ...
stack@...
stackarchiveorg
Offline Send Email
Apr 19, 2006
8:12 pm
2796
Thanks for the report -- my investigation reveals this is actually a bug in OnHostsDecideRule, in how it updates itself when a new seed is added. It fails to...
Gordon Mohr
gojomo
Offline Send Email
Apr 19, 2006
10:52 pm
2797
Thanks alot Gordon, your proposed fix works like a charm. The only thing I'm wondering is why this won't make it into CVS before 1.8.0 release as this is a...
pandae667
Offline Send Email
Apr 20, 2006
8:27 am
2798
... Im running released version 1.6.0 of Dec 2, 2005. I still haven't quite figured at what circunstances the Exception is thrown. Most of the time I only got...
Samuel
samendonca
Offline Send Email
Apr 20, 2006
1:33 pm
2799
Hi you gurus, I am new to Heritrix and met some errors. My box is Windows XP, JDK1.4.2_11, Heritrix1.6.0. I can get heritrix run up but can not get it crawl...
libsoft
n2ket
Offline Send Email
Apr 22, 2006
2:51 pm
2800
Hi, I am using Heritrix 1.6.0. And I have exprienced a strange behaviour: When setting send-range to true, Heritrix doesn't honor the entries in robots.txt...
Thimo Eichstaedt
abc@...
Send Email
Apr 24, 2006
1:04 am
Messages 2771 - 2800 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help