Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 934 - 963 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
934
Looks like I'm finally getting through the login pages. Thanks for your help. Now I have a bunch of ARC files. It'd be easy enough to write a diff between...
Phil White
CERisE8192
Offline Send Email
Sep 1, 2004
9:16 pm
935
... No, there isn't. Scan through the mailing list archives for the subject "recraw" [sic]. -- Tom Emerson Basis...
Tom Emerson
tree02139
Offline Send Email
Sep 1, 2004
9:31 pm
936
... Responding back to Kristinn's comment request, I want to be as polite as nice to my bandwidth as possible. Being polite to crawled web sites is a nice...
Phil White
CERisE8192
Offline Send Email
Sep 1, 2004
10:17 pm
937
... Its not highest priority for the crew our here at the Archive (Scaling up is our main priority). We're hoping it'll come in as a contribution (Hopefully...
stack
stack@...
Send Email
Sep 1, 2004
11:33 pm
938
Phil White writes: [...] ... [...] There has been a lot of research done on how to select URLs for subsequent crawling: the major search engines certainly...
Tom Emerson
tree02139
Offline Send Email
Sep 2, 2004
12:53 am
939
I will be working on the add on proposed earlier this summer. Expect something useful to be available before christmas. A beta might be available earlier. -...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Sep 2, 2004
8:49 am
940
I'd have nega-problem testing stuff and helping you develop 8) -Phil/CERisE...
Phil White
CERisE8192
Offline Send Email
Sep 2, 2004
11:13 am
941
... If the 'If-Modified-Since' header add doesn't work, we did a little planning yesterday and the feature '[ 941072 ] Allow operator-configured mid-HTTP-fetch...
stack
stack@...
Send Email
Sep 2, 2004
2:43 pm
942
... That sounds great and brings me to a related topic. There really should be a nice way of marking CrawlURIs as "duplicate" or something similar. This flag ...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Sep 2, 2004
2:57 pm
943
... How do you decide what constitutes a duplicate? Exact content match? Same URI path? Some subset of similarity? Doing this that will work across languages...
Tom Emerson
tree02139
Offline Send Email
Sep 2, 2004
3:12 pm
944
... be a ... flag ... extraction. ... to ... it ... That would be implementation dependant. For what Stack was talking about you just take the if-modified bit...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Sep 2, 2004
3:17 pm
945
... Ah, OK. Understood. I wasn't placing things in the right context. ... Well, for revisiting I would just use the If-Modified-Since header coupled with the...
Tom Emerson
tree02139
Offline Send Email
Sep 2, 2004
4:02 pm
946
... I'd like to see some data on whether servers actually do the Right Thing there. If their If-Modified-Since check is similar to what they used for...
Lars Clausen
lrclause
Offline Send Email
Sep 3, 2004
7:48 am
947
... That sometimes work, but unfortunately not always. It is especially likely to fail in database driven websites where the page is in fact created on demand...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Sep 3, 2004
8:00 am
948
... I add URIRegExp filter. Go to heritrix setting page. add regexp: ^.*(?i)\. (a|ai|aif|aifc|aiff|asc|au|avi|bcpio|bin|bmp|bz2|c|cdf|cgi|cgm|class| cpio|cpp? ...
crawlerobo
Offline Send Email
Sep 3, 2004
8:21 am
949
... [...] Impossible to say without seeing the crawl log. -tree -- Tom Emerson Basis Technology Corp. Software...
Tom Emerson
tree02139
Offline Send Email
Sep 3, 2004
11:27 am
950
can you tell me why load classes with this order ? I don't understantd what's difference between two ways ? thanks ... the file ... add the ... ...
agoodman_rgd
Offline Send Email
Sep 3, 2004
11:41 am
951
... Loading classes, java will take the first it finds in CLASSPATH. Ansi is having you make sure that the heritrix jar appears first in the CLASSPATH...
stack
stack@...
Send Email
Sep 3, 2004
3:21 pm
952
I download the source code of heritrix 1.0, and build it successfully with maven jar, the os is Redhat linux 9.0. But when I use maven dist, the build is...
bjhong02
Offline Send Email
Sep 5, 2004
7:31 am
953
... The maven changelog report needs to log into cvs to get a list of recent changes otherwise its generation fails. Do a cvs login or, just read the report...
stack
stack@...
Send Email
Sep 5, 2004
7:21 pm
954
stack£¬ÄúºÃ£¡ ¡¡¡¡yes, I saw the httpclient code in heritrix . when I delete HttpConnection.class & HttpParser.class of...

agoodman_rgd
Offline Send Email
Sep 6, 2004
12:51 am
955
... Sorry you are having a tough time getting it going. You shouldn't need to edit the httpclient.jar. You tried what Ansi suggested? Putting the heritrix.jar...
stack
stack@...
Send Email
Sep 6, 2004
1:32 am
956
I have no net connection problem. I've set maven.proxy.host, maven.proxy.port, and maven.repo.remote=http://public.planetmirror.com/pub/maven. And correctly...
bjhong02
Offline Send Email
Sep 6, 2004
2:46 am
957
stack£¬ÄúºÃ£¡ ¡¡¡¡yes , ansi's suggestion is very good , and it can run in that mode ! the reason that I deploy it with tomcat is I want to...

agoodman_rgd
Offline Send Email
Sep 6, 2004
3:45 am
958
... So you are going by a proxy? Sounds like you've successfully told maven how to download jars over the proxy but the resolver used fetching ...
stack
stack@...
Send Email
Sep 6, 2004
7:10 pm
959
... We develop heritrix using eclipse and are able to run it inside the eclipse environment without problem. You must set the java system property...
stack
stack@...
Send Email
Sep 6, 2004
7:21 pm
960
To disable the user/developer manual generation, do the following: Index: maven.xml =================================================================== RCS...
stack
stack@...
Send Email
Sep 6, 2004
8:18 pm
961
I disabled the user/developer manual generation as you suggested, and I think the following attainGoal should also be erased. <postGoal...
bjhong02
Offline Send Email
Sep 7, 2004
6:42 am
962
... It is a default setup except add of a URIRegExp filter. Seed.txt only add (http://mobile.yahoo.co.jp) 20040907085003065 1 56 dns:mobile.yahoo.co.jp P...
crawlerobo
Offline Send Email
Sep 7, 2004
9:18 am
963
Is there any reason your not using the 'jar' target for Maven instead of mucking around trying to hack the build configuration to not build anything else? -- ...
Tom Emerson
tree02139
Offline Send Email
Sep 7, 2004
10:53 am
Messages 934 - 963 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help