Looks like I'm finally getting through the login pages. Thanks for your help. Now I have a bunch of ARC files. It'd be easy enough to write a diff between...
... Responding back to Kristinn's comment request, I want to be as polite as nice to my bandwidth as possible. Being polite to crawled web sites is a nice...
... Its not highest priority for the crew our here at the Archive (Scaling up is our main priority). We're hoping it'll come in as a contribution (Hopefully...
stack
stack@...
Sep 1, 2004 11:33 pm
938
Phil White writes: [...] ... [...] There has been a lot of research done on how to select URLs for subsequent crawling: the major search engines certainly...
I will be working on the add on proposed earlier this summer. Expect something useful to be available before christmas. A beta might be available earlier. -...
... If the 'If-Modified-Since' header add doesn't work, we did a little planning yesterday and the feature '[ 941072 ] Allow operator-configured mid-HTTP-fetch...
stack
stack@...
Sep 2, 2004 2:43 pm
942
... That sounds great and brings me to a related topic. There really should be a nice way of marking CrawlURIs as "duplicate" or something similar. This flag ...
... How do you decide what constitutes a duplicate? Exact content match? Same URI path? Some subset of similarity? Doing this that will work across languages...
... be a ... flag ... extraction. ... to ... it ... That would be implementation dependant. For what Stack was talking about you just take the if-modified bit...
... Ah, OK. Understood. I wasn't placing things in the right context. ... Well, for revisiting I would just use the If-Modified-Since header coupled with the...
... I'd like to see some data on whether servers actually do the Right Thing there. If their If-Modified-Since check is similar to what they used for...
... That sometimes work, but unfortunately not always. It is especially likely to fail in database driven websites where the page is in fact created on demand...
... Loading classes, java will take the first it finds in CLASSPATH. Ansi is having you make sure that the heritrix jar appears first in the CLASSPATH...
stack
stack@...
Sep 3, 2004 3:21 pm
952
I download the source code of heritrix 1.0, and build it successfully with maven jar, the os is Redhat linux 9.0. But when I use maven dist, the build is...
... The maven changelog report needs to log into cvs to get a list of recent changes otherwise its generation fails. Do a cvs login or, just read the report...
stack
stack@...
Sep 5, 2004 7:21 pm
954
stack£¬ÄúºÃ£¡ ¡¡¡¡yes, I saw the httpclient code in heritrix . when I delete HttpConnection.class & HttpParser.class of...
... Sorry you are having a tough time getting it going. You shouldn't need to edit the httpclient.jar. You tried what Ansi suggested? Putting the heritrix.jar...
stack
stack@...
Sep 6, 2004 1:32 am
956
I have no net connection problem. I've set maven.proxy.host, maven.proxy.port, and maven.repo.remote=http://public.planetmirror.com/pub/maven. And correctly...
... So you are going by a proxy? Sounds like you've successfully told maven how to download jars over the proxy but the resolver used fetching ...
stack
stack@...
Sep 6, 2004 7:10 pm
959
... We develop heritrix using eclipse and are able to run it inside the eclipse environment without problem. You must set the java system property...
stack
stack@...
Sep 6, 2004 7:21 pm
960
To disable the user/developer manual generation, do the following: Index: maven.xml =================================================================== RCS...
stack
stack@...
Sep 6, 2004 8:18 pm
961
I disabled the user/developer manual generation as you suggested, and I think the following attainGoal should also be erased. <postGoal...
... It is a default setup except add of a URIRegExp filter. Seed.txt only add (http://mobile.yahoo.co.jp) 20040907085003065 1 56 dns:mobile.yahoo.co.jp P...
Is there any reason your not using the 'jar' target for Maven instead of mucking around trying to hack the build configuration to not build anything else? -- ...