... Follow the template given above: heritrix-0.7.1 (+http://bruce.earthlinke.com) Put an email into the from field. St.Ack...
Michael Stack
stack@...
Jun 1, 2004 3:29 pm
500
... The template says info-url. This is an url that someone can go to to learn more about you should they want to know who is crawling the site (The 'from'...
Michael Stack
stack@...
Jun 1, 2004 3:30 pm
501
... I don't understand. Please say more about what you are looking for. Heritrix does a HTTP GET. It will pull down the page. Page source usually lists all...
Michael Stack
stack@...
Jun 1, 2004 3:32 pm
502
... Excellent summary of the options for advanced Javascripts extraction. Another idea we've kicked around, for cases where the site wants to be crawled (or is...
... Well, I've narrowed it down, and wow, do I not like the answer: the problem is with version 2.6 of the Linux kernel. Specifically, I've tried three JDKs...
... Thanks for persevering Andy. I think onus is now on us to figure whats up w/ JVM+2.6 or JVM+2.6+heritrix. St.Ack...
Michael Stack
stack@...
Jun 1, 2004 8:47 pm
505
What happened to the continous build at: http://crawltools.archive.org:8080/cruisecontrol/buildresults/ArchiveOpenCrawler?tab=buildResults best Bjarne Andersen...
... Try now Bjarne (They were rebooting the cluster yesterday). St.Ack...
Michael Stack
stack@...
Jun 3, 2004 4:09 pm
507
Instead of writing to an arc file, Id like to create a method that takes the URI info, Content, headers, ect into a MYSQL database. Does anyone have any...
What you are going to want to do is write your own processor to replace Heritrix’s ARC writing processor. Please consult the user manual for information...
... Are you interested in that specifically to get away from ARC, or more simply because you're interested in being able to issue queries on the crawl results...
I think I might have been, until now, misunderstanding a subtlety of Domain Scope. Specifically, in this contrast: Case 1: Seed: "www.domain.com" IN scope:...
Release for second heritrix workshop, Copenhagen 06/2004 (1.0.0 first release candidate). Added site-first prioritization, fixed link extraction of multibyte...
stack
stack@...
Jun 5, 2004 11:25 pm
512
dear developers, when I access the cvs ,the server report some error message ... ·þÎñÆ÷±¨¸æÁË´íÎó£ºPermission denied ... ...
zhousp
zhousp@...
Jun 7, 2004 7:50 am
513
I just downloaded the .10 version of heritrix. I haven't had a problem building other versions but this one fails. I run maven 1.0-rc2. The build fails...
... We had considered that aswell initially as a quick way of importing the data into the db. Going the arc route after a crawl. But we have a search engine we...
I should add as a side not to all of this that even if you write your own DB insertion processor you can still have the crawler write ARC files. Heritrix is...
Documentation is now generated using a maven docbook plugin. You need to add it to your maven install. It in turn depends on a sun jar that you will ...
stack@...
Jun 8, 2004 7:43 am
517
I cannot try it myself at the moment unfortunately. Can you checkout other sourceforge projects ok? Yours, St.Ack...
stack@...
Jun 8, 2004 7:47 am
518
Are you clear on where to start making your changes? That you would put in place an alterate ARCWriterProcessor, one that did effectively what the current one...
stack@...
Jun 8, 2004 7:53 am
519
a version of jimi.jar could also be downloaded from here (at own risk) http://rsb.info.nih.gov/ij/plugins/download/jimi/jimi.jar best Bjarne Andersen...
Is there any way to set limits on each seed? For example setting a timeout of 10 minutes, or bytes downloaded, or number of documents per seed instead of a...
... No, not yet. See RFE #952241: Enhancement of per host settings. http://sourceforge.net/tracker/index.php?func=detail&aid=952241&group_id=73833&atid=539102 ...
... I just tried an anonymous checkout and it worked fine. Is it still broken for you? St.Ack...
stack@...
Jun 9, 2004 8:06 am
525
stack,ÄúºÃ£¡ It works fine now,thank you! ... = = = = = = = = = = = = = = = = = = = = ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡Ö Àñ£¡ ...
zhousp
zhousp@...
Jun 9, 2004 8:18 am
526
... On further consideration, look at ARCWriterPool. In particular, the inner class ARCWriterFactory. See how it is responsible for the manufacture of the ...
stack@...
Jun 9, 2004 8:59 am
527
... Bjarne: Yes. This changed recently. I don't think there is a way of getting back the old behavior. Kris or Igor might have some comments to make here. ...
stack@...
Jun 9, 2004 9:18 am
528
I imagine there's some discussion going on at this week's workshop about what's needed before Heritrix 1.0. (Hi, everyone at the workshop!) In that vein,...