Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 499 - 528 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
499
... Follow the template given above: heritrix-0.7.1 (+http://bruce.earthlinke.com) Put an email into the from field. St.Ack...
Michael Stack
stack@...
Send Email
Jun 1, 2004
3:29 pm
500
... The template says info-url. This is an url that someone can go to to learn more about you should they want to know who is crawling the site (The 'from'...
Michael Stack
stack@...
Send Email
Jun 1, 2004
3:30 pm
501
... I don't understand. Please say more about what you are looking for. Heritrix does a HTTP GET. It will pull down the page. Page source usually lists all...
Michael Stack
stack@...
Send Email
Jun 1, 2004
3:32 pm
502
... Excellent summary of the options for advanced Javascripts extraction. Another idea we've kicked around, for cases where the site wants to be crawled (or is...
Gordon Mohr (Internet...
gojomo
Online Now Send Email
Jun 1, 2004
6:17 pm
503
... Well, I've narrowed it down, and wow, do I not like the answer: the problem is with version 2.6 of the Linux kernel. Specifically, I've tried three JDKs...
Andy Boyko
andyboyko
Online Now Send Email
Jun 1, 2004
8:03 pm
504
... Thanks for persevering Andy. I think onus is now on us to figure whats up w/ JVM+2.6 or JVM+2.6+heritrix. St.Ack...
Michael Stack
stack@...
Send Email
Jun 1, 2004
8:47 pm
505
What happened to the continous build at: http://crawltools.archive.org:8080/cruisecontrol/buildresults/ArchiveOpenCrawler?tab=buildResults best Bjarne Andersen...
bja@...
bjarne_dk2000
Offline Send Email
Jun 3, 2004
7:05 am
506
... Try now Bjarne (They were rebooting the cluster yesterday). St.Ack...
Michael Stack
stack@...
Send Email
Jun 3, 2004
4:09 pm
507
Instead of writing to an arc file, Id like to create a method that takes the URI info, Content, headers, ect into a MYSQL database. Does anyone have any...
Ahnu Nahki
ahnunahki
Offline Send Email
Jun 4, 2004
3:24 pm
508
What you are going to want to do is write your own processor to replace Heritrix’s ARC writing processor. Please consult the user manual for information...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Jun 4, 2004
3:34 pm
509
... Are you interested in that specifically to get away from ARC, or more simply because you're interested in being able to issue queries on the crawl results...
Andy Boyko
andyboyko
Online Now Send Email
Jun 4, 2004
6:06 pm
510
I think I might have been, until now, misunderstanding a subtlety of Domain Scope. Specifically, in this contrast: Case 1: Seed: "www.domain.com" IN scope:...
Andy Boyko
andyboyko
Online Now Send Email
Jun 4, 2004
8:26 pm
511
Release for second heritrix workshop, Copenhagen 06/2004 (1.0.0 first release candidate). Added site-first prioritization, fixed link extraction of multibyte...
stack
stack@...
Send Email
Jun 5, 2004
11:25 pm
512
dear developers, when I access the cvs ,the server report some error message ... ·þÎñÆ÷±¨¸æÁË´íÎó£ºPermission denied ... ...
zhousp
zhousp@...
Send Email
Jun 7, 2004
7:50 am
513
I just downloaded the .10 version of heritrix. I haven't had a problem building other versions but this one fails. I run maven 1.0-rc2. The build fails...
jirleech
Offline Send Email
Jun 7, 2004
1:09 pm
514
... We had considered that aswell initially as a quick way of importing the data into the db. Going the arc route after a crawl. But we have a search engine we...
Ahnu Nahki
ahnunahki
Offline Send Email
Jun 7, 2004
2:35 pm
515
I should add as a side not to all of this that even if you write your own DB insertion processor you can still have the crawler write ARC files. Heritrix is...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Jun 7, 2004
2:43 pm
516
Documentation is now generated using a maven docbook plugin. You need to add it to your maven install. It in turn depends on a sun jar that you will ...
stack@...
Send Email
Jun 8, 2004
7:43 am
517
I cannot try it myself at the moment unfortunately. Can you checkout other sourceforge projects ok? Yours, St.Ack...
stack@...
Send Email
Jun 8, 2004
7:47 am
518
Are you clear on where to start making your changes? That you would put in place an alterate ARCWriterProcessor, one that did effectively what the current one...
stack@...
Send Email
Jun 8, 2004
7:53 am
519
a version of jimi.jar could also be downloaded from here (at own risk) http://rsb.info.nih.gov/ij/plugins/download/jimi/jimi.jar best Bjarne Andersen...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Jun 8, 2004
8:14 am
520
Hi ! Since upgrading from 0.7.1 to 0.10.0 I have problems when running Heritrix without the GUI I used to simply start the crawler with ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Jun 8, 2004
9:27 am
521
Is there any way to set limits on each seed? For example setting a timeout of 10 minutes, or bytes downloaded, or number of documents per seed instead of a...
jirleech
Offline Send Email
Jun 8, 2004
4:31 pm
522
... No, not yet. See RFE #952241: Enhancement of per host settings. http://sourceforge.net/tracker/index.php?func=detail&aid=952241&group_id=73833&atid=539102 ...
Igor Ranitovic
iranitovic
Offline Send Email
Jun 9, 2004
12:56 am
523
Hi Bjarne Andersen, ... This sounds like a bug to me. I will take a look it and will make an bug issue at sourceforge. Thanks. i....
Igor Ranitovic
iranitovic
Offline Send Email
Jun 9, 2004
1:00 am
524
... I just tried an anonymous checkout and it worked fine. Is it still broken for you? St.Ack...
stack@...
Send Email
Jun 9, 2004
8:06 am
525
stack,ÄúºÃ£¡ It works fine now,thank you! ... = = = = = = = = = = = = = = = = = = = = ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡Ö Àñ£¡ ...
zhousp
zhousp@...
Send Email
Jun 9, 2004
8:18 am
526
... On further consideration, look at ARCWriterPool. In particular, the inner class ARCWriterFactory. See how it is responsible for the manufacture of the ...
stack@...
Send Email
Jun 9, 2004
8:59 am
527
... Bjarne: Yes. This changed recently. I don't think there is a way of getting back the old behavior. Kris or Igor might have some comments to make here. ...
stack@...
Send Email
Jun 9, 2004
9:18 am
528
I imagine there's some discussion going on at this week's workshop about what's needed before Heritrix 1.0. (Hi, everyone at the workshop!) In that vein,...
Andy Boyko
andyboyko
Online Now Send Email
Jun 10, 2004
9:51 pm
Messages 499 - 528 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help