Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4951 - 4980 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4951
Hi, I've tried something similar (integrating Heritrix with a Tomcat Webapp) and succeeded to a point. Unfortunately I wanted to use the MirrorWriter instead...
xavier_sautejeau
xavier_saute...
Offline Send Email
Feb 1, 2008
9:58 am
4952
I forgot to mention that I used a combination of the code in the selftest package and in the Heritrix webapp to figure out the relevant sequence of calls. X. ...
xavier_sautejeau
xavier_saute...
Offline Send Email
Feb 1, 2008
10:02 am
4953
Hi, Has anyone established a mapping of the WGET parameters (for the ones that are relevant) with Heritrix Settings/Modules ? X....
xavier_sautejeau
xavier_saute...
Offline Send Email
Feb 1, 2008
10:32 am
4954
No one has any ideas on how to do this? Can it be done with Heritrix 2.0?...
nfoscarini
Offline Send Email
Feb 1, 2008
2:59 pm
4955
Thanks Xavier. I've been poking around in the WebUI class in order to get an idea of how this stuff is set up. This is the first time I've ever heard of ...
Micah Wedemeyer
mwedeme@...
Send Email
Feb 1, 2008
3:15 pm
4956
One way to do this is to use alist on the CrawlURI level. Your extractor can add key-value pairs to the CrawlURI's alist for each image discovered from that...
Igor Ranitovic
iranitovic
Offline Send Email
Feb 1, 2008
7:13 pm
4957
That is a good idea. I hadn't thought of using map to lookup the extra data at a later point. How can I make my list work with a recovery of a job?...
nfoscarini
Offline Send Email
Feb 1, 2008
7:16 pm
4958
That might be tricky. Is checkpointing something to you can use?...
Igor Ranitovic
iranitovic
Offline Send Email
Feb 1, 2008
7:28 pm
4959
It is, but I'm a little unclear as to how it works. I basically run a job order for about 4 hours, and then checkpoint and terminate the job. The next day I...
nfoscarini
Offline Send Email
Feb 1, 2008
7:31 pm
4960
If you recover from a checkpoint you will not lose your additional information. If you recover from the recover log, then yes you will need to save the ...
Igor Ranitovic
iranitovic
Offline Send Email
Feb 1, 2008
7:42 pm
4961
Ok, this sounds workable. Will I have to create my own Frontier to get this to work with recovery, because that's the only module that handles ...
nfoscarini
Offline Send Email
Feb 1, 2008
7:54 pm
4962
... Yes, an embedded Heritrix should be controllable via the objects' APIs and not only via JMX. I just committed some changes to enable that. These changes...
pjack@...
poetbeware
Offline Send Email
Feb 1, 2008
8:24 pm
4963
If you implement a module that supports the CrawlStatusListener, and register it with the controller. You get all the start/stop notifications. This is really...
nfoscarini
Offline Send Email
Feb 1, 2008
9:03 pm
4964
... Aaarghh!!! I somehow missed this e-mail and spent the last couple days pouring through the WebUI code trying to puzzle out how the Remote, JMX, BeanProxy,...
Micah Wedemeyer
mwedeme@...
Send Email
Feb 5, 2008
3:17 pm
4965
Version: trunk/heritrix2@5733 As far as I can tell, addCrawlStatusListener has been removed. How can I register to receive the notifications? Note: I'm...
Micah Wedemeyer
mwedeme@...
Send Email
Feb 5, 2008
9:25 pm
4966
Hi, I've been struggling with 2.0.1 snapshot for the past few days, and I just do not like all the JMX integration. From the perspective of someone just...
Micah Wedemeyer
mwedeme@...
Send Email
Feb 6, 2008
4:59 pm
4967
... Heritrix is not a developer's library. It's a service designed to run as a daemon with an optional WebUI for the administrator. Someone can correct me if...
nfoscarini
Offline Send Email
Feb 6, 2008
5:12 pm
4968
... The current event dispatch stuff in CrawlController is quite messy, sorry. We were going to standardize on JMX notifications but ended up just porting...
pjack@...
poetbeware
Offline Send Email
Feb 6, 2008
6:01 pm
4969
version: trunk/heritrix2@5733 ... EngineConfig config = new EngineConfig(); config.setJobsDirectory("/tmp/hjobs"); Engine engine = new EngineImpl(config); ...
Micah Wedemeyer
mwedeme@...
Send Email
Feb 6, 2008
7:54 pm
4970
Paul, Thanks. This looks a little less confusing now that you've explained it. In full disclosure, I only used 1.x very little, but I often compare the ...
Micah Wedemeyer
mwedeme@...
Send Email
Feb 6, 2008
7:56 pm
4971
I figured out that launching the job does make the difference. I guess was doing something wrong. I think I was trying to get the SheetManager for...
Micah Wedemeyer
mwedeme@...
Send Email
Feb 6, 2008
10:13 pm
4972
... They don't exist until the job is active. Prior to that, modules are stored as org.archive.settings.Stub placeholders. So instead of root:controller...
pjack@...
poetbeware
Offline Send Email
Feb 6, 2008
10:31 pm
4973
... Perfect! Thanks Paul....
Micah Wedemeyer
mwedeme@...
Send Email
Feb 6, 2008
10:40 pm
4974
At some point in 2008, the Internet Archive's Heritrix team would like to transition the Heritrix project to the Apache License (version 2), rather than the...
Gordon Mohr
gojomo
Online Now Send Email
Feb 7, 2008
5:59 pm
4975
That would be just swell. You get one positive vote here. John...
John Lekashman
lekash
Offline Send Email
Feb 7, 2008
6:13 pm
4976
+1...
Sean Timm
timmscgroups
Offline Send Email
Feb 7, 2008
6:26 pm
4977
What does this error mean Heritrix(-61)-Robots prerequisite failure?...
nt_bdr
Offline Send Email
Feb 12, 2008
3:32 pm
4978
It means the necessary robots.txt prerequisite fetch (even getting a 404) couldn't complete before giving up on the URI in question. Do you see preceding...
Gordon Mohr
gojomo
Online Now Send Email
Feb 12, 2008
6:20 pm
4979
Hi, I am new to Heritrix and I am trying the deduplicator. I am on version 1.12.1. I downloaded the deduplicator version 0.3. According to the instruction, I...
blackduck_llau
Offline Send Email
Feb 15, 2008
2:54 am
4980
Gordon Here are the last few lines in the crawl log. I have set the max-retries to 5. 2008-02-15T15:08:08.223Z -2 - http://chat.aol.de/robots.txt P...
nt_bdr
Offline Send Email
Feb 15, 2008
3:23 pm
Messages 4951 - 4980 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help