Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Controlling heritrix from the command prompt   Message List  
Reply | Forward Message #1182 of 6188 |
Re: [archive-crawler] Re: Controlling heritrix from the command prompt

list1dism writes:
[...]
> Another thing I'd like to do is crawl more jobs that one at a time
> But by reading some messages from the list-archive, I understood that
> I need to have different heritrix installations to do this. I cannot
> do it with one. Is that true? Is that an easier way to do this?

No, you don't need separate installations. Rather you just need
different heritrix.properties files for each instance you want to
run. Minimally you will want to specify the jobs directory for each
one (though you could use a single directory). I run multiple
instances with the webui on different ports, different job
directories, and different login credentials. I name the heritrix
properties with the convention 'heritrix-PORT.properties', e.g.,
'heritrix-8080.properties'. I wrote a little script that will start a
Heritrix instance on the given port, including redirecting
heritrix_out.log to a unique file:

--
#!/bin/sh

export HERITRIX_HOME=/usr/local/heritrix
export JAVA_HOME=/usr/local/java

if [ "$#" -eq 0 ]
then
echo "Usage $0 PORT"
exit 1
fi

PROPS_DIR=$HERITRIX_HOME/conf/heritrix-${1}.properties

if [ ! -f $PROPS_DIR ]
then
echo "Error: the properties file for port $1 does not exist."
exit 1
fi

export HERITRIX_OUT=$HERITRIX_HOME/heritrix_${1}_out.log
export JAVA_OPTS="-Xmx512m -Dheritrix.properties=$PROPS_DIR"
$HERITRIX_HOME/bin/heritrix
--

This requires Heritrix 1.2 (because of the use of HERITRIX_OUT).

Sorry if this isn't clear --- I haven't had my morning coffee yet.

-tree

--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"



Thu Nov 18, 2004 11:33 am

tree02139
Offline Offline
Send Email Send Email

Forward
Message #1182 of 6188 |
Expand Messages Author Sort by Date

Is there any way to create jobs and feed them to the crawler by the command prompt? (that is without using the WUI) To be more specific I want to create jobs...
list1dism
Offline Send Email
Nov 18, 2004
10:07 am

This is possible. You need to have a valid order.xml and then you specify it on the command line. ... Usage: heritrix --help Usage: heritrix --nowui ORDER.XML ...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Nov 18, 2004
10:32 am

Thanks for that info! Of course I don't mind sharing I'm trying to do the following: Grab plain-text-only content out of 8 specific web-sites Then parse the...
list1dism
Offline Send Email
Nov 18, 2004
10:47 am

list1dism writes: [...] ... No, you don't need separate installations. Rather you just need different heritrix.properties files for each instance you want to ...
Tom Emerson
tree02139
Offline Send Email
Nov 18, 2004
11:34 am

Ok thanks about the tip It's pretty clear, I'll manage I encountered another problem a while ago It logs too much on the file heritrix_out.log (not the one in...
list1dism
Offline Send Email
Nov 18, 2004
1:17 pm

Look in the conf directory (under HERITRIX_HOME), there should be a file called heritrix.properties where you can set the level of logging on various modules...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Nov 18, 2004
1:22 pm

Yeah right, I installed the new one and looks good now I think I can automate the process now Thanks to both of you for helping me out...
list1dism
Offline Send Email
Nov 18, 2004
2:10 pm

Hi, As you suggest we can run heritrix from command prompt, But How we can we check the status of running crawl and different reports, as we can see in WUI. ...
callforshadab
Offline Send Email
Jul 17, 2006
8:30 am

... See http://crawler.archive.org/articles/user_manual.html#mon_com. St.Ack P.S. Did you see Kris's P.S. below?...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 17, 2006
4:24 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help