Now, it can crawlering something .
thank you very much !
--- In archive-crawler@yahoogroups.com, "zhousp" <zhousp@6...> wrote:
> I notice you use "-Djava.ext.dirs=%HERITRIX_HOME%\lib" to include
the file
> in %HERITRIX_HOME%\lib.
>
> java will first add the class in java.ext.dirs to CLASSPATH,then
add the
> system properties CLASSPATH after it.
>
> try add all the .jar file in %HERITRIX_HOME%\lib to CLASSPATH,like
>
> classpath=%HERITRIX_HOME%\heritrix-1.0.0.jar;%HERITRIX_HOME%
\lib\xxxxx.jar.........
>
> then use
>
> java -Xmx256m org.archive.crawler.Heritrix
>
> to start heritrix.
>
>
>
> >D:\Crawler\heritrix-1.0.0\bin\heritrix.cmd
> >-----------------------------------------------
> >set HERITRIX_HOME=D:\Crawler\heritrix-1.0.0
> >set classpath=%HERITRIX_HOME%\heritrix-1.0.0.jar
> >java -Djava.ext.dirs=%HERITRIX_HOME%\lib -Xmx256m
> >org.archive.crawler.Heritrix
> >----------------------------------------------------
> >
> >D:\>cd D:\Crawler\heritrix-1.0.0
> >
> >D:\Crawler\heritrix-1.0.0>bin\heritrix.cmd
> >
> >
> >--- In archive-crawler@yahoogroups.com, "zhousp" <zhousp@6...>
wrote:
> >>
> >> ¡¡¡¡post your script use to start heritrix ,not the order.xml
pls.
> >>
> >>
> >> zhousp£¬ÄúºÃ£¡
> >>
> >> ------------------------------------------
> >> <?xml version="1.0" encoding="UTF-8"?>
> >> <crawl-order xmlns:xsi="http://www.w3.org/2001/XMLSchema-
instance"
> >xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
> >> <meta>
> >> <name>Simple</name>
> >> <description>Profile: Simple crawl</description>
> >> <operator>Admin</operator>
> >> <organization/>
> >> <audience/>
> >> <date>20040808043635</date>
> >> </meta>
> >> <controller>
> >> <string name="settings-directory">settings</string>
> >> <string name="disk-path"/>
> >> <string name="logs-path">logs</string>
> >> <string name="checkpoints-path">checkpoints</string>
> >> <string name="state-path">state</string>
> >> <string name="scratch-path">scratch</string>
> >> <long name="max-bytes-download">0</long>
> >> <long name="max-document-download">0</long>
> >> <long name="max-time-sec">0</long>
> >> <integer name="max-toe-threads">50</integer>
> >> <newObject name="scope"
> >class="org.archive.crawler.scope.DomainScope">
> >> <boolean name="enabled">true</boolean>
> >> <string name="seedsfile">seeds.txt</string>
> >> <integer name="max-link-hops">25</integer>
> >> <integer name="max-trans-hops">5</integer>
> >> <newObject name="exclude-filter"
> >class="org.archive.crawler.filter.OrFilter">
> >> <boolean name="enabled">true</boolean>
> >> <boolean name="if-matches-return">true</boolean>
> >> <map name="filters">
> >> <newObject name="pathdepth"
> >class="org.archive.crawler.filter.PathDepthFilter">
> >> <boolean name="enabled">true</boolean>
> >> <integer name="max-path-depth">20</integer>
> >> <boolean name="path-less-or-equal-
> >return">false</boolean>
> >> </newObject>
> >> <newObject name="pathologicalpath"
> >class="org.archive.crawler.filter.PathologicalPathFilter">
> >> <boolean name="enabled">true</boolean>
> >> <integer name="repetitions">3</integer>
> >> </newObject>
> >> </map>
> >> </newObject>
> >> <newObject name="additionalScopeFocus"
> >class="org.archive.crawler.filter.FilePatternFilter">
> >> <boolean name="enabled">true</boolean>
> >> <boolean name="if-match-return">true</boolean>
> >> <string name="use-default-patterns">All</string>
> >> <string name="regexp"/>
> >> </newObject>
> >> <newObject name="transitiveFilter"
> >class="org.archive.crawler.filter.TransclusionFilter">
> >> <boolean name="enabled">true</boolean>
> >> <integer name="max-speculative-hops">1</integer>
> >> <integer name="max-referral-hops">2147483647</integer>
> >> <integer name="max-embed-hops">2147483647</integer>
> >> </newObject>
> >> </newObject>
> >> <map name="http-headers">
> >> <string name="user-agent">Mozilla/5.0 (compatible;
> >heritrix/1.0.0 +http://www.sohu.com)</string>
> >> <string name="from">netsoldier@1...</string>
> >> </map>
> >> <newObject name="robots-honoring-policy"
> >class="org.archive.crawler.datamodel.RobotsHonoringPolicy">
> >> <string name="type">classic</string>
> >> <boolean name="masquerade">false</boolean>
> >> <text name="custom-robots"/>
> >> <stringList name="user-agents">
> >> </stringList>
> >> </newObject>
> >> <newObject name="frontier"
> >class="org.archive.crawler.frontier.Frontier">
> >> <float name="delay-factor">5.0</float>
> >> <integer name="max-delay-ms">5000</integer>
> >> <integer name="min-delay-ms">500</integer>
> >> <integer name="max-retries">30</integer>
> >> <long name="retry-delay-seconds">900</long>
> >> <boolean name="hold-queues">false</boolean>
> >> <integer name="preference-embed-hops">1</integer>
> >> <integer name="host-valence">1</integer>
> >> <integer name="total-bandwidth-usage-KB-sec">0</integer>
> >> <integer name="max-per-host-bandwidth-usage-KB-
> >sec">0</integer>
> >> <integer name="host-queues-memory-capacity">200</integer>
> >> </newObject>
> >> <map name="pre-fetch-processors">
> >> <newObject name="Preselector"
> >class="org.archive.crawler.prefetch.Preselector">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <boolean name="recheck-scope">true</boolean>
> >> <boolean name="block-all">false</boolean>
> >> <string name="block-by-regexp"/>
> >> </newObject>
> >> <newObject name="Preprocessor"
> >class="org.archive.crawler.prefetch.PreconditionEnforcer">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <integer name="ip-validity-duration-
seconds">21600</integer>
> >> <integer name="robot-validity-duration-
> >seconds">86400</integer>
> >> </newObject>
> >> </map>
> >> <map name="fetch-processors">
> >> <newObject name="DNS"
> >class="org.archive.crawler.fetcher.FetchDNS">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="HTTP"
> >class="org.archive.crawler.fetcher.FetchHTTP">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <integer name="timeout-seconds">1200</integer>
> >> <integer name="sotimeout-ms">20000</integer>
> >> <long name="max-length-bytes">9223372036854775807</long>
> >> <string name="load-cookies-from-file"/>
> >> <string name="save-cookies-to-file"/>
> >> <string name="trust-level">open</string>
> >> <stringList name="accept-headers">
> >> </stringList>
> >> <string name="http-proxy-host"/>
> >> <string name="http-proxy-port"/>
> >> <string name="default-encoding">ISO-8859-1</string>
> >> <boolean name="sha1-content">true</boolean>
> >> </newObject>
> >> </map>
> >> <map name="extract-processors">
> >> <newObject name="ExtractorHTTP"
> >class="org.archive.crawler.extractor.ExtractorHTTP">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="ExtractorHTML"
> >class="org.archive.crawler.extractor.ExtractorHTML">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="ExtractorCSS"
> >class="org.archive.crawler.extractor.ExtractorCSS">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="ExtractorJS"
> >class="org.archive.crawler.extractor.ExtractorJS">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="ExtractorSWF"
> >class="org.archive.crawler.extractor.ExtractorSWF">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> </map>
> >> <map name="write-processors">
> >> <newObject name="Archiver"
> >class="org.archive.crawler.writer.ARCWriterProcessor">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <boolean name="compress">false</boolean>
> >> <string name="prefix">IAH</string>
> >> <string name="suffix">${HOSTNAME}</string>
> >> <integer name="max-size-bytes">100000000</integer>
> >> <string name="path">arcs</string>
> >> <integer name="pool-max-active">5</integer>
> >> <integer name="pool-max-wait">300000</integer>
> >> </newObject>
> >> </map>
> >> <map name="post-processors">
> >> <newObject name="Updater"
> >class="org.archive.crawler.postprocessor.CrawlStateUpdater">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="Postselector"
> >class="org.archive.crawler.postprocessor.Postselector">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <boolean name="seed-redirects-new-seed">true</boolean>
> >> </newObject>
> >> </map>
> >> <map name="loggers">
> >> <newObject name="crawl-statistics"
> >class="org.archive.crawler.admin.StatisticsTracker">
> >> <integer name="interval-seconds">20</integer>
> >> </newObject>
> >> </map>
> >> <string name="recover-path"/>
> >> <newObject name="credential-store"
> >class="org.archive.crawler.datamodel.CredentialStore">
> >> <map name="credentials">
> >> </map>
> >> </newObject>
> >> </controller>
> >> </crawl-order>
> >>
> >> -------------------------------------------
> >>
> >> ======== 2004-08-08 16:10:00 ÄúÔÚÀ´ÐÅÖÐдµÀ£º ========
> >>
> >> post your script here pls.
> >>
> >> -----Original Message-----
> >> From: "agoodman_rgd" <renguodong@2...>
> >> To: archive-crawler@yahoogroups.com
> >> Date: Sun, 08 Aug 2004 08:09:40 -0000
> >> X-Virus: 1 Subject: [archive-crawler] Re: Why crawler nothing ???
> >>
> >> > thanks
> >> > but I rewrite a script ,and I can startup the heritrix ,
create a
> >> > job , start the job ,and it said "Successfully" ,but no
anything
> >> > downloaded to my disk ?
> >> >
> >> > why ?
> >> > how to config the job to download pages to my disk ?
> >> > thanks
> >> >
> >> >
> >> > --- In archive-crawler@yahoogroups.com, "zhousp" <zhousp@6...>
> >wrote:
> >> > > hi,if your install heritrix on windows box, you must write
your
> >own
> >> > script
> >> > > to start
> >> > > heritrix, because bin/heritrix should run on linux.
> >> > >
> >> > > if your write your own script to run heritrix, you should
add
> >> > heritri.jar as
> >> > > the first one on the classpath.
> >> > >
> >> > > -----Original Message-----
> >> > > From: "agoodman_rgd" <renguodong@2...>
> >> > > To: archive-crawler@yahoogroups.com
> >> > > Date: Sun, 08 Aug 2004 04:45:53 -0000
> >> X-Virus: 1 > > Subject: [archive-crawler] Why crawler nothing ???
> >> > >
> >> > > > I installed Heritrix on windows 2000 Pro
> >> > > > I start a job ,and no any exception
> >> > > > It said that finished the job,but crawler nothing.
> >> > > >
> >> > > > help me .
> >> > > >
> >> > > > reports:
> >> > > > -----------------------------------
> >> > > > Job name: Simple
> >> > > > Status: Finished
> >> > > > Time: 13 sec.
> >> > > > Processed docs/sec: 0.23
> >> > > > Processed KB/sec: 0.0
> >> > > > Total data written: 53 B
> >> > > >
> >> > > >
> >> > > > URIs
> >> > > >
> >> > > > Pending: 0 ?
> >> > > > Discovered: 3 ?
> >> > > > Queued: 0 ?
> >> > > > In progress: 0 ?
> >> > > > Total Successfully Failed Disregarded
> >> > > > Finished: 3 3 0 0
> >> > > >
> >> > > >
> >> > > > Status code Documents
> >> > > > HTTP-200-Success-OK 2
> >> > > > DNS-1-OK 1
> >> > > >
> >> > > > File type Documents Data
> >> > > > text/dns 1 1 B
> >> > > > text/html 1 1 B
> >> > > > text/plain 1 1 B
> >> > > >
> >> > > >
> >> > > > Hosts Documents Data
> >> > > > www.loc.gov 2 1 B
> >> > > > dns: 1 1 B
> >> > > >
> >> > > > log:
> >> > > > ---------------------------------
> >> > > > crawl.log for Simple Displaying: 100.0% of 352 B
> >> > > >
> >> > > > 20040808043635868 1 53 dns:www.loc.gov P
> >> > > > http://www.loc.gov/ text/dns #000 300 - -
> >> > > > 20040808043638011 200 0
> >http://www.loc.gov/robots.txt
> >> > P
> >> > > > http://www.loc.gov/ text/plain #001 641
> >> > > > 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ -
> >> > > > 20040808043642978 200 0 http://www.loc.gov/ - -
> >> > text/html
> >> > > > #027 1762 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 3t
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > Yahoo! Groups Links
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Yahoo! Groups Links
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >>
> >> = = = = = = = = = = = = = = = = = = = = = =
> >> ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡ÖÂ
> >> Àñ£¡
> >>
> >> ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡Èιú¶°
> >> ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡renguodong@2...
> >> ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡2004-08-09
> >>
> >>
> >> Yahoo! Groups Sponsor
> >> ADVERTISEMENT
> >>
> >>
> >>
> >>
> >>
> >>
> >> Yahoo! Groups Links
> >>
> >> To visit your group on the web, go to:
> >> http://groups.yahoo.com/group/archive-crawler/
> >>
> >> To unsubscribe from this group, send an email to:
> >> archive-crawler-unsubscribe@yahoogroups.com
> >>
> >> Your use of Yahoo! Groups is subject to the Yahoo! Terms of
Service.
> >
> >
> >
> >------------------------ Yahoo! Groups Sponsor --------------------
~-->
> >Make a clean sweep of pop-up ads. Yahoo! Companion Toolbar.
> >Now with Pop-Up Blocker. Get it for free!
> >http://us.click.yahoo.com/L5YrjA/eSIIAA/yQLSAA/89EolB/TM
> >-------------------------------------------------------------------
-~->
> >
> >
> >Yahoo! Groups Links
> >
> >
> >
> >