Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Why crawler nothing ???   Message List  
Reply | Forward Message #772 of 6151 |
[archive-crawler] Re: Why crawler nothing ???

Now, it can crawlering something .
thank you very much !

--- In archive-crawler@yahoogroups.com, "zhousp" <zhousp@6...> wrote:
> I notice you use "-Djava.ext.dirs=%HERITRIX_HOME%\lib" to include
the file
> in %HERITRIX_HOME%\lib.
>
> java will first add the class in java.ext.dirs to CLASSPATH,then
add the
> system properties CLASSPATH after it.
>
> try add all the .jar file in %HERITRIX_HOME%\lib to CLASSPATH,like
>
> classpath=%HERITRIX_HOME%\heritrix-1.0.0.jar;%HERITRIX_HOME%
\lib\xxxxx.jar.........
>
> then use
>
> java -Xmx256m org.archive.crawler.Heritrix
>
> to start heritrix.
>
>
>
> >D:\Crawler\heritrix-1.0.0\bin\heritrix.cmd
> >-----------------------------------------------
> >set HERITRIX_HOME=D:\Crawler\heritrix-1.0.0
> >set classpath=%HERITRIX_HOME%\heritrix-1.0.0.jar
> >java -Djava.ext.dirs=%HERITRIX_HOME%\lib -Xmx256m
> >org.archive.crawler.Heritrix
> >----------------------------------------------------
> >
> >D:\>cd D:\Crawler\heritrix-1.0.0
> >
> >D:\Crawler\heritrix-1.0.0>bin\heritrix.cmd
> >
> >
> >--- In archive-crawler@yahoogroups.com, "zhousp" <zhousp@6...>
wrote:
> >>
> >> ¡¡¡¡post your script use to start heritrix ,not the order.xml
pls.
> >>
> >>
> >> zhousp£¬ÄúºÃ£¡
> >>
> >> ------------------------------------------
> >> <?xml version="1.0" encoding="UTF-8"?>
> >> <crawl-order xmlns:xsi="http://www.w3.org/2001/XMLSchema-
instance"
> >xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
> >> <meta>
> >> <name>Simple</name>
> >> <description>Profile: Simple crawl</description>
> >> <operator>Admin</operator>
> >> <organization/>
> >> <audience/>
> >> <date>20040808043635</date>
> >> </meta>
> >> <controller>
> >> <string name="settings-directory">settings</string>
> >> <string name="disk-path"/>
> >> <string name="logs-path">logs</string>
> >> <string name="checkpoints-path">checkpoints</string>
> >> <string name="state-path">state</string>
> >> <string name="scratch-path">scratch</string>
> >> <long name="max-bytes-download">0</long>
> >> <long name="max-document-download">0</long>
> >> <long name="max-time-sec">0</long>
> >> <integer name="max-toe-threads">50</integer>
> >> <newObject name="scope"
> >class="org.archive.crawler.scope.DomainScope">
> >> <boolean name="enabled">true</boolean>
> >> <string name="seedsfile">seeds.txt</string>
> >> <integer name="max-link-hops">25</integer>
> >> <integer name="max-trans-hops">5</integer>
> >> <newObject name="exclude-filter"
> >class="org.archive.crawler.filter.OrFilter">
> >> <boolean name="enabled">true</boolean>
> >> <boolean name="if-matches-return">true</boolean>
> >> <map name="filters">
> >> <newObject name="pathdepth"
> >class="org.archive.crawler.filter.PathDepthFilter">
> >> <boolean name="enabled">true</boolean>
> >> <integer name="max-path-depth">20</integer>
> >> <boolean name="path-less-or-equal-
> >return">false</boolean>
> >> </newObject>
> >> <newObject name="pathologicalpath"
> >class="org.archive.crawler.filter.PathologicalPathFilter">
> >> <boolean name="enabled">true</boolean>
> >> <integer name="repetitions">3</integer>
> >> </newObject>
> >> </map>
> >> </newObject>
> >> <newObject name="additionalScopeFocus"
> >class="org.archive.crawler.filter.FilePatternFilter">
> >> <boolean name="enabled">true</boolean>
> >> <boolean name="if-match-return">true</boolean>
> >> <string name="use-default-patterns">All</string>
> >> <string name="regexp"/>
> >> </newObject>
> >> <newObject name="transitiveFilter"
> >class="org.archive.crawler.filter.TransclusionFilter">
> >> <boolean name="enabled">true</boolean>
> >> <integer name="max-speculative-hops">1</integer>
> >> <integer name="max-referral-hops">2147483647</integer>
> >> <integer name="max-embed-hops">2147483647</integer>
> >> </newObject>
> >> </newObject>
> >> <map name="http-headers">
> >> <string name="user-agent">Mozilla/5.0 (compatible;
> >heritrix/1.0.0 +http://www.sohu.com)</string>
> >> <string name="from">netsoldier@1...</string>
> >> </map>
> >> <newObject name="robots-honoring-policy"
> >class="org.archive.crawler.datamodel.RobotsHonoringPolicy">
> >> <string name="type">classic</string>
> >> <boolean name="masquerade">false</boolean>
> >> <text name="custom-robots"/>
> >> <stringList name="user-agents">
> >> </stringList>
> >> </newObject>
> >> <newObject name="frontier"
> >class="org.archive.crawler.frontier.Frontier">
> >> <float name="delay-factor">5.0</float>
> >> <integer name="max-delay-ms">5000</integer>
> >> <integer name="min-delay-ms">500</integer>
> >> <integer name="max-retries">30</integer>
> >> <long name="retry-delay-seconds">900</long>
> >> <boolean name="hold-queues">false</boolean>
> >> <integer name="preference-embed-hops">1</integer>
> >> <integer name="host-valence">1</integer>
> >> <integer name="total-bandwidth-usage-KB-sec">0</integer>
> >> <integer name="max-per-host-bandwidth-usage-KB-
> >sec">0</integer>
> >> <integer name="host-queues-memory-capacity">200</integer>
> >> </newObject>
> >> <map name="pre-fetch-processors">
> >> <newObject name="Preselector"
> >class="org.archive.crawler.prefetch.Preselector">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <boolean name="recheck-scope">true</boolean>
> >> <boolean name="block-all">false</boolean>
> >> <string name="block-by-regexp"/>
> >> </newObject>
> >> <newObject name="Preprocessor"
> >class="org.archive.crawler.prefetch.PreconditionEnforcer">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <integer name="ip-validity-duration-
seconds">21600</integer>
> >> <integer name="robot-validity-duration-
> >seconds">86400</integer>
> >> </newObject>
> >> </map>
> >> <map name="fetch-processors">
> >> <newObject name="DNS"
> >class="org.archive.crawler.fetcher.FetchDNS">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="HTTP"
> >class="org.archive.crawler.fetcher.FetchHTTP">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <integer name="timeout-seconds">1200</integer>
> >> <integer name="sotimeout-ms">20000</integer>
> >> <long name="max-length-bytes">9223372036854775807</long>
> >> <string name="load-cookies-from-file"/>
> >> <string name="save-cookies-to-file"/>
> >> <string name="trust-level">open</string>
> >> <stringList name="accept-headers">
> >> </stringList>
> >> <string name="http-proxy-host"/>
> >> <string name="http-proxy-port"/>
> >> <string name="default-encoding">ISO-8859-1</string>
> >> <boolean name="sha1-content">true</boolean>
> >> </newObject>
> >> </map>
> >> <map name="extract-processors">
> >> <newObject name="ExtractorHTTP"
> >class="org.archive.crawler.extractor.ExtractorHTTP">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="ExtractorHTML"
> >class="org.archive.crawler.extractor.ExtractorHTML">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="ExtractorCSS"
> >class="org.archive.crawler.extractor.ExtractorCSS">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="ExtractorJS"
> >class="org.archive.crawler.extractor.ExtractorJS">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="ExtractorSWF"
> >class="org.archive.crawler.extractor.ExtractorSWF">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> </map>
> >> <map name="write-processors">
> >> <newObject name="Archiver"
> >class="org.archive.crawler.writer.ARCWriterProcessor">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <boolean name="compress">false</boolean>
> >> <string name="prefix">IAH</string>
> >> <string name="suffix">${HOSTNAME}</string>
> >> <integer name="max-size-bytes">100000000</integer>
> >> <string name="path">arcs</string>
> >> <integer name="pool-max-active">5</integer>
> >> <integer name="pool-max-wait">300000</integer>
> >> </newObject>
> >> </map>
> >> <map name="post-processors">
> >> <newObject name="Updater"
> >class="org.archive.crawler.postprocessor.CrawlStateUpdater">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> </newObject>
> >> <newObject name="Postselector"
> >class="org.archive.crawler.postprocessor.Postselector">
> >> <boolean name="enabled">true</boolean>
> >> <map name="filters">
> >> </map>
> >> <boolean name="seed-redirects-new-seed">true</boolean>
> >> </newObject>
> >> </map>
> >> <map name="loggers">
> >> <newObject name="crawl-statistics"
> >class="org.archive.crawler.admin.StatisticsTracker">
> >> <integer name="interval-seconds">20</integer>
> >> </newObject>
> >> </map>
> >> <string name="recover-path"/>
> >> <newObject name="credential-store"
> >class="org.archive.crawler.datamodel.CredentialStore">
> >> <map name="credentials">
> >> </map>
> >> </newObject>
> >> </controller>
> >> </crawl-order>
> >>
> >> -------------------------------------------
> >>
> >> ======== 2004-08-08 16:10:00 ÄúÔÚÀ´ÐÅÖÐдµÀ£º ========
> >>
> >> post your script here pls.
> >>
> >> -----Original Message-----
> >> From: "agoodman_rgd" <renguodong@2...>
> >> To: archive-crawler@yahoogroups.com
> >> Date: Sun, 08 Aug 2004 08:09:40 -0000
> >> X-Virus: 1 Subject: [archive-crawler] Re: Why crawler nothing ???
> >>
> >> > thanks
> >> > but I rewrite a script ,and I can startup the heritrix ,
create a
> >> > job , start the job ,and it said "Successfully" ,but no
anything
> >> > downloaded to my disk ?
> >> >
> >> > why ?
> >> > how to config the job to download pages to my disk ?
> >> > thanks
> >> >
> >> >
> >> > --- In archive-crawler@yahoogroups.com, "zhousp" <zhousp@6...>
> >wrote:
> >> > > hi,if your install heritrix on windows box, you must write
your
> >own
> >> > script
> >> > > to start
> >> > > heritrix, because bin/heritrix should run on linux.
> >> > >
> >> > > if your write your own script to run heritrix, you should
add
> >> > heritri.jar as
> >> > > the first one on the classpath.
> >> > >
> >> > > -----Original Message-----
> >> > > From: "agoodman_rgd" <renguodong@2...>
> >> > > To: archive-crawler@yahoogroups.com
> >> > > Date: Sun, 08 Aug 2004 04:45:53 -0000
> >> X-Virus: 1 > > Subject: [archive-crawler] Why crawler nothing ???
> >> > >
> >> > > > I installed Heritrix on windows 2000 Pro
> >> > > > I start a job ,and no any exception
> >> > > > It said that finished the job,but crawler nothing.
> >> > > >
> >> > > > help me .
> >> > > >
> >> > > > reports:
> >> > > > -----------------------------------
> >> > > > Job name: Simple
> >> > > > Status: Finished
> >> > > > Time: 13 sec.
> >> > > > Processed docs/sec: 0.23
> >> > > > Processed KB/sec: 0.0
> >> > > > Total data written: 53 B
> >> > > >
> >> > > >
> >> > > > URIs
> >> > > >
> >> > > > Pending: 0 ?
> >> > > > Discovered: 3 ?
> >> > > > Queued: 0 ?
> >> > > > In progress: 0 ?
> >> > > > Total Successfully Failed Disregarded
> >> > > > Finished: 3 3 0 0
> >> > > >
> >> > > >
> >> > > > Status code Documents
> >> > > > HTTP-200-Success-OK 2
> >> > > > DNS-1-OK 1
> >> > > >
> >> > > > File type Documents Data
> >> > > > text/dns 1 1 B
> >> > > > text/html 1 1 B
> >> > > > text/plain 1 1 B
> >> > > >
> >> > > >
> >> > > > Hosts Documents Data
> >> > > > www.loc.gov 2 1 B
> >> > > > dns: 1 1 B
> >> > > >
> >> > > > log:
> >> > > > ---------------------------------
> >> > > > crawl.log for Simple Displaying: 100.0% of 352 B
> >> > > >
> >> > > > 20040808043635868 1 53 dns:www.loc.gov P
> >> > > > http://www.loc.gov/ text/dns #000 300 - -
> >> > > > 20040808043638011 200 0
> >http://www.loc.gov/robots.txt
> >> > P
> >> > > > http://www.loc.gov/ text/plain #001 641
> >> > > > 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ -
> >> > > > 20040808043642978 200 0 http://www.loc.gov/ - -
> >> > text/html
> >> > > > #027 1762 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 3t
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > Yahoo! Groups Links
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Yahoo! Groups Links
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >>
> >> = = = = = = = = = = = = = = = = = = = = = =
> >> ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡ÖÂ
> >> Àñ£¡
> >>
> >> ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡Èιú¶°
> >> ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡renguodong@2...
> >> ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡2004-08-09
> >>
> >>
> >> Yahoo! Groups Sponsor
> >> ADVERTISEMENT
> >>
> >>
> >>
> >>
> >>
> >>
> >> Yahoo! Groups Links
> >>
> >> To visit your group on the web, go to:
> >> http://groups.yahoo.com/group/archive-crawler/
> >>
> >> To unsubscribe from this group, send an email to:
> >> archive-crawler-unsubscribe@yahoogroups.com
> >>
> >> Your use of Yahoo! Groups is subject to the Yahoo! Terms of
Service.
> >
> >
> >
> >------------------------ Yahoo! Groups Sponsor --------------------
~-->
> >Make a clean sweep of pop-up ads. Yahoo! Companion Toolbar.
> >Now with Pop-Up Blocker. Get it for free!
> >http://us.click.yahoo.com/L5YrjA/eSIIAA/yQLSAA/89EolB/TM
> >-------------------------------------------------------------------
-~->
> >
> >
> >Yahoo! Groups Links
> >
> >
> >
> >




Mon Aug 9, 2004 2:06 am

agoodman_rgd
Offline Offline
Send Email Send Email

Forward
Message #772 of 6151 |
Expand Messages Author Sort by Date

I installed Heritrix on windows 2000 Pro I start a job ,and no any exception It said that finished the job,but crawler nothing. help me . ... Job name:...
agoodman_rgd
Offline Send Email
Aug 8, 2004
4:45 am

hi,if your install heritrix on windows box, you must write your own script to start heritrix, because bin/heritrix should run on linux. if your write your own...
zhousp
zhousp@...
Send Email
Aug 8, 2004
5:47 am

thanks but I rewrite a script ,and I can startup the heritrix , create a job , start the job ,and it said "Successfully" ,but no anything downloaded to my disk...
agoodman_rgd
Offline Send Email
Aug 8, 2004
8:09 am

post your script here pls. ... From: "agoodman_rgd" <renguodong@...> To: archive-crawler@yahoogroups.com Date: Sun, 08 Aug 2004 08:09:40 -0000 Subject:...
zhousp
zhousp@...
Send Email
Aug 8, 2004
8:16 am

zhousp£¬ÄúºÃ£¡ ... <?xml version="1.0" encoding="UTF-8"?> <crawl-order xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"...

agoodman_rgd
Offline Send Email
Aug 9, 2004
12:53 am

¡¡¡¡post your script use to start heritrix ,not the order.xml pls. zhousp£¬ÄúºÃ£¡ ... <?xml version="1.0" encoding="UTF-8"?> <crawl-order...
zhousp
zhousp@...
Send Email
Aug 9, 2004
1:02 am

D:\Crawler\heritrix-1.0.0\bin\heritrix.cmd ... set HERITRIX_HOME=D:\Crawler\heritrix-1.0.0 set classpath=%HERITRIX_HOME%\heritrix-1.0.0.jar java...
agoodman_rgd
Offline Send Email
Aug 9, 2004
1:22 am

I notice you use "-Djava.ext.dirs=%HERITRIX_HOME%\lib" to include the file in %HERITRIX_HOME%\lib. java will first add the class in java.ext.dirs to...
zhousp
zhousp@...
Send Email
Aug 9, 2004
1:40 am

Now, it can crawlering something . thank you very much ! ... the file ... add the ... \lib\xxxxx.jar......... ... pls. ... instance" ... ...
agoodman_rgd
Offline Send Email
Aug 9, 2004
2:06 am

can you tell me why load classes with this order ? I don't understantd what's difference between two ways ? thanks ... the file ... add the ... ...
agoodman_rgd
Offline Send Email
Sep 3, 2004
11:41 am

... Loading classes, java will take the first it finds in CLASSPATH. Ansi is having you make sure that the heritrix jar appears first in the CLASSPATH...
stack
stack@...
Send Email
Sep 3, 2004
3:21 pm

stack£¬ÄúºÃ£¡ ¡¡¡¡yes, I saw the httpclient code in heritrix . when I delete HttpConnection.class & HttpParser.class of...

agoodman_rgd
Offline Send Email
Sep 6, 2004
12:51 am

... Sorry you are having a tough time getting it going. You shouldn't need to edit the httpclient.jar. You tried what Ansi suggested? Putting the heritrix.jar...
stack
stack@...
Send Email
Sep 6, 2004
1:32 am

stack£¬ÄúºÃ£¡ ¡¡¡¡yes , ansi's suggestion is very good , and it can run in that mode ! the reason that I deploy it with tomcat is I want to...

agoodman_rgd
Offline Send Email
Sep 6, 2004
3:45 am

... We develop heritrix using eclipse and are able to run it inside the eclipse environment without problem. You must set the java system property...
stack
stack@...
Send Email
Sep 6, 2004
7:21 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help