Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

archive-crawler

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 795
  • Category: Cyberculture
  • Founded: Dec 1, 2002
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

Advanced
Messages Help
Messages 5593 - 5622 of 8128   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#5593 From: "goblin_cz" <adam.brokes@...>
Date: Wed Dec 3, 2008 10:48 pm
Subject: 74 millions docs - Out Of Memory
goblin_cz
Send Email Send Email
 
Hi,

the Czech crawl again :) . I started with default profile and set some
specific rules (100MB limit etc.) and run the crawl again. You can
find the order.xml here:

http://raptor.webarchiv.cz/heritrix/order.xml

Tech spec:
Heritrix 1.14.2
8 core Xeon
8GB RAM
64bit sun java
3GB java heap
Debian 4.1

Everything goes without any serious trouble. The only exception that
was thrown was:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
Stacktrace: java.lang.IllegalArgumentException: Size exceeds
Integer.MAX_VALUE
	 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:707)
	 at
org.archive.io.GenericReplayCharSequence.getReadOnlyMemoryMappedBuffer(GenericRe\
playCharSequence.java:277)
	 at
org.archive.io.GenericReplayCharSequence.decodeToFile(GenericReplayCharSequence.\
java:219)
	 at
org.archive.io.GenericReplayCharSequence.(GenericReplayCharSequence.java:164)
	 at
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream\
.java:559)
	 at
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream\
.java:515)
	 at
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStream.j\
ava:314)
	 at
org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:295)
	 at
org.archive.crawler.extractor.ExtractorHTML.extract(ExtractorHTML.java:540)
	 at
org.archive.crawler.extractor.Extractor.innerProcess(Extractor.java:67)
	 at org.archive.crawler.framework.Processor.process(Processor.java:112)
	 at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:302)
	 at org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)

This happened on few pages – for example
http://dermoporadkyne.cz/poradenstvi.pdf and I guess it is caused by
never ending scripts.

After six days of crawling (I had 74 millions of documents and 4 TB of
data) happened something much more serious. Heritrix threw about 150
exceptions and paused itself.
All exceptions look like this:
Serious error occured trying to process 'CrawlURI
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
LLL
http://www.sperky.vltava.cz/DANFIL/vyrobce=1001130574/?category-id=16706
in Scheduler'
[ToeThread #132:
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
  CrawlURI
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
LLL
http://www.sperky.vltava.cz/DANFIL/vyrobce=1001130574/?category-id=16706
    0 attempts
     in processor: Scheduler
     ACTIVE for 1h44m26s863ms
     step: ABOUT_TO_BEGIN_PROCESSOR for 1h44m5s663ms
     java.lang.Thread.getStackTrace(Thread.java:1436)
     org.archive.crawler.framework.ToeThread.reportTo(ToeThread.java:514)
     org.archive.crawler.framework.ToeThread.reportTo(ToeThread.java:592)
     org.archive.util.DevUtils.extraInfo(DevUtils.java:65)

org.archive.crawler.framework.ToeThread.seriousError(ToeThread.java:230)

org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:325)
     org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)
]
            timestamp  discovered      queued   downloaded
doc/s(avg)  KB/s(avg)   dl-failures   busy-thread   mem-use-KB
heap-size-KB   congestion   max-depth   avg-depth
2008-11-30T19:49:37Z   226231290   147999609     74002484
0(121.9)    0(6828)        293362           197      2339025
2831488     7,213.49      320897         104
  (in thread 'ToeThread #132:
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543';
in processor 'Scheduler')

java.lang.OutOfMemoryError: Java heap space
Stacktrace: java.lang.OutOfMemoryError: Java heap space
	 at java.lang.Class.getDeclaredFields0(Native Method)
	 at java.lang.Class.privateGetDeclaredFields(Class.java:2291)
	 at java.lang.Class.getDeclaredField(Class.java:1880)
	 at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1610)
	 at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
	 at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:425)
	 at java.security.AccessController.doPrivileged(Native Method)
	 at java.io.ObjectStreamClass.(ObjectStreamClass.java:413)
	 at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:310)
	 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1106)
	 at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
	 at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
	 at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
	 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
	 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
	 at
com.sleepycat.bind.serial.SerialBinding.objectToEntry(SerialBinding.java:171)
	 at com.sleepycat.collections.DataView.useValue(DataView.java:548)
	 at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:824)
	 at com.sleepycat.collections.DataCursor.put(DataCursor.java:758)
	 at
com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:319)
	 at com.sleepycat.collections.StoredMap.put(StoredMap.java:257)
	 at org.archive.util.CachedBdbMap.expungeStaleEntry(CachedBdbMap.java:562)
	 at
org.archive.util.CachedBdbMap.expungeStaleEntries(CachedBdbMap.java:533)
	 at org.archive.util.CachedBdbMap.get(CachedBdbMap.java:358)
	 at
org.archive.crawler.datamodel.ServerCache.getHostFor(ServerCache.java:146)
	 at
org.archive.crawler.datamodel.ServerCache.getHostFor(ServerCache.java:175)
	 at
org.archive.crawler.framework.WriterPoolProcessor.getHostAddress(WriterPoolProce\
ssor.java:344)
	 at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.ja\
va:132)
	 at org.archive.crawler.framework.Processor.process(Processor.java:112)
	 at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:302)
	 at org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)

When I resume the crawl, only new exceptions are thrown and the crawl
is getting back to the paused state.
I've searched and browsed this mailing list and there are a lot of
topics about OOM, but I am not 100% sure, which one can be exactly the
same situation and actual.

Can I modify the crawl to be able resume it?
If not, how to crawl it again without storing already downloaded data?
(checkpoint and recovery?)
What can I do to avoid this situation?

Thank you,

Best regards

Adam

#5594 From: Noah Levitt <nlevitt@...>
Date: Wed Dec 3, 2008 11:12 pm
Subject: Re: 74 millions docs - Out Of Memory
nlevitt0
Send Email Send Email
 
Hello Adam, thanks for the report.

I can't claim to know anything about the more serious second issue. But
the first issue appears to be
http://webteam.archive.org/jira/browse/HER-1482, which has been fixed in
svn on the 2.2.x line. If this is the case, the files that triggered
this exception would have to be larger than 2gb. Can you confirm? I
believe the only effect of this bug is to prevent link extraction from
the affected urls. I don't think it is related to your second issue.

Noah


goblin_cz wrote:
> Hi,
>
> the Czech crawl again :) . I started with default profile and set some
> specific rules (100MB limit etc.) and run the crawl again. You can
> find the order.xml here:
>
> http://raptor.webarchiv.cz/heritrix/order.xml
>
> Tech spec:
> Heritrix 1.14.2
> 8 core Xeon
> 8GB RAM
> 64bit sun java
> 3GB java heap
> Debian 4.1
>
> Everything goes without any serious trouble. The only exception that
> was thrown was:
>
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> Stacktrace: java.lang.IllegalArgumentException: Size exceeds
> Integer.MAX_VALUE
>  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:707)
>  at
>
org.archive.io.GenericReplayCharSequence.getReadOnlyMemoryMappedBuffer(GenericRe\
playCharSequence.java:277)
>  at
>
org.archive.io.GenericReplayCharSequence.decodeToFile(GenericReplayCharSequence.\
java:219)
>  at
> org.archive.io.GenericReplayCharSequence.(GenericReplayCharSequence.java:164)
>  at
>
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream\
.java:559)
>  at
>
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream\
.java:515)
>  at
>
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStream.j\
ava:314)
>  at
> org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:295)
>  at
> org.archive.crawler.extractor.ExtractorHTML.extract(ExtractorHTML.java:540)
>  at
> org.archive.crawler.extractor.Extractor.innerProcess(Extractor.java:67)
>  at org.archive.crawler.framework.Processor.process(Processor.java:112)
>  at
> org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:302)
>  at org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)
>
> This happened on few pages – for example
> http://dermoporadkyne.cz/poradenstvi.pdf and I guess it is caused by
> never ending scripts.
>
> After six days of crawling (I had 74 millions of documents and 4 TB of
> data) happened something much more serious. Heritrix threw about 150
> exceptions and paused itself.
> All exceptions look like this:
> Serious error occured trying to process 'CrawlURI
>
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
> LLL
> http://www.sperky.vltava.cz/DANFIL/vyrobce=1001130574/?category-id=16706
> in Scheduler'
> [ToeThread #132:
>
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
>  CrawlURI
>
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
> LLL
> http://www.sperky.vltava.cz/DANFIL/vyrobce=1001130574/?category-id=16706
>    0 attempts
>     in processor: Scheduler
>     ACTIVE for 1h44m26s863ms
>     step: ABOUT_TO_BEGIN_PROCESSOR for 1h44m5s663ms
>     java.lang.Thread.getStackTrace(Thread.java:1436)
>     org.archive.crawler.framework.ToeThread.reportTo(ToeThread.java:514)
>     org.archive.crawler.framework.ToeThread.reportTo(ToeThread.java:592)
>     org.archive.util.DevUtils.extraInfo(DevUtils.java:65)
>
> org.archive.crawler.framework.ToeThread.seriousError(ToeThread.java:230)
>
> org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:325)
>     org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)
> ]
>            timestamp  discovered      queued   downloaded
> doc/s(avg)  KB/s(avg)   dl-failures   busy-thread   mem-use-KB
> heap-size-KB   congestion   max-depth   avg-depth
> 2008-11-30T19:49:37Z   226231290   147999609     74002484
> 0(121.9)    0(6828)        293362           197      2339025
> 2831488     7,213.49      320897         104
>  (in thread 'ToeThread #132:
>
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543';
> in processor 'Scheduler')
>
> java.lang.OutOfMemoryError: Java heap space
> Stacktrace: java.lang.OutOfMemoryError: Java heap space
>  at java.lang.Class.getDeclaredFields0(Native Method)
>  at java.lang.Class.privateGetDeclaredFields(Class.java:2291)
>  at java.lang.Class.getDeclaredField(Class.java:1880)
>  at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1610)
>  at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
>  at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:425)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.io.ObjectStreamClass.(ObjectStreamClass.java:413)
>  at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:310)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1106)
>  at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
>  at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
>  at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
>  at
> com.sleepycat.bind.serial.SerialBinding.objectToEntry(SerialBinding.java:171)
>  at com.sleepycat.collections.DataView.useValue(DataView.java:548)
>  at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:824)
>  at com.sleepycat.collections.DataCursor.put(DataCursor.java:758)
>  at
>
com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:319)
>  at com.sleepycat.collections.StoredMap.put(StoredMap.java:257)
>  at org.archive.util.CachedBdbMap.expungeStaleEntry(CachedBdbMap.java:562)
>  at
> org.archive.util.CachedBdbMap.expungeStaleEntries(CachedBdbMap.java:533)
>  at org.archive.util.CachedBdbMap.get(CachedBdbMap.java:358)
>  at
> org.archive.crawler.datamodel.ServerCache.getHostFor(ServerCache.java:146)
>  at
> org.archive.crawler.datamodel.ServerCache.getHostFor(ServerCache.java:175)
>  at
>
org.archive.crawler.framework.WriterPoolProcessor.getHostAddress(WriterPoolProce\
ssor.java:344)
>  at
>
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.ja\
va:132)
>  at org.archive.crawler.framework.Processor.process(Processor.java:112)
>  at
> org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:302)
>  at org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)
>
> When I resume the crawl, only new exceptions are thrown and the crawl
> is getting back to the paused state.
> I've searched and browsed this mailing list and there are a lot of
> topics about OOM, but I am not 100% sure, which one can be exactly the
> same situation and actual.
>
> Can I modify the crawl to be able resume it?
> If not, how to crawl it again without storing already downloaded data?
> (checkpoint and recovery?)
> What can I do to avoid this situation?
>
> Thank you,
>
> Best regards
>
> Adam
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>
>

#5595 From: Gordon Mohr <gojomo@...>
Date: Fri Dec 5, 2008 12:30 am
Subject: Re: 74 millions docs - Out Of Memory
gojomo
Send Email Send Email
 
Re: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

As Noah notes, this is a known issue which should be fixed in a future
2.x release. In the meantime, its effects should be limited to
preventing >2GB HTML documents from having their links discovered.

For comparison, some other broad crawlers have a policy of not even
looking for links past the first ~100KB of a resource. If upon
examination the affected pages have crucial outlinks you need to
discover, we could try backporting the 2.x fix or some other workaround
to a future 1.14.x patch release.

I am a little confused by the combination of the stack trace you
forwarded (which was in ExtractorHTML) and the example URL you gave
(http://dermoporadkyne.cz/poradenstvi.pdf). The ExtractorPDF uses a
different link-extraction method that should be unaffected by the 2GB
mapping limit -- so I would *not* expect a similar exception for a PDF
file.... unless perhaps it misidentifies itself as HTML?

In our crawls we usually set some max-size limit for individual
resources in FetchHTTP (such as 100MB, 700MB, or 1GB) -- so content will
have already been truncated before link-extraction occurs.

Re: OutOfMemoryErrors

These are a concern because I don't know any outstanding bugs which
could account for them, given your apparent configuration.

When an OOME happens, we pause the crawl in the hope some interaction
and info-extraction will be possible (via the web UI or other tools).
However, since the OOME may have occurred anywhere, crucial in-memory
structures may be corrupt, and even the 'pause' may not have completely
cleanly. Thus the current crawl-launch may be unstable/unresumable. (I
discuss some other options for continuing the crawl below.)

To get a better idea of why the OOME happened, it could help to see more
of the "Serious error" stacks. Feel free to send them to me off-list, if
you'd like.

Also, the 'jmap' tool, especially as 'jmap -histo JVM_PID', shows the
distribution of live objects, and so might suggest what's gone wrong by
what objects are overrepresented (compared to other runs where the crawl
progresses indefinitely). So if the JVM is still alive, please forward
the first ~30 lines of 'jmap -histo' output.

To try continuing the crawl, the two main options are: (1) resuming from
a checkpoint; or (2) launching a new crawl which is initialized with the
previous crawl's 'recovery log'. These are discussed a bit in the
Heritrix User Manual:

http://crawler.archive.org/articles/user_manual/outside.html#recover

You *might* be able to make a valid checkpoint after the OOMEs, if the
crawler has paused cleanly -- but as noted above, after any OOME all
bets are off as the the stability of current in-memory structures.

You may have already been making regular checkpoints; if so you can
resume from one of the older ones, and the crawl should essentially
proceed from the moment of the checkpoint. That might mean some number
of URIs are repeat-crawled.

Using the recovery log is really launching an all-new crawl, but
preinitializing the crawler's 'already-seen' set and pending queues
based on two passes over the previous crawl's recovery log. As a result
you have a reasonable simulation of the first crawl's queues/seen-set at
the time the recovery log ended. (Ordering may be somewhat different;
state other than the queues/seen-set is not restored; richer URI state
like 'source-tagging' is lost.)

Hope this helps,

- Gordon @ IA

goblin_cz wrote:
> Hi,
>
> the Czech crawl again :) . I started with default profile and set some
> specific rules (100MB limit etc.) and run the crawl again. You can
> find the order.xml here:
>
> http://raptor.webarchiv.cz/heritrix/order.xml
>
> Tech spec:
> Heritrix 1.14.2
> 8 core Xeon
> 8GB RAM
> 64bit sun java
> 3GB java heap
> Debian 4.1
>
> Everything goes without any serious trouble. The only exception that
> was thrown was:
>
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> Stacktrace: java.lang.IllegalArgumentException: Size exceeds
> Integer.MAX_VALUE
>  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:707)
>  at
>
org.archive.io.GenericReplayCharSequence.getReadOnlyMemoryMappedBuffer(GenericRe\
playCharSequence.java:277)
>  at
>
org.archive.io.GenericReplayCharSequence.decodeToFile(GenericReplayCharSequence.\
java:219)
>  at
> org.archive.io.GenericReplayCharSequence.(GenericReplayCharSequence.java:164)
>  at
>
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream\
.java:559)
>  at
>
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream\
.java:515)
>  at
>
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStream.j\
ava:314)
>  at
> org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:295)
>  at
> org.archive.crawler.extractor.ExtractorHTML.extract(ExtractorHTML.java:540)
>  at
> org.archive.crawler.extractor.Extractor.innerProcess(Extractor.java:67)
>  at org.archive.crawler.framework.Processor.process(Processor.java:112)
>  at
> org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:302)
>  at org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)
>
> This happened on few pages – for example
> http://dermoporadkyne.cz/poradenstvi.pdf and I guess it is caused by
> never ending scripts.
>
> After six days of crawling (I had 74 millions of documents and 4 TB of
> data) happened something much more serious. Heritrix threw about 150
> exceptions and paused itself.
> All exceptions look like this:
> Serious error occured trying to process 'CrawlURI
>
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
> LLL
> http://www.sperky.vltava.cz/DANFIL/vyrobce=1001130574/?category-id=16706
> in Scheduler'
> [ToeThread #132:
>
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
>  CrawlURI
>
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543
> LLL
> http://www.sperky.vltava.cz/DANFIL/vyrobce=1001130574/?category-id=16706
>    0 attempts
>     in processor: Scheduler
>     ACTIVE for 1h44m26s863ms
>     step: ABOUT_TO_BEGIN_PROCESSOR for 1h44m5s663ms
>     java.lang.Thread.getStackTrace(Thread.java:1436)
>     org.archive.crawler.framework.ToeThread.reportTo(ToeThread.java:514)
>     org.archive.crawler.framework.ToeThread.reportTo(ToeThread.java:592)
>     org.archive.util.DevUtils.extraInfo(DevUtils.java:65)
>
> org.archive.crawler.framework.ToeThread.seriousError(ToeThread.java:230)
>
> org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:325)
>     org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)
> ]
>            timestamp  discovered      queued   downloaded
> doc/s(avg)  KB/s(avg)   dl-failures   busy-thread   mem-use-KB
> heap-size-KB   congestion   max-depth   avg-depth
> 2008-11-30T19:49:37Z   226231290   147999609     74002484
> 0(121.9)    0(6828)        293362           197      2339025
> 2831488     7,213.49      320897         104
>  (in thread 'ToeThread #132:
>
http://www.sperky.vltava.cz/produkt-sperky.esp?product-id=860134354&seo-hint:pro\
duct-name=Zlat%C4%82%CB%9D%20prsten%20Danfil%20DF1565&category-id=17543';
> in processor 'Scheduler')
>
> java.lang.OutOfMemoryError: Java heap space
> Stacktrace: java.lang.OutOfMemoryError: Java heap space
>  at java.lang.Class.getDeclaredFields0(Native Method)
>  at java.lang.Class.privateGetDeclaredFields(Class.java:2291)
>  at java.lang.Class.getDeclaredField(Class.java:1880)
>  at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1610)
>  at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
>  at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:425)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.io.ObjectStreamClass.(ObjectStreamClass.java:413)
>  at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:310)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1106)
>  at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
>  at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
>  at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
>  at
> com.sleepycat.bind.serial.SerialBinding.objectToEntry(SerialBinding.java:171)
>  at com.sleepycat.collections.DataView.useValue(DataView.java:548)
>  at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:824)
>  at com.sleepycat.collections.DataCursor.put(DataCursor.java:758)
>  at
>
com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:319)
>  at com.sleepycat.collections.StoredMap.put(StoredMap.java:257)
>  at org.archive.util.CachedBdbMap.expungeStaleEntry(CachedBdbMap.java:562)
>  at
> org.archive.util.CachedBdbMap.expungeStaleEntries(CachedBdbMap.java:533)
>  at org.archive.util.CachedBdbMap.get(CachedBdbMap.java:358)
>  at
> org.archive.crawler.datamodel.ServerCache.getHostFor(ServerCache.java:146)
>  at
> org.archive.crawler.datamodel.ServerCache.getHostFor(ServerCache.java:175)
>  at
>
org.archive.crawler.framework.WriterPoolProcessor.getHostAddress(WriterPoolProce\
ssor.java:344)
>  at
>
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.ja\
va:132)
>  at org.archive.crawler.framework.Processor.process(Processor.java:112)
>  at
> org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:302)
>  at org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)
>
> When I resume the crawl, only new exceptions are thrown and the crawl
> is getting back to the paused state.
> I've searched and browsed this mailing list and there are a lot of
> topics about OOM, but I am not 100% sure, which one can be exactly the
> same situation and actual.
>
> Can I modify the crawl to be able resume it?
> If not, how to crawl it again without storing already downloaded data?
> (checkpoint and recovery?)
> What can I do to avoid this situation?
>
> Thank you,
>
> Best regards
>
> Adam
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5596 From: "Tomas Ukkonen" <tomas.ukkonen@...>
Date: Mon Dec 8, 2008 4:03 pm
Subject: Crawling YouTube video pages?
tomas.ukkonen@...
Send Email Send Email
 
Hi to all,


Is there way to fetch YouTube videos with Heritrix (1.14.x)?

We have been trying to download selected youtube videos
in our thematic crawls but haven't succeeded (yet).


There is the ExtractorImpliedURI extractor which solved problem
with the previous version of YouTube pages
(http://webteam.archive.org/jira/browse/HER-1050).

However, with updated YouTube pages the construction of the
actual flash video url is slightly more complicated:
http://www.marksanborn.net/howto/using-wget-to-download-youtube-videos/


I can write an regular expression that creates the correct URI.

A line from HTML video page:
var fullscreenUrl =
'/watch_fullscreen?fs=1&BASE_YT_URL=http%3A%2F%2Fyoutube.com%2F&vq=None&vide
o_id=eBGIQ7ZuuiU&l=210&sk=QHSGy-6-eiJ6h5qtrj1kR2OJanvL6tLlC&fmt_map=&t=OEgsT
oPDskIjlCX_1shTRBRDv2UQzVb-&hl=en&plid=AAROOIUyvpMFnjvzAAAAoARsYAg&tk=6J3d54
cPnMJP7mUN2BtMkd2OmbgxTvB_GPcrV3ckGUbzd7KVMiH1kA%3D%3D&title=Rick Roll';

Matching regular expression:
(.*)/watch_fullscreen?(.*)video_id=(.*)&title(.*)

And replacement regex:
http://www.youtube.com/get_video?video_id=$3


But 'var fullscreenUrl=/watch_..' line doesn't seem to be recognized to be
an potential URI so implied URI extractor doesn't work.


So, should I try to extend implied URI extractor to process whole files or
is there some other way to extract YouTube video URI's from the pages?


The extractor queue I'm using is:

org.archive.crawler.extractor.ExtractorHTTP
org.archive.crawler.extractor.AggressiveExtractorHTML
org.archive.crawler.extractor.ExtractorCSS
org.archive.crawler.extractor.ExtractorJS
org.archive.crawler.extractor.ExtractorSWF
org.archive.crawler.extractor.ExtractorURI
org.archive.crawler.extractor.ExtractorImpliedURI
is.hi.bok.deduplicator.DeDuplicator
(+ don't follow robots.txt)



Thanks in advance,

--
Tomas Ukkonen
Information Systems Specialist
Kansalliskirjasto /
The National Library of Finland
phone +358-50-4150557
email tomas.ukkonen@...
www   http://www.kansalliskirjasto.fi
       http://www.nationallibrary.fi

#5597 From: "alihoaliho" <alihoaliho@...>
Date: Tue Dec 9, 2008 3:12 pm
Subject: How to make domain filter work?
alihoaliho
Send Email Send Email
 
Hi, we have a custom profile to crawl a small subset of pages from
each site in a seed file, and we wanted to limit our crawler such that
it only download pages within the seed host domain

However, the crawler kept getting out of seed domain pages. I had make
the adjustment according the suggestion in
http://tech.groups.yahoo.com/group/archive-crawler/message/5497
but the crawler still download pages that are not within the seed domain.

Can someone take a look at our config file and tell me what I need to
make domain filtering work?

Thanks,


====================Config=====================
<?xml version="1.0" encoding="UTF-8"?><crawl-order
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
   <meta />
   <controller>
     <string name="settings-directory">settings</string>
     <string name="disk-path"></string>
     <string name="logs-path">logs</string>
     <string name="checkpoints-path">checkpoints</string>
     <string name="state-path">state</string>
     <string name="scratch-path">scratch</string>
     <long name="max-bytes-download">0</long>
     <long name="max-document-download">0</long>
     <long name="max-time-sec">0</long>
     <integer name="max-toe-threads">50</integer>
     <integer name="recorder-out-buffer-bytes">4096</integer>
     <integer name="recorder-in-buffer-bytes">65536</integer>
     <integer name="bdb-cache-percent">0</integer>
     <newObject name="scope"
class="org.archive.crawler.deciderules.DecidingScope">
       <boolean name="enabled">true</boolean>
       <string name="seedsfile">seeds.txt</string>
       <boolean name="reread-seeds-on-config">false</boolean>
       <newObject name="decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
         <map name="rules">
           <newObject name="rejectByDefault"
class="org.archive.crawler.deciderules.RejectDecideRule">
           </newObject>
           <newObject name="acceptIfSurtPrefixed"
class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
             <string name="decision">ACCEPT</string>
             <string name="surts-source-file"></string>
             <boolean name="seeds-as-surt-prefixes">true</boolean>
             <string name="surts-dump-file"></string>
             <boolean name="also-check-via">false</boolean>
             <boolean name="rebuild-on-reconfig">true</boolean>
           </newObject>
           <newObject name="rejectIfTooManyHops"
class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
             <integer name="max-hops">10</integer>
           </newObject>
           <newObject name="rejectIfPathological"
class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
             <integer name="max-repetitions">2</integer>
           </newObject>
           <newObject name="rejectIfTooManyPathSegs"
class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">
             <integer name="max-path-depth">20</integer>
           </newObject>
           <newObject name="htmlContentTypeFilter"
class="org.archive.crawler.deciderules.ContentTypeMatchesRegExpDecideRule">
             <string name="decision">ACCEPT</string>
             <string name="regexp">(?i)text/html.*</string>
           </newObject>
           <newObject name="acceptIfPrerequisite"
class="org.archive.crawler.deciderules.PrerequisiteAcceptDecideRule">
           </newObject>
           <newObject name="nonHtmlFilter"
class="org.archive.crawler.deciderules.MatchesRegExpDecideRule">
             <string name="decision">REJECT</string>
             <string
name="regexp">.*(?i)\.(a|ai|aif|aifc|aiff|asc|avi|bcpio|bin|bmp|bz2|c|cdf|cgi|cg\
m|class|cpio|cpp?|cpt|csh|css|cxx|dcr|dif|dir|djv|djvu|dll|dmg|dms|doc|dtd|dv|dv\
i|dxr|eps|etx|exe|ez|gif|gram|grxml|gtar|h|hdf|hqx|ice|ico|ics|ief|ifb|iges|igs|\
iso|jnlp|jp2|jpe|jpeg|jpg|js|kar|latex|lha|lzh|m3u|mac|man|mathml|me|mesh|mid|mi\
di|mif|mov|movie|mp2|mp3|mp4|mpe|mpeg|mpg|mpga|ms|msh|mxu|nc|o|oda|ogg|pbm|pct|p\
db|pdf|pgm|pgn|pic|pict|pl|png|pnm|pnt|pntg|ppm|ppt|ps|py|qt|qti|qtif|ra|ram|ras\
|rdf|rgb|rm|roff|rpm|rtf|rtx|s|sgm|sgml|sh|shar|silo|sit|skd|skm|skp|skt|smi|smi\
l|snd|so|spl|src|srpm|sv4cpio|sv4crc|svg|swf|t|tar|tcl|tex|texi|texinfo|tgz|tif|\
tiff|tr|tsv|ustar|vcd|vrml|vxml|wav|wbmp|wbxml|wml|wmlc|wmls|wmlsc|wrl|xbm|xht|x\
html|xls|xml|xpm|xsl|xslt|xwd|xyz|z|zip)$</string>
           </newObject>
         </map>
       </newObject>
     </newObject>
     <map name="http-headers" />
     <newObject name="robots-honoring-policy"
class="org.archive.crawler.datamodel.RobotsHonoringPolicy">
       <string name="type">most-favored-set</string>
       <boolean name="masquerade">false</boolean>
       <text name="custom-robots"></text>
       <stringList name="user-agents">
       </stringList>
     </newObject>
     <newObject name="frontier"
class="org.archive.crawler.frontier.BdbFrontier">
       <float name="delay-factor">4.0</float>
       <integer name="max-delay-ms">10000</integer>
       <integer name="min-delay-ms">3000</integer>
       <integer name="respect-crawl-delay-up-to-secs">10</integer>
       <integer name="max-retries">5</integer>
       <long name="retry-delay-seconds">900</long>
       <integer name="preference-embed-hops">1</integer>
       <integer name="total-bandwidth-usage-KB-sec">0</integer>
       <integer name="max-per-host-bandwidth-usage-KB-sec">0</integer>
       <string
name="queue-assignment-policy">org.archive.crawler.frontier.HostnameQueueAssignm\
entPolicy</string>
       <string name="force-queue-assignment"></string>
       <boolean name="pause-at-start">false</boolean>
       <boolean name="pause-at-finish">false</boolean>
       <boolean name="source-tag-seeds">true</boolean>
       <boolean name="recovery-log-enabled">true</boolean>
       <boolean name="hold-queues">true</boolean>
       <integer name="balance-replenish-amount">3000</integer>
       <integer name="error-penalty-amount">100</integer>
       <long name="queue-total-budget">-1</long>
       <string
name="cost-policy">org.archive.crawler.frontier.UnitCostAssignmentPolicy</string\
>
       <long name="snooze-deactivate-ms">300000</long>
       <integer name="target-ready-backlog">50</integer>
       <string
name="uri-included-structure">org.archive.crawler.util.BdbUriUniqFilter</string>
       <boolean name="dump-pending-at-close">false</boolean>
     </newObject>
     <map name="uri-canonicalization-rules">
       <newObject name="Lowercase"
class="org.archive.crawler.url.canonicalize.LowercaseRule">
         <boolean name="enabled">true</boolean>
       </newObject>
       <newObject name="Userinfo"
class="org.archive.crawler.url.canonicalize.StripUserinfoRule">
         <boolean name="enabled">true</boolean>
       </newObject>
       <newObject name="WWW[0-9]*"
class="org.archive.crawler.url.canonicalize.StripWWWNRule">
         <boolean name="enabled">true</boolean>
       </newObject>
       <newObject name="SessionIDs"
class="org.archive.crawler.url.canonicalize.StripSessionIDs">
         <boolean name="enabled">true</boolean>
       </newObject>
       <newObject name="SessionCFIDs"
class="org.archive.crawler.url.canonicalize.StripSessionCFIDs">
         <boolean name="enabled">true</boolean>
       </newObject>
       <newObject name="QueryStrPrefix"
class="org.archive.crawler.url.canonicalize.FixupQueryStr">
         <boolean name="enabled">true</boolean>
       </newObject>
     </map>
     <map name="pre-fetch-processors">
       <newObject name="Preselector"
class="org.archive.crawler.prefetch.Preselector">
         <boolean name="enabled">true</boolean>
         <newObject name="Preselector#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <boolean name="override-logger">false</boolean>
         <boolean name="recheck-scope">true</boolean>
         <boolean name="block-all">false</boolean>
         <string name="block-by-regexp"></string>
         <string name="allow-by-regexp"></string>
       </newObject>
       <newObject name="Preprocessor"
class="org.archive.crawler.prefetch.PreconditionEnforcer">
         <boolean name="enabled">true</boolean>
         <newObject name="Preprocessor#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <integer name="ip-validity-duration-seconds">21600</integer>
         <integer name="robot-validity-duration-seconds">86400</integer>
         <boolean name="calculate-robots-only">false</boolean>
       </newObject>
       <newObject name="QuotaEnforcer"
class="org.archive.crawler.prefetch.QuotaEnforcer">
         <boolean name="enabled">true</boolean>
         <newObject name="QuotaEnforcer#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <boolean name="force-retire">false</boolean>
         <long name="server-max-fetch-successes">-1</long>
         <long name="server-max-success-kb">-1</long>
         <long name="server-max-fetch-responses">-1</long>
         <long name="server-max-all-kb">-1</long>
         <long name="host-max-fetch-successes">500</long>
         <long name="host-max-success-kb">-1</long>
         <long name="host-max-fetch-responses">-1</long>
         <long name="host-max-all-kb">-1</long>
         <long name="group-max-fetch-successes">-1</long>
         <long name="group-max-success-kb">-1</long>
         <long name="group-max-fetch-responses">-1</long>
         <long name="group-max-all-kb">-1</long>
       </newObject>
     </map>
     <map name="fetch-processors">
       <newObject name="DNS" class="org.archive.crawler.fetcher.FetchDNS">
         <boolean name="enabled">true</boolean>
         <newObject name="DNS#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <boolean name="accept-non-dns-resolves">false</boolean>
         <boolean name="digest-content">true</boolean>
         <string name="digest-algorithm">sha1</string>
       </newObject>
       <newObject name="HTTP"
class="org.archive.crawler.fetcher.FetchHTTP">
         <boolean name="enabled">true</boolean>
         <newObject name="HTTP#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <newObject name="midfetch-decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <integer name="timeout-seconds">1200</integer>
         <integer name="sotimeout-ms">20000</integer>
         <integer name="fetch-bandwidth">0</integer>
         <long name="max-length-bytes">0</long>
         <boolean name="ignore-cookies">false</boolean>
         <boolean name="use-bdb-for-cookies">true</boolean>
         <string name="load-cookies-from-file"></string>
         <string name="save-cookies-to-file"></string>
         <string name="trust-level">open</string>
         <stringList name="accept-headers">
         </stringList>
         <string name="http-proxy-host"></string>
         <string name="http-proxy-port"></string>
         <string name="default-encoding">ISO-8859-1</string>
         <boolean name="digest-content">true</boolean>
         <string name="digest-algorithm">sha1</string>
         <boolean name="send-if-modified-since">true</boolean>
         <boolean name="send-if-none-match">true</boolean>
         <boolean name="send-connection-close">true</boolean>
         <boolean name="send-referer">true</boolean>
         <boolean name="send-range">false</boolean>
         <string name="http-bind-address"></string>
       </newObject>
     </map>
     <map name="extract-processors">
       <newObject name="ExtractorHTTP"
class="org.archive.crawler.extractor.ExtractorHTTP">
         <boolean name="enabled">true</boolean>
         <newObject name="ExtractorHTTP#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
       </newObject>
       <newObject name="ExtractorHTML"
class="org.archive.crawler.extractor.ExtractorHTML">
         <boolean name="enabled">true</boolean>
         <newObject name="ExtractorHTML#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <boolean name="extract-javascript">true</boolean>
         <boolean name="treat-frames-as-embed-links">true</boolean>
         <boolean name="ignore-form-action-urls">false</boolean>
         <boolean name="extract-only-form-gets">true</boolean>
         <boolean name="extract-value-attributes">true</boolean>
         <boolean name="ignore-unexpected-html">true</boolean>
       </newObject>
       <newObject name="ExtractorCSS"
class="org.archive.crawler.extractor.ExtractorCSS">
         <boolean name="enabled">true</boolean>
         <newObject name="ExtractorCSS#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
       </newObject>
       <newObject name="ExtractorJS"
class="org.archive.crawler.extractor.ExtractorJS">
         <boolean name="enabled">true</boolean>
         <newObject name="ExtractorJS#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
       </newObject>
       <newObject name="ExtractorSWF"
class="org.archive.crawler.extractor.ExtractorSWF">
         <boolean name="enabled">true</boolean>
         <newObject name="ExtractorSWF#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
       </newObject>
     </map>
     <map name="write-processors">
       <newObject name="Archiver2"
class="org.archive.crawler.writer.ARCWriterProcessor2">
         <boolean name="enabled">true</boolean>
         <newObject name="Archiver2#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <boolean name="compress">true</boolean>
         <string name="prefix">IAH</string>
         <string name="suffix">${HOSTNAME}</string>
         <long name="max-size-bytes">100000000</long>
         <stringList name="path">
           <string>arcs</string>
         </stringList>
         <integer name="pool-max-active">5</integer>
         <integer name="pool-max-wait">300000</integer>
         <long name="total-bytes-to-write">0</long>
         <boolean name="skip-identical-digests">false</boolean>
       </newObject>
       <newObject name="MirrorWriter"
class="org.archive.crawler.writer.MirrorWriterProcessor">
         <boolean name="enabled">true</boolean>
         <newObject name="MirrorWriter#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <boolean name="case-sensitive">true</boolean>
         <stringList name="character-map">
         </stringList>
         <stringList name="content-type-map">
         </stringList>
         <string name="directory-file">index.html</string>
         <string name="dot-begin">%2E</string>
         <string name="dot-end">.</string>
         <stringList name="host-map">
         </stringList>
         <boolean name="host-directory">true</boolean>
         <string name="path">mirror</string>
         <integer name="max-path-length">1023</integer>
         <integer name="max-segment-length">255</integer>
         <boolean name="port-directory">false</boolean>
         <boolean name="suffix-at-end">true</boolean>
         <string name="too-long-directory">LONG</string>
         <stringList name="underscore-set">
         </stringList>
       </newObject>
     </map>
     <map name="post-processors">
       <newObject name="Updater"
class="org.archive.crawler.postprocessor.CrawlStateUpdater">
         <boolean name="enabled">true</boolean>
         <newObject name="Updater#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
       </newObject>
       <newObject name="LinksScoper"
class="org.archive.crawler.postprocessor.LinksScoper">
         <boolean name="enabled">true</boolean>
         <newObject name="LinksScoper#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
         <boolean name="override-logger">false</boolean>
         <boolean name="seed-redirects-new-seed">true</boolean>
         <integer name="preference-depth-hops">-1</integer>
         <newObject name="scope-rejected-url-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
       </newObject>
       <newObject name="Scheduler"
class="org.archive.crawler.postprocessor.FrontierScheduler">
         <boolean name="enabled">true</boolean>
         <newObject name="Scheduler#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
           <map name="rules">
           </map>
         </newObject>
       </newObject>
     </map>
     <map name="loggers">
       <newObject name="crawl-statistics"
class="org.archive.crawler.admin.StatisticsTracker">
         <integer name="interval-seconds">20</integer>
       </newObject>
     </map>
     <string name="recover-path"></string>
     <boolean name="checkpoint-copy-bdbje-logs">true</boolean>
     <boolean name="recover-retain-failures">false</boolean>
     <boolean name="recover-scope-includes">true</boolean>
     <boolean name="recover-scope-enqueues">true</boolean>
     <newObject name="credential-store"
class="org.archive.crawler.datamodel.CredentialStore">
       <map name="credentials">
       </map>
     </newObject>
   </controller>
</crawl-order>

#5598 From: Gordon Mohr <gojomo@...>
Date: Tue Dec 9, 2008 6:44 pm
Subject: Re: How to make domain filter work?
gojomo
Send Email Send Email
 
As also mentioned in the referenced message, your
"htmlContentTypeFilter" makes no sense in a scope -- there's no
content-type yet to compare.

In the case of no content-type, its default is to apply the decision (in
this case ACCEPT). Remove the meaningless rule and you should see the
expected behavior.

- Gordon @ IA

alihoaliho wrote:
> Hi, we have a custom profile to crawl a small subset of pages from
> each site in a seed file, and we wanted to limit our crawler such that
> it only download pages within the seed host domain
>
> However, the crawler kept getting out of seed domain pages. I had make
> the adjustment according the suggestion in
> http://tech.groups.yahoo.com/group/archive-crawler/message/5497
> but the crawler still download pages that are not within the seed domain.
>
> Can someone take a look at our config file and tell me what I need to
> make domain filtering work?
>
> Thanks,
>
>
> ====================Config=====================
> <?xml version="1.0" encoding="UTF-8"?><crawl-order
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
>   <meta />
>   <controller>
>     <string name="settings-directory">settings</string>
>     <string name="disk-path"></string>
>     <string name="logs-path">logs</string>
>     <string name="checkpoints-path">checkpoints</string>
>     <string name="state-path">state</string>
>     <string name="scratch-path">scratch</string>
>     <long name="max-bytes-download">0</long>
>     <long name="max-document-download">0</long>
>     <long name="max-time-sec">0</long>
>     <integer name="max-toe-threads">50</integer>
>     <integer name="recorder-out-buffer-bytes">4096</integer>
>     <integer name="recorder-in-buffer-bytes">65536</integer>
>     <integer name="bdb-cache-percent">0</integer>
>     <newObject name="scope"
> class="org.archive.crawler.deciderules.DecidingScope">
>       <boolean name="enabled">true</boolean>
>       <string name="seedsfile">seeds.txt</string>
>       <boolean name="reread-seeds-on-config">false</boolean>
>       <newObject name="decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>         <map name="rules">
>           <newObject name="rejectByDefault"
> class="org.archive.crawler.deciderules.RejectDecideRule">
>           </newObject>
>           <newObject name="acceptIfSurtPrefixed"
> class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
>             <string name="decision">ACCEPT</string>
>             <string name="surts-source-file"></string>
>             <boolean name="seeds-as-surt-prefixes">true</boolean>
>             <string name="surts-dump-file"></string>
>             <boolean name="also-check-via">false</boolean>
>             <boolean name="rebuild-on-reconfig">true</boolean>
>           </newObject>
>           <newObject name="rejectIfTooManyHops"
> class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
>             <integer name="max-hops">10</integer>
>           </newObject>
>           <newObject name="rejectIfPathological"
> class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
>             <integer name="max-repetitions">2</integer>
>           </newObject>
>           <newObject name="rejectIfTooManyPathSegs"
> class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">
>             <integer name="max-path-depth">20</integer>
>           </newObject>
>           <newObject name="htmlContentTypeFilter"
> class="org.archive.crawler.deciderules.ContentTypeMatchesRegExpDecideRule">
>             <string name="decision">ACCEPT</string>
>             <string name="regexp">(?i)text/html.*</string>
>           </newObject>
>           <newObject name="acceptIfPrerequisite"
> class="org.archive.crawler.deciderules.PrerequisiteAcceptDecideRule">
>           </newObject>
>           <newObject name="nonHtmlFilter"
> class="org.archive.crawler.deciderules.MatchesRegExpDecideRule">
>             <string name="decision">REJECT</string>
>             <string
>
name="regexp">.*(?i)\.(a|ai|aif|aifc|aiff|asc|avi|bcpio|bin|bmp|bz2|c|cdf|cgi|cg\
m|class|cpio|cpp?|cpt|csh|css|cxx|dcr|dif|dir|djv|djvu|dll|dmg|dms|doc|dtd|dv|dv\
i|dxr|eps|etx|exe|ez|gif|gram|grxml|gtar|h|hdf|hqx|ice|ico|ics|ief|ifb|iges|igs|\
iso|jnlp|jp2|jpe|jpeg|jpg|js|kar|latex|lha|lzh|m3u|mac|man|mathml|me|mesh|mid|mi\
di|mif|mov|movie|mp2|mp3|mp4|mpe|mpeg|mpg|mpga|ms|msh|mxu|nc|o|oda|ogg|pbm|pct|p\
db|pdf|pgm|pgn|pic|pict|pl|png|pnm|pnt|pntg|ppm|ppt|ps|py|qt|qti|qtif|ra|ram|ras\
|rdf|rgb|rm|roff|rpm|rtf|rtx|s|sgm|sgml|sh|shar|silo|sit|skd|skm|skp|skt|smi|smi\
l|snd|so|spl|src|srpm|sv4cpio|sv4crc|svg|swf|t|tar|tcl|tex|texi|texinfo|tgz|tif|\
tiff|tr|tsv|ustar|vcd|vrml|vxml|wav|wbmp|wbxml|wml|wmlc|wmls|wmlsc|wrl|xbm|xht|x\
html|xls|xml|xpm|xsl|xslt|xwd|xyz|z|zip)$</string>
>           </newObject>
>         </map>
>       </newObject>
>     </newObject>
>     <map name="http-headers" />
>     <newObject name="robots-honoring-policy"
> class="org.archive.crawler.datamodel.RobotsHonoringPolicy">
>       <string name="type">most-favored-set</string>
>       <boolean name="masquerade">false</boolean>
>       <text name="custom-robots"></text>
>       <stringList name="user-agents">
>       </stringList>
>     </newObject>
>     <newObject name="frontier"
> class="org.archive.crawler.frontier.BdbFrontier">
>       <float name="delay-factor">4.0</float>
>       <integer name="max-delay-ms">10000</integer>
>       <integer name="min-delay-ms">3000</integer>
>       <integer name="respect-crawl-delay-up-to-secs">10</integer>
>       <integer name="max-retries">5</integer>
>       <long name="retry-delay-seconds">900</long>
>       <integer name="preference-embed-hops">1</integer>
>       <integer name="total-bandwidth-usage-KB-sec">0</integer>
>       <integer name="max-per-host-bandwidth-usage-KB-sec">0</integer>
>       <string
>
name="queue-assignment-policy">org.archive.crawler.frontier.HostnameQueueAssignm\
entPolicy</string>
>       <string name="force-queue-assignment"></string>
>       <boolean name="pause-at-start">false</boolean>
>       <boolean name="pause-at-finish">false</boolean>
>       <boolean name="source-tag-seeds">true</boolean>
>       <boolean name="recovery-log-enabled">true</boolean>
>       <boolean name="hold-queues">true</boolean>
>       <integer name="balance-replenish-amount">3000</integer>
>       <integer name="error-penalty-amount">100</integer>
>       <long name="queue-total-budget">-1</long>
>       <string
>
name="cost-policy">org.archive.crawler.frontier.UnitCostAssignmentPolicy</string\
>
>       <long name="snooze-deactivate-ms">300000</long>
>       <integer name="target-ready-backlog">50</integer>
>       <string
>
name="uri-included-structure">org.archive.crawler.util.BdbUriUniqFilter</string>
>       <boolean name="dump-pending-at-close">false</boolean>
>     </newObject>
>     <map name="uri-canonicalization-rules">
>       <newObject name="Lowercase"
> class="org.archive.crawler.url.canonicalize.LowercaseRule">
>         <boolean name="enabled">true</boolean>
>       </newObject>
>       <newObject name="Userinfo"
> class="org.archive.crawler.url.canonicalize.StripUserinfoRule">
>         <boolean name="enabled">true</boolean>
>       </newObject>
>       <newObject name="WWW[0-9]*"
> class="org.archive.crawler.url.canonicalize.StripWWWNRule">
>         <boolean name="enabled">true</boolean>
>       </newObject>
>       <newObject name="SessionIDs"
> class="org.archive.crawler.url.canonicalize.StripSessionIDs">
>         <boolean name="enabled">true</boolean>
>       </newObject>
>       <newObject name="SessionCFIDs"
> class="org.archive.crawler.url.canonicalize.StripSessionCFIDs">
>         <boolean name="enabled">true</boolean>
>       </newObject>
>       <newObject name="QueryStrPrefix"
> class="org.archive.crawler.url.canonicalize.FixupQueryStr">
>         <boolean name="enabled">true</boolean>
>       </newObject>
>     </map>
>     <map name="pre-fetch-processors">
>       <newObject name="Preselector"
> class="org.archive.crawler.prefetch.Preselector">
>         <boolean name="enabled">true</boolean>
>         <newObject name="Preselector#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <boolean name="override-logger">false</boolean>
>         <boolean name="recheck-scope">true</boolean>
>         <boolean name="block-all">false</boolean>
>         <string name="block-by-regexp"></string>
>         <string name="allow-by-regexp"></string>
>       </newObject>
>       <newObject name="Preprocessor"
> class="org.archive.crawler.prefetch.PreconditionEnforcer">
>         <boolean name="enabled">true</boolean>
>         <newObject name="Preprocessor#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <integer name="ip-validity-duration-seconds">21600</integer>
>         <integer name="robot-validity-duration-seconds">86400</integer>
>         <boolean name="calculate-robots-only">false</boolean>
>       </newObject>
>       <newObject name="QuotaEnforcer"
> class="org.archive.crawler.prefetch.QuotaEnforcer">
>         <boolean name="enabled">true</boolean>
>         <newObject name="QuotaEnforcer#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <boolean name="force-retire">false</boolean>
>         <long name="server-max-fetch-successes">-1</long>
>         <long name="server-max-success-kb">-1</long>
>         <long name="server-max-fetch-responses">-1</long>
>         <long name="server-max-all-kb">-1</long>
>         <long name="host-max-fetch-successes">500</long>
>         <long name="host-max-success-kb">-1</long>
>         <long name="host-max-fetch-responses">-1</long>
>         <long name="host-max-all-kb">-1</long>
>         <long name="group-max-fetch-successes">-1</long>
>         <long name="group-max-success-kb">-1</long>
>         <long name="group-max-fetch-responses">-1</long>
>         <long name="group-max-all-kb">-1</long>
>       </newObject>
>     </map>
>     <map name="fetch-processors">
>       <newObject name="DNS" class="org.archive.crawler.fetcher.FetchDNS">
>         <boolean name="enabled">true</boolean>
>         <newObject name="DNS#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <boolean name="accept-non-dns-resolves">false</boolean>
>         <boolean name="digest-content">true</boolean>
>         <string name="digest-algorithm">sha1</string>
>       </newObject>
>       <newObject name="HTTP"
> class="org.archive.crawler.fetcher.FetchHTTP">
>         <boolean name="enabled">true</boolean>
>         <newObject name="HTTP#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <newObject name="midfetch-decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <integer name="timeout-seconds">1200</integer>
>         <integer name="sotimeout-ms">20000</integer>
>         <integer name="fetch-bandwidth">0</integer>
>         <long name="max-length-bytes">0</long>
>         <boolean name="ignore-cookies">false</boolean>
>         <boolean name="use-bdb-for-cookies">true</boolean>
>         <string name="load-cookies-from-file"></string>
>         <string name="save-cookies-to-file"></string>
>         <string name="trust-level">open</string>
>         <stringList name="accept-headers">
>         </stringList>
>         <string name="http-proxy-host"></string>
>         <string name="http-proxy-port"></string>
>         <string name="default-encoding">ISO-8859-1</string>
>         <boolean name="digest-content">true</boolean>
>         <string name="digest-algorithm">sha1</string>
>         <boolean name="send-if-modified-since">true</boolean>
>         <boolean name="send-if-none-match">true</boolean>
>         <boolean name="send-connection-close">true</boolean>
>         <boolean name="send-referer">true</boolean>
>         <boolean name="send-range">false</boolean>
>         <string name="http-bind-address"></string>
>       </newObject>
>     </map>
>     <map name="extract-processors">
>       <newObject name="ExtractorHTTP"
> class="org.archive.crawler.extractor.ExtractorHTTP">
>         <boolean name="enabled">true</boolean>
>         <newObject name="ExtractorHTTP#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>       </newObject>
>       <newObject name="ExtractorHTML"
> class="org.archive.crawler.extractor.ExtractorHTML">
>         <boolean name="enabled">true</boolean>
>         <newObject name="ExtractorHTML#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <boolean name="extract-javascript">true</boolean>
>         <boolean name="treat-frames-as-embed-links">true</boolean>
>         <boolean name="ignore-form-action-urls">false</boolean>
>         <boolean name="extract-only-form-gets">true</boolean>
>         <boolean name="extract-value-attributes">true</boolean>
>         <boolean name="ignore-unexpected-html">true</boolean>
>       </newObject>
>       <newObject name="ExtractorCSS"
> class="org.archive.crawler.extractor.ExtractorCSS">
>         <boolean name="enabled">true</boolean>
>         <newObject name="ExtractorCSS#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>       </newObject>
>       <newObject name="ExtractorJS"
> class="org.archive.crawler.extractor.ExtractorJS">
>         <boolean name="enabled">true</boolean>
>         <newObject name="ExtractorJS#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>       </newObject>
>       <newObject name="ExtractorSWF"
> class="org.archive.crawler.extractor.ExtractorSWF">
>         <boolean name="enabled">true</boolean>
>         <newObject name="ExtractorSWF#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>       </newObject>
>     </map>
>     <map name="write-processors">
>       <newObject name="Archiver2"
> class="org.archive.crawler.writer.ARCWriterProcessor2">
>         <boolean name="enabled">true</boolean>
>         <newObject name="Archiver2#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <boolean name="compress">true</boolean>
>         <string name="prefix">IAH</string>
>         <string name="suffix">${HOSTNAME}</string>
>         <long name="max-size-bytes">100000000</long>
>         <stringList name="path">
>           <string>arcs</string>
>         </stringList>
>         <integer name="pool-max-active">5</integer>
>         <integer name="pool-max-wait">300000</integer>
>         <long name="total-bytes-to-write">0</long>
>         <boolean name="skip-identical-digests">false</boolean>
>       </newObject>
>       <newObject name="MirrorWriter"
> class="org.archive.crawler.writer.MirrorWriterProcessor">
>         <boolean name="enabled">true</boolean>
>         <newObject name="MirrorWriter#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <boolean name="case-sensitive">true</boolean>
>         <stringList name="character-map">
>         </stringList>
>         <stringList name="content-type-map">
>         </stringList>
>         <string name="directory-file">index.html</string>
>         <string name="dot-begin">%2E</string>
>         <string name="dot-end">.</string>
>         <stringList name="host-map">
>         </stringList>
>         <boolean name="host-directory">true</boolean>
>         <string name="path">mirror</string>
>         <integer name="max-path-length">1023</integer>
>         <integer name="max-segment-length">255</integer>
>         <boolean name="port-directory">false</boolean>
>         <boolean name="suffix-at-end">true</boolean>
>         <string name="too-long-directory">LONG</string>
>         <stringList name="underscore-set">
>         </stringList>
>       </newObject>
>     </map>
>     <map name="post-processors">
>       <newObject name="Updater"
> class="org.archive.crawler.postprocessor.CrawlStateUpdater">
>         <boolean name="enabled">true</boolean>
>         <newObject name="Updater#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>       </newObject>
>       <newObject name="LinksScoper"
> class="org.archive.crawler.postprocessor.LinksScoper">
>         <boolean name="enabled">true</boolean>
>         <newObject name="LinksScoper#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>         <boolean name="override-logger">false</boolean>
>         <boolean name="seed-redirects-new-seed">true</boolean>
>         <integer name="preference-depth-hops">-1</integer>
>         <newObject name="scope-rejected-url-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>       </newObject>
>       <newObject name="Scheduler"
> class="org.archive.crawler.postprocessor.FrontierScheduler">
>         <boolean name="enabled">true</boolean>
>         <newObject name="Scheduler#decide-rules"
> class="org.archive.crawler.deciderules.DecideRuleSequence">
>           <map name="rules">
>           </map>
>         </newObject>
>       </newObject>
>     </map>
>     <map name="loggers">
>       <newObject name="crawl-statistics"
> class="org.archive.crawler.admin.StatisticsTracker">
>         <integer name="interval-seconds">20</integer>
>       </newObject>
>     </map>
>     <string name="recover-path"></string>
>     <boolean name="checkpoint-copy-bdbje-logs">true</boolean>
>     <boolean name="recover-retain-failures">false</boolean>
>     <boolean name="recover-scope-includes">true</boolean>
>     <boolean name="recover-scope-enqueues">true</boolean>
>     <newObject name="credential-store"
> class="org.archive.crawler.datamodel.CredentialStore">
>       <map name="credentials">
>       </map>
>     </newObject>
>   </controller>
> </crawl-order>
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5599 From: "mjjjhjemj" <bosoxchamps@...>
Date: Tue Dec 9, 2008 11:25 pm
Subject: How to retry seed items with 404 status
mjjjhjemj
Send Email Send Email
 
I found a thread here that describes how you can use the JMX Client to
have an url retried (How to add a URL into the retry list?) excerpt
describing JMX command line below.

My questions:
1) Is there a way to have url (seed items) retried using the heritrix
web interface?
If not then,
2) If I create the java app from the code below do I just run it from
the command line while the present crawl I have running continues? Do
I need to pause the crawl? ...
2a) Is the crawlerhostname (shown below) = 'Heritrix', in my case if
the following is displayed in the 'Setup' --> 'Local Instances' page.
guiport=9090, host=lnx2, jmxport=8849, name=Heritrix, type=CrawlService

Thanks for any help provided,
Mike

Excerpt from 'How to add a URL into the retry list'
---------------------------------------------------
"Another solution is to prepare cookies file and load it before crawling
starts. See HTTP Fetcher load-cookies-from-file and save-cookies-to-file
options.

If you want to re-fetch an URI, you can do that via JMX command line
tool. Example (I did not try it):

JAVA_BIN=/usr/bin/java
JMX_JAR=/heritrix/bin/cmdline-jmxclient-0.10.5.jar
JMX_CONTROLROLE=controlRole:letmein
JMX_PORT=8849

JMX_CMD="$JAVA_BIN -jar $JMX_JAR $JMX_CONTROLROLE"

crawler=crawlerhostname

mother=$($JMX_CMD $crawler:$JMX_PORT 2>&1 | grep 'mother')

cmd="importUri=http://www.myjones.com/code/,true,true"

$JMX_CMD $crawler:$JMX_PORT $mother $cmd"

#5600 From: Juergen Umbrich <juergen@...>
Date: Wed Dec 10, 2008 12:35 am
Subject: Disabling Seed logging
juergen@...
Send Email Send Email
 
Hi all

refering to the post with subject "Broad-scope 10M seeds Xmx6G 64-Bit
JVM: OOME: GC overhead limit exceeded" and the statement that the OOME
exception is related to the seed URI logging

  >>As I mentioned, we haven't typically run crawls with 10 million
seeds. (I think our largest has been closer to 1 million.) It's also
rare for us >>to request a seeds report on large crawls while they are
running. I suspect it's the building of the seeds report in memory
that's either >>triggering or contributing to your issue.
  >>Do you need the report, or can you get the necessary info from the logs?
  >No, we don't need the seed report. The information in the log files
and our additional logging is more than enough.

  >>We don't want any valid request for a seeds report to crash the
crawler, so there's something for us to fix here. However, the answer
  >>might be to cap the size of a seed report viewable by the web UI --
that is, protect by limiting risky functionality.
  >Please not that in our seed list are some error lines integrated (the
lines have a RDF-NQuad format)
  >(Just know i recognise all the warnings from the
org.archive.util.iterator.RegexpLineIterator, and i have in mind that
the regex handling >with java can cause serious problems. Could this be
the reason for the GC thing?)

  >>As noted above, the rescanning of the seeds file to compile the full
report is likely related. There's nothing inherently troubling about
Java >>regex handling, especially in this line-by-line scanning of
well-formed input (no deep recursion, even give your error lines). It's
more the >>size of the report data being assembled at the same time the
rest of the crawl is trying to proceed.

We discovered that we need to crawl a large list of URIs without
traversing outgoing links.
Is their a way to disable the seed reports or does this involve changes
in the code? (We do not want to split the list into bunches of 1M URIs)

Best
   juergen

#5601 From: Gordon Mohr <gojomo@...>
Date: Wed Dec 10, 2008 5:10 am
Subject: Re: Disabling Seed logging
gojomo
Send Email Send Email
 
If my guesswork on the previous post was correct, it was the requests to
display seed reports (via the web UI) that created the problem -- not
the mere background collection of seed disposition info. So I expect
your crawl should just work.

(Be sure not use a scope rule, like SurtPrefixedDecideRule, that is
defined by the seeds.)

Other ways you could pre-load a crawler with URIs that are not
specifically marked as seeds:

* start it with 'pause-at-start'; load URIs via the JMX addUris
operations (though that may prove awkward with millions of URIs)

* synthesize a fake 'recovery.log' with the desired URIs, and specify
the crawl to begin from that log. (View the log from a prior crawl to
get a sense of the format; the lines beginning 'F+ ' will cause URIs to
be queued, as long as there is no other 'Fs ' or 'Ff ' line indicating
they already completed.)

I think I'd lean towards the recovery.log approach if there were
problems with just specifying the URIs as seeds.

- Gordon @ IA

Juergen Umbrich wrote:
> Hi all
>
> refering to the post with subject "Broad-scope 10M seeds Xmx6G 64-Bit
> JVM: OOME: GC overhead limit exceeded" and the statement that the OOME
> exception is related to the seed URI logging
>
>  >>As I mentioned, we haven't typically run crawls with 10 million
> seeds. (I think our largest has been closer to 1 million.) It's also
> rare for us >>to request a seeds report on large crawls while they are
> running. I suspect it's the building of the seeds report in memory
> that's either >>triggering or contributing to your issue.
>  >>Do you need the report, or can you get the necessary info from the logs?
>  >No, we don't need the seed report. The information in the log files
> and our additional logging is more than enough.
>
>  >>We don't want any valid request for a seeds report to crash the
> crawler, so there's something for us to fix here. However, the answer
>  >>might be to cap the size of a seed report viewable by the web UI --
> that is, protect by limiting risky functionality.
>  >Please not that in our seed list are some error lines integrated (the
> lines have a RDF-NQuad format)
>  >(Just know i recognise all the warnings from the
> org.archive.util.iterator.RegexpLineIterator, and i have in mind that
> the regex handling >with java can cause serious problems. Could this be
> the reason for the GC thing?)
>
>  >>As noted above, the rescanning of the seeds file to compile the full
> report is likely related. There's nothing inherently troubling about
> Java >>regex handling, especially in this line-by-line scanning of
> well-formed input (no deep recursion, even give your error lines). It's
> more the >>size of the report data being assembled at the same time the
> rest of the crawl is trying to proceed.
>
> We discovered that we need to crawl a large list of URIs without
> traversing outgoing links.
> Is their a way to disable the seed reports or does this involve changes
> in the code? (We do not want to split the list into bunches of 1M URIs)
>
> Best
>   juergen
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5602 From: "takeru sasaki" <sasaki.takeru@...>
Date: Thu Dec 11, 2008 3:40 pm
Subject: enter regexp escape in web interface
sasaki.takeru@...
Send Email Send Email
 
Hi,

I am useing version 2.0.2.

I am viewing crawl.log by RegExp.

I enter this:
http://.*¥.html

but text field value was changed:
http://.*?.html

Can I use escape in regexp?



takeru

#5603 From: Gordon Mohr <gojomo@...>
Date: Thu Dec 11, 2008 7:56 pm
Subject: Re: enter regexp escape in web interface
gojomo
Send Email Send Email
 
This is probably a web UI encoding issue we could fix, BUT...

I don't think any '¥' characters, exactly as such, will be found in the
crawl.log. Instead, it will have been changed to a series of %-escaped
characters before having been tried and logged.

Testing in my (US-english) Firefox, a request for an URL ending
'/¥.html' actually generated an HTTP request line (viewed using the
HttpFox extension) of:

GET /%C2%A5.html HTTP/1.1

So I would suggest changing your regex to look for the %-encoded version.

Hope this helps,

- Gordon @ IA

takeru sasaki wrote:
> Hi,
>
> I am useing version 2.0.2.
>
> I am viewing crawl.log by RegExp.
>
> I enter this:
> http://.*¥.html
>
> but text field value was changed:
> http://.*?.html
>
> Can I use escape in regexp?
>
>
>
> takeru
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5604 From: "takeru sasaki" <sasaki.takeru@...>
Date: Fri Dec 12, 2008 2:05 am
Subject: Re: enter regexp escape in web interface
sasaki.takeru@...
Send Email Send Email
 
Thank you for your help.

I want to escape "." (dot), not "¥" (back slash).
And other Regex meta charactors. Such as ".()[]?".

I will debug and build heritrix by my self, 2.0.2 or trunk.
If anyone knows, Please teach me how to create build and debug
environment with eclipse.

Sorry to my poor english.
thank you.

takeru

2008/12/12 Gordon Mohr <gojomo@...>:
> This is probably a web UI encoding issue we could fix, BUT...
>
> I don't think any '¥' characters, exactly as such, will be found in the
> crawl.log. Instead, it will have been changed to a series of %-escaped
> characters before having been tried and logged.
>
> Testing in my (US-english) Firefox, a request for an URL ending
> '/¥.html' actually generated an HTTP request line (viewed using the
> HttpFox extension) of:
>
> GET /%C2%A5.html HTTP/1.1
>
> So I would suggest changing your regex to look for the %-encoded version.
>
> Hope this helps,
>
> - Gordon @ IA
>
> takeru sasaki wrote:
>> Hi,
>>
>> I am useing version 2.0.2.
>>
>> I am viewing crawl.log by RegExp.
>>
>> I enter this:
>> http://.*¥.html
>>
>> but text field value was changed:
>> http://.*?.html
>>
>> Can I use escape in regexp?
>>
>>
>>
>> takeru
>>
>> ------------------------------------
>>
>> Yahoo! Groups Links
>>
>>
>>
>

#5605 From: Gordon Mohr <gojomo@...>
Date: Fri Dec 12, 2008 2:46 am
Subject: Re: enter regexp escape in web interface
gojomo
Send Email Send Email
 
I'm sorry I didn't understand. Heritrix uses the Java regex syntax, as
described at:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html

So, a single backslash ('\') should be sufficient to escape regex meta
characters.

However, in your message, when you write...

> not "¥" (back slash)

...I am actually seeing a 'yen' symbol inside your quotes -- not a
backslash.

According to <http://en.wikipedia.org/wiki/Backslash>, the Japanese ISO
646 encoding changes the ASCII backslash character ('\') to a yen mark
('¥'), and some fonts compound the confusion.

Perhaps by forcing your web browser to use another encoding, you can
generate a backslash that will be understood as an escape by Heritrix?

- Gordon @ IA

takeru sasaki wrote:
> Thank you for your help.
>
> I want to escape "." (dot), not "¥" (back slash).
> And other Regex meta charactors. Such as ".()[]?".
>
> I will debug and build heritrix by my self, 2.0.2 or trunk.
> If anyone knows, Please teach me how to create build and debug
> environment with eclipse.
>
> Sorry to my poor english.
> thank you.
>
> takeru
>
> 2008/12/12 Gordon Mohr <gojomo@...>:
>> This is probably a web UI encoding issue we could fix, BUT...
>>
>> I don't think any '¥' characters, exactly as such, will be found in the
>> crawl.log. Instead, it will have been changed to a series of %-escaped
>> characters before having been tried and logged.
>>
>> Testing in my (US-english) Firefox, a request for an URL ending
>> '/¥.html' actually generated an HTTP request line (viewed using the
>> HttpFox extension) of:
>>
>> GET /%C2%A5.html HTTP/1.1
>>
>> So I would suggest changing your regex to look for the %-encoded version.
>>
>> Hope this helps,
>>
>> - Gordon @ IA
>>
>> takeru sasaki wrote:
>>> Hi,
>>>
>>> I am useing version 2.0.2.
>>>
>>> I am viewing crawl.log by RegExp.
>>>
>>> I enter this:
>>> http://.*¥.html
>>>
>>> but text field value was changed:
>>> http://.*?.html
>>>
>>> Can I use escape in regexp?
>>>
>>>
>>>
>>> takeru
>>>
>>> ------------------------------------
>>>
>>> Yahoo! Groups Links
>>>
>>>
>>>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5606 From: "takeru sasaki" <sasaki.takeru@...>
Date: Fri Dec 12, 2008 3:20 am
Subject: Re: enter regexp escape in web interface
sasaki.takeru@...
Send Email Send Email
 
Thank you.
You are right!!

I am using mac osx 10.4 and Firefox3 in japanese.
I enter backslash "\" into editor (CarbonEmacs) and copy it, and paste
in firefox. I successed.

I will try other brousers and change encoding.

Thank you.

takeru



2008/12/12 Gordon Mohr <gojomo@...>:
> I'm sorry I didn't understand. Heritrix uses the Java regex syntax, as
> described at:
>
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html
>
> So, a single backslash ('\') should be sufficient to escape regex meta
> characters.
>
> However, in your message, when you write...
>
>> not "¥" (back slash)
>
> ...I am actually seeing a 'yen' symbol inside your quotes -- not a
> backslash.
>
> According to <http://en.wikipedia.org/wiki/Backslash>, the Japanese ISO
> 646 encoding changes the ASCII backslash character ('\') to a yen mark
> ('¥'), and some fonts compound the confusion.
>
> Perhaps by forcing your web browser to use another encoding, you can
> generate a backslash that will be understood as an escape by Heritrix?
>
> - Gordon @ IA
>
> takeru sasaki wrote:
>> Thank you for your help.
>>
>> I want to escape "." (dot), not "¥" (back slash).
>> And other Regex meta charactors. Such as ".()[]?".
>>
>> I will debug and build heritrix by my self, 2.0.2 or trunk.
>> If anyone knows, Please teach me how to create build and debug
>> environment with eclipse.
>>
>> Sorry to my poor english.
>> thank you.
>>
>> takeru
>>
>> 2008/12/12 Gordon Mohr <gojomo@...>:
>>> This is probably a web UI encoding issue we could fix, BUT...
>>>
>>> I don't think any '¥' characters, exactly as such, will be found in the
>>> crawl.log. Instead, it will have been changed to a series of %-escaped
>>> characters before having been tried and logged.
>>>
>>> Testing in my (US-english) Firefox, a request for an URL ending
>>> '/¥.html' actually generated an HTTP request line (viewed using the
>>> HttpFox extension) of:
>>>
>>> GET /%C2%A5.html HTTP/1.1
>>>
>>> So I would suggest changing your regex to look for the %-encoded version.
>>>
>>> Hope this helps,
>>>
>>> - Gordon @ IA
>>>
>>> takeru sasaki wrote:
>>>> Hi,
>>>>
>>>> I am useing version 2.0.2.
>>>>
>>>> I am viewing crawl.log by RegExp.
>>>>
>>>> I enter this:
>>>> http://.*¥.html
>>>>
>>>> but text field value was changed:
>>>> http://.*?.html
>>>>
>>>> Can I use escape in regexp?
>>>>
>>>>
>>>>
>>>> takeru
>>>>
>>>> ------------------------------------
>>>>
>>>> Yahoo! Groups Links
>>>>
>>>>
>>>>
>>
>> ------------------------------------
>>
>> Yahoo! Groups Links
>>
>>
>>
>

#5607 From: "adam.taylor78" <adam.taylor78@...>
Date: Tue Dec 16, 2008 5:16 pm
Subject: Re: Crawling YouTube video pages?
adam.taylor78
Send Email Send Email
 
Hi Tomas,

Here is a script that can be used to download vidoes with Heritrix.
Maybe this will help you?  It is now on the Heritrix WIKI:

http://webteam.archive.org/confluence/display/Heritrix/BeanShell+Script+For+Down\
loading+Video

This has only been tested with Heritrix 2.0.2 so I'm not sure but you
may need to adjust some of the script for v1.x.

Adam

--- In archive-crawler@yahoogroups.com, "Tomas Ukkonen"
<tomas.ukkonen@...> wrote:
>
> Hi to all,
>
>
> Is there way to fetch YouTube videos with Heritrix (1.14.x)?
>
> We have been trying to download selected youtube videos
> in our thematic crawls but haven't succeeded (yet).
>
>
> There is the ExtractorImpliedURI extractor which solved problem
> with the previous version of YouTube pages
> (http://webteam.archive.org/jira/browse/HER-1050).
>
> However, with updated YouTube pages the construction of the
> actual flash video url is slightly more complicated:
> http://www.marksanborn.net/howto/using-wget-to-download-youtube-videos/
>
>
> I can write an regular expression that creates the correct URI.
>
> A line from HTML video page:
> var fullscreenUrl =
>
'/watch_fullscreen?fs=1&BASE_YT_URL=http%3A%2F%2Fyoutube.com%2F&vq=None&vide
>
o_id=eBGIQ7ZuuiU&l=210&sk=QHSGy-6-eiJ6h5qtrj1kR2OJanvL6tLlC&fmt_map=&t=OEgsT
>
oPDskIjlCX_1shTRBRDv2UQzVb-&hl=en&plid=AAROOIUyvpMFnjvzAAAAoARsYAg&tk=6J3d54
> cPnMJP7mUN2BtMkd2OmbgxTvB_GPcrV3ckGUbzd7KVMiH1kA%3D%3D&title=Rick Roll';
>
> Matching regular expression:
> (.*)/watch_fullscreen?(.*)video_id=(.*)&title(.*)
>
> And replacement regex:
> http://www.youtube.com/get_video?video_id=$3
>
>
> But 'var fullscreenUrl=/watch_..' line doesn't seem to be recognized
to be
> an potential URI so implied URI extractor doesn't work.
>
>
> So, should I try to extend implied URI extractor to process whole
files or
> is there some other way to extract YouTube video URI's from the pages?
>
>
> The extractor queue I'm using is:
>
> org.archive.crawler.extractor.ExtractorHTTP
> org.archive.crawler.extractor.AggressiveExtractorHTML
> org.archive.crawler.extractor.ExtractorCSS
> org.archive.crawler.extractor.ExtractorJS
> org.archive.crawler.extractor.ExtractorSWF
> org.archive.crawler.extractor.ExtractorURI
> org.archive.crawler.extractor.ExtractorImpliedURI
> is.hi.bok.deduplicator.DeDuplicator
> (+ don't follow robots.txt)
>
>
>
> Thanks in advance,
>
> --
> Tomas Ukkonen
> Information Systems Specialist
> Kansalliskirjasto /
> The National Library of Finland
> phone +358-50-4150557
> email tomas.ukkonen@...
> www   http://www.kansalliskirjasto.fi
>       http://www.nationallibrary.fi
>

#5608 From: "jpeimecke" <jpeimecke@...>
Date: Wed Dec 17, 2008 3:09 pm
Subject: NotMatchesListRegExpDecideRule
jpeimecke
Send Email Send Email
 
Hi!

I have a problem with the NotMatchesListRegExpDecideRule.
My aim is to crawl the following sites :

http://tennis.fr/outils
http://tennis.fr/breves
http://tennis.fr/actualites
http://foot.fr/outils
http://foot.fr/breves
http://foot.fr/atualites

I think that the NotMatchesListRegExpDecideRule is the better choice
to crawl only this sites. So I put in the scope this DecideRule with
this following parameters :

global  root:scope:rules:7  object
org.archive.modules.deciderules.NotMatchesListRegExpDecideRule
global  root:scope:rules:7:decision  enum
org.archive.modules.deciderules.DecideResult-REJECT
global  root:scope:rules:7:enabled  boolean  true
global  root:scope:rules:7:list-logical-or  boolean  true
global  root:scope:rules:7:regexp-list  list  java.util.regex.Pattern
global  root:scope:rules:7:regexp-list:0  pattern
.*//[^/]+/(actualites|breves|outils)/.*

When I run a crawl, it is stopped without having crawl anything.

Can you help me?

Thanks in advance

Cheers

JP Eimecke

#5609 From: Gordon Mohr <gojomo@...>
Date: Wed Dec 17, 2008 7:39 pm
Subject: Re: NotMatchesListRegExpDecideRule
gojomo
Send Email Send Email
 
Your rule can only REJECT not-matching URIs; whether any URIs are
ACCEPTed depends on the other rules. What are your other rules?

- Gordon @ IA

jpeimecke wrote:
> Hi!
>
> I have a problem with the NotMatchesListRegExpDecideRule.
> My aim is to crawl the following sites :
>
> http://tennis.fr/outils
> http://tennis.fr/breves
> http://tennis.fr/actualites
> http://foot.fr/outils
> http://foot.fr/breves
> http://foot.fr/atualites
>
> I think that the NotMatchesListRegExpDecideRule is the better choice
> to crawl only this sites. So I put in the scope this DecideRule with
> this following parameters :
>
> global  root:scope:rules:7  object
> org.archive.modules.deciderules.NotMatchesListRegExpDecideRule
> global  root:scope:rules:7:decision  enum
> org.archive.modules.deciderules.DecideResult-REJECT
> global  root:scope:rules:7:enabled  boolean  true
> global  root:scope:rules:7:list-logical-or  boolean  true
> global  root:scope:rules:7:regexp-list  list  java.util.regex.Pattern
> global  root:scope:rules:7:regexp-list:0  pattern
> .*//[^/]+/(actualites|breves|outils)/.*
>
> When I run a crawl, it is stopped without having crawl anything.
>
> Can you help me?
>
> Thanks in advance
>
> Cheers
>
> JP Eimecke
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5610 From: "jpeimecke" <jpeimecke@...>
Date: Thu Dec 18, 2008 9:04 am
Subject: Re: NotMatchesListRegExpDecideRule
jpeimecke
Send Email Send Email
 
Thanks for your answer.
Here are my decideRule :

global  root:scope:rules  list  org.archive.modules.deciderules.DecideRule
global  root:scope:rules:0  object
org.archive.modules.deciderules.RejectDecideRule
global  root:scope:rules:1  object
org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
global  root:scope:rules:1:also-check-via  boolean  false
global  root:scope:rules:1:decision  enum
org.archive.modules.deciderules.DecideResult-ACCEPT
global  root:scope:rules:1:rebuild-on-reconfig  boolean  true
global  root:scope:rules:1:seeds-as-surt-prefixes  boolean  true
global  root:scope:rules:2  object
org.archive.modules.deciderules.TooManyHopsDecideRule
global  root:scope:rules:3  object
org.archive.modules.deciderules.TransclusionDecideRule
global  root:scope:rules:4  object
org.archive.modules.deciderules.PathologicalPathDecideRule
global  root:scope:rules:5  object
org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
global  root:scope:rules:6  object
org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
global  root:scope:rules:7  object
org.archive.modules.deciderules.MatchesListRegExpDecideRule
global  root:scope:rules:7:decision  enum
org.archive.modules.deciderules.DecideResult-REJECT
global  root:scope:rules:7:enabled  boolean  true
global  root:scope:rules:7:list-logical-or  boolean  true
global  root:scope:rules:7:regexp-list  list  java.util.regex.Pattern
global  root:scope:rules:7:regexp-list:0  pattern
.*//[^/]+/(contenu|author)/.*
global  root:scope:rules:7:regexp-list:1  pattern
.*\.(jpg|png|gif|css|ico|JPG|PNG|GIF|CSS|ICO)$
global  root:scope:rules:7:regexp-list:2  pattern  .*//[^/]+\.\d+/.*
global  root:scope:rules:8  object
org.archive.modules.deciderules.surt.NotOnDomainsDecideRule
global  root:scope:rules:8:also-check-via  boolean  true
global  root:scope:rules:8:decision  enum
org.archive.modules.deciderules.DecideResult-REJECT
global  root:scope:rules:8:rebuild-on-reconfig  boolean  true
global root:scope:rules:9 object
org.archive.modules.deciderules.NotMatchesListRegExpDecideRule
global root:scope:rules:9:decision enum
org.archive.modules.deciderules.DecideResult-REJECT
global root:scope:rules:9:enabled boolean true
global root:scope:rules:9:list-logical-or boolean true
global root:scope:rules:9:regexp-list list java.util.regex.Pattern
global root:scope:rules:9:regexp-list:0 pattern
.*//[^/]+/(actualites|breves|outils)/.*

Cheers

#5611 From: Gordon Mohr <gojomo@...>
Date: Thu Dec 18, 2008 9:32 am
Subject: Re: Re: NotMatchesListRegExpDecideRule
gojomo
Send Email Send Email
 
Thanks.

Your rule #8, NotOnDomainsDecideRule, may be unnecessary. Based on the
earlier rules, only on-domain and inline-linked URIs will have been
ACCEPTed (and if you want to trim wandering a few hops off your target
domains, limiting or removing TransclusionDecideRule makes more sense).

Also, as it comes after both TransclusionDecideRule and
PrerequisiteAcceptDecideRule, it could be rejecting necessary
prerequisite fetches (like DNS URIs) that those rules usually ACCEPT. I
recommend always keeping PrerequisiteAcceptDecideRule in the last position.

If that doesn't solve your issue, there may be more hints about what's
going wrong in your seed list or crawl.log.

- Gordon @ IA

jpeimecke wrote:
> Thanks for your answer.
> Here are my decideRule :
>
> global  root:scope:rules  list  org.archive.modules.deciderules.DecideRule
> global  root:scope:rules:0  object
> org.archive.modules.deciderules.RejectDecideRule
> global  root:scope:rules:1  object
> org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
> global  root:scope:rules:1:also-check-via  boolean  false
> global  root:scope:rules:1:decision  enum
> org.archive.modules.deciderules.DecideResult-ACCEPT
> global  root:scope:rules:1:rebuild-on-reconfig  boolean  true
> global  root:scope:rules:1:seeds-as-surt-prefixes  boolean  true
> global  root:scope:rules:2  object
> org.archive.modules.deciderules.TooManyHopsDecideRule
> global  root:scope:rules:3  object
> org.archive.modules.deciderules.TransclusionDecideRule
> global  root:scope:rules:4  object
> org.archive.modules.deciderules.PathologicalPathDecideRule
> global  root:scope:rules:5  object
> org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
> global  root:scope:rules:6  object
> org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
> global  root:scope:rules:7  object
> org.archive.modules.deciderules.MatchesListRegExpDecideRule
> global  root:scope:rules:7:decision  enum
> org.archive.modules.deciderules.DecideResult-REJECT
> global  root:scope:rules:7:enabled  boolean  true
> global  root:scope:rules:7:list-logical-or  boolean  true
> global  root:scope:rules:7:regexp-list  list  java.util.regex.Pattern
> global  root:scope:rules:7:regexp-list:0  pattern
> .*//[^/]+/(contenu|author)/.*
> global  root:scope:rules:7:regexp-list:1  pattern
> .*\.(jpg|png|gif|css|ico|JPG|PNG|GIF|CSS|ICO)$
> global  root:scope:rules:7:regexp-list:2  pattern  .*//[^/]+\.\d+/.*
> global  root:scope:rules:8  object
> org.archive.modules.deciderules.surt.NotOnDomainsDecideRule
> global  root:scope:rules:8:also-check-via  boolean  true
> global  root:scope:rules:8:decision  enum
> org.archive.modules.deciderules.DecideResult-REJECT
> global  root:scope:rules:8:rebuild-on-reconfig  boolean  true
> global root:scope:rules:9 object
> org.archive.modules.deciderules.NotMatchesListRegExpDecideRule
> global root:scope:rules:9:decision enum
> org.archive.modules.deciderules.DecideResult-REJECT
> global root:scope:rules:9:enabled boolean true
> global root:scope:rules:9:list-logical-or boolean true
> global root:scope:rules:9:regexp-list list java.util.regex.Pattern
> global root:scope:rules:9:regexp-list:0 pattern
> .*//[^/]+/(actualites|breves|outils)/.*
>
> Cheers
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5612 From: "jpeimecke" <jpeimecke@...>
Date: Thu Dec 18, 2008 11:01 am
Subject: Re: NotMatchesListRegExpDecideRule
jpeimecke
Send Email Send Email
 
It works fine. Thanks you very much.
I have seen that if I had a DecideRule, I must run a first crawl in
order to the settings appear. Is that normal?

#5613 From: Gordon Mohr <gojomo@...>
Date: Thu Dec 18, 2008 6:48 pm
Subject: Re: Re: NotMatchesListRegExpDecideRule
gojomo
Send Email Send Email
 
jpeimecke wrote:
> It works fine. Thanks you very much.
> I have seen that if I had a DecideRule, I must run a first crawl in
> order to the settings appear. Is that normal?

I'm sorry, I don't understand the question.

I would say that when building a custom scope with new DecideRules, it
is good to work incrementally, only adding one rule at a time and
verifying its expected effect before adding other rules.

- Gordon

#5614 From: "takeru sasaki" <sasaki.takeru@...>
Date: Fri Dec 19, 2008 5:04 am
Subject: stop access to original site (wayback)
sasaki.takeru@...
Send Email Send Email
 
hello!

I am trying wayback.
http://archive-access.sourceforge.net/projects/wayback/

I have a question.

My wayback instance has many html pages.
If image-file refered by an img tag, which is not in archives, firefox
seems access to original site.
I want to stop access to original site.
Can I configure waiback to work like that?
I am using firebug.

thank you,

takeru

#5615 From: "ckannanck" <ckannanck@...>
Date: Sat Dec 27, 2008 12:12 am
Subject: Debugging/Stepping through a Crawl process
ckannanck
Send Email Send Email
 
Hi All,

I am a newbie to heritrix..I checked out the latest stable version of
  Heritrix 2...and I tried to debug the crawl process step by step...

My Attempts & and the end results:
1)I tried setting up the debugger ( using netbeans ide..can use
eclipse if necessary) on the WebUI...but this limits me to step
through only the UI code... I am unable to step through the actual
crawl process..
2)Tried to debug the main method of the
org.archive.crawler.Heritrix.java ... the debugger ends its session at
the end of the main method ( since the main thread starts
sleeping).subsequent actions on the webui does'nt get followed...

Can somebody  help me with this problem and also wondering what is the
best way to understand the program flow....(besidees the developer and
user manuals)

Thanks
Kannan

#5616 From: Gordon Mohr <gojomo@...>
Date: Sat Dec 27, 2008 5:53 am
Subject: Re: Debugging/Stepping through a Crawl process
gojomo
Send Email Send Email
 
ckannanck wrote:
> Hi All,
>
> I am a newbie to heritrix..I checked out the latest stable version of
>  Heritrix 2...and I tried to debug the crawl process step by step...

For a beginner, the 1.14.x code could be a better place to start -- the
documentation is better, and there are still significant
UI/configuration changes planned in the 2.x line.

> My Attempts & and the end results:
> 1)I tried setting up the debugger ( using netbeans ide..can use
> eclipse if necessary) on the WebUI...but this limits me to step
> through only the UI code... I am unable to step through the actual
> crawl process..
> 2)Tried to debug the main method of the
> org.archive.crawler.Heritrix.java ... the debugger ends its session at
> the end of the main method ( since the main thread starts
> sleeping).subsequent actions on the webui does'nt get followed...
>
> Can somebody  help me with this problem and also wondering what is the
> best way to understand the program flow....(besidees the developer and
> user manuals)

Heritrix developers use step debugging in Eclipse all the time, in
1.14.x and 2.x, so it's definitely possible.

When I launch from Eclipse for debugging, for Heritrix 1.14.x, my
command-line options include as a VM option "-Dheritrix.development",
setting a system property that prevents Heritrix from redirecting
STDOUT/STDERR to a log file and changes slightly how configuration info
is found.

When I launch from Eclipse for debugging, for Heritrix 2.x, my
command-line options to the Heritrix class include "-w
webui/src/main/webapp" to point to my IDE_compiled classes for the webui
(rather than the default WAR).

In either case, because Heritrix uses a pool of worker threads to handle
each URI in turn, and crawls are typically launched/controlled via short
transaction in webui webserver threads or other short-lived
special-purpose threads, attempting to step through activity from the
main() thread will not be helpful. Instead, I would add breakpoints at
other key methods to catch and observe operations of interest.

To summarize, for your purposes, I would recommend:

- As a newbie, start with Heritrix 1.14.x -- for now it's better for
getting started. (Though, you can still use 2.x if you want.)
- Use Eclipse -- I expect NetBeans would work but I'm sure Eclipse will,
and the source from project SVN is already arranged as an Eclipse project
- Set breakpoints rather than trying to step from the main() launch

Hope this helps,

- Gordon @ IA

#5617 From: "ckannanck" <ckannanck@...>
Date: Sun Dec 28, 2008 8:38 am
Subject: Re: Debugging/Stepping through a Crawl process
ckannanck
Send Email Send Email
 
Thanks a lot !!! That helps.... I am looking for continuous crawling
functionality which I believe is being developed on 2.x ( correct me
if I am wrong ) ...I am trying to set up my dev environment in
netbeans..I will post the instructions if I am able to figure it out :-)


Few more questions:
1) Is there a way to add new urls/seeds to a crawl job without
stopping the crawl ? If yes, when will the seeds get crawled...Any
pointers in the documentation or code would be helpful...

2) Are you talking about the UI changes as mentioned @
http://webteam.archive.org/confluence/display/Heritrix/Continuous+Recrawling+Pha\
se+A+Design+Notes
or are there more changes


Thanks
Kannan

--- In archive-crawler@yahoogroups.com, Gordon Mohr <gojomo@...> wrote:
>
> ckannanck wrote:
> > Hi All,
> >
> > I am a newbie to heritrix..I checked out the latest stable version of
> >  Heritrix 2...and I tried to debug the crawl process step by step...
>
> For a beginner, the 1.14.x code could be a better place to start -- the
> documentation is better, and there are still significant
> UI/configuration changes planned in the 2.x line.
>
> > My Attempts & and the end results:
> > 1)I tried setting up the debugger ( using netbeans ide..can use
> > eclipse if necessary) on the WebUI...but this limits me to step
> > through only the UI code... I am unable to step through the actual
> > crawl process..
> > 2)Tried to debug the main method of the
> > org.archive.crawler.Heritrix.java ... the debugger ends its session at
> > the end of the main method ( since the main thread starts
> > sleeping).subsequent actions on the webui does'nt get followed...
> >
> > Can somebody  help me with this problem and also wondering what is the
> > best way to understand the program flow....(besidees the developer and
> > user manuals)
>
> Heritrix developers use step debugging in Eclipse all the time, in
> 1.14.x and 2.x, so it's definitely possible.
>
> When I launch from Eclipse for debugging, for Heritrix 1.14.x, my
> command-line options include as a VM option "-Dheritrix.development",
> setting a system property that prevents Heritrix from redirecting
> STDOUT/STDERR to a log file and changes slightly how configuration info
> is found.
>
> When I launch from Eclipse for debugging, for Heritrix 2.x, my
> command-line options to the Heritrix class include "-w
> webui/src/main/webapp" to point to my IDE_compiled classes for the
webui
> (rather than the default WAR).
>
> In either case, because Heritrix uses a pool of worker threads to
handle
> each URI in turn, and crawls are typically launched/controlled via
short
> transaction in webui webserver threads or other short-lived
> special-purpose threads, attempting to step through activity from the
> main() thread will not be helpful. Instead, I would add breakpoints at
> other key methods to catch and observe operations of interest.
>
> To summarize, for your purposes, I would recommend:
>
> - As a newbie, start with Heritrix 1.14.x -- for now it's better for
> getting started. (Though, you can still use 2.x if you want.)
> - Use Eclipse -- I expect NetBeans would work but I'm sure Eclipse
will,
> and the source from project SVN is already arranged as an Eclipse
project
> - Set breakpoints rather than trying to step from the main() launch
>
> Hope this helps,
>
> - Gordon @ IA
>

#5618 From: "pandya.bhavin@..." <pandya.bhavin@...>
Date: Mon Dec 29, 2008 6:25 am
Subject: All threads are waiting forever - ABOUT TO GET URI
pandya.bhavi...
Send Email Send Email
 
Hi,

I am new to heritrix and trying to run one sample job but facing
problem.

I have configured "max-toe-thread" to 100 but still it never starts
all threads. In console panel its showing always 2 or 3 or max 10 out
of 100 Toe threads. I dont know why its not able to use all threads.

I checked my "toe-thread" report its showing almost all time,
100 threads: 99 ABOUT_TO_GET_URI; 1 ABOUT_TO_BEGIN_PROCESSOR

Here are some other statistics its showing for same job.
6287224 deepest queue
1419879 average depth
9061143 total downloaded and queued
52 KB/sec (76 avg)
1.85 URIs/sec (2.26 avg)

Can anybody give some pointer why its very very very slow?
why its not using all threads while machine has enough bandwidth.

I am using Heritrix 1.14 with java 1.5 on linux machine.

Can it be problem with DNS server?
Here is entry from /etc/resolv.conf
nameserver 192.168.101.1
nameserver 10.70.1.243
nameserver 10.70.1.230
search  192.168.101.1
retry 2
retrans 1000

Any pointer will really help!

- Bhavin

#5619 From: Gordon Mohr <gojomo@...>
Date: Tue Dec 30, 2008 7:01 am
Subject: Re: Re: Debugging/Stepping through a Crawl process
gojomo
Send Email Send Email
 
ckannanck wrote:
> Thanks a lot !!! That helps.... I am looking for continuous crawling
> functionality which I believe is being developed on 2.x ( correct me
> if I am wrong ) ...I am trying to set up my dev environment in
> netbeans..I will post the instructions if I am able to figure it out :-)

The plan is for continuous and adaptive crawling to be implemented by
the 2.4 release, in early 2009. That means within operator-set limits
and preferences, the crawler will set its own schedule for revisiting
URIs based on their observed history of changes.

However, only some steps towards that currently exist in the 2.x
releases so far, the 2.x trunk, and a working branch with configuration
changes. So there's not yet any continuous functionality that demands
the use of 2.x.

> Few more questions:
> 1) Is there a way to add new urls/seeds to a crawl job without
> stopping the crawl ? If yes, when will the seeds get crawled...Any
> pointers in the documentation or code would be helpful...

I can think of two ways:

(1) If you edit the seed list while the crawl is paused, then trigger a
rescan of the settings (which may happen when editing the seeds via the
web UI, or definitely happens if you tweak any other single-field
setting), the seeds will be rescanned. Any that are not-yet-discovered
will be queued for crawling. (I'm sure this works in 1.X, and I think it
works in 2.x).

(2) Use the JMX operations 'importUri' or 'importUris' on the 'CrawlJob'
object. Some info that could help:

Manual section on remote monitoring/control (1.x):
http://crawler.archive.org/articles/user_manual/outside.html#mon_com

Old post mentioning importUri operation:
http://tech.groups.yahoo.com/group/archive-crawler/message/4589

Page for the bundled 'cmdline-jmxclient' utility:
http://crawler.archive.org/cmdline-jmxclient/

(In 2.0.x importUris moved to being an operation on the Frontier object.)

--
In both cases, the URIs will be scheduled as if just discovered -- so
generally it will go at the end of the relevant per-host queue. (Of
course, if it's an all-new site with no other URIs, it may be first in
that queue.) When each queue then comes up for active crawling depends
on other frontier settings.

> 2) Are you talking about the UI changes as mentioned @
>
http://webteam.archive.org/confluence/display/Heritrix/Continuous+Recrawling+Pha\
se+A+Design+Notes
> or are there more changes

Yes. The primary change is to a Spring-based configuration and crawl
lifecycle system, but this will have ripple effects through the web UI.

- Gordon @ IA

> Thanks
> Kannan
>
> --- In archive-crawler@yahoogroups.com, Gordon Mohr <gojomo@...> wrote:
>> ckannanck wrote:
>>> Hi All,
>>>
>>> I am a newbie to heritrix..I checked out the latest stable version of
>>>  Heritrix 2...and I tried to debug the crawl process step by step...
>> For a beginner, the 1.14.x code could be a better place to start -- the
>> documentation is better, and there are still significant
>> UI/configuration changes planned in the 2.x line.
>>
>>> My Attempts & and the end results:
>>> 1)I tried setting up the debugger ( using netbeans ide..can use
>>> eclipse if necessary) on the WebUI...but this limits me to step
>>> through only the UI code... I am unable to step through the actual
>>> crawl process..
>>> 2)Tried to debug the main method of the
>>> org.archive.crawler.Heritrix.java ... the debugger ends its session at
>>> the end of the main method ( since the main thread starts
>>> sleeping).subsequent actions on the webui does'nt get followed...
>>>
>>> Can somebody  help me with this problem and also wondering what is the
>>> best way to understand the program flow....(besidees the developer and
>>> user manuals)
>> Heritrix developers use step debugging in Eclipse all the time, in
>> 1.14.x and 2.x, so it's definitely possible.
>>
>> When I launch from Eclipse for debugging, for Heritrix 1.14.x, my
>> command-line options include as a VM option "-Dheritrix.development",
>> setting a system property that prevents Heritrix from redirecting
>> STDOUT/STDERR to a log file and changes slightly how configuration info
>> is found.
>>
>> When I launch from Eclipse for debugging, for Heritrix 2.x, my
>> command-line options to the Heritrix class include "-w
>> webui/src/main/webapp" to point to my IDE_compiled classes for the
> webui
>> (rather than the default WAR).
>>
>> In either case, because Heritrix uses a pool of worker threads to
> handle
>> each URI in turn, and crawls are typically launched/controlled via
> short
>> transaction in webui webserver threads or other short-lived
>> special-purpose threads, attempting to step through activity from the
>> main() thread will not be helpful. Instead, I would add breakpoints at
>> other key methods to catch and observe operations of interest.
>>
>> To summarize, for your purposes, I would recommend:
>>
>> - As a newbie, start with Heritrix 1.14.x -- for now it's better for
>> getting started. (Though, you can still use 2.x if you want.)
>> - Use Eclipse -- I expect NetBeans would work but I'm sure Eclipse
> will,
>> and the source from project SVN is already arranged as an Eclipse
> project
>> - Set breakpoints rather than trying to step from the main() launch
>>
>> Hope this helps,
>>
>> - Gordon @ IA
>>
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5620 From: Gordon Mohr <gojomo@...>
Date: Tue Dec 30, 2008 7:07 am
Subject: Re: All threads are waiting forever - ABOUT TO GET URI
gojomo
Send Email Send Email
 
All threads will only be used if there are many separate sites to crawl.
   Heritrix will only fetch a single URI from a site at a time, and will
pause between fetches according to configurable parameters to be polite
to target sites.

Thus, a single thread can juggle the polite, one-at-a-time fetches to
many sites -- anywhere from a handful to dozens of sites (depending on
the politeness pauses). Thus unless you have thousands of target sites
with pending URIs, you are unlikely to see 100 threads all busy at once.

I suspect you are crawling just a handful of large sites, in which case
crawling politely is the main factor controlling pace.

- Gordon @ IA

pandya.bhavin@... wrote:
> Hi,
>
> I am new to heritrix and trying to run one sample job but facing
> problem.
>
> I have configured "max-toe-thread" to 100 but still it never starts
> all threads. In console panel its showing always 2 or 3 or max 10 out
> of 100 Toe threads. I dont know why its not able to use all threads.
>
> I checked my "toe-thread" report its showing almost all time,
> 100 threads: 99 ABOUT_TO_GET_URI; 1 ABOUT_TO_BEGIN_PROCESSOR
>
> Here are some other statistics its showing for same job.
> 6287224 deepest queue
> 1419879 average depth
> 9061143 total downloaded and queued
> 52 KB/sec (76 avg)
> 1.85 URIs/sec (2.26 avg)
>
> Can anybody give some pointer why its very very very slow?
> why its not using all threads while machine has enough bandwidth.
>
> I am using Heritrix 1.14 with java 1.5 on linux machine.
>
> Can it be problem with DNS server?
> Here is entry from /etc/resolv.conf
> nameserver 192.168.101.1
> nameserver 10.70.1.243
> nameserver 10.70.1.230
> search  192.168.101.1
> retry 2
> retrans 1000
>
> Any pointer will really help!
>
> - Bhavin
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#5621 From: Bhavin Pandya <pandya.bhavin@...>
Date: Tue Dec 30, 2008 8:50 am
Subject: Re: All threads are waiting forever - ABOUT TO GET URI
pandya.bhavi...
Send Email Send Email
 
Hi,
 
Thanks for reply... I tried with large set of websites and its working as per expectation.
 
Thanks.
Bhavin

--- On Tue, 12/30/08, Gordon Mohr <gojomo@...> wrote:
From: Gordon Mohr <gojomo@...>
Subject: Re: [archive-crawler] All threads are waiting forever - ABOUT TO GET URI
To: archive-crawler@yahoogroups.com
Date: Tuesday, December 30, 2008, 7:07 AM

All threads will only be used if there are many separate sites to crawl.
Heritrix will only fetch a single URI from a site at a time, and will
pause between fetches according to configurable parameters to be polite
to target sites.

Thus, a single thread can juggle the polite, one-at-a-time fetches to
many sites -- anywhere from a handful to dozens of sites (depending on
the politeness pauses). Thus unless you have thousands of target sites
with pending URIs, you are unlikely to see 100 threads all busy at once.

I suspect you are crawling just a handful of large sites, in which case
crawling politely is the main factor controlling pace.

- Gordon @ IA

pandya.bhavin@ ymail.com wrote:
> Hi,
>
> I am new to heritrix and trying to run one sample job but facing
> problem.
>
> I have configured "max-toe-thread" to 100 but still it never starts
> all threads. In console panel its showing always 2 or 3 or max 10 out
> of 100 Toe threads. I dont know why its not able to use all threads.
>
> I checked my "toe-thread" report its showing almost all time,
> 100 threads: 99 ABOUT_TO_GET_ URI; 1 ABOUT_TO_BEGIN_ PROCESSOR
>
> Here are some other statistics its showing for same job.
> 6287224 deepest queue
> 1419879 average depth
> 9061143 total downloaded and queued
> 52 KB/sec (76 avg)
> 1.85 URIs/sec (2.26 avg)
>
> Can anybody give some pointer why its very very very slow?
> why its not using all threads while machine has enough bandwidth.
>
> I am using Heritrix 1.14 with java 1.5 on linux machine.
>
> Can it be problem with DNS server?
> Here is entry from /etc/resolv. conf
> nameserver 192.168.101. 1
> nameserver 10.70.1.243
> nameserver 10.70.1.230
> search 192.168.101. 1
> retry 2
> retrans 1000
>
> Any pointer will really help!
>
> - Bhavin
>
>
>
> ------------ --------- --------- ------
>
> Yahoo! Groups Links
>
>
>


#5622 From: "pandya.bhavin@..." <pandya.bhavin@...>
Date: Tue Dec 30, 2008 12:38 pm
Subject: Distributed crawling - Filter domains by regular expression
pandya.bhavi...
Send Email Send Email
 
Hi,

I have started using heritrix on single machine. I was just thinking
what are the different ways we can achieve distributed crawling using
heritrix.

I can think of one way but i dont from performance point of view how
better it is.

what i want to do is i will feed same seed list to all five servers
and configure each server to crawl only specific domain. eg. server A
will crawl all pages from only those domain which starts with [a-m]
and server B will crawl only pages from domain which start with [m-z].

I tried to achieve using "MatchesRegExpDecideRule" but in seeds
report, for all seeds its showing message "Blocked by user"

Preselector#decide-rules:
allow-by-regexp:
^(http|https|dns)\\://[a-zA-Z0-9][^!]+\\.[a-kA-K1-3][^!]+\\.
(com|org|net|mil|edu|gov|in|uk|biz|COM|ORG|NET|MIL|EDU|GOV|IN|UK|BIZ)
[^!]?

I don't know. what's wrong?

Is there any other better way to achieve same?

Any pointer, suggestions will be really helpful.

Thanks.
Bhavin

Messages 5593 - 5622 of 8128   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help