Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 6140 - 6169 of 6210   Newest  |  < Newer  |  Older >  |  Oldest
Messages: Show Message Summaries   (Group by Topic) Sort by Date v  
#6169 From: Pranay Pandey <sspranay@...>
Date: Tue Nov 24, 2009 3:36 pm
Subject: Re: (subject edited) Recrawling In Heritrix3.0.0-RC1
sspranay
Offline Offline
Send Email Send Email
 


Hello Matt and Gordon,

Following Gordon's advice and assuming HER-1706 to be fixed, I am using only two of the persistProcessors: load and store.
As suggested, I am using the same BdbModule/'history' dir for both the processors.
However I don't see any deduplication happening. Neither I am getting any exception or error to look into. I am using heritrix-3.0.0-RC1
This is my config file. Have I missed something?
http://cs.odu.edu/~pramo_p/crawler-beans.cxml

wud appreciate any help on this.

Thanks,
Pranay

--- On Fri, 11/13/09, Matthew Warhaftig <mwarhaftig@...> wrote:

From: Matthew Warhaftig <mwarhaftig@...>
Subject: Re: [archive-crawler] Recrawling In Heritrix3
To: archive-crawler@yahoogroups.com
Date: Friday, November 13, 2009, 9:04 PM

 

Good advice, thank you Gordon.  Adding the recrawl processors to the chain bean and pointing PersistLoadProcesso r directly to my existing history (no preload) got the recrawl working.


Thanks,
Matt


On Nov 10, 2009, at 5:33 PM, Gordon Mohr wrote:

 

Your setup looks generally correct. Are you perhaps forgetting to both
declare the beans by name, *and* insert them by <ref> into the <list> of
the chain bean?

Some other comments that may help in configuring these features in H3:

Matthew Warhaftig wrote:
> In H3 I am trying to setup crawl jobs that
> use FetchHistoryProcess or/PersistStoreP rocessor/ PersistLoadProce ssor to
> discard duplicate content. I can get H1 to recrawl correctly but the
> same technique is not storing a history and finding duplicates for me in
> H3 (my job setup is based on these
> postings: https://webarchive. jira.com/ wiki/display/ Heritrix/ Feature+Notes+ -+1.12.0 & http://tech. groups.yahoo. com/group/ archive-crawler/ message/5920).
>
> For the storing job I added the following to the default
> H3 crawler-beans. cxml file. In the Fetch Chain just after
> the "fetchHttp" bean:
>> <bean id="fetchHistoryPro cessor"
>> class="org.archive. modules.recrawl. FetchHistoryProc essor" > <property
>> name="historyLength " value="30" /> </bean>
>>
>> <bean id="persistStorePro cessor"
>> class="org.archive. modules.recrawl. PersistStoreProc essor" > </bean>

Unrelated to your issues, in the near future it will be better to put
the persistStoreProcess or as the first in the 'dispositionChain' rather
than the last position of the 'fetchChain' . Essentially, the
dispositionChain' s activities will be atomic with regard to
checkpointing -- any URI that starts the dispositionChain will finish it
before a checkpoint is stored -- but the fetchChain's activities will not.

Also unrelated but good to know: H3's configuration makes it much easier
for different components to optionally use separate BDB environments on
disk. Two reasons one might want to do so: (1) distribute the IO costs
over different disks; (2) keep distinct data separate on disk for
backup/migration- to-new-jobs, as with the difference between
queues/stateof the running crawl, and URI history for deduplication
purposes.

For example, you could declare...

<bean id="persistStoreBdb " class="org.archive. bdb.BdbModule"
autowire-candidate= "false">
<property name="dir" value="history" />
</bean>

<bean id="persistStorePro cessor"
class="org.archive. modules.recrawl. PersistStoreProc essor">
<property name="bdbModule" >
<ref bean="persistStoreB db"/>
</property>
</bean>

...and then persistent URI history will collect in a 'history' directory
(which could be a full path to anywhere convenient) rather than being
mixed-in with other crawler state.

(The reason for the 'autowire-candidate ="false"' is to prevent this
BdbModule from being a competitor to the default for all the objects
that expect one distinguished instance to be available for autowiring.)

> Then for another job to use this stored history I added the following to
> the default H3 crawler-bean. cxml file. In the Fetch Chain just after the
> "preconditions" bean:
>> <bean id="persistLoadProc essor"
>> class="org.archive. modules.recrawl. PersistLoadProce ssor"> <property
>> name="preloadSource "
>> value="/Users/ mattwarhaftig/ Documents/ heritrix- 3.0.0-SNAPSHOT/ jobs/basic/ state"
>> /> </bean>
> And just after the "fetchHttp" bean:
>> <bean id="fetchHistoryPro cessor"
>> class="org.archive. modules.recrawl. FetchHistoryProc essor" > <property
>> name="historyLength " value="30" /> </bean>

Note that 'preloadSource' causes the processor to scan the named log or
directory, and copy all its contents into its current history database.
If using this, make sure you're not pointing to the same path it's
currently

You could also skip the 'preload' and directly point the
PersistLoadProcesso r to an existing history store:

<bean id="persistLoadBdb" class="org.archive. bdb.BdbModule"
autowire-candidate= "false">
<property name="dir"
value="/Users/ mattwarhaftig/ Documents/ heritrix- 3.0.0-SNAPSHOT/ jobs/basic/ state"/>
</bean>

<bean id="persistLoadProc essor"
class="org.archive. modules.recrawl. PersistLoadProce ssor">
<property name="bdbModule" >
<ref bean="persistLoadBd b"/>
</property>
</bean>

(Except for the 'preload', if specified, the PersistLoadProcesso r only
reads from its given BdbModule/environme nt, so you could reuse a prior
crawl's persist-store target without damaging it.)

There's currently a problem with having both a PersistLoadProcesso r and
PersistStoreProcess or in the same crawl using the same
BdbModule/directory [HER-1706], but that should work by the time of the
H3 official release.

Hope this helps,

- Gordon @ IA

> Am I declaring these beans correctly?
>
> Thanks,
> Matt
>
>
>




#6168 From: takeru sasaki <sasaki.takeru@...>
Date: Tue Nov 24, 2009 4:19 am
Subject: Re: IDN support of heritrix
sasaki.takeru@...
Send Email Send Email
 
thank you matt,

I know about IDN in Herritrix.
And I know the problem about seed.
I am watching it.

takeru


2009/11/22 Matthew Warhaftig <mwarhaftig@...>
 

Hi Takeru,


Heritrix automatically converts Internationalized Domain Name seeds and discovered links to punycode.  However, I recreated your issue of certain IDN seeds getting ignored and in response opened HER-1711.

Non-ASCII characters in the file path to Heritrix is not a problem.  Also, I did not encounter any other IDN related issues during my testing.

Thanks,
Matt


On Nov 16, 2009, at 6:27 AM, takeru sasaki wrote:

 

Hi,

I want to know about IDN support of heritrix.
("Internationalized domain name"
http://en.wikipedia.org/wiki/Internationalized_Domain_Name)

I was tried to add Japanese-IDN host to seeds, and failed to crawl.
At "seed report" page, I met this message.
----- http://myheritrix/reports/seeds.jsp?job=xxxxxxx
Some items in seed specification were ignored. This may not indicate
any problem, but the ignored items are displayed here for reference:
-----

And I try to add Punycode of failed one.
It is success for seed report. and looks like good for crawling
relative links...

I want to know:
- Can I add IDN url to seed?
- Is Heritrix run correct when IDN-url is found in crawled HTMLs?
- Is Heritrix run correct when not-ASCII chars in path string?
- any other non-ASCII URL problem?
- if NOT, Is it hard to add IDN-support to Heritrix? I need it.

I am using version 1.14.3.

Thank you for read my message.

takeru




#6167 From: Bjarne Andersen <bja@...>
Date: Mon Nov 23, 2009 8:32 pm
Subject: SV: Nb documents per host
bjarne_dk2000
Offline Offline
Send Email Send Email
 
You should most likely use the QuotaEnforcer module that allows you to set number of documents (succesful and total) and number of bytes per queue. If at the same time you use the HostnameQueueEnforcement (I forgot the precise name of that) I believe thats exactly what you want.
This is by the way whats used by the NetarchiveSuite (http://netarchive.dk/suite), a complete toolset for web archiving using heritrix as the crawl engine.
Best
Bjarne andersen

Sent fra min HTC Touch Pro


Fra: bourely <bourely@...>
Sendt: 23. november 2009 20:31
Til: archive-crawler@yahoogroups.com <archive-crawler@yahoogroups.com>
Emne: [archive-crawler] Nb documents per host

 

Hi,
I used recently heritrix 1.14.3 and I do not understand how to limit the nb of documents per host.
The parameter max-document-download limit the total number of documents of the job.
But how to limit this number per host ?

Have I missed something in the doc ?

Regards,
Dominique


#6166 From: "bourely" <bourely@...>
Date: Mon Nov 23, 2009 6:39 pm
Subject: Nb documents per host
bourely
Offline Offline
Send Email Send Email
 
Hi,
I used recently heritrix 1.14.3 and I do not understand how to limit the nb of
documents per host.
The parameter max-document-download limit the total number of documents of the
job.
But how to limit this number per host ?

Have I missed something in the doc ?

Regards,
Dominique

#6165 From: ÷¦ÔÁÌ¦Ê ôÉÍÞÉÛÉÎ <tivv00@...>
Date: Mon Nov 23, 2009 10:39 am
Subject: TopmostAssignedSurtQueueAssignmentPolicy
tivv00@...
Send Email Send Email
 
Hello.

I am using Heritrix 2.0.2 with configuration based on "Broad but shallow crawl". It seems that I've managed to setup everything OK.
The one problem I currently have is that even with root:queue-assignment-policy equal to org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy it still creates queues for each subdomain, example:
Queue com,excential,integration,siebel,lexingtonpark, (p1)
1 items
last enqueued: http://lexingtonpark.siebel.integration.excential.com/integration/siebel/text/javascript
last peeked: http://lexingtonpark.siebel.integration.excential.com/integration/siebel/text/javascript
total expended: 4 (total budget: 500)
active balance: 96
last(avg) cost: 1(1)
totalScheduled fetchSuccesses fetchFailures fetchDisregards fetchResponses robotsDenials successBytes totalBytes fetchNonResponses
4 3 0 0 0 0 192368 192368 3
SimplePrecedenceProvider
1
Is it a bug or I've missed something in configuration?
P.S. I do not have any "special" sheets, I am using only one global sheet.

Best regards, Vitalii Tymchyshyn


#6164 From: Gordon Mohr <gojomo@...>
Date: Mon Nov 23, 2009 11:32 am
Subject: Heritrix 3.0.0-RC1 release candidate now available
gojomo
Offline Offline
Send Email Send Email
 
The first Release Candidate test release of Heritrix 3.0 is now
available, version identifier 3.0.0-RC1.

We encourage expert Heritrix users curious about the new version or
willing to help with testing to try this release and share feedback.

Full information on obtaining and running this release is available on
the project wiki:

http://webarchive.jira.com/wiki/display/Heritrix/Heritrix3

Heritrix 30 documentation now includes a draft User Guide and draft API
Guide, about which we also welcome feedback:

https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.0+User+Guide

https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.0+API+Guide

== What's New in this RC1 Release ==

* job checkpoint/resume has been restored, and now proceeds without
requiring a full crawl pause

* a job with defaults can always be created (even without a another job
or profile to copy it from), and handling of job directories in custom
locations is improved

* an experimental fixed-interval mechanism for automatically
rescheduling URIs

* support for storing FTP fetch results in WARCs

* Heritrix3 source code has been relicensed under the Apache License
2.0, except for a small number of files remaining under LGPL as noted in
the license notice and their individual headers

* numerous bug fixes

== What's New in Heritrix 3 ==

Heritrix 3 has a new, Spring-based system for configuring and
instantiating/launching crawls. The Spring-originated XML configuration
metadata format is now our format for describing crawls, as well.

The web-based user-interface in Heritrix 3 has been streamlined and
updated to have consistent URLs and simple forms for most actions,
including viewing and editing job files or running arbitrary script code
within the context of a job. Programmatic operations against the web
interface have replaced JMX as the preferred manner to remote-control
Heritrix.

Also, Heritrix 3 moves to a model where a single job, in a single job
directory, may be be relaunched in place many times (instead of creating
a new job directory before each launch).

== Limitations ==

The most significant limitation compared to earlier versions of Heritrix
is that crawl configuration requires editing an XML configuration file,
either on disk or via the web UI's large-textarea raw file editor. A
guided, form-of-many-fields crawl configuration editor is not expected
until some time after the 3.0.0 release.

This Release Candidate is considered feature-complete, but work
continues on known bugs and packaging/interface/documentation issues
before a final release in early December. The prioritized list of issues
to be addressed for the 3.0.0 final release, and subsequent releases, is
viewable in the project issue tracker 'Road Map':

https://webarchive.jira.com/browse/HER?report=com.atlassian.jira.plugin.system.p\
roject:roadmap-panel

== Downloads ==

Distribution packages (.tar.gz or .zip) may be downloaded directly from
our Maven2 repository:

http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix/3.0.0-RC1/

As always, problem reports, ideas, fix/feature contributions, and other
kinds of feedback are all welcome here on the list and on the project
wiki and JIRA issue tracker:

Heritrix Wiki: http://webarchive.jira.com/wiki/display/Heritrix
Heritrix JIRA: http://webarchive.jira.com/browse/HER

Thanks!

- Gordon @ IA

#6163 From: "cagtat" <cagtat@...>
Date: Mon Nov 23, 2009 12:42 am
Subject: excluding a list of urls
cagtat
Offline Offline
Send Email Send Email
 
Hi all,

Although I searched thoroughly in the group messages, my searches didn't end up
with a solution.

I am looking for a way to exclude a list of urls in the crawl. The reason for
wanting this is that I am doing domain specific crawls to discover new urls for
that domain, and I don't want to spend time/resource with already discovered
urls. The urls are mostly like

http://www.domain.com/path/blabla.aspx?id=111
http://www.domain.com/path/blabla.aspx?id=222

SurtPrefixedDecideRule didn't help me, because what I try to do is not about
prefixes.

Any suggestions?

Thanks in advance

#6162 From: Matthew Warhaftig <mwarhaftig@...>
Date: Sat Nov 21, 2009 5:17 pm
Subject: Re: IDN support of heritrix
matthewwarha...
Offline Offline
Send Email Send Email
 
Hi Takeru,

Heritrix automatically converts Internationalized Domain Name seeds and discovered links to punycode.  However, I recreated your issue of certain IDN seeds getting ignored and in response opened HER-1711.

Non-ASCII characters in the file path to Heritrix is not a problem.  Also, I did not encounter any other IDN related issues during my testing.

Thanks,
Matt


On Nov 16, 2009, at 6:27 AM, takeru sasaki wrote:

 

Hi,

I want to know about IDN support of heritrix.
("Internationalized domain name"
http://en.wikipedia.org/wiki/Internationalized_Domain_Name)

I was tried to add Japanese-IDN host to seeds, and failed to crawl.
At "seed report" page, I met this message.
----- http://myheritrix/reports/seeds.jsp?job=xxxxxxx
Some items in seed specification were ignored. This may not indicate
any problem, but the ignored items are displayed here for reference:
-----

And I try to add Punycode of failed one.
It is success for seed report. and looks like good for crawling
relative links...

I want to know:
- Can I add IDN url to seed?
- Is Heritrix run correct when IDN-url is found in crawled HTMLs?
- Is Heritrix run correct when not-ASCII chars in path string?
- any other non-ASCII URL problem?
- if NOT, Is it hard to add IDN-support to Heritrix? I need it.

I am using version 1.14.3.

Thank you for read my message.

takeru



#6161 From: stack <stack@...>
Date: Fri Nov 20, 2009 4:48 pm
Subject: crawler-commons project
stackarchiveorg
Offline Offline
Send Email Send Email
 
Hey crawlers:

I was Apachecon in Oakland in early November and was present during a meeting of a few of the open source crawler projects (Ken Krugle for Bixo, Andrzej for Nutch, and the Apache Droids fellow via Skype from Spain).  The topic was what can be done to facilitate sharing of common resources and code amongst the open source crawlers; e.g. robots.txt parsing, url parse, canonicalization, page similarity, etc.  There was no Heritrix representative so I'm writing the list because my guess is that you fellas are probably interested in what was discussed and the in particular, the meetings' outcome.

Here are meetings from the meeting: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp.

Andrzej went ahead and started a crawler-commons project over on google code: http://code.google.com/p/crawler-commons/

Write Ken Krugle or the crawler-commons project if interested in adding commonalities or participating (or me offline if you'd like to know more about the meeting).

Thanks,
St.Ack

P.S. Gordon I tried sending you this stuff offlist but my mail must be stuck in your spam filters.  Go easy.






#6160 From: Bjarne Andersen <bja@...>
Date: Thu Nov 19, 2009 4:44 pm
Subject: SV: Re: Avoiding overloading webservers hosting many virtual servers
bjarne_dk2000
Offline Offline
Send Email Send Email
 
Our main challenge is that we need the queues sperated on TLD (foo.com and bar.com) to use the quota-enforcer to limit number of bytes on each TLD but at the same time calculate politeness based on the IP-number so that thousands of TLDs on the same IP-number does not overload the server.
Will this be easier to implement with heritrix 3.x ??

Best
Bjarne andersen
Netarchive.dk

Sent fra min HTC Touch Pro


Fra: kristsi25 <kris@...>
Sendt: 19. november 2009 17:09
Til: archive-crawler@yahoogroups.com <archive-crawler@yahoogroups.com>
Emne: [archive-crawler] Re: Avoiding overloading webservers hosting many virtual servers

 

John pretty much sums up the problem.

The way I've dealt with this has been on a case by case basis. Each time we detect this situation, we override the frontier's "force-queue-assignment" setting to ensure that all the domains are dumped in the same politeness bucket.

This is quite simple when you are dealing with a multitude of sub-domains as you can just throw in the override on the parent domain.

- Kris

--- In archive-crawler@yahoogroups.com, John Lekashman <lekash@...> wrote:
>
> Hi,
>
> It is not actually possible to guarantee this.
> There is no real way to distinguish for sure where the actual physical
> hardware that hosts a name is.
>
> There has been some discussion of adding IP addresses to the politeness
> capability, which
> would do some of it.
>
> But, that would be defeated if the provider doing the service has given
> separate IP addresses
> to each virtual server, e.g. as would be done if they had ssl clients on
> each virtual machine.
>
> Another thing that can be considered is to extend politeness behavior
> into parent domains,
> e.g. act polite across all of:
> *.foo.com
>
> Which would again handle some of it, if the virtual servers all had a
> common parent.
> Of course, this would be wrong for large scale domains spread across
> multiple disparate
> locations.
>
> If your scope is relatively small, you can also put those things you
> detect as virtual servers in a separate
> crawl job, turn off in the other job, increase the inter arrival time,
> and limit the number of threads to slow it all down.
>
> These are all workarounds for the problem, that may help a little.
> There are probably others you can try.
>
> Its just that the Internet architecture was designed such that the
> bandwidth limiting
> should happen lower in the stack, like at the border router to that
> actual virtualized machine,
> or on the overloaded host itself, and most ISPs have no incentive to do
> so. (Some guy is using extra bandwidth, so I
> can bill more? Great. . .) Or no clue that they even should.
>
> John
>
>
> Søren Vejrup Carlsen wrote:
>
> > Hi all.
> > Is it possible in Heritrix 1.14.3 to avoid overloading webservers
> > hosting many virtual servers.
> > We currently have the problem that those webservers often choose to
> > ban our harvesters from their site.
> > And we would very much like to avoid that.
> >
> > best regards
> > ----------------------------------------------------------
> > Søren Vejrup Carlsen, NetarchiveSuite developer
> > Department of Digital Preservation, Royal Library, Copenhagen, Denmark
> > tlf: (+45) 33 47 48 41
> > email: svc@... <mailto:svc%40kb.dk>
> > ----------------------------------------------------------
> > Non omnia possumus omnes
> > --- Macrobius, Saturnalia, VI, 1, 35 -------
> >
> >
>


#6159 From: "kristsi25" <kris@...>
Date: Thu Nov 19, 2009 4:08 pm
Subject: Re: Avoiding overloading webservers hosting many virtual servers
kristsi25
Offline Offline
Send Email Send Email
 
John pretty much sums up the problem.

The way I've dealt with this has been on a case by case basis. Each time we
detect this situation, we override the frontier's "force-queue-assignment"
setting to ensure that all the domains are dumped in the same politeness bucket.

This is quite simple when you are dealing with a multitude of sub-domains as you
can just throw in the override on the parent domain.

- Kris

--- In archive-crawler@yahoogroups.com, John Lekashman <lekash@...> wrote:
>
> Hi,
>
> It is not actually possible to guarantee this.
> There is no real way to distinguish for sure where the actual physical
> hardware that hosts a name is.
>
> There has been some discussion of adding IP addresses to the politeness
> capability, which
> would do some of it.
>
> But, that would be defeated if the provider doing the service has given
> separate IP addresses
> to each virtual server, e.g. as would be done if they had ssl clients on
> each virtual machine.
>
> Another thing that can be considered is to extend politeness behavior
> into parent domains,
> e.g. act polite across all of:
> *.foo.com
>
> Which would again handle some of it, if the virtual servers all had a
> common parent.
> Of course, this would be wrong for large scale domains spread across
> multiple disparate
> locations.
>
> If your scope is relatively small, you can also put those things you
> detect as virtual servers in a separate
> crawl job, turn off in the other job, increase the inter arrival time,
> and limit the number of threads to slow it all down.
>
> These are all workarounds for the problem, that may help a little.
> There are probably others you can try.
>
> Its just that the Internet architecture was designed such that the
> bandwidth limiting
> should happen lower in the stack, like at the border router to that
> actual virtualized machine,
> or on the overloaded host itself, and most ISPs have no incentive to do
> so. (Some guy is using extra bandwidth, so I
> can bill more? Great. . .) Or no clue that they even should.
>
> John
>
>
> Søren Vejrup Carlsen wrote:
>
> > Hi all.
> > Is it possible in Heritrix 1.14.3 to avoid overloading webservers
> > hosting many virtual servers.
> > We currently have the problem that those webservers often choose to
> > ban our harvesters from their site.
> > And we would very much like to avoid that.
> >
> > best regards
> > ----------------------------------------------------------
> > Søren Vejrup Carlsen, NetarchiveSuite developer
> > Department of Digital Preservation, Royal Library, Copenhagen, Denmark
> > tlf: (+45) 33 47 48 41
> > email: svc@... <mailto:svc%40kb.dk>
> > ----------------------------------------------------------
> > Non omnia possumus omnes
> > --- Macrobius, Saturnalia, VI, 1, 35 -------
> >
> >
>

#6158 From: John Lekashman <lekash@...>
Date: Thu Nov 19, 2009 3:40 pm
Subject: Re: Avoiding overloading webservers hosting many virtual servers
lekash
Offline Offline
Send Email Send Email
 
Hi,

It is not actually possible to guarantee this.
There is no real way to distinguish for sure where the actual physical
hardware that hosts a name is.

There has been some discussion of adding IP addresses to the politeness
capability, which
would do some of it.

But, that would be defeated if the provider doing the service has given
separate IP addresses
to each virtual server, e.g. as would be done if they had ssl clients on
each virtual machine.

Another thing that can be considered is to extend politeness behavior
into parent domains,
e.g. act polite across all of:
*.foo.com

Which would again handle some of it, if the virtual servers all had a
common parent.
Of course, this would be wrong for large scale domains spread across
multiple disparate
locations.

If your scope is relatively small, you can also put those things you
detect as virtual servers in a separate
crawl job, turn off in the other job, increase the inter arrival time,
and limit the number of threads to slow it all down.

These are all workarounds for the problem, that may help a little.
There are probably others you can try.

Its just that the Internet architecture was designed such that the
bandwidth limiting
should happen lower in the stack, like at the border router to that
actual virtualized machine,
or on the overloaded host itself, and most ISPs have no incentive to do
so. (Some guy is using extra bandwidth, so I
can bill more? Great. . .) Or no clue that they even should.

John


Søren Vejrup Carlsen wrote:

> Hi all.
> Is it possible in Heritrix 1.14.3 to avoid overloading webservers
> hosting many virtual servers.
> We currently have the problem that those webservers often choose to
> ban our harvesters from their site.
> And we would very much like to avoid that.
>
> best regards
> ----------------------------------------------------------
> Søren Vejrup Carlsen, NetarchiveSuite developer
> Department of Digital Preservation, Royal Library, Copenhagen, Denmark
> tlf: (+45) 33 47 48 41
> email: svc@... <mailto:svc%40kb.dk>
> ----------------------------------------------------------
> Non omnia possumus omnes
> --- Macrobius, Saturnalia, VI, 1, 35 -------
>
>

#6157 From: Søren Vejrup Carlsen <svc@...>
Date: Thu Nov 19, 2009 3:02 pm
Subject: Avoiding overloading webservers hosting many virtual servers
svc400
Offline Offline
Send Email Send Email
 
Hi all.
Is it possible in Heritrix 1.14.3 to avoid overloading webservers hosting many
virtual servers.
We currently have the problem that those webservers often choose to ban our
harvesters from their site.
And we would very much like to avoid that.

best regards
---------------------------------------------------------------------------
Søren Vejrup Carlsen, NetarchiveSuite developer
Department of Digital Preservation, Royal Library, Copenhagen, Denmark
tlf: (+45) 33 47 48 41
email: svc@...
----------------------------------------------------------------------------
Non omnia possumus omnes
--- Macrobius, Saturnalia, VI, 1, 35 -------

#6156 From: "parseram34" <avansssaa@...>
Date: Thu Nov 19, 2009 8:00 am
Subject: Re: wrong document "crawl-order" Heritrix
parseram34
Offline Offline
Send Email Send Email
 
I reinstalled Heritrix and got the same message again after configuring my first
job:
Wrong document type 'crawl-order' in
'file:/home/name/heritrix-1.14.3/jobs/default-20091119071233556/order.xml',
line: 1, column: 160

<?xml version="1.0" encoding="UTF-8"?><crawl-order
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
   <meta>
     <name>default</name>
     <description>Default Profile</description>


-------------------------------------
--- In archive-crawler@yahoogroups.com, "parseram34" <avansssaa@...> wrote:
>
> I was reebooting and now the http://127.0.0.1:8080 address shows "Failed to
connect". http://localhost:8080 doesnt work either.
> When I start the terminal its shows error:
> bash:export:' maven-1.1' : not a valid identifier.
> bash:export:' home/h/heritrix-1.14.3' : not a valid identifier.
>
> The "wrong crawl-order"(lower-case) appeared when I started the first crawl..
I edited the file, just added the contact emailaddress and the user-agent [name]
(+[http-url])[optional-etc].
>
> Please, how could I fix it?
>
> --- In archive-crawler@yahoogroups.com, Gordon Mohr <gojomo@> wrote:
> >
> > When and where does this error appear? (For example: at the time
> > Heritrix is launched, at the time you try to start a crawl, at the time
> > you edit settings, etc.)
> >
> > Have you edited the order.xml outside of Heritrix at all?
> >
> > Does the error message really include a capitalized 'Crawl-order' while
> > the actual file includes a lowercase 'crawl-order'?
> >
> > - Gordon @ IA
> >
> > parseram34 wrote:
> > > Could you please help me..I get the following Error message in Heritrix:
> > > Wrong document type 'Crawl-order' in
'file:/heritrix-1.14.3/jobs/default-2009456783456/order.xml', line:1, column:160
> > >
> > > And the order.xml-file looks like this on line 1:
> > >
> > > <?xml version="1.0" encoding="UTF-8"?><crawl-order
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
> > > <meta>
> > >
> > >
> > >
> > > ------------------------------------
> > >
> > > Yahoo! Groups Links
> > >
> > >
> > >
> >
>

#6155 From: "Tomas Ukkonen" <tomas.ukkonen@...>
Date: Wed Nov 18, 2009 3:52 pm
Subject: RE: Re: Contributing code to Heritrix?
tomas.ukkonen@...
Send Email Send Email
 
Hi Kris,

Thank you for your reply.

I will use JIRA as you suggested attach patches against the latest 1.14.x
revision from the Heritrix repository.



Regards,
--
Tomas Ukkonen
Information Systems Specialist
The National Library of Finland


From: archive-crawler@yahoogroups.com
[mailto:archive-crawler@yahoogroups.com] On Behalf Of kristsi25
Sent: Monday, November 16, 2009 12:39 PM
To: archive-crawler@yahoogroups.com
Subject: [archive-crawler] Re: Contributing code to Heritrix?

 


I'd suggest opening up an issue (probably multiple issues in your case) in
the crawler issue base (https://webarchive.jira.com/browse/HER) and to then
attach your patch to the issue along with whatever explanation/rationale you
have for it.

- Kris

--- In archive-crawler@yahoogroups.com, "Tomas Ukkonen" <tomas.ukkonen@...>
wrote:
>
> Hi
>
> In National Library of Finland we have made some improvements to Heritrix:
>
> - document classification (content-based) deciderules support
> (e.g. language-based filtering of HTML document made of multiple
> frames/files)
> - fetching WHOIS data (FetchWHOIS fetch-module)
> - some minor bug fixes/enchantments
>
>
> * I wonder which is the preferred way to contribute these changes to
> Heritrix?
> * To whom should I send patches and in what format?
>
>
> The changes have been made to the 1.14.x branch.
>
>
> --
> Tomas Ukkonen
> Information Systems Specialist
> The National Library of Finland
>

#6154 From: "Alexis" <alexis@...>
Date: Tue Nov 17, 2009 10:56 pm
Subject: Internet Archive needs a world wide crawl tech lead
alexisrossi
Offline Offline
Send Email Send Email
 
Hi,

IA is hiring a tech lead for our new web crawling initiative.  You can read the
job description here:

http://www.archive.org/about/webjobs.php#wwcengineer

If you're interested in the position, feel free to send your resume directly to
me at alexis[at]archive.org.

Thanks,
Alexis Rossi

#6153 From: takeru sasaki <sasaki.takeru@...>
Date: Mon Nov 16, 2009 11:27 am
Subject: IDN support of heritrix
sasaki.takeru@...
Send Email Send Email
 
Hi,

I want to know about IDN support of heritrix.
("Internationalized domain name"
http://en.wikipedia.org/wiki/Internationalized_Domain_Name)

I was tried to add Japanese-IDN host to seeds,  and failed to crawl.
At "seed report" page, I met this message.
-----   http://myheritrix/reports/seeds.jsp?job=xxxxxxx
Some items in seed specification were ignored. This may not indicate
any problem, but the ignored items are displayed here for reference:
-----

And I try to add Punycode of failed one.
It is success for seed report. and looks like good for crawling
relative links...

I want to know:
  - Can I add IDN url to seed?
  - Is Heritrix run correct when IDN-url is found in crawled HTMLs?
  - Is Heritrix run correct when not-ASCII chars in path string?
  - any other non-ASCII URL problem?
  - if NOT, Is it hard to add IDN-support to Heritrix? I need it.

I am using version 1.14.3.

Thank you for read my message.

takeru

#6152 From: "kristsi25" <kris@...>
Date: Mon Nov 16, 2009 10:38 am
Subject: Re: Contributing code to Heritrix?
kristsi25
Offline Offline
Send Email Send Email
 
I'd suggest opening up an issue (probably multiple issues in your case) in the
crawler issue base (https://webarchive.jira.com/browse/HER) and to then attach
your patch to the issue along with whatever explanation/rationale you have for
it.

- Kris

--- In archive-crawler@yahoogroups.com, "Tomas Ukkonen" <tomas.ukkonen@...>
wrote:
>
> Hi
>
> In National Library of Finland we have made some improvements to Heritrix:
>
> - document classification (content-based) deciderules support
>   (e.g. language-based filtering of HTML document made of multiple
> frames/files)
> - fetching WHOIS data (FetchWHOIS fetch-module)
> - some minor bug fixes/enchantments
>
>
> * I wonder which is the preferred way to contribute these changes to
> Heritrix?
> * To whom should I send patches and in what format?
>
>
> The changes have been made to the 1.14.x branch.
>
>
> --
> Tomas Ukkonen
> Information Systems Specialist
> The National Library of Finland
>

#6151 From: Matthew Warhaftig <mwarhaftig@...>
Date: Sat Nov 14, 2009 2:04 am
Subject: Re: Recrawling In Heritrix3
matthewwarha...
Offline Offline
Send Email Send Email
 
Good advice, thank you Gordon.  Adding the recrawl processors to the chain bean and pointing PersistLoadProcessor directly to my existing history (no preload) got the recrawl working.

Thanks,
Matt


On Nov 10, 2009, at 5:33 PM, Gordon Mohr wrote:

 

Your setup looks generally correct. Are you perhaps forgetting to both
declare the beans by name, *and* insert them by <ref> into the <list> of
the chain bean?

Some other comments that may help in configuring these features in H3:

Matthew Warhaftig wrote:
> In H3 I am trying to setup crawl jobs that
> use FetchHistoryProcessor/PersistStoreProcessor/PersistLoadProcessor to
> discard duplicate content. I can get H1 to recrawl correctly but the
> same technique is not storing a history and finding duplicates for me in
> H3 (my job setup is based on these
> postings: https://webarchive.jira.com/wiki/display/Heritrix/Feature+Notes+-+1.12.0 & http://tech.groups.yahoo.com/group/archive-crawler/message/5920).
>
> For the storing job I added the following to the default
> H3 crawler-beans.cxml file. In the Fetch Chain just after
> the "fetchHttp" bean:
>> <bean id="fetchHistoryProcessor"
>> class="org.archive.modules.recrawl.FetchHistoryProcessor" > <property
>> name="historyLength" value="30" /> </bean>
>>
>> <bean id="persistStoreProcessor"
>> class="org.archive.modules.recrawl.PersistStoreProcessor" > </bean>

Unrelated to your issues, in the near future it will be better to put
the persistStoreProcessor as the first in the 'dispositionChain' rather
than the last position of the 'fetchChain'. Essentially, the
dispositionChain's activities will be atomic with regard to
checkpointing -- any URI that starts the dispositionChain will finish it
before a checkpoint is stored -- but the fetchChain's activities will not.

Also unrelated but good to know: H3's configuration makes it much easier
for different components to optionally use separate BDB environments on
disk. Two reasons one might want to do so: (1) distribute the IO costs
over different disks; (2) keep distinct data separate on disk for
backup/migration-to-new-jobs, as with the difference between
queues/stateof the running crawl, and URI history for deduplication
purposes.

For example, you could declare...

<bean id="persistStoreBdb" class="org.archive.bdb.BdbModule"
autowire-candidate="false">
<property name="dir" value="history"/>
</bean>

<bean id="persistStoreProcessor"
class="org.archive.modules.recrawl.PersistStoreProcessor">
<property name="bdbModule">
<ref bean="persistStoreBdb"/>
</property>
</bean>

...and then persistent URI history will collect in a 'history' directory
(which could be a full path to anywhere convenient) rather than being
mixed-in with other crawler state.

(The reason for the 'autowire-candidate="false"' is to prevent this
BdbModule from being a competitor to the default for all the objects
that expect one distinguished instance to be available for autowiring.)

> Then for another job to use this stored history I added the following to
> the default H3 crawler-bean.cxml file. In the Fetch Chain just after the
> "preconditions" bean:
>> <bean id="persistLoadProcessor"
>> class="org.archive.modules.recrawl.PersistLoadProcessor"> <property
>> name="preloadSource"
>> value="/Users/mattwarhaftig/Documents/heritrix-3.0.0-SNAPSHOT/jobs/basic/state"
>> /> </bean>
> And just after the "fetchHttp" bean:
>> <bean id="fetchHistoryProcessor"
>> class="org.archive.modules.recrawl.FetchHistoryProcessor" > <property
>> name="historyLength" value="30" /> </bean>

Note that 'preloadSource' causes the processor to scan the named log or
directory, and copy all its contents into its current history database.
If using this, make sure you're not pointing to the same path it's
currently

You could also skip the 'preload' and directly point the
PersistLoadProcessor to an existing history store:

<bean id="persistLoadBdb" class="org.archive.bdb.BdbModule"
autowire-candidate="false">
<property name="dir"
value="/Users/mattwarhaftig/Documents/heritrix-3.0.0-SNAPSHOT/jobs/basic/state"/>
</bean>

<bean id="persistLoadProcessor"
class="org.archive.modules.recrawl.PersistLoadProcessor">
<property name="bdbModule">
<ref bean="persistLoadBdb"/>
</property>
</bean>

(Except for the 'preload', if specified, the PersistLoadProcessor only
reads from its given BdbModule/environment, so you could reuse a prior
crawl's persist-store target without damaging it.)

There's currently a problem with having both a PersistLoadProcessor and
PersistStoreProcessor in the same crawl using the same
BdbModule/directory [HER-1706], but that should work by the time of the
H3 official release.

Hope this helps,

- Gordon @ IA

> Am I declaring these beans correctly?
>
> Thanks,
> Matt
>
>
>



#6150 From: "Tomas Ukkonen" <tomas.ukkonen@...>
Date: Fri Nov 13, 2009 11:38 am
Subject: Contributing code to Heritrix?
tomas.ukkonen@...
Send Email Send Email
 
Hi

In National Library of Finland we have made some improvements to Heritrix:

- document classification (content-based) deciderules support
   (e.g. language-based filtering of HTML document made of multiple
frames/files)
- fetching WHOIS data (FetchWHOIS fetch-module)
- some minor bug fixes/enchantments


* I wonder which is the preferred way to contribute these changes to
Heritrix?
* To whom should I send patches and in what format?


The changes have been made to the 1.14.x branch.


--
Tomas Ukkonen
Information Systems Specialist
The National Library of Finland

#6149 From: raffaele messuti <raffaele@...>
Date: Fri Nov 13, 2009 10:20 am
Subject: Re: heritrix as a spider library?
raffaele@...
Send Email Send Email
 
On Nov 12, 2009, at 10:36 AM, pierce403 wrote:
> Could anyone point me at some documentation that might be helpful for what I
am trying to do (a good tutorial, or maybe just a tip at which classes I should
be looking into in the javadoc)? Maybe some suggestions for an alternative?
Thanks for any help you could provide.

not in java, but ruby:  http://anemone.rubyforge.org/

just use from shell:
$ anemone url-list http://crawler.archive.org > seeds.txt




--
raffaele@...

#6148 From: "pierce403" <pierce403@...>
Date: Thu Nov 12, 2009 9:36 am
Subject: heritrix as a spider library?
pierce403
Offline Offline
Send Email Send Email
 
I am looking for a simple way to spider web pages from within an app I am
working on.  I know heritrix is not intended to be used as a library, but would
using it as one be feasible?

As a proof of concept, I would like to just write an app that you can point to a
domain, and have it print out all the URLs it can find in the domain.  I've been
digging through the code for a couple hours, and am still not really seeing how
I would do something like that.  Is heritrix useful as a web spider library, or
should I just back away slowly?

Heritrix seems to contain many features that I would like to take advantage of
later, so I really don't want to rewrite all this from scratch, however I also
don't want to spend all my time trying to hammer a square peg into a round hole.

Could anyone point me at some documentation that might be helpful for what I am
trying to do (a good tutorial, or maybe just a tip at which classes I should be
looking into in the javadoc)?  Maybe some suggestions for an alternative? 
Thanks for any help you could provide.

    - DEAN

#6147 From: Gordon Mohr <gojomo@...>
Date: Tue Nov 10, 2009 10:53 pm
Subject: Re: heritrix2 bad html parsing?
gojomo
Offline Offline
Send Email Send Email
 
Heritrix cannot execute Javascript, so its link-extraction with respect
to Javascript uses a crude heuristic of trying strings that might be
relative URIs (having internal '/' or '.' characters).

This can be turned off via settings on ExtractorHTML.

Also, the latest Heritrix 1.14.3 release and upcoming Heritrix 3 release
have special cases to suppress some of the most common misinterpreted
Javascript strings, such as "text/javascript", to lessen this problem.

- Gordon @ IA

nukleonrus wrote:
> We got an email from a website owner that encountered many attempts from
us.Heritrix ran with default configuration
> We searched crawl.log file for the details and this is what we found(I am
showing one real example):
>
> 1. heritrix started to crawl this url:
> http://www.zlatesipy.cz/oddil/akce/2007-2008/ubikace/
> =>without any problems, html code 200
>
> 2. from that url, it went into this new url, with refer set as a link from
point no. 1
> http://www.zlatesipy.cz/oddil/akce/2007-2008/ubikace/text/javascript/
> =>heritrix reported html code 302 (Found)
>
> 3. the link in point no. 2 was then set as a referer and also as a "newly"
crawled link with html error code 404
>
> 4. heritrix continued with new url links
>
> the problem is, that link in point no. 2 doesn't even exists in source code
from link from no. 1. It is so strange to me, that heritrix found this url even
that it doesnt exist.
> Maybe the JSExtractor processor made a unexplainable mistake?
>
> thanks for reply
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#6146 From: Gordon Mohr <gojomo@...>
Date: Tue Nov 10, 2009 10:43 pm
Subject: Re: warc files left open H3-beta
gojomo
Offline Offline
Send Email Send Email
 
If your curl-script has the effect of pushing the 'terminate' button in
the web UI, then the crawl should (after a little time for any pending
fetches to complete/timeout) close cleanly, with all WARCs also closed
cleanly and renamed to lose the '.open' suffix.

If you can reliably create a situation where the WARCs are left '.open',
and the crawler JVM is still running and shows the job run has having
completed, please let us know the steps you took, and any suspicious log
output that might indicate a problem with job-wrapup.

Generally, any job terminated through the UI should, after some number
of timeout minutes, end cleanly, with the only drawback being not
crawling those URIs waiting to be visited/discovered. (Actually killing
the process or powering down the machine could leave other open files
incompletely written, but this typically manifests as just some missing
lines at the end of logs, or a last-record in WARCs that is
truncated/corrupt, with all other records still usable.)

- Gordon @ IA

Pranay Pandey wrote:
>
>
>
> I have written a script that controls the build, launch and termination
> of jobs using curl commands. I pass along a parameter to the script
> telling it how long the job has to run before it terminates it.
>
> I noticed that many of the broad crawl jobs (perhaps due to forced and
> hence abrupt termination) leave the later-timestamped warcs in open
> state. Is there a way to close these warcs? What are the other possible
> drawbacks of prematurely terminating jobs like this?
>
> Thanks,
> Pranay
>
>
>
>
>

#6145 From: Gordon Mohr <gojomo@...>
Date: Tue Nov 10, 2009 10:33 pm
Subject: Re: Recrawling In Heritrix3
gojomo
Offline Offline
Send Email Send Email
 
Your setup looks generally correct. Are you perhaps forgetting to both
declare the beans by name, *and* insert them by <ref> into the <list> of
the chain bean?

Some other comments that may help in configuring these features in H3:

Matthew Warhaftig wrote:
> In H3 I am trying to setup crawl jobs that
> use FetchHistoryProcessor/PersistStoreProcessor/PersistLoadProcessor to
> discard duplicate content.  I can get H1 to recrawl correctly but the
> same technique is not storing a history and finding duplicates for me in
> H3 (my job setup is based on these
> postings:
https://webarchive.jira.com/wiki/display/Heritrix/Feature+Notes+-+1.12.0 &
http://tech.groups.yahoo.com/group/archive-crawler/message/5920).
>
> For the storing job I added the following to the default
> H3 crawler-beans.cxml file.  In the Fetch Chain just after
> the "fetchHttp" bean:
>> <bean id="fetchHistoryProcessor"
>> class="org.archive.modules.recrawl.FetchHistoryProcessor" > <property
>> name="historyLength" value="30" /> </bean>
>>
>> <bean id="persistStoreProcessor"
>> class="org.archive.modules.recrawl.PersistStoreProcessor" > </bean>

Unrelated to your issues, in the near future it will be better to put
the persistStoreProcessor as the first in the 'dispositionChain' rather
than the last position of the 'fetchChain'. Essentially, the
dispositionChain's activities will be atomic with regard to
checkpointing -- any URI that starts the dispositionChain will finish it
before a checkpoint is stored -- but the fetchChain's activities will not.

Also unrelated but good to know: H3's configuration makes it much easier
for different components to optionally use separate BDB environments on
disk. Two reasons one might want to do so: (1) distribute the IO costs
over different disks; (2) keep distinct data separate on disk for
backup/migration-to-new-jobs, as with the difference between
queues/stateof the running crawl, and URI history for deduplication
purposes.

For example, you could declare...

<bean id="persistStoreBdb" class="org.archive.bdb.BdbModule"
autowire-candidate="false">
    <property name="dir" value="history"/>
</bean>

<bean id="persistStoreProcessor"
class="org.archive.modules.recrawl.PersistStoreProcessor">
    <property name="bdbModule">
      <ref bean="persistStoreBdb"/>
    </property>
   </bean>

...and then persistent URI history will collect in a 'history' directory
(which could be a full path to anywhere convenient) rather than being
mixed-in with other crawler state.

(The reason for the 'autowire-candidate="false"' is to prevent this
BdbModule from being a competitor to the default for all the objects
that expect one distinguished instance to be available for autowiring.)

> Then for another job to use this stored history I added the following to
> the default H3 crawler-bean.cxml file. In the Fetch Chain just after the
> "preconditions" bean:
>> <bean id="persistLoadProcessor"
>> class="org.archive.modules.recrawl.PersistLoadProcessor"> <property
>> name="preloadSource"
>>
value="/Users/mattwarhaftig/Documents/heritrix-3.0.0-SNAPSHOT/jobs/basic/state"
>> /> </bean>
> And just after the "fetchHttp" bean:
>> <bean id="fetchHistoryProcessor"
>> class="org.archive.modules.recrawl.FetchHistoryProcessor" > <property
>> name="historyLength" value="30" /> </bean>

Note that 'preloadSource' causes the processor to scan the named log or
directory, and copy all its contents into its current history database.
If using this, make sure you're not pointing to the same path it's
currently

You could also skip the 'preload' and directly point the
PersistLoadProcessor to an existing history store:

<bean id="persistLoadBdb" class="org.archive.bdb.BdbModule"
autowire-candidate="false">
    <property name="dir"
value="/Users/mattwarhaftig/Documents/heritrix-3.0.0-SNAPSHOT/jobs/basic/state"/\
>
</bean>

<bean id="persistLoadProcessor"
   class="org.archive.modules.recrawl.PersistLoadProcessor">
   <property name="bdbModule">
      <ref bean="persistLoadBdb"/>
    </property>
</bean>

(Except for the 'preload', if specified, the PersistLoadProcessor only
reads from its given BdbModule/environment, so you could reuse a prior
crawl's persist-store target without damaging it.)

There's currently a problem with having both a PersistLoadProcessor and
PersistStoreProcessor in the same crawl using the same
BdbModule/directory [HER-1706], but that should work by the time of the
H3 official release.

Hope this helps,

- Gordon @ IA


> Am I declaring these beans correctly?
>
> Thanks,
> Matt
>
>
>

#6144 From: Gordon Mohr <gojomo@...>
Date: Tue Nov 10, 2009 9:19 pm
Subject: Re: Re: Question about QueueOverbudgetDecideRule
gojomo
Offline Offline
Send Email Send Email
 
There are at least 2 ways to limit the number of URIs Heritrix fetches
from a host:

   - QuotaEnforcer, which discards extra URIs when they come up for
fetching once certain tallies reach configured quotas
   - frontier budgeting, which can either (a) retire a queue once it's
'expenditure' goes over a budget threshold (even as the discovered URIs
continue to queue up, in case you want to increase the budget later); or
(b) using the contributed QueueOverbudgetDecideRule, discard as
out-of-scope URIs once a queue has exhausted its budget.

Each has different tradeoffs. The budgeting-based techniques, in
particular, may not offer the exact count-enforcement you seem to want,
because the 'queue expenditures' are not exactly 1-per-successful-URI.
(They're close, with some usual queue-assignment and cost-assignment
policies, but not exact.)

Looking at your order.xml, you've configured the
QueueOverbudgetDecideRule in a way where it's can't have the desired
effect. Its 'decision' is set to ACCEPT, so if the evaluated URI is
assigned to an overbudget queue, it will return the ACCEPT decision.
And, installed on LinksScoper, any decision other than REJECT simply
means "OK to run this processor". Finally, as a rule on a processor, it
runs against the URI being processed, *not* the outlinks discovered. So
even as a REJECT rule, it would only prevent LinksScoper activity on
URIs that come from already-overbudget queues. That might approximate
what you want -- no more discovered outlinks from such URIs -- but you
would miss URIs destined for other, under-budget queues.

I believe the intended use of the QueueOverbudgetDecideRule is inside a
DecidingScope, like the 'deciding-defaults' profile. (You can convert
such a scope to a 'broad' crawl by making its initial rule an
AcceptDecideRule rather than RejectDecideRule, and discarding the
following SurtPrefixedDecideRule for ACCEPTing some URIs.) Then, it will
apply to all URIs discovered and considered for inclusion, rejecting
those destined for already-overbudget queues.

Hope this helps,

- Gordon @ IA


olintocattaneo wrote:
> Replying to myself just in case anyone competent missed this.
>
> Olinto
>
> --- In archive-crawler@yahoogroups.com, "olintocattaneo"
> <olintocattaneo@...> wrote:
>> Hello
>>
>> I'm trying to get QueueOverbudgetDecideRule to work but I don't
>> seem to be able to do this. Is this module still functional or
>> maybe I have added it to a wrong place?
>>
>> Here is my order file:
>> http://ihave.bushiq.com/stuff/order_20091022065616.xml
>>
>> What I want to accomplish is that I just want to crawl 5 pages from
>> each host. I tried QuotaEnforcer initially but this module is
>> really inefficient since when it finds new links to a host that has
>> reached it's quota it will still try to check them out
>> from"already-seen" database, will add them to queue if there are
>> none and when the queue goes active and it doesn't find them it
>> will write them to log file. This means that the crawling is using
>> unnecessary amount of resources.
>>
>> If I want to crawl 5 pages from each domain it should do just that
>> - 1. When extracting links from URL check if domain is already in
>> the already-seen database, if it is check if it has reached quota,
>> when it is not then add the links to queue but if it has then just
>> drop the or write them to log file(would be nice to be able to
>> specify this too).
>>
>> I'm thinking that this is not possible right now and although I
>> have spent weeks researching this very fine crawler it seems that
>> it is not possible, maybe I'm just doing something wrong though. I
>> can achieve this behavior with Mnogosearch but compared to Heritrix
>> it is not scalable and flexible enough for me.
>>
>> I'm sure there are other people too who are interested about
>> configuring Heritrix this way since limiting URL's per host/domain
>> is something everyone would probably want to do and I'm sure that
>> they are already doing this but they might be doing this as
>> inefficiently as me.
>>
>> Regards
>>
>> Olinto
>>
>
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>

#6143 From: Gordon Mohr <gojomo@...>
Date: Tue Nov 10, 2009 8:25 pm
Subject: Re: SV: Heritrix 3.0.0-beta test release now available
gojomo
Offline Offline
Send Email Send Email
 
Søren Vejrup Carlsen wrote:
> Hi Gordon.
> I can't find the tool to migrate 1.X configurations to 3.X style
configurations.
> I have downloaded the heritrix-3.0.0-beta-dist.tar.gz from
http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix/3.0.0-beta/

The class is org.archive.crawler.migrate.MigrateH1to3Tool, and basic
notes on its use are available at:

https://webarchive.jira.com/wiki/display/Heritrix/H3+Dev+Notes+for+Crawl+Operato\
rs#H3DevNotesforCrawlOperators-MigrateH1to3Tool

Note that initially this tool is very rudimentary -- mirroring changed
settings from a H1 configuration, based on our default profile, to an H3
configuration, based on the new default profile.

However, it should report those H1 settings that it cannot translate.

Please let us know the major configuration customizations you most
commonly use, but are not yet handled, so we can prioritize support for
those. (If you want to send me some example order.xml files and/or
settings/ subdirectories, that'd be welcome, as well.)

- Gordon @ IA

#6142 From: "nukleonrus" <rusnack@...>
Date: Mon Nov 9, 2009 9:19 pm
Subject: heritrix2 bad html parsing?
nukleonrus
Offline Offline
Send Email Send Email
 
We got an email from a website owner that encountered many attempts from
us.Heritrix ran with default configuration
We searched crawl.log file for the details and this is what we found(I am
showing one real example):

1. heritrix started to crawl this url:
http://www.zlatesipy.cz/oddil/akce/2007-2008/ubikace/
=>without any problems, html code 200

2. from that url, it went into this new url, with refer set as a link from point
no. 1
http://www.zlatesipy.cz/oddil/akce/2007-2008/ubikace/text/javascript/
=>heritrix reported html code 302 (Found)

3. the link in point no. 2 was then set as a referer and also as a "newly"
crawled link with html error code 404

4. heritrix continued with new url links

the problem is, that link in point no. 2 doesn't even exists in source code from
link from no. 1. It is so strange to me, that heritrix found this url even that
it doesnt exist.
Maybe the JSExtractor processor made a unexplainable mistake?

thanks for reply

#6141 From: Pranay Pandey <sspranay@...>
Date: Mon Nov 9, 2009 6:53 pm
Subject: warc files left open H3-beta
sspranay
Offline Offline
Send Email Send Email
 

I have written a script that controls the build, launch and termination of jobs using curl commands. I pass along a parameter to the script telling it how long the job has to run before it terminates it.

I noticed that many of the broad crawl jobs (perhaps due to forced and hence abrupt termination) leave the later-timestamped warcs in open state. Is there a way to close these warcs? What are the other possible drawbacks of prematurely terminating jobs like this?

Thanks,
Pranay 


#6140 From: Matthew Warhaftig <mwarhaftig@...>
Date: Sun Nov 8, 2009 8:20 pm
Subject: Recrawling In Heritrix3
mwarhaftig@...
Send Email Send Email
 
Hi,

In H3 I am trying to setup crawl jobs that use FetchHistoryProcessor/PersistStoreProcessor/PersistLoadProcessor to discard duplicate content.  I can get H1 to recrawl correctly but the same technique is not storing a history and finding duplicates for me in H3 (my job setup is based on these postings: https://webarchive.jira.com/wiki/display/Heritrix/Feature+Notes+-+1.12.0 & http://tech.groups.yahoo.com/group/archive-crawler/message/5920).

For the storing job I added the following to the default H3 crawler-beans.cxml file.  In the Fetch Chain just after the "fetchHttp" bean:
<bean id="fetchHistoryProcessor" class="org.archive.modules.recrawl.FetchHistoryProcessor" > <property name="historyLength" value="30" /> </bean>

<bean id="persistStoreProcessor" class="org.archive.modules.recrawl.PersistStoreProcessor" > </bean>

Then for another job to use this stored history I added the following to the default H3 crawler-bean.cxml file. In the Fetch Chain just after the "preconditions" bean:
<bean id="persistLoadProcessor" class="org.archive.modules.recrawl.PersistLoadProcessor"> <property name="preloadSource" value="/Users/mattwarhaftig/Documents/heritrix-3.0.0-SNAPSHOT/jobs/basic/state" /> </bean>
And just after the "fetchHttp" bean:
<bean id="fetchHistoryProcessor" class="org.archive.modules.recrawl.FetchHistoryProcessor" > <property name="historyLength" value="30" /> </bean>

Am I declaring these beans correctly?

Thanks,
Matt

Messages 6140 - 6169 of 6210   Newest  |  < Newer  |  Older >  |  Oldest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help