Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

archive-crawler

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 795
  • Category: Cyberculture
  • Founded: Dec 1, 2002
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Messages

Advanced
Messages Help
Messages 180 - 209 of 8125   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#180 From: Steen Christensen <steensc42@...>
Date: Mon Nov 24, 2003 1:32 pm
Subject: Crawl of 907 danish seeds
steensc42
Send Email Send Email
 

The last couple of years an attempt has been made to crawl approx. 900 danish URLs deemed interesting during the week 46.

This year we tried to use the Heritrix crawler.

 

The following is a brief summary of some of the experiences:

 

Total number of seed URL’s: 907

 

Harvest start: 11.11.2003-17

Harverst end: 14.11.2003-12

 

Harvest time: 67 hours

Total amount of data harvested: 4 GB (compressed), 14 GB (uncompressed)

Compression factor measured to 3.5.

Harvest bandwidth: 14 GB/67 hours = 464 Kbit/sec

 

Coverage analysis:

 

367 (40% of the 907) seed URLs were investigated more closely.

265 of these were valid sites that should be harvested.

101 (38%) were completely missed by the harvester.

 58 (22%) were partially harvested (typically only front-page harvested).

106 (40%) were satisfactorily harvested.

 

 

Observed problems:

 

Relative image URLs seem not to be stored

Example:

<img src="../Images/logo.gif" alt="Logo">

 

Complex framesets:

Example:

                      http://www.centrumdemokraterne.dk (center frame not crawled)

http://www.chilinet.dk

 

 

A large number of city-sites using the same framework were not harvested:

                      http://www.bynet.dk

                      http://esbjerg.bynet.dk/

 

 

Explicit port definitions seem to confuse Heritrix

                      http://dk.yahoo.com:80

 

                     

PDF files seem not to be stored correctly (Word files are OK.).

 

------------
Yahoo! Mail - Gratis: 6 MB lagerplads, spamfilter og virusscan


#181 From: Gordon Mohr <gojomo@...>
Date: Tue Nov 25, 2003 12:50 am
Subject: Re: Crawl of 907 danish seeds
gojomo
Send Email Send Email
 
Thanks for this report; these are curious results, and we'll
want to track down every place where an expected URI/resource was
missed.

Was the code grabbed from CVS and built just prior to the run?
(We are still making significant destabilizing changes regularly.)

What configuration options were used? (Can you forward your crawl-order
file?) What command-line was used to launch the crawler?

Without listing all the seeds, can you say whether they were always
site roots (http://www.site.org) or sometimes other entry pages
(http://www.site.org/subsection/)?

Steen Christensen wrote:
  > Total number of seed URL’s: 907
  >
  > Harvest start: 11.11.2003-17
  > Harverst end: 14.11.2003-12
  >
  > Harvest time: 67 hours
  >
  > Total amount of data harvested: 4 GB (compressed), 14 GB (uncompressed)

How many total resources were successfully collected?

Did the crawler run out of pages to crawl, or hit an error or user-abort?

The current version of Heritrix will still eventually hit, and be
stopped, by memory-footprint limits.

How soon these limits are hit depend on the available memory, diversity
of URIs crawled, and whether you've enabled the experimental disk-based
structure for tracking "alreadyIncluded" items. (This is currently done
in code, in the Frontier class initialization method.)

Using the in-memory only implementation (MemLongFPSet), on a 2GB crawl
machine, we recently ran a crawl of ~250 sites which gathered 4.8 million
URIs over 3 days before hitting implementation problems.

However, in order to run that long, we had to disable the ExtractorDOC
and ExtractorPDF processors (which have unresolved memory-overuse bugs)
and set the expiration of IP and robots info to 3 days (because the
refetching of this info after it expires is currently unreliable).

  > Coverage analysis:
  >
  > 367 (40% of the 907) seed URLs were investigated more closely.
  > 265 of these were valid sites that should be harvested.
  > 101 (38%) were completely missed by the harvester.

This is the most surprising result; any URI listed in the seeds
should definitely be visited by the crawler. Were there errors
evident in the logs for the seed sites missed?

(I would check the crawl.log first. If a negative error code is
associated with the URI of interest, you have to look up its
meaning in the FetchStatusCodes class -- at least until we
move to better, more symbolic/mnemonic error codes. Barring any
useful info there, you can also check the runtime-errors.log
and the local-errors.log to see if unexpected errors thwarted
normal processing.)

  > Observed problems:
  >
  > Relative image URLs seem not to be stored
  >
  > Example:
  > <img src="../Images/logo.gif" alt="Logo">

I will look into this further. Keep in mind that when you find
specific, reproduceable bugs, you may enter them directly into
the project bug-tracking system at:

     https://sourceforge.net/tracker/?func=browse&group_id=73833&atid=539099

  > Complex framesets:
  >
  > Example:
  >
  >                       http://www.centrumdemokraterne.dk
  > <http://www.centrumdemokraterne.dk/> (center frame not crawled)
  > http://www.chilinet.dk <http://www.chilinet.dk/>

I suspect this might be other problems (perhaps evident in
the logs) affecting the given frame URIs, rather than a problem
with frameset parsing, but I will investigate further.


  > A large number of city-sites using the same framework were not harvested:
  >
  >                       http://www.bynet.dk <http://www.bynet.dk/>
  >                       http://esbjerg.bynet.dk/

Were there errors in the logs explaining why the seed/roots
failed?

  > Explicit port definitions seem to confuse Heritrix
  >
  >                       http://dk.yahoo.com:80 <http://dk.yahoo.com/>

Do you mean sites were collected twice, or not at all with
some error?

  > PDF files seem not to be stored correctly (Word files are OK.).

Before a bug in the ReplayInputStream was fixed on November 17, all
content byte values > 127 were being corrupted when sent to
ARC files.

- Gordon

#182 From: Bjarne Andersen <bja@...>
Date: Tue Nov 25, 2003 8:21 am
Subject: Re: Crawl of 907 danish seeds
bjarne_dk2000
Send Email Send Email
 
The large number of missing portals using same frameset (ww.bynet.dk)
are not harvested due to:

<meta name="robots" content="index,nofollow,noimageindex">

which can not be overruled with Heritrix ! (as I pointed out in the small test
of HtTrack vs. Heritrix - results uploaded to yahoo)



best
Bjarne Andersen

Steen Christensen wrote:

> The last couple of years an attempt has been made to crawl approx. 900
> danish URLs deemed interesting during the week 46.
>
> This year we tried to use the Heritrix crawler.
>
> The following is a brief summary of some of the experiences:
>
> Total number of seed URL’s: 907
>
> Harvest start: 11.11.2003-17
>
> Harverst end: 14.11.2003-12
>
> Harvest time: 67 hours
>
> Total amount of data harvested: 4 GB (compressed), 14 GB (uncompressed)
>
> Compression factor measured to 3.5.
>
> Harvest bandwidth: 14 GB/67 hours = 464 Kbit/sec
>
> Coverage analysis:
>
> 367 (40% of the 907) seed URLs were investigated more closely.
>
> 265 of these were valid sites that should be harvested.
>
> 101 (38%) were completely missed by the harvester.
>
> 58 (22%) were partially harvested (typically only front-page harvested).
>
> 106 (40%) were satisfactorily harvested.
>
> Observed problems:
>
> Relative image URLs seem not to be stored
>
> Example:
>
> <img src="../Images/logo.gif" alt="Logo">
>
> Complex framesets:
>
> Example:
>
> http://www.centrumdemokraterne.dk <http://www.centrumdemokraterne.dk/>
> (center frame not crawled)
>
> http://www.chilinet.dk <http://www.chilinet.dk/>
>
> A large number of city-sites using the same framework were not harvested:
>
> http://www.bynet.dk <http://www.bynet.dk/>
>
> http://esbjerg.bynet.dk/
>
> Explicit port definitions seem to confuse Heritrix
>
> http://dk.yahoo.com:80 <http://dk.yahoo.com/>
>
> PDF files seem not to be stored correctly (Word files are OK.).
>
> ------------
> Yahoo! Mail <http://dk.mail.yahoo.com> - Gratis: 6 MB lagerplads,
> spamfilter og virusscan
> *Yahoo! Groups Sponsor*
> ADVERTISEMENT
> Click to learn more...
>
<http://rd.yahoo.com/SIG=12cjscnsb/M=243273.4156324.5364586.1261774/D=egroupweb/\
S=1705004924:HM/EXP=1069767144/A=1750744/R=0/*http://servedby.advertising.com/cl\
ick/site=552006/bnum=1069680744983342>
>
>
>
> To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service
> <http://docs.yahoo.com/info/terms/>.
>

#183 From: Steen Christensen <steensc42@...>
Date: Tue Nov 25, 2003 9:09 am
Subject: ARC Storage format extension
steensc42
Send Email Send Email
 

We are currently discussing the storage format we will use for our archive.

Currently we are inclined to use the ARC format because

- it is simple

- a lot of data are already stored in this format

- we do not know of better alternatives

We have been discussing how to extend the ARC format to accomodate out requirements with respect to

- Meta Data

- Transformation Data

One proposal for the storage of meta-data is to introduce a new protocol header: metadata:

metadata:<protocol> <url> <ip> <timestamp>

<mime-type> <data-size>

<DATA – metadata for url>

 

The metadata for a collected URL is stored using the same value for the <protocol> <url> <ip> and <timestamp> fields as the block containing the collected data:

Example:

http://www.kb.dk/index.html 130.226.231.14 20031103164106

text/plain 299

HTTP/1.1 200 OK

...

metadata:http://www.kb.dk/index.html 130.226.231.14

20031103164106 text/plain 420

<MD5>0123456789</>

<RESPONSETIME>100ms</>

<HARVESTREASON>DKDOMAIN</>

...

Checkking if any metadata are available for a datablok thus simply amounts to prefacing the original protocol header with metadata:

The metadata section should as minimum contain a MD5 fingerprint of the data section it references.

 

We anticipate that we will need to perform a number of format transformation of the data as a data format becomes obsolete/unsuported

Example: GIF -> JPEG

We propose to store this information in ARC files using a transform: protocol header.

 

transform:<protocol> <url> <ip> <timestamp>

<mime-type> <data-size>

<DATA –the transformed data>

Information about the transformation process is stored as metadata using the format:

 

metadata:transform:<protocol> <url> <ip> <timestamp>

<mime-type> <data-size>

<DATA – metadata for transformering>

 

Example:

ARC1:

http://www.kb.dk/logo.gif 130.226.231.14 20031103164106

image/gif 299

HTTP/1.1 200 OK

...

metadata:http://www.kb.dk/logo.gif 130.226.231.14

20031103164106 image/gif 420

<MD5>8888</>

-----------------------------------------------------

ARC2:

transform:http://www.kb.dk/logo.gif 130.226.231.14

20031103164106 image/png 420

...

transform:http://www.kb.dk/logo.gif 130.226.231.14

20031103164106 image/jpeg 68

...

metadata:transform:http://www.kb.dk/logo.gif 130.226.231.14

20031103164106 image/gif 420

<MD5>42424242</>

<orginal_format>gif</>

<orignal_MD5>8888</>

<new_format>png</>

...

metadata:transform:http://www.kb.dk/logo.gif 130.226.231.14

20031103164106 image/gif 420

<MD5>12474567812</>

<orginal_format>png</>

<orignal_MD5>42424242</>

<new_format>jpeg</>

...

 

These entries corresponds to the transformations gif -> png -> jpeg

The basic idea is to add information to the ARC files without breaking existing code.

What do you think of this approach ?

 

------------
Yahoo! Mail - Gratis: 6 MB lagerplads, spamfilter og virusscan


#184 From: "johnerikhalse" <johnh@...>
Date: Tue Nov 25, 2003 8:03 pm
Subject: Suggestions for per host settings
johnerikhalse
Send Email Send Email
 
SUGGESTIONS FOR PER HOST SETTINGS

This document aims to describe a solution for the per host settings as
required by the "Memorandum of understanding" between the Internet
Archive and the Nordic National Libraries. The requirement is to be
able to specify settings for each site, domain defaults and global
default. This document does not address the possibility to set
configuration settings on a per document basis.


HIERARCHY OF SETTINGS

Settings should be settable in a hierarchy with the default or global
settings on top. Then there should be possible to set settings on a
per domain basis and on a per host basis. The latter should always has
precedence over the former.

In addition to this we might want to have two sets of hierarchies so
that a set of settings could be shared among different crawls. And
another that would eventually override the first set. The use for this
could be that there is a set of configurations shared by different
institutions doing similar crawls for example different national
libraries doing almost the same crawl but for different domains. The
settings hierarchy will then look something like this for a host A in
the domain B, we call the first configuration hierarchy for common and
the second for local:

1. read the order file which contains the default settings for the
    crawl and also points to the common and local configurations.
2. override if there are global settings in common
3. override if there are global settings in local
4. override if there are domain B settings in common
5. override if there are domain B settings in local
6. override if there are host A settings in common
7. override if there are host A settings in local

All settings except for the initial order file should only contain
those fields which are to be overridden.


SETTINGS THAT SHOULD BE MODIFIABLE PER HOST

Not all settings should be settable on a per domain or per host
basis. Especially this is the case with certain settings as the
number of threads and which directories is to be used for log and arc
files. This list shows some of the settings that should be
overrideable by domain or host settings.

* Scope settings
   - max link hops
   - max trans hops
* Behavior settings
   - robots honoring policy
   - user-agent and from settings
* Politeness settings
   - delay factor
   - min delay
   - max delay
* Processor settings
   - max file size to fetch
   - which filters for inclusion and exclusion to apply

In addition we might want to change which processors should handle URIs
from a specific host.


CONFIGURATION FILES

Probably the best way of extending how the configuration is to be read
is by adding a two new fields in the order file which points to
directories containing the common and the local configuration
hierarchies. These directories has three different kind of files which
is all optional. One global file with the global settings. One or more
domain files with settings for one domain each. One or more host files
with settings for one host each.


IMPLEMENTATION

* FRONTIER

To achieve these goals we need a place where the frontier should get the
configuration settings for each host.

In the current implementation there is a CrawlServer class which keeps
track of the robots.txt for a host and also makes sure that we only
have one fetch in progress from a certain host. I think this class
might be a good candidate for keeping the per host configuration as
well. The CrawlServer should at instantiation ask the CrawlController
for its configuration by issuing its host name. Then it is up to the
controller to figure out if it should stick with the default settings
or if some or all settings should be overridden for this host.

The frontier should be altered to associate a CrawlServer with the URI
as soon as it is fetched from the pending queue and turned into a
CrawlURI. This way the URI could be asked for the configuration
settings it should use.

The big implication this has on the design is that the different
modules that currently extends the XMLConfig class for getting
its own configuration settings should instead ask other components for
its settings so that they are more dynamic. This way the module will
not be as dependent of the structure of the order file as it is
today. The configuration will be a data structure on its own and most
modules should not extend the XMLConfig class.


* CONTROLLER

The second issue to be considered is how the controller keeps track of
the configurations for each host. One way of doing this is to compile
the hierarchy of settings for a host into one order object which is then
delivered to the CrawlServer class upon request. Another approach is to
keep the hierarchy of settings in memory and let the CrawlServer get
the configuration object with the fines granularity for a certain
host. When a setting is requested the request will be thrown up the
hierarchy until it finds something to respond. Both solutions can be
implemented with a late-initialization approach so that the actual
configuration is not read into memory before it is needed.

The main problem is that with a large broad crawl there might be a lot
of CrawlServers active and that will demand a lot of memory to be
allocated. Several things could be done to reduce that problem.

The frontier could be constrained to only work with a limited set of
hosts at a time. When it thinks it has finished a host, either by
constraints given to it or there is no more pending URIs, it should
throw the CrawlServer away and start working on new ones. If the host
is to be revisited, that is that other hosts pointing to links inside
this host which haven't been visited, then the configuration has to be
reconstructed.

A variant of the above is to keep track of when a CrawlServer object
was last accessed and dispose the object after a certain time.

Another optimization would be to not use the XML DOM as the underlying
data structure for configuration settings. The DOM creates a lot of
unnecessary objects. By building a custom data structure created by a
SAX parser the burden both in terms of memory and processing, could be
reduced.

#185 From: archive-crawler@yahoogroups.com
Date: Tue Nov 25, 2003 8:05 pm
Subject: New file uploaded to archive-crawler
archive-crawler@yahoogroups.com
Send Email Send Email
 
Hello,

This email message is a notification to let you know that
a file has been uploaded to the Files area of the archive-crawler
group.

   File        : /requirementsupdate.doc
   Uploaded by : kristsi25 <kris@...>
   Description : Clarifications / progress report on requirements in the
"Memorandum of understanding" between the Internet Archive and the Nordic
National Libraries, that have already been met or are close to completion

You can access this file at the URL

http://groups.yahoo.com/group/archive-crawler/files/requirementsupdate.doc

To learn more about file sharing for your group, please visit

http://help.yahoo.com/help/us/groups/files

Regards,

kristsi25 <kris@...>

#186 From: archive-crawler@yahoogroups.com
Date: Tue Nov 25, 2003 8:06 pm
Subject: New file uploaded to archive-crawler
archive-crawler@yahoogroups.com
Send Email Send Email
 
Hello,

This email message is a notification to let you know that
a file has been uploaded to the Files area of the archive-crawler
group.

   File        : /requirementsupdate2.doc
   Uploaded by : kristsi25 <kris@...>
   Description : Definitions of the requirements in the "Memorandum of
understanding" between the Internet Archive and the Nordic National Libraries,
except those that have already been met or are close to completion

You can access this file at the URL

http://groups.yahoo.com/group/archive-crawler/files/requirementsupdate2.doc

To learn more about file sharing for your group, please visit

http://help.yahoo.com/help/us/groups/files

Regards,

kristsi25 <kris@...>

#187 From: archive-crawler@yahoogroups.com
Date: Tue Nov 25, 2003 8:07 pm
Subject: New file uploaded to archive-crawler
archive-crawler@yahoogroups.com
Send Email Send Email
 
Hello,

This email message is a notification to let you know that
a file has been uploaded to the Files area of the archive-crawler
group.

   File        : /IA_NORDIC_Memorandum of understanding.doc
   Uploaded by : kristsi25 <kris@...>
   Description : Memorandum of understanding - Agreement between the Internet
Archive and the Nordic National Libraries

You can access this file at the URL

http://groups.yahoo.com/group/archive-crawler/files/IA_NORDIC_Memorandum%20of%20\
understanding.doc

To learn more about file sharing for your group, please visit

http://help.yahoo.com/help/us/groups/files

Regards,

kristsi25 <kris@...>

#188 From: Aaron Krowne <akrowne@...>
Date: Tue Nov 25, 2003 10:15 pm
Subject: state of the crawler
aaronkrowne
Send Email Send Email
 
Hi,

We are considering using Heretrix on the MetaCombine project.  I can see
it is in early development, what can it do at the moment?  (I.e. can it
do crawls, can it do focusing, is the infrastructure there but not
focusing algorithm plug-ins, etc.).  I apologize in advance if I missed any
documentation explaining this.

Cheers,

Aaron

--
Aaron Krowne
Head of Digital Library Research
Emory University General Libraries
Office: 404-712-2810
Cell: 404-405-5766
akrowne@...

#189 From: "steensc42" <steensc42@...>
Date: Thu Nov 27, 2003 9:59 am
Subject: Re: Crawl of 907 danish seeds
steensc42
Send Email Send Email
 
I will produce a list of the seeds that caused problems and try a
recrawl with a newer version of Heritrix, with pdf and doc link
extraction disabled. I will report the results of that crawl when we
have them.

--- In archive-crawler@yahoogroups.com, Gordon Mohr <gojomo@a...>
wrote:
> Thanks for this report; these are curious results, and we'll
> want to track down every place where an expected URI/resource was
> missed.
>
> Was the code grabbed from CVS and built just prior to the run?
> (We are still making significant destabilizing changes regularly.)
>

We did not use the most recent CVS version as we had some minor
problems with that version - instead we used an earlier apparently
more stable version.

> What configuration options were used? (Can you forward your crawl-
order
> file?) What command-line was used to launch the crawler?
>
> Without listing all the seeds, can you say whether they were always
> site roots (http://www.site.org) or sometimes other entry pages
> (http://www.site.org/subsection/)?

Mixture of both

>
> Steen Christensen wrote:
>  > Total number of seed URL's: 907
>  >
>  > Harvest start: 11.11.2003-17
>  > Harverst end: 14.11.2003-12
>  >
>  > Harvest time: 67 hours
>  >
>  > Total amount of data harvested: 4 GB (compressed), 14 GB
(uncompressed)
>
> How many total resources were successfully collected?
Approx. 800 000
>
> Did the crawler run out of pages to crawl, or hit an error or user-
abort?
>

It ran out of pages

> The current version of Heritrix will still eventually hit, and be
> stopped, by memory-footprint limits.
>
> How soon these limits are hit depend on the available memory,
diversity
> of URIs crawled, and whether you've enabled the experimental disk-
based
> structure for tracking "alreadyIncluded" items. (This is currently
done
> in code, in the Frontier class initialization method.)
>

We used the disk-based method - the memory based ran out of memory
after approx. 24 Hours

> Using the in-memory only implementation (MemLongFPSet), on a 2GB
crawl
> machine, we recently ran a crawl of ~250 sites which gathered 4.8
million
> URIs over 3 days before hitting implementation problems.
>
> However, in order to run that long, we had to disable the
ExtractorDOC
> and ExtractorPDF processors (which have unresolved memory-overuse
bugs)

We did not disable doc's and pdf extraction. We will do this on out
next trial.

> and set the expiration of IP and robots info to 3 days (because the
> refetching of this info after it expires is currently unreliable).
>
>  > Coverage analysis:
>  >
>  > 367 (40% of the 907) seed URLs were investigated more closely.
>  > 265 of these were valid sites that should be harvested.
>  > 101 (38%) were completely missed by the harvester.
>
> This is the most surprising result; any URI listed in the seeds
> should definitely be visited by the crawler. Were there errors
> evident in the logs for the seed sites missed?
>
The term "missed" is not accurate. A site was classified as missed if
the seed-url page could not be retrieved from the archive - often
part of a framestructure would be stored.


As Bjarne pointed out a number of "missed" sites were due to a robots
meta-tag on the frontpage.

> (I would check the crawl.log first. If a negative error code is
> associated with the URI of interest, you have to look up its
> meaning in the FetchStatusCodes class -- at least until we
> move to better, more symbolic/mnemonic error codes. Barring any
> useful info there, you can also check the runtime-errors.log
> and the local-errors.log to see if unexpected errors thwarted
> normal processing.)
>
>  > Observed problems:
>  >
>  > Relative image URLs seem not to be stored
>  >
>  > Example:
>  > <img src="../Images/logo.gif" alt="Logo">
>
> I will look into this further. Keep in mind that when you find
> specific, reproduceable bugs, you may enter them directly into
> the project bug-tracking system at:
>
>     https://sourceforge.net/tracker/?
func=browse&group_id=73833&atid=539099
We will report specific bugs directly to sourceforge in the future.
>
>  > Complex framesets:
>  >
>  > Example:
>  >
>  >                       http://www.centrumdemokraterne.dk
>  > <http://www.centrumdemokraterne.dk/> (center frame not crawled)
>  > http://www.chilinet.dk <http://www.chilinet.dk/>
>
> I suspect this might be other problems (perhaps evident in
> the logs) affecting the given frame URIs, rather than a problem
> with frameset parsing, but I will investigate further.
>
>
>  > A large number of city-sites using the same framework were not
harvested:
>  >
>  >                       http://www.bynet.dk <http://www.bynet.dk/>
>  >                       http://esbjerg.bynet.dk/
>
> Were there errors in the logs explaining why the seed/roots
> failed?
>
>  > Explicit port definitions seem to confuse Heritrix
>  >
>  >                       http://dk.yahoo.com:80
<http://dk.yahoo.com/>
>
> Do you mean sites were collected twice, or not at all with
> some error?
>
Not at all
>  > PDF files seem not to be stored correctly (Word files are OK.).
>
> Before a bug in the ReplayInputStream was fixed on November 17, all
> content byte values > 127 were being corrupted when sent to
> ARC files.
>
Ok - we will try a more recent version
> - Gordon
/Steen

#190 From: John Erik Halse <johnh@...>
Date: Mon Dec 1, 2003 10:43 pm
Subject: [Fwd: Re: First draft of per host settings document]
johnerikhalse
Send Email Send Email
 
I'm forwarding this response from Gordon to the group.

John

-----Forwarded Message-----
From: Gordon Mohr <gojomo@...>
To: John Erik Halse <johnh@...>
Cc: Igor Ranitovic <igor@...>, Kristinn Sigurðsson <kris@...>
Subject: Re: First draft of per host settings document
Date: Tue, 25 Nov 2003 09:31:19 -0800

Initial reactions:

Good proposal! Your approach makes sense.

We should make a list of common tasks a crawl operator would want to
achieve with such a capability, and test the proposal against those
tasks.

More comments interspersed below...

John Erik Halse wrote:
> This is my first draft on the per host settings document. I would like
> to get some comments on this before I post it on the yahoogroups.
>
> John
>
> ------------------------------------------------------------------------
>
> SUGGESTIONS FOR PER HOST SETTINGS
>
> This document aims to describe a solution for the per host settings as
> required by the "Memorandum of understanding" between the Internet
> Archive and the Nordic National Libraries. The requirement is to be
> able to specify settings for each site, domain defaults and global
> default. This document does not address the possibility to set
> configuration settings on a per document basis.
>
>
> HIERARCHY OF SETTINGS
>
> Settings should be settable in a hierarchy with the default or global
> settings on top. Then there should be possible to set settings on a
> per domain basis and on a per host basis. The latter should always has
> precedence over the former.
>
> In addition to this we might want to have two sets of hierarchies so
> that a set of settings could be shared among different crawls. And
> another that would eventually override the first set. The use for this
> could be that there is a set of configurations shared by different
> institutions doing similar crawls for example different national
> libraries doing almost the same crawl but for different domains. The
> settings hierarchy will then look something like this for a host A in
> the domain B, we call the first configuration hierarchy for common and
> the second for local:
>
> 1. read the order file which contains the default settings for the
>    crawl and also points to the common and local configurations.
> 2. override if there are global settings in common
> 3. override if there are global settings in local
> 4. override if there are domain B settings in common
> 5. override if there are domain B settings in local
> 6. override if there are host A settings in common
> 7. override if there are host A settings in local
>
> All settings except for the initial order file should only contain
> those fields which are to be overridden.

The idea of nested hierarchies makes sense to me; however the
common<->local override dimension seems to add a lot of complexity
for minimal benefit.

I can't think of a specific need, and making the overrides "zig-zag"
between two hierarchies rather than "telescope" up a single hierarchy
adds implementation and comprehension difficulties. I would defer
such a capability until the more basic override capabilities are
working, and a need is expressed.

Do you expect that top-level-domains like ".com" can also have
settings?

>
> SETTINGS THAT SHOULD BE MODIFIABLE PER HOST
>
> Not all settings should be settable on a per domain or per host
> basis. Especially this is the case with certain settings as the
> number of threads and which directories is to be used for log and arc
> files. This list shows some of the settings that should be
> overrideable by domain or host settings.
>
> * Scope settings
>   - max link hops
>   - max trans hops

Hadn't previously thought of these as per-host changeable -- need
to consider further (or see example usage).

> * Behavior settings
>   - robots honoring policy
>   - user-agent and from settings
> * Politeness settings
>   - delay factor
>   - min delay
>   - max delay

These, together with a generic "don't fetch anyting with this
site/prefix/pattern",
seem to be the typical cases motivating this functionality.

When our scheduling capabilites are more advanced we might also add "only
fetch this site during these specific times".

Also, site-specific rules for reinterpreting away superficial URI info
(like session-IDs) may be a common use of this facility.

> * Processor settings
>   - max file size to fetch
>   - which filters for inclusion and exclusion to apply
>
> In addition we might want to change which processors should handle URIs
> from a specific host.

Suppressing or enabling specific processors sounds interesting but
potentially dangerous; operators might expect that they have a
clear view of what runs from their global setup alone, and lingering
obscure per-site settings could lead to confusion. Seems like we should
be careful here. Perhaps in this case (and others?) the global settings
should indicate whether local overrides are allowed?

The per-site configuration facility should support whatever arbitrary
settings future processor modules need.

>
> CONFIGURATION FILES
>
> Probably the best way of extending how the configuration is to be read
> is by adding a two new fields in the order file which points to
> directories containing the common and the local configuration
> hierarchies. These directories has three different kind of files which
> is all optional. One global file with the global settings. One or more
> domain files with settings for one domain each. One or more host files
> with settings for one host each.

This seems like it could map well to a directory-hierarchy matching
the domain levels (eg: /org/archive/movies) and possibly even subpaths
(eg /com/yahoo/geocities/TelevisionCity/studio/9999/) -- provided we do
sensible things to limit both excessive depth when there are just a small
number of configuration items, and excessive fanout in other cases.

A key factor is how well the storage scales when there are potentially
millions of custom overrides, each potentially including lengthy
directives, and the crawler can only keep some subset in memory at a
time.

>
> IMPLEMENTATION
>
> * FRONTIER
>
> To achieve these goals we need a place where the frontier should get the
> configuration settings for each host.
>
> In the current implementation there is a CrawlServer class which keeps
> track of the robots.txt for a host and also makes sure that we only
> have one fetch in progress from a certain host. I think this class
> might be a good candidate for keeping the per host configuration as
> well. The CrawlServer should at instantiation ask the CrawlController
> for its configuration by issuing its host name. Then it is up to the
> controller to figure out if it should stick with the default settings
> or if some or all settings should be overridden for this host.
>
> The frontier should be altered to associate a CrawlServer with the URI
> as soon as it is fetched from the pending queue and turned into a
> CrawlURI. This way the URI could be asked for the configuration
> settings it should use.
>
> The big implication this has on the design is that the different
> modules that currently extends the XMLConfig class for getting
> its own configuration settings should instead ask other components for
> its settings so that they are more dynamic. This way the module will
> not be as dependent of the structure of the order file as it is
> today. The configuration will be a data structure on its own and most
> modules should not extend the XMLConfig class.

Asking the CrawlURI, for its CrawlServer, which supplies the
"current" settings is a promising technique.

For the common case where there are no overrides, though, it seems
beneficial to retain the idea that modules have their own
configuration -- sort of the global default. Then the CrawlURI's
info can be seen as overrides when present.

> * CONTROLLER
>
> The second issue to be considered is how the controller keeps track of
> the configurations for each host. One way of doing this is to compile
> the hierarchy of settings for a host into one order object which is then
> delivered to the CrawlServer class upon request. Another approach is to
> keep the hierarchy of settings in memory and let the CrawlServer get
> the configuration object with the fines granularity for a certain
> host. When a setting is requested the request will be thrown up the
> hierarchy until it finds something to respond. Both solutions can be
> implemented with a late-initialization approach so that the actual
> configuration is not read into memory before it is needed.
>
> The main problem is that with a large broad crawl there might be a lot
> of CrawlServers active and that will demand a lot of memory to be
> allocated. Several things could be done to reduce that problem.
>
> The frontier could be constrained to only work with a limited set of
> hosts at a time. When it thinks it has finished a host, either by
> constraints given to it or there is no more pending URIs, it should
> throw the CrawlServer away and start working on new ones. If the host
> is to be revisited, that is that other hosts pointing to links inside
> this host which haven't been visited, then the configuration has to be
> reconstructed.
>
> A variant of the above is to keep track of when a CrawlServer object
> was last accessed and dispose the object after a certain time.

Yes, I think the approaches in order of increasing sophistication would
be:
    (1) Keep all settings in memory (fast but not scalable)
    (2) Keep all settings on disk, load every time it's needed, discard upon
        each completed CrawlURI (scales up, but very inefficient)
    (3) Keep it all on disk, use in-memory cache so that many settings
        loads are served from cache (scales up, may tend towards (2)
        for certain patterns of use)
    (4) Keep it all on disk, but constrain the crawler to work on URIs in
        related batches which tend to either improve the efficiency of
        a (3)-style cache, or which provide clear/natural signals as
        to when different settings-groups should be paged in or out of
        fast memory.

It seems there may be another role-module here -- a "ConfigurationProvider"
or some such -- which would initially use the configuration-files-and-
directories approach above but could be swapped with a relational-DB-based
implementation someday if someone prefers that.

> Another optimization would be to not use the XML DOM as the underlying
> data structure for configuration settings. The DOM creates a lot of
> unnecessary objects. By building a custom data structure created by a
> SAX parser the burden both in terms of memory and processing, could be
> reduced.

Agree, we'll have to limit our use of full DOM/XPath operations if extending
this mechanism down to fine-grained per-CrawlURI control.

A possibility that's similar to something Kris and I discussed in the
context of the Admin UI would be a "pseudo-DOM" -- where we use a small
fixed subset of the XPath strings that would be used to access a rich DOM
as keys into a flat HashMap instead.

Then the global settings might still come from a hierarchical DOM, but
overrides would be specified as:

    XPath->alternate value
    XPath->alternate value
    etc.

- Gordon

#191 From: John Erik Halse <johnh@...>
Date: Thu Dec 4, 2003 10:16 pm
Subject: Second draft of per host settings document
johnerikhalse
Send Email Send Email
 
The document has evolved a bit, but still needs comments.

John Erik

#192 From: Gordon Mohr <gojomo@...>
Date: Mon Dec 8, 2003 7:06 pm
Subject: Re: state of the crawler
gojomo
Send Email Send Email
 
Aaron Krowne wrote:
> We are considering using Heretrix on the MetaCombine project.  I can see
> it is in early development, what can it do at the moment?  (I.e. can it
> do crawls, can it do focusing, is the infrastructure there but not
> focusing algorithm plug-ins, etc.).  I apologize in advance if I missed any
> documentation explaining this.

Hi, Aaron. Sorry for the delay in responding.

Heritrix currently takes a seed list, typically up to hundreds of sites,
and then crawls according to a number of configurable parameters, including...

   - within a limited number of hops
   - within the seed domains, hosts, or path-prefixes

The configuration of the crawler is achieved via a number of pluggable
Java implementation classes, so other forms of focusing the crawl, including
forms based on content analysis, can be added by coding the appropriate
new classes.

As the internal interfaces continue to evolve and the documentation outside
of source code is thin, Heritrix is still best suited for use by adventurous
users with Java development skills, especially those eager to contribute to
the core design with feedback or functionality.

Can you tell us more about your project?

Please don't hesitate to send any and all questions you have to the list;
responses will usually come much more quickly!

- Gordon @ IArchive

#193 From: Aaron Krowne <akrowne@...>
Date: Mon Dec 8, 2003 9:08 pm
Subject: Re: state of the crawler
aaronkrowne
Send Email Send Email
 
> Can you tell us more about your project?

Sure.

MetaCombine is a Mellon-funded project which takes the form of a series
of experiments exploring digital library "combinations" technologies.
We are interested in how to make federations of digital library
resources work better together, and work with the web better.   Our
experiments can be grouped into three phases:

A: Semantic clustering experiments (classification, metatagging,
    building category hierarchies).
B: Combined search (web + OAI/digital library native)
C: Federated framework (federated digital library services beyond OAI)

We are considering Heretrix chiefly for items in phase B.  Here we are
studying two approaches: combined search (and potentially other
services) via publishing digital library resources in OAI format to the
web, then unifying them with other crawled web resources and building
the search on top of this.  The second approach is to use a focused
crawl to discover salient digital library resources on the web, then use
metatagging (from phase A) to publish these resources to the digital
library, upon which services such as search can be built.

The two approaches are basically the converse of each other.  What we
are doing is studying the feasibility of them with open source software
in particular.

The better the crawler, the better the results we can get in phase B.  A
focused crawler is necessary to ensure subject relevance, as our test
platform and digital libraries in general tend to specialize.  Thus we
eagerly anticipate focusing capabilities in Heretrix.  We will of course
be happy to assist in any way that we can in getting Heretrix to the
point of filling this role in our project.

Best,

Aaron Krowne

--
Aaron Krowne
Head of Digital Library Research
Emory University General Libraries
Office: 404-712-2810
Cell: 404-405-5766
akrowne@...

#194 From: Kristinn Sigurðsson <kris@...>
Date: Tue Dec 9, 2003 7:26 pm
Subject: Heritrix: statistics
kristsi25
Send Email Send Email
 
Hi all,
 
We are currently addressing the statistics gathering facilities of Heritrix (that is those gathered at runtime, not compiled after the fact).  The attachment to this post discusses this in detail.  Any comments/observations/remarks etc. are very welcome at this stage.
 
- Kristinn Sigurðsson

#195 From: Michael Stack <stack@...>
Date: Wed Dec 10, 2003 8:57 pm
Subject: Re: [Fwd: Second draft of per host settings document]
stack@...
Send Email Send Email
 
Proposal looks good to me.  Model reminds of me of apache '.htaccess'
file scheme where you can insert directory specific config. to override
the core httpd.conf.  I like it.  Below are some general
comments/questions:

+ Does the order file change as you do a crawl?  Where is state kept?
What if you want to tweak a setting during a crawl.  Where would that be
done?  Would you change the order file or would you instead change UI
and its value would be save to a state file subsequently used crawling?
Would such a change be one of the 'dynamic settings' from the global
configuration file saved at the head of the configuration dir
('overrides' for the order file) mentioned in your doc.?
+ Why can't I set any value?  What if during a crawl I want to up the
number of running threads or change the logging levels or the way in
which statistics are being gathered (e.g. move from 'pedestrian' crawl
to debug mode) or even instantiate new processors and change the order
in which the processor chain works.  Some settings won't be changeable
mid-crawl and these might fail w/ a polite message (e..g. changing heap
size mid-run) but otherwise, I'd suggest no constraint on what can be
changed in configuration (Keep in mind I don't know much about crawler
deploys).
+ I'd imagine in the scheme of things, the reading of a new
configuration file into memory, whether per host or per domain, a rare
event.  Do you agree?  At what time are the domain/host configuration
files read?  On initially crossing into a new domain or on first seeing
a host?  Will we ever refresh what we have in mem?
+ What are the plans for having multiple crawler instances working off
the one Frontier instance?  If the frontier makes a CrawlServer to
associate w/ a URI, how will we ensure that one crawler instance does
one server only (Maybe frontier can do distribution between crawlers).
+ Won't there only be one instance of the configuration instance in
memory? It won't be duplicated per CrawlerServer -- just a reference to
the single instance?  Why then are we worried about XML DOM cost?  Won't
the configuration just be read into an object tree w/ an xml serializer
used creating the objects w/ periodic visits to the file on disk to
check mod time?  The serializer would do something like, if a value
exists at a certain node in the configuration hierarchy, a getter that
returns the value at that location in the hierarchy is returned, else,
we call super.  Or are we talking of doing per host an aggregation of
all xml snippets to write a new configuration file to feed the crawler
accessing a particular host?
+ Author, date and section numbering make it easier to refer to a
document and its sections.
+ Later we might use an ldap server implementation behind your
'Configuration' reader class to store config.  The hierarchical nature
of the configuration w/ a rare write would make it a natural match.

Other comments:

+ Is CrawlServer class to do w/ crawl accessing a particular server?
Maybe rename it Server?
+ I have comments on the xml -- no schema nor dtd and extensive use of
element attributes -- but they can go elsewhere.

St.Ack

#196 From: John Erik Halse <johnh@...>
Date: Wed Dec 10, 2003 10:40 pm
Subject: Re: Re: [Fwd: Second draft of per host settings document]
johnerikhalse
Send Email Send Email
 
Thanks, a lot of good points here.

On Wed, 2003-12-10 at 12:57, Michael Stack wrote:
> Proposal looks good to me.  Model reminds of me of apache '.htaccess'
> file scheme where you can insert directory specific config. to override
> the core httpd.conf.  I like it.  Below are some general
> comments/questions:
>
> + Does the order file change as you do a crawl?  Where is state kept?
> What if you want to tweak a setting during a crawl.  Where would that be
> done?  Would you change the order file or would you instead change UI
> and its value would be save to a state file subsequently used crawling?
> Would such a change be one of the 'dynamic settings' from the global
> configuration file saved at the head of the configuration dir
> ('overrides' for the order file) mentioned in your doc.?
It is, with the current code, possible to change the order file during a
crawl. This is done trough the UI and the changes are stored to disk.

> + Why can't I set any value?  What if during a crawl I want to up the
> number of running threads or change the logging levels or the way in
> which statistics are being gathered (e.g. move from 'pedestrian' crawl
> to debug mode) or even instantiate new processors and change the order
> in which the processor chain works.  Some settings won't be changeable
> mid-crawl and these might fail w/ a polite message (e..g. changing heap
> size mid-run) but otherwise, I'd suggest no constraint on what can be
> changed in configuration (Keep in mind I don't know much about crawler
> deploys).
Yes, most things should be changeable during a crawl. The distinction
between the order file and the configuration hierarchy is that settings
that should be possible to override on a per host basis goes to the
crawl configuration, while settings that only makes sense for the
crawler as a whole goes to the order file. The names of these two kind
configurations should be changed to better reflect this.

> + I'd imagine in the scheme of things, the reading of a new
> configuration file into memory, whether per host or per domain, a rare
> event.  Do you agree?  At what time are the domain/host configuration
> files read?  On initially crossing into a new domain or on first seeing
> a host?  Will we ever refresh what we have in mem?
My thought is that we read the configuration file when crossing into a
new domain. However it is possible to imagine an arbitrary large number
of per host configuration files. Then it might be possible that we
cannot keep them all in memory at all times.

> + What are the plans for having multiple crawler instances working off
> the one Frontier instance?  If the frontier makes a CrawlServer to
> associate w/ a URI, how will we ensure that one crawler instance does
> one server only (Maybe frontier can do distribution between crawlers).
If you by multiple crawler instances mean a crawler spanning multiple
machines (or VMs), then this has not been addressed yet.

> + Won't there only be one instance of the configuration instance in
> memory? It won't be duplicated per CrawlerServer -- just a reference to
> the single instance?  Why then are we worried about XML DOM cost?  Won't
> the configuration just be read into an object tree w/ an xml serializer
> used creating the objects w/ periodic visits to the file on disk to
> check mod time?  The serializer would do something like, if a value
> exists at a certain node in the configuration hierarchy, a getter that
> returns the value at that location in the hierarchy is returned, else,
> we call super.  Or are we talking of doing per host an aggregation of
> all xml snippets to write a new configuration file to feed the crawler
> accessing a particular host?
When I started writing the document the DOM was used as the internal
data structure and XPaths was used to get the different settings. This
is costly if the values should be looked up for every URI. These days
the settings are kept in a HashMap after the initial lookup which
greatly decreases the cost. But if we are working with an internal data
structure distinct from the DOM, it shouldn't by much harder to use SAX
instead of the DOM to build it. Then as mentioned before there could
possibly be thousands of per host configuration files and then
performance becomes an issue. I'd like the Configuration objects to
represent a file on disk. If there is a configuration for a host the
CrawlServer points to this. If there isn't, the CrawlServer will point
to the configuration for its domain if present, or to the global
configuration. When a module asks for a setting it will ask the
configuration object which the CrawlServer references. The configuration
objects knows its parent, so if the setting isn't there it will push the
request up to that.

> + Author, date and section numbering make it easier to refer to a
> document and its sections.
Agree.

> + Later we might use an ldap server implementation behind your
> 'Configuration' reader class to store config.  The hierarchical nature
> of the configuration w/ a rare write would make it a natural match.
That makes sense. You could even keep it in an RDBMS, but I think the
default implementation shouldn't need external servers to be set up, so
the concept of a hierarchy of XML files should probably be the default
implementation.

>
> Other comments:
>
> + Is CrawlServer class to do w/ crawl accessing a particular server?
> Maybe rename it Server?
> + I have comments on the xml -- no schema nor dtd and extensive use of
> element attributes -- but they can go elsewhere.
>
> St.Ack
>
>
>
> To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>

#197 From: Gordon Mohr <gojomo@...>
Date: Wed Dec 10, 2003 11:28 pm
Subject: Re: Re: [Fwd: Second draft of per host settings document]
gojomo
Send Email Send Email
 
Michael Stack wrote:
> Proposal looks good to me.  Model reminds of me of apache '.htaccess'
> file scheme where you can insert directory specific config. to override
> the core httpd.conf.  I like it.  Below are some general
> comments/questions:
>
> + Does the order file change as you do a crawl?  Where is state kept?

Originally, the the order file did not change: it was only editted
(by hand or the web UI) before the crawl began, and then consulted at
the start of a crawl to fill in per-instance variables holding chosen
values.

Recently, Kris made several changes:
   (1) Most places where order settings are used, the order object (or
       portion thereof) is directly consulted each time the value is
       needed -- rather than only at startup -- so that changes during
       a crawl can have an effect.
   (2) The crawl can be paused in memory, and various crawl-order fields
       changed during the pause. The updated order is written to disk,
       and the new values should affect the crawl when it is resumed.

(I'm not sure everything behaves as might be desirable for all the
crawler objects which inherit from XMLConfig, and retain direct
references to their 'home node' inside the overall order DOM.)

The state of the running crawl is in memory, and reflected in the various
logs, but never yet captured as a consistent/resumable whole to disk.

> What if you want to tweak a setting during a crawl.  Where would that be
> done?  Would you change the order file or would you instead change UI
> and its value would be save to a state file subsequently used crawling?

Currently, tweaks can only happen via the web UI, while the crawl is
paused in memory. We could conceivably allow hand editting of
the on-disk files -- and then somehow signal the running crawler
to revisit them.

> Would such a change be one of the 'dynamic settings' from the global
> configuration file saved at the head of the configuration dir
> ('overrides' for the order file) mentioned in your doc.?

I think most such adjustments to global values would be reflected
in the global config file, under John's proposal.

> + Why can't I set any value?  What if during a crawl I want to up the
> number of running threads or change the logging levels or the way in
> which statistics are being gathered (e.g. move from 'pedestrian' crawl
> to debug mode) or even instantiate new processors and change the order
> in which the processor chain works.  Some settings won't be changeable
> mid-crawl and these might fail w/ a polite message (e..g. changing heap
> size mid-run) but otherwise, I'd suggest no constraint on what can be
> changed in configuration (Keep in mind I don't know much about crawler
> deploys).

Changing some values mid-crawl may imply consequences that are hard
to define or implement. For example, does changing the paths for
crawler output mid-crawl mean the crawler should copy its existing
data to the new location? Or simply write new data to the new location?
And if parts of the UI need to consult the full data set to operate,
must it remember all old paths?

Thus I think the decision on what settings are dynamically changeable
must be made on a case-by-case basis. Outside of the (large) class
of settings which affect individual CrawlURI processing, I expect
the default to be "not changeable until need is demonstrated" -- then
we can consider the particulars of the situation.

> + I'd imagine in the scheme of things, the reading of a new
> configuration file into memory, whether per host or per domain, a rare
> event.  Do you agree?  At what time are the domain/host configuration
> files read?  On initially crossing into a new domain or on first seeing
> a host?  Will we ever refresh what we have in mem?

Clearly per-host (or per-domain) settings must be read when any URI
within that category is being processed in a way that could be affected
by those settings. But such per-host or per-domain info will also often
be flushed from memory when no such URIs are being processed, and then
re-read as necessary. So depending on the breadth and scheduling policy
of a crawl, such files might be read only once or many times.

I suspect only changes of settings through the web UI, or some other
manual kick/flush of existing settings, would cause a refresh of
settings held in memory.

> + What are the plans for having multiple crawler instances working off
> the one Frontier instance?  If the frontier makes a CrawlServer to
> associate w/ a URI, how will we ensure that one crawler instance does
> one server only (Maybe frontier can do distribution between crawlers).

Not yet considered in depth. It's quite likely that when independent
crawlers are cooperating, they'll each have a firm idea of which
remote servers (host:port combos) they are to crawl, and which they
should delegate to their siblings.

> + Won't there only be one instance of the configuration instance in
> memory? It won't be duplicated per CrawlerServer -- just a reference to
> the single instance?  Why then are we worried about XML DOM cost?  Won't
> the configuration just be read into an object tree w/ an xml serializer
> used creating the objects w/ periodic visits to the file on disk to
> check mod time?  The serializer would do something like, if a value
> exists at a certain node in the configuration hierarchy, a getter that
> returns the value at that location in the hierarchy is returned, else,
> we call super.  Or are we talking of doing per host an aggregation of
> all xml snippets to write a new configuration file to feed the crawler
> accessing a particular host?

If many domains/hosts/servers have their own custom settings, and
each settings file is XML, and that XML is always loaded into a
generic DOM and then referenced by the CrawlServer instance, the DOM
overhead might be a concern.

> + Author, date and section numbering make it easier to refer to a
> document and its sections.

Good point!

> + Later we might use an ldap server implementation behind your
> 'Configuration' reader class to store config.  The hierarchical nature
> of the configuration w/ a rare write would make it a natural match.

Or, other relational DBs.

> Other comments:
>
> + Is CrawlServer class to do w/ crawl accessing a particular server?
> Maybe rename it Server?

Yes, CrawlServer represents one remote server (host:port) that should
be treated as a unit for various reasons during crawling. Like with
"CrawlURI", "Crawl" is prepended to help distinguish the particular
work-unit here from other senses/versions of the word.

> + I have comments on the xml -- no schema nor dtd and extensive use of
> element attributes -- but they can go elsewhere.

The motivation for extensive use of attributes was overall compactness of
representation and ease of access using short, unique XPaths. We could
consider other better or more common XML syntax practices; I look forward
to additional comments.

- Gordon

#198 From: Michael Stack <stack@...>
Date: Wed Dec 10, 2003 11:25 pm
Subject: Re: Re: [Fwd: Second draft of per host settings document]
stack@...
Send Email Send Email
 
John Erik Halse wrote:

>Thanks, a lot of good points here.
>
>On Wed, 2003-12-10 at 12:57, Michael Stack wrote:
>
>
>>Proposal looks good to me.  Model reminds of me of apache '.htaccess'
>>file scheme where you can insert directory specific config. to override
>>the core httpd.conf.  I like it.  Below are some general
>>comments/questions:
>>
>>+ Does the order file change as you do a crawl?  Where is state kept?
>>What if you want to tweak a setting during a crawl.  Where would that be
>>done?  Would you change the order file or would you instead change UI
>>and its value would be save to a state file subsequently used crawling?
>>Would such a change be one of the 'dynamic settings' from the global
>>configuration file saved at the head of the configuration dir
>>('overrides' for the order file) mentioned in your doc.?
>>
>>
>It is, with the current code, possible to change the order file during a
>crawl. This is done trough the UI and the changes are stored to disk.
>
>
>
So the UI updates the order file?

>>+ Why can't I set any value?  What if during a crawl I want to up the
>>number of running threads or change the logging levels or the way in
>>which statistics are being gathered (e.g. move from 'pedestrian' crawl
>>to debug mode) or even instantiate new processors and change the order
>>in which the processor chain works.  Some settings won't be changeable
>>mid-crawl and these might fail w/ a polite message (e..g. changing heap
>>size mid-run) but otherwise, I'd suggest no constraint on what can be
>>changed in configuration (Keep in mind I don't know much about crawler
>>deploys).
>>
>>
>Yes, most things should be changeable during a crawl. The distinction
>between the order file and the configuration hierarchy is that settings
>that should be possible to override on a per host basis goes to the
>crawl configuration, while settings that only makes sense for the
>crawler as a whole goes to the order file. The names of these two kind
>configurations should be changed to better reflect this.
>
>
>

How about saving changes to the order file to the configuration file
that is at the root of the configuration hierarchy?  (Perhaps this is
what you already propose?).

+1 on clearer naming.  'order' seems like a good name considering what
it does though it might strike newbies as odd.   'newbies' might find
'crawler' or 'configuration' more natural?  Or call all of the files
'crawler' or 'configuration'  and its their location that designates
what they do?   Just suggestions.

Sounds like you might want to keep in mind the moving of configurations
amongst crawlers.  For instance, when the archive does broad crawls of
the internet, I can see our codifying our experience going against the
grand disaparity of sites that are out on the internet as configuration
in the configuration directory hierarchy.  We might be constantly adding
to it.  There might be a baseline config. we'd hand out to others doing
broad internet crawls.

>>+ I'd imagine in the scheme of things, the reading of a new
>>configuration file into memory, whether per host or per domain, a rare
>>event.  Do you agree?  At what time are the domain/host configuration
>>files read?  On initially crossing into a new domain or on first seeing
>>a host?  Will we ever refresh what we have in mem?
>>
>>
>My thought is that we read the configuration file when crossing into a
>new domain. However it is possible to imagine an arbitrary large number
>of per host configuration files. Then it might be possible that we
>cannot keep them all in memory at all times.
>
>
>
Ok.  Later the 'Configuration' class could take on memory management
interposing a cache that was cognizant of the configuration's tree
hierarchy (so it favored the flushing of leaf nodes ahead of nodes
higher in the hierarchy).

>>+ What are the plans for having multiple crawler instances working off
>>the one Frontier instance?  If the frontier makes a CrawlServer to
>>associate w/ a URI, how will we ensure that one crawler instance does
>>one server only (Maybe frontier can do distribution between crawlers).
>>
>>
>If you by multiple crawler instances mean a crawler spanning multiple
>machines (or VMs), then this has not been addressed yet.
>
>
>
Ok.

>>+ Won't there only be one instance of the configuration instance in
>>memory? It won't be duplicated per CrawlerServer -- just a reference to
>>the single instance?  Why then are we worried about XML DOM cost?  Won't
>>the configuration just be read into an object tree w/ an xml serializer
>>used creating the objects w/ periodic visits to the file on disk to
>>check mod time?  The serializer would do something like, if a value
>>exists at a certain node in the configuration hierarchy, a getter that
>>returns the value at that location in the hierarchy is returned, else,
>>we call super.  Or are we talking of doing per host an aggregation of
>>all xml snippets to write a new configuration file to feed the crawler
>>accessing a particular host?
>>
>>
>When I started writing the document the DOM was used as the internal
>data structure and XPaths was used to get the different settings. This
>is costly if the values should be looked up for every URI. These days
>the settings are kept in a HashMap after the initial lookup which
>greatly decreases the cost. But if we are working with an internal data
>structure distinct from the DOM, it shouldn't by much harder to use SAX
>instead of the DOM to build it. Then as mentioned before there could
>possibly be thousands of per host configuration files and then
>performance becomes an issue. I'd like the Configuration objects to
>represent a file on disk. If there is a configuration for a host the
>CrawlServer points to this. If there isn't, the CrawlServer will point
>to the configuration for its domain if present, or to the global
>configuration. When a module asks for a setting it will ask the
>configuration object which the CrawlServer references. The configuration
>objects knows its parent, so if the setting isn't there it will push the
>request up to that.
>
>
>
I see Processor inherits from XMLConfig.  On first reading, this looks a
little odd.
I don't know enough yet about how things work.  I'd think we might want
to move configuration out to classes that enscapsulate the mechanics of
how configuration is done, hiding its XML nature, caching of
configuration (distributed?), etc.

>>+ Author, date and section numbering make it easier to refer to a
>>document and its sections.
>>
>>
>Agree.
>
>
>
>>+ Later we might use an ldap server implementation behind your
>>'Configuration' reader class to store config.  The hierarchical nature
>>of the configuration w/ a rare write would make it a natural match.
>>
>>
>That makes sense. You could even keep it in an RDBMS, but I think the
>default implementation shouldn't need external servers to be set up, so
>the concept of a hierarchy of XML files should probably be the default
>implementation.
>
>
>
Yes.  I agree.

St.Ack

>>Other comments:
>>
>>+ Is CrawlServer class to do w/ crawl accessing a particular server?
>>Maybe rename it Server?
>>+ I have comments on the xml -- no schema nor dtd and extensive use of
>>element attributes -- but they can go elsewhere.
>>
>>St.Ack
>>
>>
>>
>>To unsubscribe from this group, send an email to:
>>archive-crawler-unsubscribe@yahoogroups.com
>>
>>
>>
>>Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>>
>>
>>
>>
>
>
>
>To unsubscribe from this group, send an email to:
>archive-crawler-unsubscribe@yahoogroups.com
>
>
>
>Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>
>
>

#199 From: John Erik Halse <johnh@...>
Date: Thu Dec 11, 2003 12:03 am
Subject: Re: Checkpointing
johnerikhalse
Send Email Send Email
 
Since Gordon wrote this mail, he has added a recover log which could be
used to "replay" the crawl. The actual replaying functionality isn't
added though.

The replaying approach is simple and by adding a little more information
to the recover log (timestamps) it should be possible to reconstruct the
statistics as well. For focused crawls this should work very well. But
with very broad crawls, the recover log grows to a size that is not easy
to handle. And if the crawler sometime in the future will support
infinite/incremental crawls, then the recover log will not work.

When thinking of different approaches for doing checkpoints I came up
with a some questions that should be answered before we try to design
it.

* How often are we supposed to do a checkpoint (aka how costly is a
checkpoint allowed to be).
If the checkpoints are very expensive, we could do a combination of
checkpoint and recovery log. The recovery log should then be reset at
every checkpoint.
* Is it ok to pause the crawler for a checkpoint? It might take some
time to wait for all the threads to finish. Is this acceptable?
* Is checkpointing just for recovering from crashes?
If not:
   - Should it be possible to manipulate queues in a suspended state? For
example adding or removing URIs in the pending queue.
   - Should it be possible to change implementation of modules between
suspend and resume? For example fixing bugs.
   - Should it be possible to alter the configuration in suspended state
* Is it ok to insert a checkpoint mark in the working files or should
everything be copied to a safe location to make sure that a crash would
not corrupt files?

If we add the possibility to run a multiple machine crawl; Should the
checkpoint span all the crawler instances or should the checkpoint be
local to a single instance?


Comments/answers to these questions? Other questions that should be
asked?

John

On Wed, 2003-11-19 at 15:01, Gordon Mohr wrote:
> The top frustration during our recent evaluation crawl was
> that we don't yet have a working system for resuming a crawl
> in progress from disk-based state, aka "checkpointing".
>
> There are many ways we could remedy this shortcoming,
> some incremental, some comprehensive.
>
> Two extremes of checkpoint functionality would be "just enough
> to ensure coverage" and "total crawler state".
>
> In "just enough to ensure coverage", a resumption might not
> have internal state and running totals that closely match
> those that the crawler would have had, if not interrupted.
> However, it would be trusted to still visit every URI that
> the original crawler would have. (In some ways, this could be
> considered a "checkpoint Frontier only" or "checkpoint
> simplified view of Frontier".)
>
> In "total crawler state", a resumption would reliably reload
> the state of any component which accumulates information --
> such as the statistics tracker or a hypothetical postprocessor
> which tallies relative proportions of document features -- such
> that a resumption makes the crawler work exactly like it had
> never stopped.
>
> Several different approaches fall along a continuum of
> increasing sophistication at crawl/checkpoint time:
>
> A "forensic" approach would require almost no crawl-time
> support. A resume would simply look at the output generated
> by the crawler, essentially "replaying" the crawl at an
> accelerated rate (perhaps even repeating the extraction
> steps against ARCed resources), eventually winding up at
> a state closely mimicking that of the previously-suspended
> crawler. This requires little to no support at crawl/checkpoint
> time, but a sophisticated resumption routing.
>
> A "transaction log" approach would generate extra logs
> at crawl-time to assist in rapid resumption -- for example,
> all inserts to the froniter would be logged, to save the
> resumption from having to re-scan source material. This
> remains very straightforward, if we're only considering
> the matter of including/excluding URIs for visitation, and
> could have other debugging benefits.
>
> A "snapshot" approach would, at certain intervals or at
> operator request, dump some or all of the crawler's state
> to files. A resumption could occur quickly, and mimic the
> dumped state as accurately as we care to enable.
>
> ==
> There are merits and tradeoffs to each approach. My understanding
> is that Mercator implements a "total crawler state"/"snapshot"
> approach, giving every module a chance to persist itself to
> disk at checkpoint time.
>
> I think our initial implementation should be a "just enough for
> coverage"/"transaction log" approach, as this requires minimal
> invasiveness and effectively tackles the largest issue: unbroken
> coverage across crawler hiccups. We can re-synthesize any
> running statistics post-crawl.
>
> Later, we should enable the Mercator-style "total state"/"snapshot"
> approach.
>
> Agree or disagree? Comments? Other ideas?
>
> - Gordon
>
>
>
>
> To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>

#200 From: Gordon Mohr <gojomo@...>
Date: Thu Dec 11, 2003 7:39 pm
Subject: Re: Checkpointing
gojomo
Send Email Send Email
 
John Erik Halse wrote:
> When thinking of different approaches for doing checkpoints I came up
> with a some questions that should be answered before we try to design
> it.
>
> * How often are we supposed to do a checkpoint (aka how costly is a
> checkpoint allowed to be).

For recovery from major crawl problems, the checkpoint resolution is
how much time/work will be lost when the crawl restarts. We'd like
to be able to do checkpoints every few hours on most focused crawls.
We might want as infrequent as once per day on broad crawls.

> If the checkpoints are very expensive, we could do a combination of
> checkpoint and recovery log. The recovery log should then be reset at
> every checkpoint.

> * Is it ok to pause the crawler for a checkpoint? It might take some
> time to wait for all the threads to finish. Is this acceptable?

Some disruption/pause to full crawler throughput is inevitable.

An approach that has been recommended is to divide up the processing
of a CrawlURI into a first phase (including the potentially lengthy
network fetch), which has no persistent effect on the checkpointed
state (of the Frontier or other modules), and then a second phase,
which if started must be completed before a checkpoint occurs.

So the problem of waiting for all the threads to finish becomes in
fact just waiting for them to finish their critical second phases,
which are more likely to be guaranteed to finish in a bounded
amount of time. Also, progress can continue on first phase network
download activity -- indeed new URIs can even be begun during a
checkpoint, they just can't proceed to the second phase processing.

In our design, this might entail making some subset of the Processors
after the fetchers, including the ARCWriter and others with lasting
effect on in-memory structures and running statistics, into the critical
second phase

> * Is checkpointing just for recovering from crashes?
> If not:
>   - Should it be possible to manipulate queues in a suspended state? For
> example adding or removing URIs in the pending queue.

Yes -- though perhaps this is just the same capability as we would want
for any paused crawl. (That is, the operator might not directly edit
state on disk, but rather (1) load checkpoint without restarting active
crawl; then (2) use other admin options to edit standing queues.)

>   - Should it be possible to change implementation of modules between
> suspend and resume? For example fixing bugs.

Definitely.

>   - Should it be possible to alter the configuration in suspended state

Yes -- though again this may just be the same capability as would
be wanted for any paused crawl, whether it has been completely
checkpointed or not.

> * Is it ok to insert a checkpoint mark in the working files or should
> everything be copied to a safe location to make sure that a crash would
> not corrupt files?

To be determined; I think some files which change in arbitrary ways
between checkpoints would need to be duplicated in full. A "safe location"
might just be another filename in the same working directory.

> If we add the possibility to run a multiple machine crawl; Should the
> checkpoint span all the crawler instances or should the checkpoint be
> local to a single instance?

It must span all instances to the extent required to prevent any URIs
from falling through the cracks or major discrepancies in expected
behavior between the original run and a subsequent resume-from-checkpoint.

For example, if cooperating crawler A sends a URI to B for crawling, then
makes checkpoint 0001, while B makes checkpoint 0001, then receives
the URI, there would be a problem upon resuming both from their respective
checkpoints 0001: A would think the URI was already handled, while B will
not have received it.

- Gordon

#201 From: Michael Stack <stack@...>
Date: Thu Dec 11, 2003 10:14 pm
Subject: Re: Checkpointing
stack@...
Send Email Send Email
 
Is there definition of term checkpoint anywhere?  A description of how
it currently works?

See also inline below.

Gordon Mohr wrote:

>John Erik Halse wrote:
>
>
>>When thinking of different approaches for doing checkpoints I came up
>>with a some questions that should be answered before we try to design
>>it.
>>
>>* How often are we supposed to do a checkpoint (aka how costly is a
>>checkpoint allowed to be).
>>
>>
>
>For recovery from major crawl problems, the checkpoint resolution is
>how much time/work will be lost when the crawl restarts. We'd like
>to be able to do checkpoints every few hours on most focused crawls.
>We might want as infrequent as once per day on broad crawls.
>
>
>>If the checkpoints are very expensive, we could do a combination of
>>checkpoint and recovery log. The recovery log should then be reset at
>>every checkpoint.
>>
>>
>
>
>
>>* Is it ok to pause the crawler for a checkpoint? It might take some
>>time to wait for all the threads to finish. Is this acceptable?
>>
>>
>
>Some disruption/pause to full crawler throughput is inevitable.
>
>An approach that has been recommended is to divide up the processing
>of a CrawlURI into a first phase (including the potentially lengthy
>network fetch), which has no persistent effect on the checkpointed
>state (of the Frontier or other modules), and then a second phase,
>which if started must be completed before a checkpoint occurs.
>
>So the problem of waiting for all the threads to finish becomes in
>fact just waiting for them to finish their critical second phases,
>which are more likely to be guaranteed to finish in a bounded
>amount of time. Also, progress can continue on first phase network
>download activity -- indeed new URIs can even be begun during a
>checkpoint, they just can't proceed to the second phase processing.
>
>In our design, this might entail making some subset of the Processors
>after the fetchers, including the ARCWriter and others with lasting
>effect on in-memory structures and running statistics, into the critical
>second phase
>
>
>
Would be grand if we could minimize the number (or type) of
objects/processors that need to serialize.

>>* Is checkpointing just for recovering from crashes?
>>If not:
>>  - Should it be possible to manipulate queues in a suspended state? For
>>example adding or removing URIs in the pending queue.
>>
>>
>
>Yes -- though perhaps this is just the same capability as we would want
>for any paused crawl. (That is, the operator might not directly edit
>state on disk, but rather (1) load checkpoint without restarting active
>crawl; then (2) use other admin options to edit standing queues.)
>
>
>
>>  - Should it be possible to change implementation of modules between
>>suspend and resume? For example fixing bugs.
>>
>>
>
>Definitely.
>
>
>
We'd need to play w/ class-loaders to implement such a "hot-deploy"
feature so they'd check disk on a period or when kicked for new class
instances.

>>  - Should it be possible to alter the configuration in suspended state
>>
>>
>
>Yes -- though again this may just be the same capability as would
>be wanted for any paused crawl, whether it has been completely
>checkpointed or not.
>
>
>
>>* Is it ok to insert a checkpoint mark in the working files or should
>>everything be copied to a safe location to make sure that a crash would
>>not corrupt files?
>>
>>
>
>To be determined; I think some files which change in arbitrary ways
>between checkpoints would need to be duplicated in full. A "safe location"
>might just be another filename in the same working directory.
>
>
>
>>If we add the possibility to run a multiple machine crawl; Should the
>>checkpoint span all the crawler instances or should the checkpoint be
>>local to a single instance?
>>
>>
>
>It must span all instances to the extent required to prevent any URIs
>from falling through the cracks or major discrepancies in expected
>behavior between the original run and a subsequent resume-from-checkpoint.
>
>For example, if cooperating crawler A sends a URI to B for crawling, then
>makes checkpoint 0001, while B makes checkpoint 0001, then receives
>the URI, there would be a problem upon resuming both from their respective
>checkpoints 0001: A would think the URI was already handled, while B will
>not have received it.
>
>
>
Yeah.  Checkpointing would have to span the cluster.  Sounds like it'd
be nice if the checkpointing was transactional ("Are you ready to
checkpoint?", "Ok, commit").

St.Ack

>- Gordon
>
>
>To unsubscribe from this group, send an email to:
>archive-crawler-unsubscribe@yahoogroups.com
>
>
>
>Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>
>
>

#202 From: Gordon Mohr <gojomo@...>
Date: Thu Dec 11, 2003 8:23 pm
Subject: Re: Checkpointing
gojomo
Send Email Send Email
 
I should qualify the following:

Gordon Mohr wrote:
> John Erik Halse wrote:

>> If we add the possibility to run a multiple machine crawl; Should the
>> checkpoint span all the crawler instances or should the checkpoint be
>> local to a single instance?
>
>
> It must span all instances to the extent required to prevent any URIs
> from falling through the cracks or major discrepancies in expected
> behavior between the original run and a subsequent resume-from-checkpoint.
>
> For example, if cooperating crawler A sends a URI to B for crawling, then
> makes checkpoint 0001, while B makes checkpoint 0001, then receives
> the URI, there would be a problem upon resuming both from their respective
> checkpoints 0001: A would think the URI was already handled, while B will
> not have received it.

In practice, we might choose to accept this risk rather than tackle the
problem of checkpointing a whole cluster of machines in unison.

In very broad (or continuous) crawls, it's quite likely the same URI would be
discovered by other paths and retrieved. Even if it is not discovered, or is
not discovered quickly, missing it might not rise above the level of inherent
imperfection in any broad crawl, because...
   - the web changes even as it is being crawled
   - the crawler cannot follow every path, because..
      - time/bandwidth is limited
      - many paths are of vanishingly small value and/or infinite
      - other important volatile pages must be revisited

So this sort of error might be in the noise with regard to our development
priorities.

- Gordon

#203 From: Michael Stack <stack@...>
Date: Thu Dec 11, 2003 10:33 pm
Subject: Re: Checkpointing
stack@...
Send Email Send Email
 
Ignore my last request for definition of checkpoint.  I see it below in
Gordon's original mail.

Comments inline below.

John Erik Halse wrote:

>Since Gordon wrote this mail, he has added a recover log which could be
>used to "replay" the crawl. The actual replaying functionality isn't
>added though.
>
>The replaying approach is simple and by adding a little more information
>to the recover log (timestamps) it should be possible to reconstruct the
>statistics as well. For focused crawls this should work very well. But
>with very broad crawls, the recover log grows to a size that is not easy
>to handle. And if the crawler sometime in the future will support
>infinite/incremental crawls, then the recover log will not work.
>
>
>When thinking of different approaches for doing checkpoints I came up
>with a some questions that should be answered before we try to design
>it.
>
>* How often are we supposed to do a checkpoint (aka how costly is a
>checkpoint allowed to be).
>If the checkpoints are very expensive, we could do a combination of
>checkpoint and recovery log. The recovery log should then be reset at
>every checkpoint.
>
>
As you say above, sounds like crawler has to do a 'total crawler state'
checkpoint either on a period or on the crossing of a threshold -- size
of replay log, size of data crawled so far, etc. -- just so it can
safely throw away the recover log.

>* Is it ok to pause the crawler for a checkpoint? It might take some
>time to wait for all the threads to finish. Is this acceptable?
>* Is checkpointing just for recovering from crashes?
>If not:
>  - Should it be possible to manipulate queues in a suspended state? For
>example adding or removing URIs in the pending queue.
>
>
Shouldn't we be able to do this anyways?  While the crawler is running?

>  - Should it be possible to change implementation of modules between
>suspend and resume? For example fixing bugs.
>  - Should it be possible to alter the configuration in suspended state
>* Is it ok to insert a checkpoint mark in the working files or should
>everything be copied to a safe location to make sure that a crash would
>not corrupt files?
>
>If we add the possibility to run a multiple machine crawl; Should the
>checkpoint span all the crawler instances or should the checkpoint be
>local to a single instance?
>
>
>
Can we list out what state needs to be saved on a 'total crawler state'
checkpoint?

>Comments/answers to these questions? Other questions that should be
>asked?
>
>John
>
>On Wed, 2003-11-19 at 15:01, Gordon Mohr wrote:
>
>
>>The top frustration during our recent evaluation crawl was
>>that we don't yet have a working system for resuming a crawl
>>in progress from disk-based state, aka "checkpointing".
>>
>>There are many ways we could remedy this shortcoming,
>>some incremental, some comprehensive.
>>
>>Two extremes of checkpoint functionality would be "just enough
>>to ensure coverage" and "total crawler state".
>>
>>In "just enough to ensure coverage", a resumption might not
>>have internal state and running totals that closely match
>>those that the crawler would have had, if not interrupted.
>>However, it would be trusted to still visit every URI that
>>the original crawler would have. (In some ways, this could be
>>considered a "checkpoint Frontier only" or "checkpoint
>>simplified view of Frontier".)
>>
>>In "total crawler state", a resumption would reliably reload
>>the state of any component which accumulates information --
>>such as the statistics tracker or a hypothetical postprocessor
>>which tallies relative proportions of document features -- such
>>that a resumption makes the crawler work exactly like it had
>>never stopped.
>>
>>

We might send out a checkpoint event.  Anyone interested -- the above
hypothetical postprocessor -- would have a listener out to receive the
event.

>>Several different approaches fall along a continuum of
>>increasing sophistication at crawl/checkpoint time:
>>
>>A "forensic" approach would require almost no crawl-time
>>support. A resume would simply look at the output generated
>>by the crawler, essentially "replaying" the crawl at an
>>accelerated rate (perhaps even repeating the extraction
>>steps against ARCed resources), eventually winding up at
>>a state closely mimicking that of the previously-suspended
>>crawler. This requires little to no support at crawl/checkpoint
>>time, but a sophisticated resumption routing.
>>
>>
>>
A 'journaling' mechanism such as that described would be best but would
probably be hard to develop?

>>A "transaction log" approach would generate extra logs
>>at crawl-time to assist in rapid resumption -- for example,
>>all inserts to the froniter would be logged, to save the
>>resumption from having to re-scan source material. This
>>remains very straightforward, if we're only considering
>>the matter of including/excluding URIs for visitation, and
>>could have other debugging benefits.
>>
>>
>>

How does this differ from the 'forensic' approach?  Are you suggesting
here that we wouldn't replay the 'transaction' log, but that we'd just
jump to the last complete transaction in the log and resume there?

>>A "snapshot" approach would, at certain intervals or at
>>operator request, dump some or all of the crawler's state
>>to files. A resumption could occur quickly, and mimic the
>>dumped state as accurately as we care to enable.
>>
>>

We should do this too.   In the UI, somehow you could force a 'total
crawler state' checkpoint.

St.Ack

>>==
>>There are merits and tradeoffs to each approach. My understanding
>>is that Mercator implements a "total crawler state"/"snapshot"
>>approach, giving every module a chance to persist itself to
>>disk at checkpoint time.
>>
>>I think our initial implementation should be a "just enough for
>>coverage"/"transaction log" approach, as this requires minimal
>>invasiveness and effectively tackles the largest issue: unbroken
>>coverage across crawler hiccups. We can re-synthesize any
>>running statistics post-crawl.
>>
>>Later, we should enable the Mercator-style "total state"/"snapshot"
>>approach.
>>
>>Agree or disagree? Comments? Other ideas?
>>
>>- Gordon
>>
>>
>>
>>
>>To unsubscribe from this group, send an email to:
>>archive-crawler-unsubscribe@yahoogroups.com
>>
>>
>>
>>Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>>
>>
>>
>>
>
>
>
>To unsubscribe from this group, send an email to:
>archive-crawler-unsubscribe@yahoogroups.com
>
>
>
>Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>
>
>

#204 From: Michael Stack <stack@...>
Date: Thu Dec 11, 2003 10:43 pm
Subject: Re: Re: [Fwd: Second draft of per host settings document]
stack@...
Send Email Send Email
 
Gordon Mohr wrote:

>Michael Stack wrote:
>
>
>>Proposal looks good to me.  Model reminds of me of apache '.htaccess'
>>file scheme where you can insert directory specific config. to override
>>the core httpd.conf.  I like it.  Below are some general
>>comments/questions:
>>
>>+ Does the order file change as you do a crawl?  Where is state kept?
>>
>>
>
>Originally, the the order file did not change: it was only editted
>(by hand or the web UI) before the crawl began, and then consulted at
>the start of a crawl to fill in per-instance variables holding chosen
>values.
>
>Recently, Kris made several changes:
>  (1) Most places where order settings are used, the order object (or
>      portion thereof) is directly consulted each time the value is
>      needed -- rather than only at startup -- so that changes during
>      a crawl can have an effect.
>  (2) The crawl can be paused in memory, and various crawl-order fields
>      changed during the pause. The updated order is written to disk,
>      and the new values should affect the crawl when it is resumed.
>
>
>(I'm not sure everything behaves as might be desirable for all the
>crawler objects which inherit from XMLConfig, and retain direct
>references to their 'home node' inside the overall order DOM.)
>
>
>
I'm guessing they don't.

....

>I suspect only changes of settings through the web UI, or some other
>manual kick/flush of existing settings, would cause a refresh of
>settings held in memory.
>
>
>
That sounds grand.

>>+ What are the plans for having multiple crawler instances working off
>>the one Frontier instance?  If the frontier makes a CrawlServer to
>>associate w/ a URI, how will we ensure that one crawler instance does
>>one server only (Maybe frontier can do distribution between crawlers).
>>
>>
>
>Not yet considered in depth. It's quite likely that when independent
>crawlers are cooperating, they'll each have a firm idea of which
>remote servers (host:port combos) they are to crawl, and which they
>should delegate to their siblings.
>
>
>
So there'll be a Controller -- the Frontier? -- that hands out the work
but sounds like crawlers' will need to communicate (e.g. Checkpointing).

>>+ Won't there only be one instance of the configuration instance in
>>memory? It won't be duplicated per CrawlerServer -- just a reference to
>>the single instance?  Why then are we worried about XML DOM cost?  Won't
>>the configuration just be read into an object tree w/ an xml serializer
>>used creating the objects w/ periodic visits to the file on disk to
>>check mod time?  The serializer would do something like, if a value
>>exists at a certain node in the configuration hierarchy, a getter that
>>returns the value at that location in the hierarchy is returned, else,
>>we call super.  Or are we talking of doing per host an aggregation of
>>all xml snippets to write a new configuration file to feed the crawler
>>accessing a particular host?
>>
>>
>
>If many domains/hosts/servers have their own custom settings, and
>each settings file is XML, and that XML is always loaded into a
>generic DOM and then referenced by the CrawlServer instance, the DOM
>overhead might be a concern.
>
>
>
For sure.  Can avoid keeping settings in a DOM and in a particular a
generic DOM.

St.Ack

#205 From: Michael Stack <stack@...>
Date: Thu Dec 11, 2003 10:46 pm
Subject: On JMX: [Fwd: Unable to deliver your message]
stack@...
Send Email Send Email
 
My mail sent this morning was rejected by yahoo groups because I sent it
from home.  I'm resending though the below is missing gifs, the main
reason I added the attachment.  I can show you the gifs on my machine
John if you're interested.

St.Ack
We are unable to deliver the message from <stack@...>
to <archive-crawler@yahoogroups.com>.

The email address used to send your message is not subscribed to this
group. If you are a member of this group, please be aware that you may
only send messages to this group using the email address(es) you have
registered with Yahoo! Groups.  Yahoo! Groups allows you to send messages
using the email address you originally used to register, or an alternate
email address you specify in your personal settings.

If you would like to subscribe to this group:
1. visit
    http://groups.yahoo.com/group/archive-crawler/join
-OR-
2. send email to archive-crawler-subscribe@yahoogroups.com

If you would like to specify an alternate email address:
1. visit
    http://groups.yahoo.com/myprefs?edit=2
2. type your alternate email address in the area labeled "Alternate
    posting addresses".
3. click the "Save Changes" button
4. wait approximately 10 minutes for the change to take effect

After you follow these steps, you will be able to send messages
to all your groups using this alternate email address.

For further assistance, please email support@yahoogroups.com
or visit http://help.yahoo.com/help/us/groups/
John:

When I suggested JMX, I was thinking of the screens on the bottom of
this page -- how we'd have an admin interface for free and of how the
MBean server would manage instantiation of beans.  Per-host
configuration would probably be done by making Processors whose
attributes can be changed on a per-host basis implement the Dynamic
MBean interface.

As to MBean initial state and serialization of state, there is no canned
soln. if we use the Reference Implementation; we'd have to develop it
ourselves:  On construction a bean would pick up its config. from disk
-- a config per bean (JBoss which at its core is nought but an MBean
server has such a system).

JMX looks like it'd be a large development investment that could be
overkill considering where we're currently at.

St.Ack
O'Reilly Network Safari Bookshelf
  Logout 

Printer-friendly version of this section  Print    E-Mail this section  E-Mail  Add a public, group or private note  Add Note  Add a bookmark about this section  Add Bookmark  Return to your last search result list  Return to Results ListHide search terms   Previous section   Next section

Java™ Management Extensions
By J. Steven Perry
Table of Contents
Chapter 1.  Java Management Extensions Concepts


1.3 The Sample Producer/Consumer Application

In the remainder of this chapter, we will build and run a sample application that demonstrates each MBean instrumentation approach. The sections that follow look at the design of the application, where to obtain the source code, how to actually build and run the application, and how to monitor the application via a web browser.

1.3.1 Design

In this section, we will take a look at how the sample application is designed, so that you can better understand what is going on when you see it run. First, we will look at the pattern that is fundamental to the application's design. Then we will see how the pattern is implemented and what classes constitute the source code for the application.

The design pattern used in the application is a monitor. A monitor is a construct that coordinates activity between multiple threads in the system. In this pattern, the monitor coordinates activities between two categories of threads: producer threads and consumer threads. As you might imagine, a producer thread provides something that the consumer uses. That "something" is generically defined as a unit of work. This can be physically realized as anything relevant to a problem that is solved by this pattern.

For example, the unit of work might be an email message that is sent to the email system (the monitor) by the producer (an email client) and removed by the consumer (some agent on the incoming email server side). The producer might perform additional processing on the message before sending it to the email system, such as checking the spelling. By the same token, the consumer may perform additional processing of the message after removing it from the queue, such as applying an anti-virus check. For this reason, we will refer to the pattern as "value-added producer/consumer." This pattern is shown in UML notation in Figure 1-9.

Figure 1-9. UML diagram showing the "value-added producer/consumer" pattern
figs/jmx_0109.gif

As you can see in Figure 1-9, the producer and consumer are separated (decoupled) by the monitor. This pattern is best applied to systems that are inherently asynchronous in nature, where the producer and consumer are decoupled by varying degrees. This decoupling can be a separation of location as well as of synchronicity.

The implementation of the value-added producer/consumer pattern is shown in Figure 1-10. The classes in the diagram are implemented as Java classes. The stereotypes shown in the diagram are named according to the pattern shown in Figure 1-9.

Figure 1-10. UML diagram showing the implementation of the pattern in the form of the application
figs/jmx_0110.gif

Basic is the base class for all of the classes that make up the implementation (with the exception of WorkUnit, which represents the unit of work that is exchanged between Supplier and Consumer). Controller is a class that acts as the JMX agent and is responsible for creating the producer and consumer threads that run inside the application. Queue is a thread-safe queue that acts as the monitor. Producer threads place items in the queue in a thread-safe way, and consumer threads remove them. Worker is the base class for Supplier and Consumer, because much of their behavior is common.

In the sample application, the following resources can be managed:

  • Controller

  • Queue

  • Supplier

  • Consumer

I encourage you to look at the source code to see exactly what attributes and operations are on each of the management interfaces for these resources.

1.3.2 Source Code

The source code for the application is standalone with respect to each type of instrumentation approach. There are three versions of the application, each in its own package. The name of the package corresponds to the instrumentation approach. For example, with the exception of common classes such as GenericException, the application source code for standard MBeans is entirely contained in the standard package; thus, if you install the source code to c:\jmxbook, the path to the application source code for standard MBeans will be c:\jmxbook\sample\standard. All of the source code shares the contents of the exception package. Other than that, however, the application can be built and run independently of the other packages.

For each type of MBean, there is a Windows batch file and a Unix (Korn shell) script that builds and runs the code for that instrumentation strategy. The name of the script or batch file matches the instrumentation strategy: for example, the build script for dynamic MBeans is called dynamic.sh, and the batch file for building the source code for the version of the application instrumented as dynamic MBeans is called dynamic.bat. The major differences between the application versions are in the source code. The console output and the management view will show very little difference (other than output from the Ant build script) between the versions of the application.

1.3.3 Building and Running the Application

Before you can build and run the sample application (see Section P.5 in the Preface for details on how to obtain the application's source code), you must download the JMX RI and Jakarta Ant. For this book, I used JMX RI 1.0.1 and Ant 1.4. You can obtain the JMX RI at http://java.sun.com/products/JavaManagement/ and Jakarta Ant at http://jakarta.apache.org/ant/index.html.

The name of the build file Ant uses to build the application for all of the instrumentation strategies is build.xml. The build scripts are designed to work with very little modification on your part. However, you may have to modify either the build script or the Ant build file, depending on where you installed the JDK, the JMX RI, and Ant itself. Example 1-5 shows an excerpt from build.xml.

Example 1-5. Selected portions of the Ant build file for the application, build.xml
.
.
.
<project name="jmxbook" default="standard" basedir=".">
  
<!-- Set global properties -->
<property name="source_root" value="c:\jmxbook\sample"/>
<property name="jmx_home" value="c:\jmx1.0.1"/>
  
<path id="project.general.class.path">
  <pathelement path="${jmx_home}\jmx\lib\jmxri.jar"/>
  <pathelement path="${jmx_home}\jmx\lib\jmxtools.jar"/>
  <pathelement path="."/>
</path>
  
<!-- Build the init target -->
<target name="init">
  <!-- create the time stamp -->
  <tstamp>
    <format property="build.start.time" pattern="MM/dd/yyyy hh:mm:ss aa"/>
  </tstamp>
  <echo message="Build started at ${build.start.time}..."/>
</target>
  
<!-- Build the exception target -->
<target name="build-exception" depends="init">
  <javac>
    <classpath refid="project.general.class.path"/>
    <src path="${source_root}"/>
    <include name="exception\*"/>
  </javac>
</target>
  
<!-- Build the "standard" target -->
<target name="build-standard" depends="build-exception">
  <javac>
    <classpath refid="project.general.class.path"/>
    <src path="${source_root}"/>
    <include name="standard\*"/>
  </javac>
</target>
  
<!-- Build the "dynamic" target -->
<target name="build-dynamic" depends="build-exception">
  <javac>
    <classpath refid="project.general.class.path"/>
    <src path="${source_root}"/>
    <include name="dynamic\*"/>
  </javac>
</target>
  
<!-- Build the "model" target -->
<target name="build-model" depends="build-exception">
  <javac>
    <classpath refid="project.general.class.path"/>
    <src path="${source_root}"/>
    <include name="model\*"/>
  </javac>
</target>
.
.
.
</project>

As you can see, the Ant build file is an XML document. This is what sets Ant apart from other build utilities, such as make. Each component to be built using Ant is called a target. A target may have one or more dependent targets that must be built first, each of which may be dependent on other targets, and so on. Ant resolves these dependencies for you. A target is specified in an Ant build file as an XML tag called target and has the following format:

<target name="mytarget" depends="d1,d2">

in which case mytarget depends on targets d1 and d2, or:

<target name="mytarget">

if mytarget has no dependent targets. Let's look at the build-standard target from Example 1-5:

<!-- Build the "standard" target -->
<target name="build-standard" depends="build-exception">
  <javac>
    <classpath refid="project.general.class.path"/>
    <src path="${source_root}"/>
    <include name="standard\*"/>
  </javac>
</target>

You can see that the build-standard target depends on the build-exception target. Ant knows that there may be other dependencies, so it looks at build-exception:

<!-- Build the exception target -->
<target name="build-exception" depends="init">
  <javac>
    <classpath refid="project.general.class.path"/>
    <src path="${source_root}"/>
    <include name="exception\*"/>
  </javac>
</target>

and notices that build-exception depends on init. Ant then looks at init:

<target name="init">
  <!-- create the time stamp -->
  <tstamp>
    <format property="build.start.time" pattern="MM/dd/yyyy hh:mm:ss aa"/>
  </tstamp>
  <echo message="Build started at ${build.start.time}..."/>
</target>

Ant sees that init has no dependencies, so it begins the build. init is built first, followed by build-exception and finally build-standard. Notice the javac tag within build-standard and build-exception. This is known as an Ant task. A task is a Java class that executes within the JVM in which Ant is running (unless you tell Ant to fork a new process when executing the task). The javac task is the java compiler. The classpath, src, and include tags nested within the javac task tell the Java compiler what the CLASSPATH is, the root location of the .java files, and the packages (directories) to compile, respectively.

The application classes for each chapter in this book are built and run using either a batch file or a shell script. If you are running the application on Windows (as I did to produce the screen shots for this chapter), use the batch file (i.e., the .bat file). If you are running the application on Unix, use the shell script (i.e., the .sh file). Throughout the rest of this chapter, the examples will be Windows-based. There are two reasons for this. First, because of the popularity of Windows, it is likely that most developers will be running this operating system. Second, the differences in the behavior of the application when it is run on Windows versus Unix are negligible.

To build and run the application, type in the name of the batch file you want to run, based on the type of MBean instrumentation strategy you want to see in action. You will notice that there is no detectable difference between what you see when you run the build/run batch file and what you see in your browser (discussed in the next section), regardless of the instrumentation strategy. Suppose we want to run the standard MBean batch file, which will build and run the application as standard MBeans. Example 1-6 shows the batch file that builds the application.

Example 1-6. standard.bat, the batch file that builds the application as standard MBeans
@set TARGET_NAME=build-standard
@set JAVA_HOME=c:\jdk1.3.1
@set ANT_VERSION=1.4
@set ANT_HOME=c:\ant%ANT_VERSION%
  
@echo Starting Build ...
  
call %ANT_HOME%\bin\ant %TARGET_NAME%
  
if NOT "%ERRORLEVEL%"=="0" goto DONE
  
%JAVA_HOME%\bin\java sample.standard.Controller 100 150
  
:DONE

This batch file is very simple. Aside from setting a few environment variables, it does only two things: it builds the application by calling Ant, and, if that succeeds, it starts the application. Figure 1-11 shows the output of running the batch file. Recall our earlier discussion of how Ant resolves target dependencies; you'll see that the targets are built in the order described there.

Figure 1-11. Running the build/run batch file for standard MBeans
figs/jmx_0111.gif

All of the batch files (standard.bat, dynamic.bat, and model.bat) operate as described below, but I've used standard.bat here for the purposes of illustration.

In each version of the application, Controller contains the main( ) method that starts the producer and consumer threads and is itself an MBean that can be managed and monitored. There are two command-line arguments to Controller's main( ) method: the work factor for the producer thread and the work factor for the consumer thread. Notice that in standard.bat values of 100 and 150, respectively, are specified for these arguments. I set these values for a reason: it is unlikely that you will find an application of the value-added producer/consumer pattern where the producer and consumer perform an equal amount of work. These command-line parameters to Controller allow you to simulate this asymmetry. When Controller is started, one producer thread and one consumer thread are created. However, Controller has a management method that allows you to start additional threads to balance out the workload (we will see how to do this later).

Figure 1-10 illustrates the relationship between the various classes in the application, where there is a single Queue object into which Supplier threads place WorkUnit objects and from which Consumer threads remove them. For a single unit of work, here is the flow of control:

  1. The Supplier performs an amount of work N—where N is specified on the command line to Controller—and places a single WorkUnit object into the Queue.

  2. The Consumer removes a single WorkUnit object from the Queue and performs an amount of work M—again, where M is specified on the command line to Controller.

These steps are repeated for each work unit.

The work that is performed by Supplier and Consumer threads is to calculate prime numbers. The amount of work specified on the command line to Controller is the number of prime numbers to calculate for each WorkUnit. The Supplier calculates N primes, then places a WorkUnit object into the Queue. The Consumer removes a WorkUnit object from the Queue and then calculates M primes.


This section looked at how to run the sample application and briefly discussed what it is doing internally to simulate the production and consumption of units of work. I strongly encourage you to examine the source code for yourself to see the various attributes and operations available on the management interfaces of each resource in the application.

In the next section, we will look at how to use a web browser to monitor and manage the sample application's MBeans.

1.3.4 Monitoring and Managing the Application

Once the application is running, you can point your web browser to port 8090 (the default—you can change this, but if you do so, remember to point your browser to the new port number). Figure 1-12 shows the result of pointing my web browser (which happens to be Internet Explorer) to port 8090 after running standard.bat.

Figure 1-12. The management view of the application in Internet Explorer
figs/jmx_0112.gif

Remember the work factors that we specified on the command line to Controller for the producer and consumer threads? Because they are different (100 and 150, respectively), and the producer thread does less work than the consumer thread for each work unit, I expect the Queue to always be full once the application reaches a steady state.

If I click on the Queue MBean in my browser, I see the screen shown in Figure 1-13. There are several interesting things about Figure 1-13. First, the AddWaitTime attribute is much larger than the RemoveWaitTime attribute. After processing 72 units of work (according to the NumberOfItemsProcessed attribute), the Supplier thread has waited a total of 3,421 milliseconds to add items to the Queue because it was full, whereas the Consumer thread has not had to wait at all to remove items (although, depending on which thread actually starts first, you may see a small amount of Consumer wait time). This is pretty much what we would expect, as the Supplier thread does only two-thirds the work of the Consumer thread.

Figure 1-13. The management view of the Queue object
figs/jmx_0113.gif

Suppose we want to start another Consumer thread to pick up some of the slack of the other Consumer thread and balance things out a bit. For the moment, let's ignore the fact that we can control the amount of work each type of Worker thread can perform. In a real-world application, we would not have that luxury. As I mentioned earlier in this chapter, Controller acts as the JMX agent for the application, but it is also itself a managed resource (i.e., an MBean). If we look at the management interface of Controller, we'll see that there is a management operation to start new Worker threads, called createWorker( ). Figure 1-14 shows the management view of the Controller MBean and its createWorker( ) operation.

Figure 1-14. The management view of Controller showing the createWorker( ) operation
figs/jmx_0114.gif

There are two parameters to createWorker( ): the first is a string that contains the worker type, and the second is the work factor that worker is to have (i.e., the number of primes calculated per unit of work). The valid values for the worker type are "Supplier" and "Consumer". We want to create a new Consumer thread with the same work factor as the currently running Consumer thread, so we set these parameters to Consumer and 150, respectively. Once we have entered the parameters for the management operation into the text boxes, as shown in Figure 1-14, we click the createWorker button to invoke the management operation. If the operation succeeds, we will see a screen that looks like Figure 1-15.

Figure 1-15. The screen we see once createWorker( ) has successfully been invoked
figs/jmx_0115.gif

We would now expect that activity in the Queue has balanced out somewhat, and we would expect to start seeing the Supplier wait, as we now have two Consumer threads at work. Figure 1-16 shows the management view of the Queue after we start the second Consumer thread.

Figure 1-16. The management view of the Queue after starting a second Consumer thread
figs/jmx_0116.gif

Notice that after processing 1,013 units of work (as we see from the NumberOfItemsProcessed attribute), the Consumer threads have waited nearly 7 times as long as the Supplier thread. Through the use of management operations, we can give an operator at a management console the ability to tune our application at runtime.


    Printer-friendly version of this section  Print    E-Mail this section  E-Mail  Add a public, group or private note  Add Note  Add a bookmark about this section  Add Bookmark  Return to your last search result list  Return to Results ListHide search terms   Previous section   Next section
    Top

    URL http://safari.oreilly.com/0596002459/javamngext-CHP-1-SECT-3

     

       About Safari   |   Terms of Service   |   Privacy Policy   |   Contact Us   |   Help   |   Submit a Problem
    Copyright © 2003 O'Reilly & Associates, Inc. All rights reserved.
    1005 Gravenstein Highway North
    Sebastopol, CA 95472
     

    #206 From: Gordon Mohr <gojomo@...>
    Date: Thu Dec 11, 2003 10:47 pm
    Subject: Re: Checkpointing
    gojomo
    Send Email Send Email
     
    Michael Stack wrote:
    
    > Is there definition of term checkpoint anywhere?  A description of how
    > it currently works?
    
    Heritrix currently has no checkpointing facility. However, our conception
    of what we need is heavily influenced by what Mercator provided. In one
    of the papers on Mercator, it is described this way:
    
    # Checkpointing is an important part of any long-running process
    # such as a web crawl. By checkpointing we mean writing a
    # representation of the crawler’s state to stable storage that, in
    # the event of a failure, is sufficient to allow the crawler to
    # recover its state by reading the checkpoint and to resume crawling
    # from the exact state it was in at the time of the checkpoint. By
    # this definition, in the event of a failure, any work performed
    # after the most recent checkpoint is lost, but none of the work up
    # to the most recent checkpoint. In Mercator, the frequency with
    # which the background thread performs a checkpoint is
    # user-configurable; we typically checkpoint anywhere from 1 to 4
    # times per day.
    
    (http://citeseer.nj.nec.com/najork01highperformance.html)
    
    >>> - Should it be possible to change implementation of modules between
    >>>suspend and resume? For example fixing bugs.
    >>
    >>Definitely.
    >
    > We'd need to play w/ class-loaders to implement such a "hot-deploy"
    > feature so they'd check disk on a period or when kicked for new class
    > instances.
    
    Not with resume-from-checkpoint: we can quit the VM, relaunch it
    with new classes, and then load the checkpoint, avoiding the need
    for any class-loading trickiness.
    
    To the extent we use Java serialization to capture the state, we
    will need to be careful about serialVersionUIDs and class
    evolution.
    
    - Gordon

    #207 From: Michael Stack <stack@...>
    Date: Fri Dec 12, 2003 5:29 pm
    Subject: Re: On JMX: [Fwd: Unable to deliver your message]
    stack@...
    Send Email Send Email
     
    Here's PDF version of page I tried to include below. It has the missing
    gifs.
    St.Ack
    
    Michael Stack wrote:
    
    >My mail sent this morning was rejected by yahoo groups because I sent it
    >from home.  I'm resending though the below is missing gifs, the main
    >reason I added the attachment.  I can show you the gifs on my machine
    >John if you're interested.
    >
    >St.Ack
    >
    >
    >To unsubscribe from this group, send an email to:
    >archive-crawler-unsubscribe@yahoogroups.com
    >
    >
    >
    >Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
    >
    >
    >
    >
    > ------------------------------------------------------------------------
    >
    > Subject:
    > Unable to deliver your message
    > From:
    > Yahoo! Groups <notify@yahoogroups.com>
    > Date:
    > 11 Dec 2003 19:51:41 -0000
    > To:
    > stack@...
    >
    >
    >We are unable to deliver the message from <stack@...>
    >to <archive-crawler@yahoogroups.com>.
    >
    >The email address used to send your message is not subscribed to this
    >group. If you are a member of this group, please be aware that you may
    >only send messages to this group using the email address(es) you have
    >registered with Yahoo! Groups.  Yahoo! Groups allows you to send messages
    >using the email address you originally used to register, or an alternate
    >email address you specify in your personal settings.
    >
    >If you would like to subscribe to this group:
    >1. visit
    >   http://groups.yahoo.com/group/archive-crawler/join
    >-OR-
    >2. send email to archive-crawler-subscribe@yahoogroups.com
    >
    >If you would like to specify an alternate email address:
    >1. visit
    >   http://groups.yahoo.com/myprefs?edit=2
    >2. type your alternate email address in the area labeled "Alternate
    >   posting addresses".
    >3. click the "Save Changes" button
    >4. wait approximately 10 minutes for the change to take effect
    >
    >After you follow these steps, you will be able to send messages
    >to all your groups using this alternate email address.
    >
    >For further assistance, please email support@yahoogroups.com
    >or visit http://help.yahoo.com/help/us/groups/
    >
    >
    >
    > ------------------------------------------------------------------------
    >
    > Subject:
    > JMX
    > From:
    > Michael Stack <stack@...>
    > Date:
    > Wed, 10 Dec 2003 21:01:07 -0800
    > To:
    > archive-crawler@yahoogroups.com
    >
    >
    > John:
    >
    > When I suggested JMX, I was thinking of the screens on the bottom of
    > this page -- how we'd have an admin interface for free and of how the
    > MBean server would manage instantiation of beans. Per-host
    > configuration would probably be done by making Processors whose
    > attributes can be changed on a per-host basis implement the Dynamic
    > MBean interface.
    > As to MBean initial state and serialization of state, there is no
    > canned soln. if we use the Reference Implementation; we'd have to
    > develop it ourselves: On construction a bean would pick up its config.
    > from disk -- a config per bean (JBoss which at its core is nought but
    > an MBean server has such a system).
    >
    > JMX looks like it'd be a large development investment that could be
    > overkill considering where we're currently at.
    >
    > St.Ack
    >
    > ------------------------------------------------------------------------
    >
    > <http://safari.oreilly.com/>
    >
    > <http://oreilly.com/>  <http://oreillynet.com/>
    > <http://conferences.oreilly.com/>
    >
    > Home
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=Home&sortKey=rank&sortOrder=desc&v\
    iew=book&xmlid=&open=false&g=&srchText=&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c\
    =1&u=1&r=&o=1>
    >  My Safari
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=MySafari&sortKey=rank&sortOrder=de\
    sc&view=&xmlid=&open=false&g=&srchText=&code=&h=&m=0&l=1&catid=&s=1&b=1&f=1&t=1&\
    c=1&u=1&r=&o=1>
    >  My Account
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=MyAccount&sortKey=rank&sortOrder=d\
    esc&view=&xmlid=&open=false&g=&srchText=&code=&h=&m=0&l=1&catid=&s=1&b=1&f=1&t=1\
    &c=1&u=1&r=&o=1>
    >  Logout <http://safari.oreilly.com/JVXSL.asp?mode=Logout>
    >
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=0&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > Current Book
    >
    > Code Fragments only
    > Advanced Search
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=search&sortKey=rank&sortOrder=desc\
    &view=book&xmlid=&open=false&g=&srchText=&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1\
    &c=1&u=1&r=&o=1>
    >
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=0&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9&open=false&g=&srchText=&code=&h=&m=&l=1&catid=&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Java™ Management Extensions
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9&open=false&g=&srchText=&code=&h=&m=&l=1&catid=&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >  Copyright
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/preface_1_0&open=true&g=&srchText=&code=&h=&m=&l\
    =1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-PREF&open=true&g=&srchText=&code=&h=&\
    m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Preface
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-PREF&open=true&g=&srchText=&code=&h=&\
    m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1&open=false&g=&srchText=&code=&h\
    =&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Java Management Extensions Concepts
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > 	 Introducing JMX
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-1&open=true&g=&srchText=&c\
    ode=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > 	 JMX Architecture
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-2&open=true&g=&srchText=&c\
    ode=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > 	 The Sample Producer/Consumer Application
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=true&g=&srchText=&c\
    ode=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-2&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Standard MBeans
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-2&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-3&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Dynamic MBeans
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-3&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-4&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Model MBeans
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-4&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-5&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Open MBeans
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-5&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-6&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  The MBean Server
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-6&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-7&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  JMX Notifications
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-7&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-8&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Dynamic Loading
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-8&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-9&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  The Monitoring Services
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-9&open=true&g=&srchText=&code=&h=\
    &m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-10&open=true&g=&srchText=&code=&h\
    =&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  The Timer Service
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-10&open=true&g=&srchText=&code=&h\
    =&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-11&open=true&g=&srchText=&code=&h\
    =&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  The Relation Service
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-11&open=true&g=&srchText=&code=&h\
    =&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >  Colophon
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/colophon&open=true&g=&srchText=&code=&h=&m=&l=1&\
    catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >  Index
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/index&open=true&g=&srchText=&code=&h=&m=&l=1&cat\
    id=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=0>
    >
    > • Perl
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.prog.perl&\
    s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Java
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.prog.java&\
    s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Python
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.prog.pytho\
    n&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Web
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.inet.webau\
    th&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Web Dev
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.inet.webde\
    v&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    >
    > • XML
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.inet.xml&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Linux
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.opsys.linu\
    x&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Unix
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.opsys.unix\
    &s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Mac/OS X
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.opsys.maco\
    s&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • .NET
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.inet.dotne\
    t&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=0&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > View All Titles
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=title&sortOrder=desc\
    &view=book&xmlid=&open=false&g=&srchText=BOOK&code=&h=&m=0&l=1&catid=&s=1&b=1&f=\
    1&t=1&c=1&u=1&r=&o=1&title=>
    >
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.appsci&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Applied Sciences
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.appsci&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.artint&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Artificial Intelligence
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.artint&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.bus&s=1&\
    b=1&f=1&t=1&c=1&u=1&r=&o=1>  Business
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.bus&s=1&b=\
    1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.cert&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>  Certification
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.cert&s=1&b\
    =1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.csci&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>  Computer Science
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.csci&s=1&b\
    =1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.dbase&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Databases
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.dbase&s=1&\
    b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.dskapps&\
    s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Desktop Applications
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.dskapps&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.dpub&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>  Desktop Publishing
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.dpub&s=1&b\
    =1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.ecomm&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  E-Commerce
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.ecomm&s=1&\
    b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.enter&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Enterprise Computing
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.enter&s=1&\
    b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.graphics\
    &s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Graphics
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.graphics&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.hware&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Hardware
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.hware&s=1&\
    b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.humcomp&\
    s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Human-Computer Interaction
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.humcomp&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.inet&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>  Internet/Online
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.inet&s=1&b\
    =1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.itmgmt&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  IT Management
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.itmgmt&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.markup&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Markup Languages
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.markup&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.multi&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Multimedia
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.multi&s=1&\
    b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.network&\
    s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Networking
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.network&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.opsys&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Operating Systems
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.opsys&s=1&\
    b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.prog&s=1\
    &b=1&f=1&t=1&c=1&u=1&r=&o=1>  Programming
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.prog&s=1&b\
    =1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.security\
    &s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Security
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.security&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.sweng&s=\
    1&b=1&f=1&t=1&c=1&u=1&r=&o=1>  Software Engineering
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=books&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=&open=true&g=&srchText=&code=&h=&m=&l=1&catid=itbooks.sweng&s=1&\
    b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=0&t=1&c=1&u=1&r=&o=1>
    >
    > • Author
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=findbook&sortKey=rank&sortOrder=de\
    sc&view=author&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchTex\
    t=&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • ISBN
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=findbook&sortKey=rank&sortOrder=de\
    sc&view=isbn&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=\
    &code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Title
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=findbook&sortKey=rank&sortOrder=de\
    sc&view=booktitle&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srch\
    Text=&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > • Publisher
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=findbook&sortKey=rank&sortOrder=de\
    sc&view=publ&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=\
    &code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    >
    >
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=0&r=&o=1>
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=print&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=jav\
    a+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > Print
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=print&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=jav\
    a+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=email&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=jav\
    a+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > E-Mail
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=email&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=jav\
    a+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=addnote&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > Add Note
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=addnote&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=bookmark&sortKey=rank&sortOrder=de\
    sc&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=\
    java+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > Add Bookmark
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=bookmark&sortKey=rank&sortOrder=de\
    sc&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=\
    java+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=list&sortKey=rank&sortOrder=desc&v\
    iew=book&xmlid=&open=false&g=&srchText=java+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f\
    =1&t=1&c=1&u=1&r=&o=1&page=0>
    > Return to Results List
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=list&sortKey=rank&sortOrder=desc&v\
    iew=book&xmlid=&open=false&g=&srchText=java+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f\
    =1&t=1&c=1&u=1&r=&o=1&page=0>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=1&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1&page=0&useLPos=1&\
    LPos=-1> 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-2&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-2&open=false&g=&srchText=java+jmx\
    &code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > ------------------------------------------------------------------------
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9&open=false&g=&srchText=&code=&h=&m=&l=1&catid=&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >  Java™ Management Extensions
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9&open=false&g=&srchText=&code=&h=&m=&l=1&catid=&s\
    =1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > By J. Steven Perry
    > Table of Contents
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=toc&sortKey=rank&sortOrder=desc&vi\
    ew=book&xmlid=0-596-00245-9&open=false&g=&srchText=&code=&h=&m=&l=1&catid=&s=1&b\
    =1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    > Chapter 1. Java Management Extensions Concepts
    >
    > ------------------------------------------------------------------------
    >
    >
    >       1.3 The Sample Producer/Consumer Application
    >
    > In the remainder of this chapter, we will build and run a sample
    > application that demonstrates each MBean instrumentation approach. The
    > sections that follow look at the design of the application, where to
    > obtain the source code, how to actually build and run the application,
    > and how to monitor the application via a web browser.
    >
    >
    >         1.3.1 Design
    >
    > In this section, we will take a look at how the sample application is
    > designed, so that you can better understand what is going on when you
    > see it run. First, we will look at the pattern that is fundamental to
    > the application's design. Then we will see how the pattern is
    > implemented and what classes constitute the source code for the
    > application.
    >
    > The design pattern used in the application is a monitor. A monitor is
    > a construct that coordinates activity between multiple threads in the
    > system. In this pattern, the monitor coordinates activities between
    > two categories of threads: producer threads and consumer threads. As
    > you might imagine, a producer thread provides something that the
    > consumer uses. That "something" is generically defined as a unit of
    > work. This can be physically realized as anything relevant to a
    > problem that is solved by this pattern.
    >
    > For example, the unit of work might be an email message that is sent
    > to the email system (the monitor) by the producer (an email client)
    > and removed by the consumer (some agent on the incoming email server
    > side). The producer might perform additional processing on the message
    > before sending it to the email system, such as checking the spelling.
    > By the same token, the consumer may perform additional processing of
    > the message after removing it from the queue, such as applying an
    > anti-virus check. For this reason, we will refer to the pattern as
    > "value-added producer/consumer." This pattern is shown in UML notation
    > in Figure 1-9 <#javamngext-CHP-1-FIG-9>.
    >
    >
    >           Figure 1-9. UML diagram showing the "value-added
    >           producer/consumer" pattern
    >
    > As you can see in Figure 1-9 <#javamngext-CHP-1-FIG-9>, the producer
    > and consumer are separated (decoupled) by the monitor. This pattern is
    > best applied to systems that are inherently asynchronous in nature,
    > where the producer and consumer are decoupled by varying degrees. This
    > decoupling can be a separation of location as well as of synchronicity.
    >
    > The implementation of the value-added producer/consumer pattern is
    > shown in Figure 1-10 <#javamngext-CHP-1-FIG-10>. The classes in the
    > diagram are implemented as Java classes. The stereotypes shown in the
    > diagram are named according to the pattern shown in Figure 1-9
    > <#javamngext-CHP-1-FIG-9>.
    >
    >
    >           Figure 1-10. UML diagram showing the implementation of the
    >           pattern in the form of the application
    >
    > Basic is the base class for all of the classes that make up the
    > implementation (with the exception of WorkUnit, which represents the
    > unit of work that is exchanged between Supplier and Consumer).
    > Controller is a class that acts as the JMX agent and is responsible
    > for creating the producer and consumer threads that run inside the
    > application. Queue is a thread-safe queue that acts as the monitor.
    > Producer threads place items in the queue in a thread-safe way, and
    > consumer threads remove them. Worker is the base class for Supplier
    > and Consumer, because much of their behavior is common.
    >
    > In the sample application, the following resources can be managed:
    >
    >    *
    >
    >       Controller
    >
    >    *
    >
    >       Queue
    >
    >    *
    >
    >       Supplier
    >
    >    *
    >
    >       Consumer
    >
    > I encourage you to look at the source code to see exactly what
    > attributes and operations are on each of the management interfaces for
    > these resources.
    >
    >
    >         1.3.2 Source Code
    >
    > The source code for the application is standalone with respect to each
    > type of instrumentation approach. There are three versions of the
    > application, each in its own package. The name of the package
    > corresponds to the instrumentation approach. For example, with the
    > exception of common classes such as GenericException, the application
    > source code for standard MBeans is entirely contained in the standard
    > package; thus, if you install the source code to c:\jmxbook, the path
    > to the application source code for standard MBeans will be
    > c:\jmxbook\sample\standard. All of the source code shares the contents
    > of the exception package. Other than that, however, the application
    > can be built and run independently of the other packages.
    >
    > For each type of MBean, there is a Windows batch file and a Unix (Korn
    > shell) script that builds and runs the code for that instrumentation
    > strategy. The name of the script or batch file matches the
    > instrumentation strategy: for example, the build script for dynamic
    > MBeans is called /dynamic.sh/, and the batch file for building the
    > source code for the version of the application instrumented as dynamic
    > MBeans is called /dynamic.bat/. The major differences between the
    > application versions are in the source code. The console output and
    > the management view will show very little difference (other than
    > output from the Ant build script) between the versions of the
    > application.
    >
    >
    >         1.3.3 Building and Running the Application
    >
    > Before you can build and run the sample application (see Section P.5
    >
    <http://safari.oreilly.com/JVXSL.asp?xmlid=0-596-00245-9/javamngext-PREF-SECT-5#\
    javamngext-PREF-SECT-5>
    > in the Preface for details on how to obtain the application's source
    > code), you must download the JMX RI and Jakarta Ant. For this book, I
    > used JMX RI 1.0.1 and Ant 1.4. You can obtain the JMX RI at
    > http://java.sun.com/products/JavaManagement
    > <http://java.sun.com/products/JavaManagement>/ and Jakarta Ant at
    > http://jakarta.apache.org/ant/index.html.
    >
    > The name of the build file Ant uses to build the application for all
    > of the instrumentation strategies is /build.xml/. The build scripts
    > are designed to work with very little modification on your part.
    > However, you may have to modify either the build script or the Ant
    > build file, depending on where you installed the JDK, the JMX RI, and
    > Ant itself. Example 1-5 <#javamngext-CHP-1-EX-5> shows an excerpt from
    > /build.xml/.
    >
    >
    >           Example 1-5. Selected portions of the Ant build file for the
    >           application, build.xml
    >
    >.
    >.
    >.
    ><project name="jmxbook" default="standard" basedir=".">
    >
    ><!-- Set global properties -->
    ><property name="source_root" value="c:\jmxbook\sample"/>
    ><property name="jmx_home" value="c:\jmx1.0.1"/>
    >
    ><path id="project.general.class.path">
    >  <pathelement path="${jmx_home}\jmx\lib\jmxri.jar"/>
    >  <pathelement path="${jmx_home}\jmx\lib\jmxtools.jar"/>
    >  <pathelement path="."/>
    ></path>
    >
    ><!-- Build the init target -->
    ><target name="init">
    >  <!-- create the time stamp -->
    >  <tstamp>
    >    <format property="build.start.time" pattern="MM/dd/yyyy hh:mm:ss aa"/>
    >  </tstamp>
    >  <echo message="Build started at ${build.start.time}..."/>
    ></target>
    >
    ><!-- Build the exception target -->
    ><target name="build-exception" depends="init">
    >  <javac>
    >    <classpath refid="project.general.class.path"/>
    >    <src path="${source_root}"/>
    >    <include name="exception\*"/>
    >  </javac>
    ></target>
    >
    ><!-- Build the "standard" target -->
    ><target name="build-standard" depends="build-exception">
    >  <javac>
    >    <classpath refid="project.general.class.path"/>
    >    <src path="${source_root}"/>
    >    <include name="standard\*"/>
    >  </javac>
    ></target>
    >
    ><!-- Build the "dynamic" target -->
    ><target name="build-dynamic" depends="build-exception">
    >  <javac>
    >    <classpath refid="project.general.class.path"/>
    >    <src path="${source_root}"/>
    >    <include name="dynamic\*"/>
    >  </javac>
    ></target>
    >
    ><!-- Build the "model" target -->
    ><target name="build-model" depends="build-exception">
    >  <javac>
    >    <classpath refid="project.general.class.path"/>
    >    <src path="${source_root}"/>
    >    <include name="model\*"/>
    >  </javac>
    ></target>
    >.
    >.
    >.
    ></project>
    >
    > As you can see, the Ant build file is an XML document. This is what
    > sets Ant apart from other build utilities, such as make. Each
    > component to be built using Ant is called a target. A target may have
    > one or more dependent targets that must be built first, each of which
    > may be dependent on other targets, and so on. Ant resolves these
    > dependencies for you. A target is specified in an Ant build file as an
    > XML tag called target and has the following format:
    >
    ><target name="mytarget" depends="d1,d2">
    >
    > in which case mytarget depends on targets d1 and d2, or:
    >
    ><target name="mytarget">
    >
    > if mytarget has no dependent targets. Let's look at the build-standard
    > target from Example 1-5 <#javamngext-CHP-1-EX-5>:
    >
    ><!-- Build the "standard" target -->
    ><target name="build-standard" depends="build-exception">
    >  <javac>
    >    <classpath refid="project.general.class.path"/>
    >    <src path="${source_root}"/>
    >    <include name="standard\*"/>
    >  </javac>
    ></target>
    >
    > You can see that the build-standard target depends on the
    > build-exception target. Ant knows that there may be other
    > dependencies, so it looks at build-exception:
    >
    ><!-- Build the exception target -->
    ><target name="build-exception" depends="init">
    >  <javac>
    >    <classpath refid="project.general.class.path"/>
    >    <src path="${source_root}"/>
    >    <include name="exception\*"/>
    >  </javac>
    ></target>
    >
    > and notices that build-exception depends on init. Ant then looks at init:
    >
    ><target name="init">
    >  <!-- create the time stamp -->
    >  <tstamp>
    >    <format property="build.start.time" pattern="MM/dd/yyyy hh:mm:ss aa"/>
    >  </tstamp>
    >  <echo message="Build started at ${build.start.time}..."/>
    ></target>
    >
    > Ant sees that init has no dependencies, so it begins the build. init
    > is built first, followed by build-exception and finally
    > build-standard. Notice the javac tag within build-standard and
    > build-exception. This is known as an Ant task. A task is a Java class
    > that executes within the JVM in which Ant is running (unless you tell
    > Ant to fork a new process when executing the task). The javac task is
    > the java compiler. The classpath, src, and include tags nested within
    > the javac task tell the Java compiler what the CLASSPATH is, the root
    > location of the /.java/ files, and the packages (directories) to
    > compile, respectively.
    >
    > The application classes for each chapter in this book are built and
    > run using either a batch file or a shell script. If you are running
    > the application on Windows (as I did to produce the screen shots for
    > this chapter), use the batch file (i.e., the /.bat/ file). If you are
    > running the application on Unix, use the shell script (i.e., the /.sh/
    > file). Throughout the rest of this chapter, the examples will be
    > Windows-based. There are two reasons for this. First, because of the
    > popularity of Windows, it is likely that most developers will be
    > running this operating system. Second, the differences in the behavior
    > of the application when it is run on Windows versus Unix are negligible.
    >
    > To build and run the application, type in the name of the batch file
    > you want to run, based on the type of MBean instrumentation strategy
    > you want to see in action. You will notice that there is no detectable
    > difference between what you see when you run the build/run batch file
    > and what you see in your browser (discussed in the next section),
    > regardless of the instrumentation strategy. Suppose we want to run the
    > standard MBean batch file, which will build and run the application as
    > standard MBeans. Example 1-6 <#javamngext-CHP-1-EX-6> shows the batch
    > file that builds the application.
    >
    >
    >           Example 1-6. standard.bat, the batch file that builds the
    >           application as standard MBeans
    >
    >@set TARGET_NAME=build-standard
    >@set JAVA_HOME=c:\jdk1.3.1
    >@set ANT_VERSION=1.4
    >@set ANT_HOME=c:\ant%ANT_VERSION%
    >
    >@echo Starting Build ...
    >
    >call %ANT_HOME%\bin\ant %TARGET_NAME%
    >
    >if NOT "%ERRORLEVEL%"=="0" goto DONE
    >
    >%JAVA_HOME%\bin\java sample.standard.Controller 100 150
    >
    >:DONE
    >
    > This batch file is very simple. Aside from setting a few environment
    > variables, it does only two things: it builds the application by
    > calling Ant, and, if that succeeds, it starts the application. Figure
    > 1-11 <#javamngext-CHP-1-FIG-11> shows the output of running the batch
    > file. Recall our earlier discussion of how Ant resolves target
    > dependencies; you'll see that the targets are built in the order
    > described there.
    >
    >
    >           Figure 1-11. Running the build/run batch file for standard
    >           MBeans
    >
    > All of the batch files (/standard.bat/, /dynamic.bat/, and
    > /model.bat/) operate as described below, but I've used /standard.bat/
    > here for the purposes of illustration.
    >
    > In each version of the application, Controller contains the main( )
    > method that starts the producer and consumer threads and is itself an
    > MBean that can be managed and monitored. There are two command-line
    > arguments to Controller's main( ) method: the work factor for the
    > producer thread and the work factor for the consumer thread. Notice
    > that in /standard.bat/ values of 100 and 150, respectively, are
    > specified for these arguments. I set these values for a reason: it is
    > unlikely that you will find an application of the value-added
    > producer/consumer pattern where the producer and consumer perform an
    > equal amount of work. These command-line parameters to Controller
    > allow you to simulate this asymmetry. When Controller is started, one
    > producer thread and one consumer thread are created. However,
    > Controller has a management method that allows you to start additional
    > threads to balance out the workload (we will see how to do this later).
    >
    > Figure 1-10 <#javamngext-CHP-1-FIG-10> illustrates the relationship
    > between the various classes in the application, where there is a
    > single Queue object into which Supplier threads place WorkUnit objects
    > and from which Consumer threads remove them. For a single unit of
    > work, here is the flow of control:
    >
    >   1.
    >
    >       The Supplier performs an amount of work N—where N is specified
    >       on the command line to Controller—and places a single WorkUnit
    >       object into the Queue.
    >
    >   2.
    >
    >       The Consumer removes a single WorkUnit object from the Queue and
    >       performs an amount of work M—again, where M is specified on the
    >       command line to Controller.
    >
    > These steps are repeated for each work unit.
    >
    >
    >
    > The work that is performed by Supplier and Consumer threads is to
    > calculate prime numbers. The amount of work specified on the command
    > line to Controller is the number of prime numbers to calculate for
    > each WorkUnit. The Supplier calculates N primes, then places a
    > WorkUnit object into the Queue. The Consumer removes a WorkUnit object
    > from the Queue and then calculates M primes.
    >
    >
    > This section looked at how to run the sample application and briefly
    > discussed what it is doing internally to simulate the production and
    > consumption of units of work. I strongly encourage you to examine the
    > source code for yourself to see the various attributes and operations
    > available on the management interfaces of each resource in the
    > application.
    >
    > In the next section, we will look at how to use a web browser to
    > monitor and manage the sample application's MBeans.
    >
    >
    >         1.3.4 Monitoring and Managing the Application
    >
    > Once the application is running, you can point your web browser to
    > port 8090 (the default—you can change this, but if you do so, remember
    > to point your browser to the new port number). Figure 1-12
    > <#javamngext-CHP-1-FIG-12> shows the result of pointing my web browser
    > (which happens to be Internet Explorer) to port 8090 after running
    > /standard.bat/.
    >
    >
    >           Figure 1-12. The management view of the application in
    >           Internet Explorer
    >
    > Remember the work factors that we specified on the command line to
    > Controller for the producer and consumer threads? Because they are
    > different (100 and 150, respectively), and the producer thread does
    > less work than the consumer thread for each work unit, I expect the
    > Queue to always be full once the application reaches a steady state.
    >
    > If I click on the Queue MBean in my browser, I see the screen shown in
    > Figure 1-13 <#javamngext-CHP-1-FIG-13>. There are several interesting
    > things about Figure 1-13 <#javamngext-CHP-1-FIG-13>. First, the
    > AddWaitTime attribute is much larger than the RemoveWaitTime
    > attribute. After processing 72 units of work (according to the
    > NumberOfItemsProcessed attribute), the Supplier thread has waited a
    > total of 3,421 milliseconds to add items to the Queue because it was
    > full, whereas the Consumer thread has not had to wait at all to remove
    > items (although, depending on which thread actually starts first, you
    > may see a small amount of Consumer wait time). This is pretty much
    > what we would expect, as the Supplier thread does only two-thirds the
    > work of the Consumer thread.
    >
    >
    >           Figure 1-13. The management view of the Queue object
    >
    > Suppose we want to start another Consumer thread to pick up some of
    > the slack of the other Consumer thread and balance things out a bit.
    > For the moment, let's ignore the fact that we can control the amount
    > of work each type of Worker thread can perform. In a real-world
    > application, we would not have that luxury. As I mentioned earlier in
    > this chapter, Controller acts as the JMX agent for the application,
    > but it is also itself a managed resource (i.e., an MBean). If we look
    > at the management interface of Controller, we'll see that there is a
    > management operation to start new Worker threads, called createWorker(
    > ). Figure 1-14 <#javamngext-CHP-1-FIG-14> shows the management view of
    > the Controller MBean and its createWorker( ) operation.
    >
    >
    >           Figure 1-14. The management view of Controller showing the
    >           createWorker( ) operation
    >
    > There are two parameters to createWorker( ): the first is a string
    > that contains the worker type, and the second is the work factor that
    > worker is to have (i.e., the number of primes calculated per unit of
    > work). The valid values for the worker type are "Supplier" and
    > "Consumer". We want to create a new Consumer thread with the same work
    > factor as the currently running Consumer thread, so we set these
    > parameters to Consumer and 150, respectively. Once we have entered the
    > parameters for the management operation into the text boxes, as shown
    > in Figure 1-14 <#javamngext-CHP-1-FIG-14>, we click the createWorker
    > button to invoke the management operation. If the operation succeeds,
    > we will see a screen that looks like Figure 1-15
    > <#javamngext-CHP-1-FIG-15>.
    >
    >
    >           Figure 1-15. The screen we see once createWorker( ) has
    >           successfully been invoked
    >
    > We would now expect that activity in the Queue has balanced out
    > somewhat, and we would expect to start seeing the Supplier wait, as we
    > now have two Consumer threads at work. Figure 1-16
    > <#javamngext-CHP-1-FIG-16> shows the management view of the Queue
    > after we start the second Consumer thread.
    >
    >
    >           Figure 1-16. The management view of the Queue after starting
    >           a second Consumer thread
    >
    > Notice that after processing 1,013 units of work (as we see from the
    > NumberOfItemsProcessed attribute), the Consumer threads have waited
    > nearly 7 times as long as the Supplier thread. Through the use of
    > management operations, we can give an operator at a management console
    > the ability to tune our application at runtime.
    >
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/11071533&open=false&g=&srchText=java+jmx&code=&h\
    =&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    >
    > ------------------------------------------------------------------------
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=print&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=jav\
    a+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > Print
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=print&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=jav\
    a+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=email&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=jav\
    a+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > E-Mail
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=email&sortKey=rank&sortOrder=desc&\
    view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=jav\
    a+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=addnote&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > Add Note
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=addnote&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=bookmark&sortKey=rank&sortOrder=de\
    sc&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=\
    java+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    > Add Bookmark
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=bookmark&sortKey=rank&sortOrder=de\
    sc&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=\
    java+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=list&sortKey=rank&sortOrder=desc&v\
    iew=book&xmlid=&open=false&g=&srchText=java+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f\
    =1&t=1&c=1&u=1&r=&o=1&page=0>
    > Return to Results List
    >
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=list&sortKey=rank&sortOrder=desc&v\
    iew=book&xmlid=&open=false&g=&srchText=java+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f\
    =1&t=1&c=1&u=1&r=&o=1&page=0>
    > 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-3&open=false&g=&srchText=j\
    ava+jmx&code=&h=1&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1&page=0&useLPos=1&\
    LPos=-1> 
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-1-SECT-2&open=false&g=&srchText=j\
    ava+jmx&code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    <http://safari.oreilly.com/JVXSL.asp?x=1&mode=section&sortKey=rank&sortOrder=des\
    c&view=book&xmlid=0-596-00245-9/javamngext-CHP-2&open=false&g=&srchText=java+jmx\
    &code=&h=&m=&l=1&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1>
    >
    > Top <#toppage>
    >
    > *URL* http://safari.oreilly.com/0596002459/javamngext-CHP-1-SECT-3
    >
    >
    >
    > About Safari <http://safari.oreilly.com/JVXSL.asp?mode=About> | Terms
    > of Service <http://safari.oreilly.com/JVXSL.asp?mode=Terms> | Privacy
    > Policy <http://safari.oreilly.com/JVXSL.asp?mode=Privacy> | Contact Us
    > <http://safari.oreilly.com/JVXSL.asp?mode=Contact> | Help
    > <http://safari.oreilly.com/JVXSL.asp?mode=Help> | Submit a Problem
    > <http://safari.oreilly.com/JVXSL.asp?mode=Submit>
    > Copyright © 2003 O'Reilly & Associates, Inc. All rights reserved.
    > 1005 Gravenstein Highway North
    > Sebastopol, CA 95472
    >

    #208 From: Gordon Mohr <gojomo@...>
    Date: Wed Dec 17, 2003 12:45 am
    Subject: archive-crawler@yahoogroups changes
    gojomo
    Send Email Send Email
     
    Starting tomorrow, we intend to make the 'archive-crawler@yahoogroups.com'
    message archives visible to non-list members, and allow people to subscribe
    themselves. (Previously, subscription-requests were held for admin approval.)
    
    If there are any concerns about this transition -- for example, concern
    that project info that should not be made public is in the message or
    file archives -- please let me know.
    
    We are also considering moving project development discussion to a
    different listserver, perhaps at Sourceforge. Possible benefits
    would include:
    
        (1) More familiar/open to walk-up open-source contributors
        (2) Lower delays before messages are propagated to list
            (We've seen delays of several hours.)
        (3) Less advertising
    
    However, the Sourceforge list service itself is currently going through
    growing pains, with the last couple of weeks of archives offline
    after a hardware migration, so any move on our part will wait until their
    services stabilize.
    
    Comments on this potential move welcome.
    
    - Gordon @ IArchive

    #209 From: Michael Stack <stack@...>
    Date: Thu Dec 18, 2003 9:59 pm
    Subject: Re: archive-crawler@yahoogroups changes
    stack@...
    Send Email Send Email
     
    +1 on the below.
    St.Ack
    
    Gordon Mohr wrote:
    
    >Starting tomorrow, we intend to make the 'archive-crawler@yahoogroups.com'
    >message archives visible to non-list members, and allow people to subscribe
    >themselves. (Previously, subscription-requests were held for admin approval.)
    >
    >If there are any concerns about this transition -- for example, concern
    >that project info that should not be made public is in the message or
    >file archives -- please let me know.
    >
    >We are also considering moving project development discussion to a
    >different listserver, perhaps at Sourceforge. Possible benefits
    >would include:
    >
    >   (1) More familiar/open to walk-up open-source contributors
    >   (2) Lower delays before messages are propagated to list
    >       (We've seen delays of several hours.)
    >   (3) Less advertising
    >
    >However, the Sourceforge list service itself is currently going through
    >growing pains, with the last couple of weeks of archives offline
    >after a hardware migration, so any move on our part will wait until their
    >services stabilize.
    >
    >Comments on this potential move welcome.
    >
    >- Gordon @ IArchive
    >
    >To unsubscribe from this group, send an email to:
    >archive-crawler-unsubscribe@yahoogroups.com
    >
    >
    >
    >Yahoo! Groups Links
    >
    >To visit your group on the web, go to:
    > http://groups.yahoo.com/group/archive-crawler/
    >
    >To unsubscribe from this group, send an email to:
    > archive-crawler-unsubscribe@yahoogroups.com
    >
    >Your use of Yahoo! Groups is subject to:
    > http://docs.yahoo.com/info/terms/
    >
    >
    >
    >

    Messages 180 - 209 of 8125   Oldest  |  < Older  |  Newer >  |  Newest
    Add to My Yahoo!      XML What's This?

    Copyright © 2010 Yahoo! Inc. All rights reserved.
    Privacy Policy - Terms of Service - Guidelines NEW - Help