Skip to search.
enterprise-information-integration · EII

Group Information

  • Members: 168
  • Category: Software
  • Founded: Dec 22, 2003
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Messages

  Messages Help
Advanced
Seán on Information Entity Definition/Classification   Message List  
Reply Message #24 of 282 |
Duck modelling in commercial IT systems

By Sean McGrath

Some time ago, I wrote an article entitled 'How to model a bishop'[1].
The thrust of the article is the obvious-yet-profound fact that people
model information from their own point of view. That 'point of view'
is a smorsgasbord of influences that include training, culture,
language, hubris, sloth and so on. This article approaches the same
problem of information modelling from another perspective and ends, I
hope, with an equally obvious-yet-profound fact.

Let's get started.

One of the great mysteries of human intelligence is the fact that we
can somehow synthesize wildly different models of information into
crisp sounding concepts such as an 'invoice', or 'the letter A'.

Flick through a good book of fonts and wildly different variations on
a theme known as 'the letter A' will hit your retinas. Yet somehow,
our brains have no difficultly coalescing these diverse images into a
single concept - the letter A - that we recognize with astonishing
speed and reliability.

Faced with one hundred different invoice formats, from one hundred
different suppliers, our brains (nobody knows how this works) know
they are all basically the same thing. Variations on a theme we call
'invoice'.

Douglas Hofstadter has written an engaging analysis of 'the letter A'
problem[2] which I would exhort anybody interested in the problem to
read. Here, we will concentrate on the more commercially interesting
problem of recognizing an invoice when we see one.

What is an invoice? Don't all shout together now! Write down, on a
mental notepad, the concepts you believe need to be present for
something to qualify as an invoice. Let's go through it. Does your
model have a sender? It does? Great. What did you call the invoice
sender in your model? What is it comprised of? Is it the name of a
person, a process or a thing? All of the above? None of the above?
Some of the above? Let's look at 'person' for a moment. What is a
person from the point of view of an invoice? A real person or a
business role? Does it have a name? How is the name structured? Does
it have an address? A title? What is a 'given name' anyway? How is
the address structured? Is it a place, a geo-code or a business
process name? Is a zip code mandatory or optional? What about city?
What is a city anyway?

And so it goes. Down and down we go into the subatomic world of
invoice particle physics. The closer we look, the more we find that
the model refuses to bottom out, refuses to condense into something
solid that we can model rigorously. And yet, paradoxically, our
brains can recognize an invoice as an invoice in a split second.

All around the world, as I write this, developers are struggling to
create models for invoices and other "simple" business documents into
IT systems. All around the world, multiple efforts new and old
continue to attempt to zoom in on a definitive model of what it is to
be an invoice.

Also all around the world, retired developers in Zimmer frames and
comfy shoes remember the good old days when they too chased such
modelling rainbows.

The classical approach to data modelling - as enshrined in techniques
such as data dictionaries, object models and XML schemas - is to model
the data rigorously from the top down. Every thing in the model has a
name. Each thing is either a simple lump of data or a complex thing.
Complex things, themselves have names and models. And so it goes.

The latest silver bullet of the classical approach - XML schemas -
illustrate the genre very well. You start at the top
concept "invoice". You break it down into its component parts known
as "elements", say, "header" and "body". You break these down further
into more elements. For example, a "header" has a sender element, a
receiver element, a date element. A sender element is comprised
of...and so it goes.

The trouble is, this modelling exercise never ends. The essence of an
invoice refuses to be modelled. Every model, to paraphrase Oscar
Wilde, becomes a work of art that is never finished, simply
abandoned. If this were not the case then surely we would have a
definitive invoice model by now? How come our planet is so
chockablock with mutually incompatible, application-specific models
of invoices? How comes new ones appear every second day?

I have a suggestion that explains the situation. It is radical
sounding at first but please bear with me.

There is *no such thing as an invoice* in the classical data modelling
sense.

What does your brain do when it sees an invoice? It scans the page
picking up clues as to what is on the page. Something that looks like
a sender's details in one place, something that looks like a
receiver's details in another. A bunch of what look like
products/services with amounts in another. Some sort of total near
the bottom. Perhaps some terms and conditions on the back. The more
of these subsidiary structures you recognize, the higher your
confidence level that you are dealing with an invoice.

The important thing is that you never actually decide that it is
definitely an invoice, rather, you decide that it is more like an
invoice than any other document type you know. Invoice-ness is
statistical, fuzzy, fluid. It is not solid, not deterministic.

There seems to me to be a fundamental mismatch here between the
classical software model of an invoice and the reality of real world
invoices. A mismatch that goes to the heart of our attempts to model
data in machine readable form. A mismatch that is reminiscent of the
classical 'billiard ball'[3] model of particle physics versus the
quantum model.

This mismatch is not of mere intellectual interest. It runs to the
heart of the difference between a truly flexible computer system and
one that just pretends to be flexible. True flexibility comes from
the ability to adapt to changing business needs. Invariably, this
involves the ability to adapt to changing data models. Systems
designed with the classical data model approach are highly resistant
to change. These systems forge rock hard data models based on rigid
top down analysis of the elements present in, say, invoices. Then
they bury these rigid models deep into the core of the system by
compiling programs in Java or C# or whatever that are intimately tied
to these models.

Worrying isn't it? The standard approach to data modelling would seem
to be antithetical to flexibility. The answer to this conundrum is
not yet on the horizon. Indeed, recognition of the problem is not yet
widespread in the industry in my experience. We hear lots of nice
words like 'flexibility' and 'loose coupling' but they tend to have
vague, abstract definitions.

Amongst those who do recognize the problem, an amusing, and very
useful phrase can often be heard - 'duck typing'[4]. So called after
the way we humans recognize ducks. Namely, if it walks like a duck,
quacks like a duck, it *is* a duck.

Think back to how you recognized that invoice as an invoice. You
found a bunch of attributes which you associate with invoices. You
found enough of them in your analysis of the piece of paper to
conclude that it was statistically speaking, more likely than not to
be an invoice. It if walks like a duck...

This obvious-yet-profound idea is the route to our salvation. In the
XML world, which seems set to be a hot bed of work in this area,
there is an increasing recognition that so called 'grammar-based'
approaches to data modelling have weaknesses as well as strengths.
Alternate approaches that are more duck-typing in their approach such
as Schematron[5] are growing in popularity because of the business
benefits that accrue from the extra flexibility they provide. At the
same time, so called 'dynamically typed' programming languages such
as Python/Jython[6] are becoming increasingly popular for XML
processing. Again because of the flexibility that their duck-typing
provides.

I hope that long before I take delivery of my own Zimmer frame,
dynamic typing will be the standard way to model data. I can see no
other way to keep up with the incessant demands for flexibility
required of commercial IT systems.


[1] http://www.itworld.com/nl/xml_prac/08082002/

[2] Metafont, metamathematics and metaphysics: Comments on Donald
Knuth's
article "The concept of a Meta-Font".
http://itc.fgg.uni-lj.si/data/cumincad/robots/aed6.htm

[3] http://www.meta-library.net/ghc-obs/colatoms-body.html

[4] http://c2.com/cgi/wiki?DuckTyping

[5] http://www.ascc.net/xml/resource/schematron/schematron.html


[6] http://www.python.org / http://www.jython.org







Tue Feb 24, 2004 10:02 am

gervasdouglas
Offline Offline
Send Email Send Email

Message #24 of 282 |
Expand Messages Author Sort by Date

Duck modelling in commercial IT systems By Sean McGrath Some time ago, I wrote an article entitled 'How to model a bishop'[1]. The thrust of the article is the...
Gervas Douglas
gervasdouglas Offline Send Email
Feb 24, 2004
10:03 am
Advanced

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help