This might be of interest to those working on categorisation of
channels:
This is a forwarded message
From: Mike Jackson <m.s.jackson@...>
To: 'dc-general@...' <dc-general@...>
Date: Friday, July 09, 1999, 10:23:57 AM
Subject: Automatic Classification
===8<==============Original message text===============
For a couple of years now part of the work of the SEED research group
(http://www.scit.wlv.ac.uk/seed/) at the University of Wolverhampton has been
the automatic classification of web pages. The initial work revolved around the
generation of a set of Dewey Decimal Classifications for each page. This has
recently been extended to the generation of Dublin Core metadata expressed in
RDF. This work was described in a paper presented at WWW8
http://www.scit.wlv.ac.uk/~ex1253/rdf_paper/.
We have now made the classifier publicly accessible at
http://scitsd.wlv.ac.uk:8080/metadata.html Please feel free to use it and
address any criticisms you have of it to us after reading Charlotte Jenkin's
caveat page at http://www.scit.wlv.ac.uk/~ex1253/excuses.html
Mike
Mike Jackson M.Sc. B.Sc. C.Eng. FBCS
Professor of Data Engineering
School of Computing and Information Technology
35/49 Lichfield Street
Wolverhampton WV1 1EL
United Kingdom
Phone: +44 (0)1902 321429 Fax: +44 (0)1902 321491
Email: m.s.jackson@...
For anyone interested in a prototype web log viewer, check it out at
http://myhome.mystuff.net/myweblog.cgi
basically, a cross between a headline viewer and a mail/news reader.
comments welcome.
-paul
Version 0.8.5 of Carmen's Headline Viewer (for Windows) is now
available for download at
http://www.vertexdev.com/HeadlineViewer
This program reads XML and text-based "backend" headline syndication
files produced by several hundred producers of syndicated news
content. It gathers up the headlines from each producer and presents
them in a nice scrolling list. Clicking on a headline opens the full
story in your browser. It is a clean and straightforward way to stay
in touch with the latest news from hundreds of news producers without
spending all day clicking.
Access to nearly 300 news sites is provided:
* 34 are built in
* 233 can be downloaded from the list at UserLand.com
* 34 can be downloaded from the RSS Maker list
The user can add new sites, and the two syndication lists add more
content nearly every day. Preliminary support for the
http://www.moreover.com site (with hundreds more news channels) is
also provided.
This version adds the following new features:
* Added 18 new built in providers.
* Added support for text-based (non XML) providers.
* Faster loading of the UserLand provider list
* Improved context menus.
* More tooltips.
* Lots of small tweaks and bug fixes (details are on the
web page)
Carmen
Try Headline Viewer at http://www.vertexdev.com/HeadlineViewer
Carmen
> Well, a true portal might have thousands or tens of thousands of
> channels. When (not if :-) we get to this point things will get
> unwieldly. We'll need a meta-OCSF which contains a list of
> OCSF files. Imagine any significant fraction of the sites listed
> in Yahoo in a single file. Ugly, eh?
Yeah, ugly. This is where categorisation comes into play. I thought about this
some time
ago when deciding how to open the contents of xmlTree on display. I reckoned
that it was
sensible to provide a channel listing by category, limiting the number of
channels in any
one OCS-like file to around 20-100. At least all of the channels are on a
similar topic.
An obvious point : including categorisation at the OCS level would imply that it
applied
to the all of the channels in the list, and would preclude us from compiling
lists based
on something other than categorisation, like keywords. So if we include
categorisation in
the meta data, it should only be at the individual channel level.
Best regards,
James Carlyle
james@...
www.xmltree.com - directory of XML content on the web
My notes below...
James> a) to define a channel metadata format (when the channel is updated,
James> who the editor is
James> etc.), or
James> b) to define the channel content format (news item 1, news item 2
etc.), or
James> c) to define both separately, or
James> d) to define both rolled into one.
Ian> I'd like to see a and b, and I think they are separate but
Ian> the syndication format probably needs some additional metadata such as
Ian> date published, and author.
Let's definitely keep a and b separate. Without further broadening
our charter I should mention that there is a lot of syndication
going on that is not XML-based. The channel metadata format could
possibly address these. In my Headline Viewer I store a text/XML
flag, and then handle them separately:
* For XML the top-level node defines the format
(RSS, ultraMode, scriptingNews, or MoreOver).
* For Text, I store a "format" string with each item.
It defines the number of header lines, then the number
of lines per item, then the content of individual lines.
I store the entire format as a single string which
might look something like "00,04,TUD3X". This means
that there are 0 header lines, 4 lines per item, then
a Title, an URL, a Date (in format 3), and then a line
to be ignored. Ugly yet effective.
James> I also believe that we should not rush into defining another
James> channel format too early. Netscape are rumoured
James> to be releasing version 1.0 of RSS later this month and no matter how
James> many flaws the current format has it is the most widely supported so
James> backwards compatibility is a must.
I am in violent agreement here.
> Does anyone feel that the current OCS format is too cumbersome from
> either an authoring or a parsing point of view?
Not me. Its easy to parse.
> Do you think that OCS can scale from single channel sites to channel
> portals such as my.userland?
Well, a true portal might have thousands or tens of thousands of
channels. When (not if :-) we get to this point things will get
unwieldly. We'll need a meta-OCSF which contains a list of
OCSF files. Imagine any significant fraction of the sites listed
in Yahoo in a single file. Ugly, eh?
> Do the Dublin Core elements provide enough flexibility to describe the
> channel and it's owner?
They seem to, although I found it more satisfying to parse a file
with distinctively named elements. Its hard to say why.
Carmen
Try Headline Viewer at http://www.vertexdev.com/HeadlineViewer
Hi Dan,
On Tuesday, July 06, 1999, 10:29:15 PM, Dan wrote:
>> However from what I've seen we shouldn't hold out too much hope of
>> Netscape adding everything that we need to the format. Personally, I'd
>> be looking for at least the following: language, publication date,
>> author/copyright/credits, optional content with each item, optional
>> date with each item.
> language is there - per channel, not item. no date stuff yet. no attribution
stuff. The
> latter two are really just laziness and time constraints, not because we don't
want to. In
> fact, I could probably just make those "undocumented tags" that would be
ignored by our
> validator. There is an optional description available for each item now.
What sort of
> "optional content with each item" did you have in mind?
The optional content would be an element that would allow the provider
to add a paragraph or so of text including some limited HTML markup
such as <i>, <b>, <em>, <strong>, <a>, <span>
I'm not sure what value the description element has. I guess it could
be useful for more static listings, such as mini sitemaps. But for
content syndication it's not enough - we want to be able to include
links in the text ala scriptingNews format (but simpler to write and
bw compatible with RSS)
> When I get a chance, I will post the RSS 1.0 spec + DTD here.
That would be useful, my copy is dated 21 June 1999 but there may have
been changes since then.
> If someone could post a proposal for the exact syntax of above tags, that
would be useful.
> Also, even if these things don't make it into 1.0, I don't see any reason why
we couldn't
> easily create a 1.1 with some of these added as optional attributes.
This list is a good place to start opening up the channel format to
outside developers. RSS is a public standard and it would be nice to
be able to work with Netscape openly to enrich the format.
> -dan
.id.
--
weblog - http://alchemy.openjava.org/
me - http://www.fdc.co.uk/people/iand/
email - iand@... | icq - 4423828
> However from what I've seen we shouldn't hold out too much hope of
> Netscape adding everything that we need to the format. Personally, I'd
> be looking for at least the following: language, publication date,
> author/copyright/credits, optional content with each item, optional
> date with each item.
language is there - per channel, not item. no date stuff yet. no attribution
stuff. The
latter two are really just laziness and time constraints, not because we don't
want to. In
fact, I could probably just make those "undocumented tags" that would be ignored
by our
validator. There is an optional description available for each item now. What
sort of
"optional content with each item" did you have in mind?
When I get a chance, I will post the RSS 1.0 spec + DTD here.
If someone could post a proposal for the exact syntax of above tags, that would
be useful.
Also, even if these things don't make it into 1.0, I don't see any reason why we
couldn't
easily create a 1.1 with some of these added as optional attributes.
-dan
On Tuesday, July 06, 1999, 8:47:02 AM, james wrote:
> From: "james@..." <james@...>
> Thanks, Ian, for the work that you have put in on OCS. I am, though, a little
unclear on
> the objectives of the group. Is it :
> a) to define a channel metadata format (when the channel is updated, who the
editor is
> etc.), or
> b) to define the channel content format (news item 1, news item 2 etc.), or
> c) to define both separately, or
> d) to define both rolled into one.
I'd like to see a and b, and I think they are separate but
the syndication format probably needs some additional metadata such as
date published, and author. I also believe that we should not rush
into defining another channel format too early. Netscape are rumoured
to be releasing version 1.0 of RSS later this month and no matter how
many flaws the current format has it is the most widely supported so
backwards compatibility is a must.
However from what I've seen we shouldn't hold out too much hope of
Netscape adding everything that we need to the format. Personally, I'd
be looking for at least the following: language, publication date,
author/copyright/credits, optional content with each item, optional
date with each item.
For the channel metadata I obviously see OCS as the way forward. While
we're all here I'd like to canvas some opinion from this list:
Does anyone feel that the current OCS format is too cumbersome from
either an authoring or a parsing point of view?
Do you think that OCS can scale from single channel sites to channel
portals such as my.userland?
Do the Dublin Core elements provide enough flexibility to describe the
channel and it's owner?
> One solution would be -
> <rdf:description about="http://www.windowscepower.com/shares/netscape-
rss/headlines.rdf">>
> <ocs:language>en</ocs:language>
> <ocs:content-type>text/xml</ocs:content-type>
> <ocs:schema>http://my.netscape.com/rdf/simple/0.9/</ocs:schema>
> <ocs:updatePeriod>daily</ocs:updatePeriod>
> <ocs:updateFrequency>1</ocs:updateFrequency>
> </rdf:description>
This is definately a better way to do it. Although I believe the
default XML MIME type should be application/xml. I'd like to add this
to the nect version of the OCS spec
.id.
--
weblog - http://alchemy.openjava.org/
me - http://www.fdc.co.uk/people/iand/
email - iand@... | icq - 4423828
Thanks, Ian, for the work that you have put in on OCS. I am, though, a little
unclear on
the objectives of the group. Is it :
a) to define a channel metadata format (when the channel is updated, who the
editor is
etc.), or
b) to define the channel content format (news item 1, news item 2 etc.), or
c) to define both separately, or
d) to define both rolled into one.
> BTW, when dealing with the HTML and text versions in the CE Power
> example I used the MIME types as the format entries. Normally the
> format element points to a DTD or Schema for the format, but for
> syndication formats that are essentially designed to be pulled
> straight into a web page I considered text/html and text/plain to be
> adequate. Any thoughts on that?
I think that it is a pity that the <format> element is being used to store two
types of
information, making it harder for parsers. The mime type is probably necessary
anyway for
completeness (can we ever consider a scenario where .pdf files are used as one
of the
channel formats?)
One solution would be -
<rdf:description about="http://www.windowscepower.com/shares/netscape-
rss/headlines.rdf">
<ocs:language>en</ocs:language>
<ocs:content-type>text/xml</ocs:content-type>
<ocs:schema>http://my.netscape.com/rdf/simple/0.9/</ocs:schema>
<ocs:updatePeriod>daily</ocs:updatePeriod>
<ocs:updateFrequency>1</ocs:updateFrequency>
</rdf:description>
In the case of text/plain, the schema element would be left out. In the case of
text/html, the schema could be the dtd for the version of html used.
Best regards,
James Carlyle
james@...
www.xmltree.com - directory of XML content on the web
On Saturday, July 03, 1999, 4:20:28 PM, Dave wrote:
> What about the format for the channel list? Anyone want to work on that
> this weekend? I can create an experimental URL for quick
> additions/revisions. Right now our channel list is just a list of URLs.
> What additional info should we provide for each channel? This is what OCS
> is about, correct?
Correct. OCS is currently at version 0.3 and lets you define multiple
channels for a site, with each channel having multiple formats.
The spec is at http://alchemy.openjava.org/ocs/. The current version
lets you specify any of the Dublic Core (http://purl.org/dc) elements
as metadata about the channel provider and individual channel. The
channel formats then have several custom elements to describe
language, format and publishing schedule.
Did any of you see the Windows CE Power article about syndication (I
know Carmen did ;). They produce three different channels: news,
articles and combined, each in three different formats: RSS, HTML and
plaintext. I wrote to the author and suggested that OCS would be ideal
for this and even supplied an example of what it could look like for
them ( http://alchemy.openjava.org/ocs/cepowerdirectory.rdf). I've not
heard back yet.
BTW, when dealing with the HTML and text versions in the CE Power
example I used the MIME types as the format entries. Normally the
format element points to a DTD or Schema for the format, but for
syndication formats that are essentially designed to be pulled
straight into a web page I considered text/html and text/plain to be
adequate. Any thoughts on that?
.id.
--
weblog - http://alchemy.openjava.org/
me - http://www.fdc.co.uk/people/iand/
email - iand@... | icq - 4423828
On Monday, July 05, 1999, 9:00:08 AM, james wrote:
> From: "james@..." <james@...>
> Dear Mark and Carmen and Dave (and whoever else is interested in
classification),
>> So the actual categories that the publisher can choose from, how does that
>> get decided? I don't think it's a good idea to allow channel publishers to
>> create new categories; that would be a mess. A centrally-managed list would
>> be idea, but probably impractical - do we want to give this a try?
> The centrally-managed list is not impractical if it already exists and is well
understood
> and documented. Candidates for such a central list might be:
> Group I
> 1) Newsgroups structrure (like comp.text.xml)
Ugh! These are a complete mess because of all the politics, infighting
and general ill-will amongst usenet members.
> 2) Yahoo
> 3) Open Directory Project
This group gives the greatest potential IMVHO, they've evolved to meet
the need of the users of the sites and reflect 'real-world'
categorisations.
> Group II
> 4) Library of Congress classification
> 5) Dewey Decimal System
These are the traditional contenders but I feel that they're overly
complex for our needs. Of the entire Dewey classification I believe
that we'd only be using a very small subset, very far down into the
hierarchy. Plus, how do we cope with more general purpose sites?
GeneHack is a popular weblog that is interested in Computers,
Technology, Films and _Microbiology_!!! How do even begin to classify
an individual's tastes? See below for what I think is the best
approach.
>> Hopefully, selecting the channel from a large list at the aggregator will be
>> only one method of adding channels. Netscape had the right approach in
> I agree - there should be alternative ways of finding channels, such as by
traditional
> free-text search, or by keyword navigation (see an example at
> http://www.xmltree.com/metadata/search.cfm and excuse the quality of keywords;
inspired by
> http://www.aeiwi.com)
i think the keyword search is excellent and presented in the right way
it can be a very intuitive way of narrowing a search. The Dublin Core
Subject element as used in OCS is freeform text and putting specific
keywords there would make a lot of sense.
.id.
--
weblog - http://alchemy.openjava.org/
me - http://www.fdc.co.uk/people/iand/
email - iand@... | icq - 4423828
( this is a bit of elementary stuff that would be good to cover early on )
As a starting point, is there any comment on the basic structure of:
<formatname>
<header>
...
</header>
<item>
...
</item>
...
</formatname>
(with the possibility of other item-like entities)
If we agree on this, then we can deconstruct the header and item formats. I
know people have spoken of RDF, etc, but the feeling I get from many people
is that it's too complex. Comments?
Also, it would be nice to have a formalised definition for the scope of the
format. I've scribbled:
> A format that Internet content sources can use to make
> lists of references to or summaries of resources (local
> or otherwise) that are bound by an area of interest available
> to third parties in a timely, predictable and intelligent manner.
This obviously needs work.
Mark Nottingham, Melbourne Australia
mnot@...http://www.mnot.net/
Dear Mark and Carmen and Dave (and whoever else is interested in
classification),
> So the actual categories that the publisher can choose from, how does that
> get decided? I don't think it's a good idea to allow channel publishers to
> create new categories; that would be a mess. A centrally-managed list would
> be idea, but probably impractical - do we want to give this a try?
The centrally-managed list is not impractical if it already exists and is well
understood
and documented. Candidates for such a central list might be:
Group I
1) Newsgroups structrure (like comp.text.xml)
2) Yahoo
3) Open Directory Project
Group II
4) Library of Congress classification
5) Dewey Decimal System
The following comments are my opinions only.
Group I advantages : Uses easy-to-understand terms and has been modelled on the
categorisation needs of the web.
Group I disadvantages : The categorisation is ad-hoc and has not been refined
over many
years of use. The categorisation structure is not published in any way (Yahoo;
ODP
publishes structure in RDF) or the logic behind the categorisation is not
explained in any
way (Open Directory Project). No algorithms exist for automated classification
(as far as
I know).
Group II advantages : Refined over a hundred years of use by trained
categorisation
experts (librarians). The categorisation structure has been studied and
understood by
millions of classifiers, and used by everyone who has visited a library.
Group II disadvantages : The categorisation is rigid and cannot be changed by
individual
decisions, but only by peer review. The structure is not (IMO) 'optimised' for
the
subjects that people think of when dealing with the web.
My own feeling is that the Group I structures only came into being on an ad-hoc
and
reactive basis, and it would be a waste to ignore the collective thought that
went into
refining the Group II structures. Of these, the Dewey System is preferable to
the Library
of Congress since the classification is extensible and remains logical:
Quote from David Mundie:
"The fundamental difference is that DDC defines a conceptual space, and
the LOC does not. That is, the decimal nature of Dewey conveys a very rich
set of relationships among categories, and permits "chaining", where the LOC
does not. To take an example chosen completely at random, the Dewey code for
French birds is 598.2944, which is broken down as follows:
500 - Science
590 - Zoological Science
598 - Birds
598.29 - Geographical treatment
598.294 - Europe
598.2944 - France
In this system, the conceptual space is a continuum: it is immediately
apparent from the code that 598.29 is a subdivision of 598, and 598.2944 a
subdivision of 598.29, and so on. Contrast this with the LOC code
"QL683.W4P4", which is an unanalyzable, arbitrary code for "French Birds",
bearing no relationship to "QL682.W4P4" or "QL683.W4P5" or "QK683.W4P4".
That is, the LOC is an enumerated system, and DDC is (largely) faceted - and
it is generally acknowledged that faceted systems are the way to go."
> Maybe another way to go about it is to allow the publisher to decide which
> classification system they want to be considered part of. For instance:
>
> <FooFormat>
> <header>
> ...
> <category
> authority="http://bar.com/categorydefinition.xml">Widgets/FrobNobs</category
> >
> ...
> </header>
> <item>
> ...
>
> This is nice and flexible, but may lead to problems; if an aggregator didn't
> want to support too many categorization methods, or there was a lot of
> overlap between categorisation schemes, there would be trouble.
The danger is that we get a proliferation of schemes, but each "My." portal will
only
offer navigation through one of them at the risk of overwhelming it's users.
This would
mean that the content provider might provide classification through scheme
"http://bar.com" but the portal might offer the chance to browse channels by the
scheme
http://foo.com - the portal would need to re-categorise to foo.com, which would
either be
manual, or write a schema mapping algorithm for every combination of bar1.com,
bar2.com,
bar3.com etc. to foo.com.
> Hopefully, selecting the channel from a large list at the aggregator will be
> only one method of adding channels. Netscape had the right approach in
I agree - there should be alternative ways of finding channels, such as by
traditional
free-text search, or by keyword navigation (see an example at
http://www.xmltree.com/metadata/search.cfm and excuse the quality of keywords;
inspired by
http://www.aeiwi.com)
Best regards,
James Carlyle
james@...
www.xmltree.com - directory of XML content on the web
------------------------------------------------------------------------
Wow, I am honored. All these messages with my name in the Subject!
Surely we are not the first ones to try and solve this problem.
I'd like to hear more about James' plan to use the Dewey Decimal
system.
If you don't want to allow publishers to create new categories,
what if we have a nice handful of top-level categories and
then allow publishers to create sub-categories within those?
If we had, say, a general "news" category then it makes sense
to sub-divide it arbitrarily deep:
news.business
news.business.microsoft
news.business.microsoft.win2K
This would be nice and organized but a particular article
might need several different tags to fully describe it.
Or we could use a keyword scheme. We can start with a reasonable
fixed list, let providers augment it with their own, and apply
social pressure to nudge noncomformists into line.
Carmen
Try Headline Viewer at http://www.vertexdev.com/HeadlineViewer
-----Original Message-----
From: Mark Nottingham [mailto:mnot@...]
Sent: Saturday, July 03, 1999 5:37 PM
To: syndication@onelist.com
Subject: Re: [syndication] channel classification (was: Hello, I am
Carmen)
From: "Mark Nottingham" <mnot@...>
So the actual categories that the publisher can choose from, how does that
get decided? I don't think it's a good idea to allow channel publishers to
create new categories; that would be a mess. A centrally-managed list would
be idea, but probably impractical - do we want to give this a try?
Maybe another way to go about it is to allow the publisher to decide which
classification system they want to be considered part of. For instance:
<FooFormat>
<header>
...
<category
authority="http://bar.com/categorydefinition.xml">Widgets/FrobNobs</category
>
...
</header>
<item>
...
This is nice and flexible, but may lead to problems; if an aggregator didn't
want to support too many categorization methods, or there was a lot of
overlap between categorisation schemes, there would be trouble.
Hopefully, selecting the channel from a large list at the aggregator will be
only one method of adding channels. Netscape had the right approach in
allowing channel publishers to put a 'subscribe to our channel' button on
their home page; the problem here of course is that the aggregator could
(and should) be anywhere. A simple cut-n-paste URL for the XML file will
probably have to suffice for this...
--------------------------- ONElist Sponsor ----------------------------
Who is the most visited e-mail list community Web Service?
http://www.onelist.com
ONElist.com - where more than 20 million e-mails are exchanged each day!
------------------------------------------------------------------------
So the actual categories that the publisher can choose from, how does that
get decided? I don't think it's a good idea to allow channel publishers to
create new categories; that would be a mess. A centrally-managed list would
be idea, but probably impractical - do we want to give this a try?
Maybe another way to go about it is to allow the publisher to decide which
classification system they want to be considered part of. For instance:
<FooFormat>
<header>
...
<category
authority="http://bar.com/categorydefinition.xml">Widgets/FrobNobs</category
>
...
</header>
<item>
...
This is nice and flexible, but may lead to problems; if an aggregator didn't
want to support too many categorization methods, or there was a lot of
overlap between categorisation schemes, there would be trouble.
Hopefully, selecting the channel from a large list at the aggregator will be
only one method of adding channels. Netscape had the right approach in
allowing channel publishers to put a 'subscribe to our channel' button on
their home page; the problem here of course is that the aggregator could
(and should) be anywhere. A simple cut-n-paste URL for the XML file will
probably have to suffice for this...
Niel
> >I agree completely here. The providers understand better than any machine
> >how their
> >content should be categorised.
>
> Sounds like keywords to me.
IMHO, categories are not like keywords because :
1) categories imply a structured hierarchy (in the sense that I am using the
word), and
suggest a way of navigating resources
2) keywords do not necessarily impart the right meaning - taken as atomic
elements, the
semantic meaning is lost. This is why traditional searches based on the
non-noise words
(keywords) that spiders and bots pick from web pages are notoriously
ineffective.
Best regards,
James Carlyle
james@...
www.xmltree.com - directory of XML content on the web
>> Re categories, I'd like to allow the channel-providers to be able to seed
>> them.
>I agree completely here. The providers understand better than any machine
>how their
>content should be categorised.
Sounds like keywords to me.
Niel
(Introduction coming soon)
>>The only question is whether and in what cases each channel needs more
than one category.
I think that should be up to the editor too. The format should allow for a
list of categories.
Also, I thought the Motley Fool's addition of ticker symbols is brilliant
and perfect. It says that the <items> should be able to carry additional
information that the aggregator might want to pick up. I can tell you this
much, we'd be very happy to create a database indexed by ticker symbol to
support people who want to follow stocks thru our aggregator. It's good
business.
What about the format for the channel list? Anyone want to work on that
this weekend? I can create an experimental URL for quick
additions/revisions. Right now our channel list is just a list of URLs.
What additional info should we provide for each channel? This is what OCS
is about, correct?
Dave
Dave
> Re categories, I'd like to allow the channel-providers to be able to seed
> them. Let them declare the categories they want to belong to, and then the
> aggregators can group them. This grouping should be dynamic, so it's easy
> for a CP to take themselves out of a category.
I agree completely here. The providers understand better than any machine how
their
content should be categorised. Likewise, the grouping should be dynamic, so
that
categorisation reflects any changes in the content and focus of the channel.
The only
question is whether and in what cases each channel needs more than one category.
James Carlyle
james@...
www.xmltree.com - directory of XML content on the web
Carmen, I'm having the same problem with the Choose Your News page. How do
you wade thru all the channels? My current idea is that you sort them based
on update frequency, so the most active channels rise to the top of the
list and the most inactive ones fall to the bottom. If there will be many
more channels, what's coming next will be even more challenging.
Re categories, I'd like to allow the channel-providers to be able to seed
them. Let them declare the categories they want to belong to, and then the
aggregators can group them. This grouping should be dynamic, so it's easy
for a CP to take themselves out of a category.
BTW, half of my intro message appears to have been cut off, right at the
point where I had pasted a URL into the message. I wonder if there's a bug
in this listserv??
Also, I'm looking for a way for readers to address their complaints about
channels to the people who are responsible for the channel, not to me. I
get a fair amount of email about broken links, bad style, etc, when it's
completely not in my power to do anything about it.
Dave
Dear Carmen and all
> * The format of a list of syndicated web sites. Right now there
> are two one-instance formats, Dave's and Ian's (what a great
> community -- we are on a first name basis). It sounds to me
> like James is coming up with something too. Ian's OCSF is a great
> start.
James is not coming up with a competing standard. I'll publish the contents of
xmlTree in
whatever format the group decides to standardise on. I have at the moment only
one
feeling which I would like others to comment on - that whatever standard we
agree on
should try to separate the metadata of the content from the content itself (i.e.
What
hours it is available, when it is updated, who the editor is and so on from the
actual
news items. Otherwise this metadata is repeated at the top of every news
document).
> I am very interested in ways to sort and organize the lists
> of providers. Right now between my built-in list, Ian's list,
> and Dave's list I have about 250. This is way too many for
> a normal user to sort through in order to decide what kinds
> of news they want. We need standard (yet extensible)
> categories. Either exclusive or non-exclusive.
I am changing the categorisation structure for xmlTree from my own anarchistic
hodge-podge
to an international standard, the Dewey Decimal System. This is used by
hundreds of
thousands of libraries around the world (it has around 80% market share). The
reason for
this is that I want to be able to import and export channel registrations to and
from
other sites which already have their own ad-hoc category structure, and the
mapping
involved is complex, tedious and error prone. If we can agree on a globally
used (and
well publicised) standard then at least there is common understanding and other
sites can
use my categorisation easily.
> A way to get stock quotes and weather would be cool. I expect
> to add filtering and notification in a post-1.0 release.
You can obtain XML marked-up stock quotes - see this link for details:
http://www.xmltree.com/resource/detail.cfm/ContainerID/42/ResourceID/226
Best regards,
James Carlyle
james@...
www.xmltree.com - directory of XML content on the web
Dear Carmen and all
> * The format of a list of syndicated web sites. Right now there
> are two one-instance formats, Dave's and Ian's (what a great
> community -- we are on a first name basis). It sounds to me
> like James is coming up with something too. Ian's OCSF is a great
> start.
James is not coming up with a competing standard. I'll publish the contents of
xmlTree in
whatever format the group decides to standardise on. I have at the moment only
one
feeling which I would like others to comment on - that whatever standard we
agree on
should try to separate the metadata of the content from the content itself (i.e.
What
hours it is available, when it is updated, who the editor is and so on from the
actual
news items. Otherwise this metadata is repeated at the top of every news
document).
> I am very interested in ways to sort and organize the lists
> of providers. Right now between my built-in list, Ian's list,
> and Dave's list I have about 250. This is way too many for
> a normal user to sort through in order to decide what kinds
> of news they want. We need standard (yet extensible)
> categories. Either exclusive or non-exclusive.
I am changing the categorisation structure for xmlTree from my own anarchistic
hodge-podge
to an international standard, the Dewey Decimal System. This is used by
hundreds of
thousands of libraries around the world (it has around 80% market share). The
reason for
this is that I want to be able to import and export channel registrations to and
from
other sites which already have their own ad-hoc category structure, and the
mapping
involved is complex, tedious and error prone. If we can agree on a globally
used (and
well publicised) standard then at least there is common understanding and other
sites can
use my categorisation easily.
> A way to get stock quotes and weather would be cool. I expect
> to add filtering and notification in a post-1.0 release.
You can obtain XML marked-up stock quotes - see this link for details:
http://www.xmltree.com/resource/detail.cfm/ContainerID/42/ResourceID/226
Best regards,
James Carlyle
james@...
www.xmltree.com - directory of XML content on the web
Hello,
I am Carmen. I got interested in syndication so that I could
read more web sites without spending all day clicking around.
Working with my husband (who stays in the background but
helps me out a little bit :-) we've been having some fun.
We started by writing a very simple Windows client which could
read and display SlashDot's UltraMode text format. Then Dave
started talking about publishing his channel list, and we resolved
to be the first to support it. So we added in support for several
XML syndication formats (RSS, scriptingNews, and Ultramode XML),
and waited for Dave to announce his list. Just as soon as it
was ready we wrote the last bits of code and announced our
baby, "Carmen's Headline Viewer." Its been available for download
at http://www.vertexdev.com/HeadlineViewer; so far I have
gotten a lot of good feedback.
It has evolved over the last 2 months into a pretty nice generic
client. I can read the syndication lists from Dave's userland.com
and from Ian's Internet Alchemy (http://alchemy.openjava.org/rss/).
In order to satisfy my news craving I have also added support
for general text-based "backend" formats.
I would definitely prefer to see less formats, not more.
Realistically, it is really not a big deal for me to add support
formats; often just 20-45 minutes of fiddling around. But I would
prefer to work on features instead. I have added some support
for the moreover (www.moreover.com) format; however, their
URLs are not persistent so this is not yet a useful feature.
We should make it clear that there are two interesting formats
to talk about:
* The format of an individual syndicated web site. Basically a
list of articles/titles/URLs (with interesting details thrown
in for good measure).
* The format of a list of syndicated web sites. Right now there
are two one-instance formats, Dave's and Ian's (what a great
community -- we are on a first name basis). It sounds to me
like James is coming up with something too. Ian's OCSF is a great
start.
So, what should we do? Here are my suggestions:
0. Let's stay friendly here. This is a small group and it is
way too easy for innocent misunderstandings to turn into
acrimonious debate. We don't need that.
1. Let's get as much wood behind one arrow as possible.
2. Let's focus on practical stuff (counter-example: the XMLDev
list, where some very abstract concerns end up scaring off
many interested XML newbies).
3. Let's keep things simple. In most situations we would want
to simplify things for the client program. Here, however,
I believe that we want to have clean, simple, formats with
many optional pieces of data. That way, new syndicators
can start simple (with a small investment) and build from
there.
4. Let's pay attention to scalability. Both to the very
low-end, simple situations and to high-end textually
rich ones. Let's also worry around the list format so that
if (as I fully expect) there are thousands of syndicated
providers within a short time it will be possible to sort
through the interesting ones.
5. Let's start actively soliciting sites to syndicate their
content. Once Ian's FAQ is ready to go we can point
interested sites to it and say "here is what you need
to do to syndicate your content."
I am very interested in ways to sort and organize the lists
of providers. Right now between my built-in list, Ian's list,
and Dave's list I have about 250. This is way too many for
a normal user to sort through in order to decide what kinds
of news they want. We need standard (yet extensible)
categories. Either exclusive or non-exclusive.
I have looked at XMLNews and XMLNews-Meta but so far there
is no content for them (not that I know of anyway). It would
not be a big deal to support them. They do seem to be on
the verge of violating my rule #3.
A way to get stock quotes and weather would be cool. I expect
to add filtering and notification in a post-1.0 release.
I am also trying to think ahead to ways to advertise and
generate some revenue. Imagine if I had a small area on
my UI which could display banner ads. Some of these ads
would be my ads, but many would be drawn from the syndicated
content providers. This is kind of like the way US television
producers sell some ad time themselves and then leave open
slots for the local affiliates. I am getting way ahead of
reality here but it is worth thinking about.
That's enough for now!
Carmen
Try Headline Viewer at http://www.vertexdev.com/HeadlineViewer
Hello all,
My name is Ian Davis and I run the Internet Alchemy weblog. I run a
script called RSS Maker that crawls selected web sites and generates
RSS files from the headlines it discovers. There are currently 25
channels generated in this way.
I'm also working on the Open Content Syndication directory format for
creating channel listings. The current spec is at
http://alchemy.openjava.org/ocs/
I'm working on a content syndication FAQ directed primarily at new
users who perhaps are overwhelmed by the variety of syndication formats
and need help getting started.
The basic outline of the FAQ is as follows, comments are very welcome.
1. What is Content Syndication?
2. What formats exist for syndication?
2.1 What is RSS?
2.2 What is scriptingNews?
2.3 What is XMLNews?
2.4 What is Ultramode?
2.5 What is Avantgo?
2.6 What is PQA?
3. Where can I advertise my content?
4. Where can I find content for my site?
5. What software is available to help create channels?
Look forward to working with you all
.id.
--
weblog - http://alchemy.openjava.org/
me - http://www.fdc.co.uk/people/iand/
email - iand@... | icq - 4423828
Dear Mark and all on the list,
Many thanks for the welcome.
My name is James, and I run what I believe is the first directory of XML content
and
interfaces which generate XML.
My goal is to match content providers with content consumers. This means
providing an
infrastructure where providers can describe adequately their wares, and which
has search
and discovery facilities for content consumers.
I'm amazed at the growth of channel-type content syndication, but feel that the
burden of
generating content in multiple channel formats will discourage providers, and
also raise
the barrier for potential consumers who want to access content regardless of
it's format.
I agree totally with the aims of the list, trying to find a Goldilocks standard
which is
not too light (like the current RSS?) and not too heavy.
My main thrust at the moment is categorisation of content, in the expectation
that in the
near future we will see thousands of channels. I am standardising the
categorisation
schema of xmlTree, refining automated categorisation algorithms, and I have
offered to
My.Userland and anyone else to export the categories, along with other details,
of all of
the channels listed in xmlTree (currently around 250).
I'm looking forward to contributing in any way I can.
Best regards,
James Carlyle
james@...
www.xmltree.com - directory of XML content on the web
My name is Dave Winer, I run UserLand Software, and we do My.UserLand.Com,
which is a public content aggregator for RSS and <scriptingNews> format. We
are also doing editorial, writing and system management tools, as well as
servers around these syndication formats. UserLand also has content that is
distributed thru both formats. So we're on both sides of the issue, really
on all sides.
Here's a snapshot in time of our position on evolution of web syndication
formats:
I'd like to introduce myself and let you know what brought me to this
list and how I think I can contribute.
I'm Director of Research at Innovision Corporation. We aren't in the
business of syndicating content, but we are developing (and sell)
products (mostly server side) that work with XML. I have several
applications running in-house that utilize both RSS and ScriptingNews
formats as data sources. I'm also personally very interested in
widespread use of standard syndication formats because, like many of you
I'm sure, I dream of having an application that consolidates, filters,
archives and prioritizes all the information available from various
"weblogs" and sites like Scripting News, and presents it to me with an
easy to use interface. This will only happen if content providers can
agree to use (and formally define) standard formats.
I hope that I can help with technical XML issues. I've had experience
working with other organizations that are creating XML formats (like
FIXML <http://www.fixprotocol.org> and OFX <http://www.ofx.net>...both
protocols for exchanging financial information). I've created many XML
DTDs, lots of software that works with XML and, I think, have a pretty
good working knowledge of XML.
I look forward to working with everyone!
-Matt
Hey, all!
First of all, let me say "thanks" to Mark for getting this list started. I
feel this will be a great resource that will hopefully lead to something to
make all of our lives easier: an XML-based syndication standard!
> Existant Work
> There is a lot of work already out there. Other formats that are similar,
> and designed/promoted by primarily one entity include:
>
> * RSS (Netscape) - Started it all. Netscape currently working on a new
> version.
> * ScriptingNews - WebLog. Started with RSS, moved to their own format.
Also
> working on new version.
> * MoreOverNews - proprietary news aggregation site.
> and others (list if you know of one)
I've looked at XMLNews (http://www.xmlnews.org/) and found it interesting;
there may be some good ideas there that could add to this discussion.
XMLNews is designed for syndicating news articles and information about news
articles. They have 2 formats, one for delivering the meta content about an
article (author, title, etc.), and another for delivering the actual
article.
It appears to me that XMLNews strikes a middle ground between formats like
RSS and ScriptingNews and "heavier" formats like ICE.
I also have been interested in the Dublin Core subset of RDF:
http://purl.org/dc. There has been a lot of work done by the Dublin Core
folks that could be useful for a syndication discussion.
Personal info:
Just to explain where I'm coming from and what my interests are in
syndication: I work for The Motley Fool, a personal finance web site with a
large amount of editorial content. The biz dev folks are constantly working
on deals with portals, etc. to carry our content. In the past, every time
they inked a deal a new means of delivering the content was developed. Some
content is emailed to partner A; partner B gets their info FTP'd; partner C
gets it flown in by carrier pigeon; etc. I'm now trying to convince everyone
that we need an XML solution. Everyone (partners included for the most part)
agrees that XML is the solution, the question is "which format?" Hence, my
interest in a list such as this one.
I've been trying to develop a format "behind closed doors" much like many
other people. The current iteration is here:
http://www.fool.com/foolhq/markk/xml/foolnews.xml It uses RDF (specifically
the Dublin Core Subset) as well as some Fool-centric tags for describing the
last week's worth of articles on the Fool site. A key feature that we need
is a way to describe which stock tickers are associated with each article.
One thought: There are (at least) two sides to the syndication game, content
providers and content distributors (for lack of a better word). Sites like
The Motley Fool are providers; portals are distributors. I very much hope
that this list gets a mixture of both breeds of organizations, because if a
bunch of providers come up with a solution that is ignored by the
distributors, no one wins. The reciprocal is true as well.
Anyway, enough from me. I look forward to the input of others!
[ There are now 16 people on this list, with a very good representation of
the people who should be here. I'll put a modified version of this in the
welcome message, so that newcomers are uptospeed-ish. ]
There are a lot of separate efforts to define XML
news/announcement/syndication/resource discovery formats. All of them are
happening behind closed doors, despite people's best intentions. As a
result, we're going to end up with multiple, incompatible overlapping
formats.
This list is here to get it out in the open. There are obviously enough
people interested. If we can get everybody to hash out a core, then
individual gathering places (like NetCenter, ScriptingNews and others) can
add featues to a common standard. The trick is to separate the format from
the delivery/presentation mechanism.
I think everybody would agree the first thing we have to figure out is what
the scope of a syndication format is. 'Syndication' may not even be an
appropriate term.
Existant Work
There is a lot of work already out there. Other formats that are similar,
and designed/promoted by primarily one entity include:
* RSS (Netscape) - Started it all. Netscape currently working on a new
version.
* ScriptingNews - WebLog. Started with RSS, moved to their own format. Also
working on new version.
* MoreOverNews - proprietary news aggregation site.
and others (list if you know of one)
On the heavier side, there's:
* ICE - true syndication format; request/response format layered over HTTP
POST. Probably too excessive for the uses above.
* RDF - resource description framework. A standard way of describing
metadata. May be good to incorporate this, or it may confuse the issue.
Could someone more familiar summarise?
List Activity and Scope
This is what I had in mind when I suggested the list:
* look into current standards, decide what's usable, where there's overlap
* define the problem - what kinds of things do we want the format to do, and
what is out of scope?
* hammer out a specification. Two major parts are:
- what features a [insert-format here] file must and may have
- which of those features a parser must support
* document it
* produce a reference implementation of the spec, and tools to enable people
to translate to/from other formats, as well as read and write it.
I've already been playing on my own with tools and ideas to get content into
a format, and then drag it back out. Getting everybody to agree on bit in
the middle is what this is all really about for me...
Let's get going!
Mark Nottingham, Melbourne Australia
mnot@...http://www.mnot.net/