I was running some feed tests this weekend and I thought some of you might
be interested in the results. I know the sample set is far too small and the
testing methodology is far from perfect, but it may have entertainment value
if nothing else.
First some background. I initially started with a sample set of 50000 feeds
from syndic8.com including only those that had changed in the past 180 days.
The choice of time was fairly arbitrary - I was just trying to get feeds
that were still likely to be alive. I eliminated any feeds that had no
title, description or html url which suggested they were probably invalid
(about 800) . More controversially I eliminated any duplicates from the same
host (about 39000). There were some hosts with huge numbers of feeds
(rss.topix.net has around 18000, izynews.de 5000) and I thought these would
likely skew the results. I also wasn't too keen to hit a single host with
18000 downloads.
After all that, my total set of test feeds was only 9899 of which 557 didn't
connect and 213 weren't valid feeds (actually some I discovered later were
just urls that had been corrupted by the syndic8 export). That left 9129 of
which 3416 were RDF, 5162 were RSS and 551 were Atom. Of the RSS feeds, 3179
were RSS 2.0 and 1974 were RSS 0.9x (mostly 0.91).
First some tag usage. Of the 5162 RSS feeds, only 53 used the textInput
element (25 in 2.0 feeds, 28 in 0.9x feeds). 129 used skipHours (123 in
2.0), 27 used skipDays (20 in 2.0), 604 used ttl (589 in 2.0), 37 used cloud
(36 in 2.0), and 21 used rating (9 in 2.0).
My main interest, though, was testing for markup in item title elements
(actually I tested a bunch of different elements, but I'm not going to go
through them all). Basically I checked for any occurrence of a left angle
bracket or an ampersand in the content (obviously after XML escaping had
been removed). In the RSS feeds I matched 1530 titles in 756 unique feeds.
After sorting the titles into those which appeared to be markup and those
which appeared not to be markup, I was left with 289 "markup" feeds, and 492
"plain text" feeds.
You'll notice, however, that 492+289=781 which is 25 more than my original
756 feeds. This is because some feeds contain entries with HTML titles as
well as entries with titles that appear to be plaintext. The ones that
appear to be plaintext have usually been marked as such because they contain
a solitary unescaped ampersand, however this is quite likely to just be an
HTML encoding error. Now if you were to disregard these ampersand entries,
the figures look more like 289 definitely markup, 68 definitely plain text,
and 399 uncertain.
What conclusions can we draw from this? Nothing really. As I said at the
beginning, the sample set is far too small. Also I don't think the feeds on
syndic8 are very representative of feeds on the Internet in general. When I
have more time I'll try and get a bigger/better set of feeds to test with
and see if the results differ significantly. Most of the results were close
to what I'd expected though.
Regards
James