Search the web
Sign In
New User? Sign Up
rss-public
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Some useless RSS statistics   Message List  
Reply | Forward Message #139 of 1975 |
I was running some feed tests this weekend and I thought some of you might
be interested in the results. I know the sample set is far too small and the
testing methodology is far from perfect, but it may have entertainment value
if nothing else.

First some background. I initially started with a sample set of 50000 feeds
from syndic8.com including only those that had changed in the past 180 days.
The choice of time was fairly arbitrary - I was just trying to get feeds
that were still likely to be alive. I eliminated any feeds that had no
title, description or html url which suggested they were probably invalid
(about 800) . More controversially I eliminated any duplicates from the same
host (about 39000). There were some hosts with huge numbers of feeds
(rss.topix.net has around 18000, izynews.de 5000) and I thought these would
likely skew the results. I also wasn't too keen to hit a single host with
18000 downloads.

After all that, my total set of test feeds was only 9899 of which 557 didn't
connect and 213 weren't valid feeds (actually some I discovered later were
just urls that had been corrupted by the syndic8 export). That left 9129 of
which 3416 were RDF, 5162 were RSS and 551 were Atom. Of the RSS feeds, 3179
were RSS 2.0 and 1974 were RSS 0.9x (mostly 0.91).

First some tag usage. Of the 5162 RSS feeds, only 53 used the textInput
element (25 in 2.0 feeds, 28 in 0.9x feeds). 129 used skipHours (123 in
2.0), 27 used skipDays (20 in 2.0), 604 used ttl (589 in 2.0), 37 used cloud
(36 in 2.0), and 21 used rating (9 in 2.0).

My main interest, though, was testing for markup in item title elements
(actually I tested a bunch of different elements, but I'm not going to go
through them all). Basically I checked for any occurrence of a left angle
bracket or an ampersand in the content (obviously after XML escaping had
been removed). In the RSS feeds I matched 1530 titles in 756 unique feeds.
After sorting the titles into those which appeared to be markup and those
which appeared not to be markup, I was left with 289 "markup" feeds, and 492
"plain text" feeds.

You'll notice, however, that 492+289=781 which is 25 more than my original
756 feeds. This is because some feeds contain entries with HTML titles as
well as entries with titles that appear to be plaintext. The ones that
appear to be plaintext have usually been marked as such because they contain
a solitary unescaped ampersand, however this is quite likely to just be an
HTML encoding error. Now if you were to disregard these ampersand entries,
the figures look more like 289 definitely markup, 68 definitely plain text,
and 399 uncertain.

What conclusions can we draw from this? Nothing really. As I said at the
beginning, the sample set is far too small. Also I don't think the feeds on
syndic8 are very representative of feeds on the Internet in general. When I
have more time I'll try and get a bigger/better set of feeds to test with
and see if the results differ significantly. Most of the results were close
to what I'd expected though.

Regards
James



Mon Feb 13, 2006 6:28 am

james_holder...
Offline Offline
Send Email Send Email

Forward
Message #139 of 1975 |
Expand Messages Author Sort by Date

I was running some feed tests this weekend and I thought some of you might be interested in the results. I know the sample set is far too small and the testing...
James Holderness
james_holder...
Offline Send Email
Feb 13, 2006
6:29 am

... James, I could definitely cook up some more interesting and/or representative data for you if you would like. For example I could collect up several days...
Jeff Barr
jeffbarr_2000
Offline Send Email
Feb 13, 2006
7:04 am

... Good stuff. When all is said and done, I simply want a spec on which I can base a decision as to which of this titles the Feed Validator can flag. - Sam...
Sam Ruby
sa3ruby
Online Now Send Email
Feb 13, 2006
12:31 pm

... Thanks for the offer. That would be really great. ... There are two main concerns for me. One is making sure the feeds are alive and valid - I was assuming...
James Holderness
james_holder...
Offline Send Email
Feb 13, 2006
11:53 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help