Search the web
Sign In
New User? Sign Up
webanalytics · The Web Analytics Forum
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Wanted: a tool that can scan a site to report on integrity of track   Message List  
Reply | Forward Message #7930 of 24284 |
Re: [webanalytics] Re: Wanted: a tool that can scan a site to report on integrity of tracking code

On 9/22/06, Debbie Pascoe <dpascoe@...> wrote:
> I checked with one of our Unix experts, who says that you would be
> able to use wget and grep to download a page and look for page tags
> and links and check the page tagging integrity, assuming that:

Well not quite... :-)


> - Your page tags are always in the same format (i.e. the code is not
> split across lines differently)

Do it in perl, and slurp the entire file in as a "single line".
RegEx's to ignore line splits are trivial to write. Common even. A
"while loop" to cycle thru the matches is a fairly common construct.
:-)
Even multi megabyte htm lfiles would be easy this way. And perl is
*designed* to do this.

Or parse thru tidy first.
Or something similar. Plenty of cool libraries on CPAN to checkout.


> - Your site only uses plain links, no JavaScript menus or onClick links

Should be very easy to handle those. *Something* can interpret them
programatically (ie a web browser), reversing that wouldn't be hard at
all. It's not like these are undocumented standards. Look for the
pattern(s). Match them.

If you wanted to get super sophisticated, you could grab the
javascript code libraries from the firefox codebase and use those to
handle any and all javascript and html translation issues.


> - You can navigate your site without using flash

Never looked inside flash myself. so you could be very correct. :-)
Tho the GNU flash program (Gnash) may have appropriate libs that could
be borrowed or interfaced to, to do this. It's a lot harder, but still
achievable. The joy of Open Source: you don't have to reinvent the
wheel.


> - You can confidently assume that all of the other JavaScript in your
> pages will not cause the page tagging JavaScript to fail

Well that's just normal debugging. But point taken and agreed.
Additional: Your acceptance testing should be picking this up.
Key word: "Should". :-)


> - Your site doesn't use <base> html elements

If a spidering tool can't handle the base tag, or any legitimate HTML
tag, then the tool is broken. Submit bug report, get fix overnight.
Problem solved. :-)


> - Your grep command handles <a> elements that are split over multiple
> lines, and multiple <a> elements per line

See above. Being able to split up multiple things from a single chunk
of data is a common task. Perhaps grep is not the best tool, but awk,
sed and perl certainly are more than capable.


> - Your grep command handles other navigation elements, such as <frame>
> or <iframe>

Two step it. Slurp the site, and glob all files in the resulting tree
with find or some such.
Or: I'm sure there's some simple spidering libraries in CPAN for
libwww. I seem to recall I came across some when I wrote our internal
Perl based link checker a few years ago.


> - Your site doesn't need a login or session cookies to be viewed.

The advice is incorrect, wget can handle both. Or use curl which is
scarily sophisticated and powerful.
From the wget man page:
--user=user
--password=password
--keep-session-cookies

All of which do pretty much what they say. :-)


> These are some conditions - not a complete list - that would keep this
> from being an optimal approach. The best alternative is to utilize a
> product that can handle all those conditions, regardless of the site's
> operating system, web server, content creation solution or
> methodology, or web analytics vendor.

Sure. No real disagreement. But there's also heaps of trivial
solutions that could be used to make life easier too and avoid many of
these issues.


Aside: The big plus that Unix (as a collective) has over many other
systems is the most amazing array of simple but highly powerful tools
that can be easily glued together to do tasks that would be several
days of effort in any programming language.


Maybe wget and grep won't do it. But a combo of wget, find, sed, tidy,
egrep, uniq, sort and wc may. It may not be perfect, but it may get
you easily 70% of the way there. And 70% is a huge improvement on 0%.


The flip side is that if the problems raised are not existent for a
simple site, then they are not problems, and Tim's suggestion still
stands. And if you ain't got the money, you ain't got the money. :-)


I'm really quite tempted to accept the thrown gauntlet just to truely
satisfy my own burning curiosity as to how hard or easy the problem
actually is. Vs making an educated guess. Sounds like an interesting
challenge to burn a few hours or so. And it's been a few months since
I've done any serious Perl hacking. Hmmmm. And I could finally have a
decent excuse to try out the perl pthread libraries. Have had lots of
fun with those in C programs I've written. Hmmmmmmmmmmm....


It's not too unreasonable to assume that javascript page tagging will
become more sophisticated in Open Source analysis packages in the
future. Awstats has a very simple one already. And with that
increasing sophistication, the need for a matching solution to verify
same becomes necessary too. A solution will follow as night surely
follows day.


I would argue not so much how hard or easy the problem is, but rather
argue on the additional value that Maxamine adds and brings to solving
the problems. Technical solutions are achievable, value add is harder.
Marketing and sustaining that value add is something else again.


>
> Debbie Pascoe
> MAXAMINE, Inc.
>
> >
> > If you are UNIX-literate (or have such folks available to you), a simple
> > 'wget/grep' command should be enough.
> >


Cheers!


- Steve, Unix Guru.

Tho I believe my actual position title, for what little meaning or
even relevance a position title holds, is: "Senior Unix Systems
Administrator".

Guru sums it up nicely and clears away the clutter. :-)






Fri Sep 22, 2006 5:27 am

nuilvows
Offline Offline
Send Email Send Email

Forward
Message #7930 of 24284 |
Expand Messages Author Sort by Date

Hi Does anyone know of or use a tool that is able to scan a site to report on the integrity of the tracking page tags? Whether the code is missing, rendered...
hunter_analytics
hunter_analy...
Offline Send Email
Sep 21, 2006
12:17 pm

Hello Lesley, WebtraffIQ Alerts checks the customer’s website every 10 minutes, and then alerts them by SMS and email if there is a broken link and the page...
Marcos Richardson
marcos.richardson@...
Send Email
Sep 21, 2006
2:25 pm

That's not really the same service. Lesley's wondering if there's a service that spiders your site to see if there is a reference to a particular JS file...
Jason Egan
egan_jason
Offline Send Email
Sep 21, 2006
3:11 pm

Lesley, As you have no doubt realized, errors crop up in many forms. Common errors include: a) pages lacking tags, b) pages with incorrect tags, c) tags that...
d_pascoe2002
Offline Send Email
Sep 21, 2006
8:46 pm

Hi Lesley; I remember when I had to identify faulty tags in an Omniture implementation. I'd modified my s_code.js and reserved a s_prop just to store which...
Julien Coquet
julien.coquet
Offline Send Email
Sep 21, 2006
3:13 pm

You're looking for Maximine. We used it to successfully scan our site and report on which pages are missing tags. http://www.maxamine.com...
Ted McDonald
bigteddymac
Offline Send Email
Sep 21, 2006
3:14 pm

You should also look at WatchFire. http://www.watchfire.com/ ... site ... code ... appreciated....
bwalh
Offline Send Email
Sep 21, 2006
3:57 pm

... code ... If you are UNIX-literate (or have such folks available to you), a simple 'wget/grep' command should be enough. -- Tim Evans...
Evans, Tim
evanstimk
Offline Send Email
Sep 21, 2006
6:44 pm

I checked with one of our Unix experts, who says that you would be able to use wget and grep to download a page and look for page tags and links and check the...
Debbie Pascoe
d_pascoe2002
Offline Send Email
Sep 21, 2006
9:20 pm

... Well not quite... :-) ... Do it in perl, and slurp the entire file in as a "single line". RegEx's to ignore line splits are trivial to write. Common even....
Steve
nuilvows
Offline Send Email
Sep 22, 2006
2:14 pm

Steve, It was great fun reading your point-by-point response, and I can tell that you enjoyed the exercise :-) Your last observation is the crucial thing. In...
Debbie Pascoe
d_pascoe2002
Offline Send Email
Sep 25, 2006
5:01 pm

Hi Lesley, you might want to check my post at http://shamel.blogspot.com/2006/09/web-analytics-solution-profiler-wasp.html where I list a couple of available...
Stephane Hamel
shamel67
Offline Send Email
Sep 22, 2006
3:21 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help