On 9/22/06, Debbie Pascoe <dpascoe@...> wrote:
> I checked with one of our Unix experts, who says that you would be
> able to use wget and grep to download a page and look for page tags
> and links and check the page tagging integrity, assuming that:
Well not quite... :-)
> - Your page tags are always in the same format (i.e. the code is not
> split across lines differently)
Do it in perl, and slurp the entire file in as a "single line".
RegEx's to ignore line splits are trivial to write. Common even. A
"while loop" to cycle thru the matches is a fairly common construct.
:-)
Even multi megabyte htm lfiles would be easy this way. And perl is
*designed* to do this.
Or parse thru tidy first.
Or something similar. Plenty of cool libraries on CPAN to checkout.
> - Your site only uses plain links, no JavaScript menus or onClick links
Should be very easy to handle those. *Something* can interpret them
programatically (ie a web browser), reversing that wouldn't be hard at
all. It's not like these are undocumented standards. Look for the
pattern(s). Match them.
If you wanted to get super sophisticated, you could grab the
javascript code libraries from the firefox codebase and use those to
handle any and all javascript and html translation issues.
> - You can navigate your site without using flash
Never looked inside flash myself. so you could be very correct. :-)
Tho the GNU flash program (Gnash) may have appropriate libs that could
be borrowed or interfaced to, to do this. It's a lot harder, but still
achievable. The joy of Open Source: you don't have to reinvent the
wheel.
> - You can confidently assume that all of the other JavaScript in your
> pages will not cause the page tagging JavaScript to fail
Well that's just normal debugging. But point taken and agreed.
Additional: Your acceptance testing should be picking this up.
Key word: "Should". :-)
> - Your site doesn't use <base> html elements
If a spidering tool can't handle the base tag, or any legitimate HTML
tag, then the tool is broken. Submit bug report, get fix overnight.
Problem solved. :-)
> - Your grep command handles <a> elements that are split over multiple
> lines, and multiple <a> elements per line
See above. Being able to split up multiple things from a single chunk
of data is a common task. Perhaps grep is not the best tool, but awk,
sed and perl certainly are more than capable.
> - Your grep command handles other navigation elements, such as <frame>
> or <iframe>
Two step it. Slurp the site, and glob all files in the resulting tree
with find or some such.
Or: I'm sure there's some simple spidering libraries in CPAN for
libwww. I seem to recall I came across some when I wrote our internal
Perl based link checker a few years ago.
> - Your site doesn't need a login or session cookies to be viewed.
The advice is incorrect, wget can handle both. Or use curl which is
scarily sophisticated and powerful.
From the wget man page:
--user=user
--password=password
--keep-session-cookies
All of which do pretty much what they say. :-)
> These are some conditions - not a complete list - that would keep this
> from being an optimal approach. The best alternative is to utilize a
> product that can handle all those conditions, regardless of the site's
> operating system, web server, content creation solution or
> methodology, or web analytics vendor.
Sure. No real disagreement. But there's also heaps of trivial
solutions that could be used to make life easier too and avoid many of
these issues.
Aside: The big plus that Unix (as a collective) has over many other
systems is the most amazing array of simple but highly powerful tools
that can be easily glued together to do tasks that would be several
days of effort in any programming language.
Maybe wget and grep won't do it. But a combo of wget, find, sed, tidy,
egrep, uniq, sort and wc may. It may not be perfect, but it may get
you easily 70% of the way there. And 70% is a huge improvement on 0%.
The flip side is that if the problems raised are not existent for a
simple site, then they are not problems, and Tim's suggestion still
stands. And if you ain't got the money, you ain't got the money. :-)
I'm really quite tempted to accept the thrown gauntlet just to truely
satisfy my own burning curiosity as to how hard or easy the problem
actually is. Vs making an educated guess. Sounds like an interesting
challenge to burn a few hours or so. And it's been a few months since
I've done any serious Perl hacking. Hmmmm. And I could finally have a
decent excuse to try out the perl pthread libraries. Have had lots of
fun with those in C programs I've written. Hmmmmmmmmmmm....
It's not too unreasonable to assume that javascript page tagging will
become more sophisticated in Open Source analysis packages in the
future. Awstats has a very simple one already. And with that
increasing sophistication, the need for a matching solution to verify
same becomes necessary too. A solution will follow as night surely
follows day.
I would argue not so much how hard or easy the problem is, but rather
argue on the additional value that Maxamine adds and brings to solving
the problems. Technical solutions are achievable, value add is harder.
Marketing and sustaining that value add is something else again.
>
> Debbie Pascoe
> MAXAMINE, Inc.
>
> >
> > If you are UNIX-literate (or have such folks available to you), a simple
> > 'wget/grep' command should be enough.
> >
Cheers!
- Steve, Unix Guru.
Tho I believe my actual position title, for what little meaning or
even relevance a position title holds, is: "Senior Unix Systems
Administrator".
Guru sums it up nicely and clears away the clutter. :-)