Skip to search.
cms-vendor · Where vendors of Content Management Systems can discuss common services, standards and features.

Group Information

  • Members: 92
  • Category: Software
  • Founded: Oct 6, 2000
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Messages

  Messages Help
Advanced
The problem, and why the "solution" may be hard   Message List  
Reply Message #70 of 2215 |
Re: The problem, and why the "solution" may be hard

We've been looking at this issue a while, and our preferred option
would be to use something like ICE to allow "crawlers" to ask us
what's changed since their last check, and we can simply send them a
list of all changed data. Of course, this requires the crawlers to do
more work, but in return I'd give them nice, consistent, easy to
process XML data. Since all of our data is in a database (i.e. we can
easily know what's changed, and ship only changes), and we run ICE,
this is a an extremely easy and efficient combination. The motivation
isn't just to keep load off of our servers and database, it's more
drive by a desire to provide timely and accurate information to users
even if they're on other sites.

Give that the web spiders that we've talked to are, oddly enough, more
comfortable processing HTML than XML (least common denominator -- it's
more work, and fragile, but it works everywhere and they already know
how to do it) it looks like it would be up to us as the owner of the
content to force the issue and block all spiders (that we know of) to
force them to use the mechanism that we prefer because it places
almost no load on our systems.

It hasn't come to the top of our lists of things to do (silly things
like "features" keep coming up) but it's on the list of things to do
some day.

--- In cms-vendor@egroups.com, "Jeff Barr" <jeff@v...> wrote:
> I've got two "bombardment" war stories on my EditThisPage
> site right now:
>
> http://jeffbarr.editthispage.com/discuss/msgReader$49
>
> I am still working to resolve both issues. The bad thing is
> that stuff like this happens. The good thing is that it is
> actually possible to find the responsible parties.
>
> Jeff;
>
> -----Original Message-----
> From: Peter Friedman [mailto:peter@c...]
> Sent: Wednesday, November 29, 2000 5:15 AM
> To: cms-vendor@egroups.com
> Subject: RE: [cms-vendor] Re: The problem, and why the "solution" may be
> hard
>
>
> Time to start putting together a 'crawler bombardment response' FAQ yet?
>
> -----Original Message-----
> From: Dave Winer [mailto:dave@u...]
> Sent: Tuesday, November 28, 2000 10:57 PM
> To: cms-vendor@egroups.com
> Subject: Re: [cms-vendor] Re: The problem, and why the "solution" may be
> hard
>
>
> We have many domains per IP address. That's what we'd like the
search engine
> guys to account for. Dave
>
>
>
> To unsubscribe from this group, send an email to:
> cms-vendor-unsubscribe@egroups.com
>
>
>
>
> To unsubscribe from this group, send an email to:
> cms-vendor-unsubscribe@egroups.com




Wed Dec 27, 2000 2:00 pm

laird.popkin@...
Send Email Send Email

Message #70 of 2215 |
Expand Messages Author Sort by Date

First I think we need to clarify exactly what the problem is that we are trying to solve. Is it "search engine spiders are causing too much load to my site", ...
Stephen Tyler
tuesday@... Send Email
Oct 8, 2000
2:27 pm

Stephen, thanks for posting this excellent summary of the issues. It's the "bursty" traffic that hurts our performance. And I *want* Googlebot to visit, and my...
Dave Winer
dave@... Send Email
Oct 8, 2000
3:02 pm

... It dawns on me that places like Userland, geocities, members.aol.com, and so on, have this problem in a special way that is qualitatively different from...
Tim Bray
tbray@... Send Email
Oct 8, 2000
8:33 pm

Since the search engine typically "announce" themselves -- would it be possible to create a server side application that "remembers" what pages a large scale...
Ben Prater
bprater@... Send Email
Oct 9, 2000
2:34 pm

From the type of content on my site, there are two scenarios: 1) Editorial content, which is relatively static, both in terms of update rate and delivery...
Laird Popkin
laird.popkin@... Send Email
Oct 10, 2000
5:47 am

Dave wrote on SN today ... and I've been thinking, shouldn't it be possible to create with a callback a copy of robots.txt that contains a list of all the...
sdevore@... Send Email Nov 20, 2000
8:47 pm

It seems to me the problem of informing search engines of updated pages is very similar to the problem of syndicating content. Couldn't robots.txt be extended...
Gary Teter
bigdog@... Send Email
Nov 20, 2000
9:00 pm

... There's been some discussion of this on the robots mailing list, but nothing ever happened. I don't know why, it would save everyone a lot of time and...
Avi Rappoport
nets@... Send Email
Nov 21, 2000
12:42 am

Sometimes I wonder if the guys who are having their sites hit so horribly hard are the same guys who set their web servers not to provide any valid HTTP...
Tom Thomson
tthomson@... Send Email
Nov 28, 2000
10:50 pm

Time to start putting together a 'crawler bombardment response' FAQ yet? ... From: Tom Thomson [mailto:tthomson@...] Sent: Tuesday, November 28, 2000...
Peter Friedman
peter@... Send Email
Nov 29, 2000
1:15 pm

... All of the above. ... Right. But I think this is not fatal at all. What we're asking for is just pointers to the pages that should be crawled and info ...
Tim Bray
tbray@... Send Email
Oct 8, 2000
8:29 pm

Actually our problem would be solved if the robots were aware of the machine they're hitting so mercilessly. We map thousands of virtual domains to a single...
Dave Winer
dave@... Send Email
Nov 21, 2000
12:47 am

... Do you have any suggestions as to how they should figure this out? A quick spot-check of a couple editthispage.com sites shows them mapping to different IP...
Gary Teter
bigdog@... Send Email
Nov 21, 2000
6:06 pm

We have many domains per IP address. That's what we'd like the search engine guys to account for. Dave...
Dave Winer
dave@... Send Email
Nov 28, 2000
10:57 pm

Time to start putting together a 'crawler bombardment response' FAQ yet? ... From: Dave Winer [mailto:dave@...] Sent: Tuesday, November 28, 2000 10:57...
Peter Friedman
peter@... Send Email
Nov 29, 2000
1:15 pm

I've got two "bombardment" war stories on my EditThisPage site right now: http://jeffbarr.editthispage.com/discuss/msgReader$49 I am still working to resolve...
Jeff Barr
jeff@... Send Email
Dec 2, 2000
8:04 am

We've been looking at this issue a while, and our preferred option would be to use something like ICE to allow "crawlers" to ask us what's changed since their...
Laird Popkin
laird.popkin@... Send Email
Dec 27, 2000
2:00 pm
Advanced

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help