Archive

Posts Tagged ‘index’

Great idea! Google *should* open their index!

July 15, 2010 26 comments
Raised bridge on the Chicago River by Art Hill

Raised bridge on the Chicago River by Art Hill

tl;dr: Serving dozens (hundreds?) of crawlers is expensive. We could use an open index. Google’s?

Just read Tom Foremski’s Google Exec Says It’s A Good Idea: Open The Index And Speed Up The Internet article. And I have to say, it’s a great idea!

I don’t have hard numbers handy, but I would estimate close to 50% of our web server CPU resources (and related data access layers) go to serving crawler robots. Stop and think about that for a minute. SmugMug is a Top 300 website with tens of millions of visitors, more than half a billion page views, and billions of HTTP / AJAX requests (we’re very dynamic) each month. As measured by both Google and Alexa, we’re extremely fast (faster than 84% of sites) despite being very media heavy. We invest heavily in performance.

And maybe 50% of that is wasted on crawler robots. We have billions of ‘unique’ URLs since we have galleries, timelines, keywords, feeds, etc. Tons of ways to slice and dice our data. Every second of every day, we’re being crawled by Google, Yahoo, Microsoft, etc. And those are the well-behaved robots. The startups who think nothing of just hammering us with crazy requests all day long are even worse. And if you think about it, the robots are much harder to optimize for – they’re crawling the long tail, which totally annihilates your caching layers. Humans are much easier to predict and optimize for.

Worst part about the whole thing, though? We’re serving the exact same data to Google. And to Yahoo. And to Microsoft. And to Billy Bob’s Startup. You get the idea. For every new crawler, our costs go up.

We spend significant effort attempting to serve the robots quickly and well, but the duplicated effort is getting pretty insane. I wouldn’t be surprised if that was part of the reason Facebook revised their robots.txt policy, and I wouldn’t be surprised to see us do something similar in the near future, which would allow us to devote our resources to the crawlers that really matter.

Anyway, if a vote were held to decide whether the world needs an open-to-all index, rather than all this duplicated crawling, I’d vote YES! And SmugMug would get even faster than it is today.

On a totally separate, but sorta related issue, Google shouldn’t have to do anything at all to their algorithms. Danny Sullivan has some absolutely brilliant satire on that subject.

%d bloggers like this: