Home > web > Great idea! Google *should* open their index!

Great idea! Google *should* open their index!

July 15, 2010
Raised bridge on the Chicago River by Art Hill

Raised bridge on the Chicago River by Art Hill

tl;dr: Serving dozens (hundreds?) of crawlers is expensive. We could use an open index. Google’s?

Just read Tom Foremski’s Google Exec Says It’s A Good Idea: Open The Index And Speed Up The Internet article. And I have to say, it’s a great idea!

I don’t have hard numbers handy, but I would estimate close to 50% of our web server CPU resources (and related data access layers) go to serving crawler robots. Stop and think about that for a minute. SmugMug is a Top 300 website with tens of millions of visitors, more than half a billion page views, and billions of HTTP / AJAX requests (we’re very dynamic) each month. As measured by both Google and Alexa, we’re extremely fast (faster than 84% of sites) despite being very media heavy. We invest heavily in performance.

And maybe 50% of that is wasted on crawler robots. We have billions of ‘unique’ URLs since we have galleries, timelines, keywords, feeds, etc. Tons of ways to slice and dice our data. Every second of every day, we’re being crawled by Google, Yahoo, Microsoft, etc. And those are the well-behaved robots. The startups who think nothing of just hammering us with crazy requests all day long are even worse. And if you think about it, the robots are much harder to optimize for – they’re crawling the long tail, which totally annihilates your caching layers. Humans are much easier to predict and optimize for.

Worst part about the whole thing, though? We’re serving the exact same data to Google. And to Yahoo. And to Microsoft. And to Billy Bob’s Startup. You get the idea. For every new crawler, our costs go up.

We spend significant effort attempting to serve the robots quickly and well, but the duplicated effort is getting pretty insane. I wouldn’t be surprised if that was part of the reason Facebook revised their robots.txt policy, and I wouldn’t be surprised to see us do something similar in the near future, which would allow us to devote our resources to the crawlers that really matter.

Anyway, if a vote were held to decide whether the world needs an open-to-all index, rather than all this duplicated crawling, I’d vote YES! And SmugMug would get even faster than it is today.

On a totally separate, but sorta related issue, Google shouldn’t have to do anything at all to their algorithms. Danny Sullivan has some absolutely brilliant satire on that subject.

  1. Dan
    July 16, 2010 at 2:42 am | #1

    Maybe an open standard for self indexing would be a first step. Rather than a robot crawling a site pulling down all the media, the site of interest could just hand them, or your open public index site, a “here are the most recent updates” file?

    • July 16, 2010 at 1:26 pm | #2

      The big problem with this is trust. Most sites, given a chance to feed robots data that would enhance their rankings, would take it. That’s the world we live in. There’s value in having the crawler actually do the work and validate that it’s the same or at least similar to what end-users will experience.

      • August 2, 2010 at 2:37 pm | #3

        Not really. Even if spammers were a problem the search engine would still build a graph, compute rank, etc. It doesn’t have to use the documents that are flat out sent by the origin site.

        Even if your goal would be to force pages on google they can truncate you if they don’t think your domain is that valuable.

  2. July 16, 2010 at 5:19 am | #4

    Don – why doesn’t SmugMug just limit the crawlers to the top 10 (random number), and then prohibit crawling from Billy Bob’s Startup et al.?

    Eliminate the source(s) of the problem. If they question why – tell them to play nice and maybe you will let them back in.

    • July 16, 2010 at 1:25 pm | #5

      It’s something we’re seriously considering. But even limiting it to 10 crawlers means we’re doing 10X the work we really could be doing. An entire order of magnitude. That’s not trivial.

      Doing a single crawl that’s shared would make a lot more sense from our point of view (I realize the crawlers probably have a different one… :)

      • August 2, 2010 at 2:38 pm | #6

        It’s a bad idea and will have a chilling effect for the search industry. Now all new search startups have to go out on the web and beg for permission from hundreds or thousands of websites?

        It would mean the death of the search industry.

        When Google/Yahoo started crawling they just had to beg for forgiveness. Now they have to beg for permission.

        It will have a harsh chilling effect on the crawl industry.

  3. Steven Roussey
    July 16, 2010 at 9:44 am | #7

    Or, perhaps a way for site to export its index deltas. Something beyond sitemaps. Oh, and don’t forget the AdSense bots — they have to be up-to-date immediately, and so they will always add a large load to a system as well.

    • July 16, 2010 at 1:27 pm | #8

      Deltas for a site like ours are very difficult to create and produce, but it’s probably easier than supporting a hojillion crawlers. Wonder how that’d work…

      • Cabbey
        July 16, 2010 at 4:56 pm | #9

        Could just add a changedSince argument to the sitemap request. Then only report things that had changed since then in the map. Just need to get everyone to agree to the details. :)

      • August 2, 2010 at 2:40 pm | #10

        This is what you want to use… everytime there is a mutation just append that to a write ahead log and publish that with something like sitemaps.

        This protocol would be VERY easy to stream / cache etc.

        The BIGGEST problem with serving robots are the random hits and serving content out of cache and having to regenerate page content and burning CPU.

  4. July 16, 2010 at 3:47 pm | #11

    Thanks for your support for this idea Don. I’ve mentioned it before at various times but sometimes it takes time for ideas to resonate with others. I’m glad I’m not off-base and that others see this as a significant prpblem.

    Also, it’s a problem with a fairly easy fix plus the benefits are a faster Internet for all without any new infrastructure. And, zero carbon costs for a faster Internet

  5. John Friend
    July 16, 2010 at 5:11 pm | #12

    It’s an interesting idea and I can certainly see the issue from your side of the fence. But, from Google’s perspective, what do they have to gain by doing this?

    And, are Microsoft and Yahoo actually going to take Google’s index? No way. They’ll each claim they have some of their own secret sauce in digesting the pages that they can’t afford to give up.

    The startups would love to take Google’s index and we’d get an extra amount of innovation on using the index because the huge hurdle of building a useful index before you can add value on top of it would be removed. But, again I’m not sure how that helps Google.

    • July 17, 2010 at 8:39 am | #13

      The secret sauce is in interpreting the data, not crawling it.

      How does it help Google? I don’t know, that’s up to Google to decide, but the obvious one is to simply charge for access to the index. Even simply breaking even on their crawler cost has got to be beneficial, let alone profiting from it.

      I’m positive if there was a will, Google could find a benefit. Luckily, Google often does things for benefits other than profits, so maybe this will resonate with one of those other motivations.

      • August 2, 2010 at 2:44 pm | #14

        Why just Google? We’d love to charge other companies money to access our index?

        Of course that’s our business model :)

    • August 2, 2010 at 2:43 pm | #15

      Also, let’s imagine you’re a new startup … you’re going head to head to compete with Google. Are you going to just blindly trust that they won’t yank the rug out from under you at some point in the future?

      While shared crawling can be interesting – so can making it more efficient.

      If it was hyper efficient to serve crawlers having Google host the whole thing wouldn’t be part of the discussion.

      Sitemaps and protocols like Pub Sub Hub Bub are helping solve this problem.

  6. July 16, 2010 at 9:43 pm | #16

    Writing a great crawler is still an art and Google seems to be the only one who have perfected it. It’s a competitive advantage for them. Why would they give away that edge on a platter to othes?

    • July 17, 2010 at 8:41 am | #17

      It sure ain’t perfect with us. Their crawler frequently crawls “useless” stuff and ignores the “good” stuff. Part of this is our fault, for not providing them with better Sitemaps, etc, but if I have to do the work, then clearly the crawler itself isn’t an “art”.

      I haven’t detected any extra brilliance in Google’s crawler. Their brilliance seems to be in interpreting the data and providing the results, which the excel at.

      • July 22, 2010 at 4:24 am | #18

        I think the responsibility for efficiency and relevance lies in the bot and in your CMS, not in the publisher.

        My proposal (detailed in my blog) is this: run your site in the cloud with a vendor-provided CMS, so the cloud-vendor guarantees it won’t play SEO tricks. This custom CMS is able to serve bots just the “good” stuff when it’s updated, but only serves a single bot running on the cloud, updating a “trusted” mirror of your site -which search engines index as they see fit (hopefully ignoring “useless” stuff) and bandwidth’s on you if they are big players.

        Larger shops roll their own “bot-friendly” CMS using a common API and document schemas and sign agreements ensuring its trustworthiness.

        This way you save on bandwidth and CPU, search engines avoid duplicate crawling on irrelevant changes and startups get a shot at indexing just the parts they want with their own algorithms, or licensing the common index.

  7. George
    July 17, 2010 at 5:30 pm | #19

    Google are already working on changing the way they crawl the web. Projects like PubSubHubbub go a long way.

    Until then, the only thing you can do is pre-render/cache and foot the network bill.

  8. July 28, 2010 at 9:45 am | #20

    Several random ideas…

    1) Google (or other search entitites) publish hashes of sitemaps as crawled. You publish that data, crawlers can pick it up and verify. Google doesn’t have to open the whole index, you get reduced traffic

    2) Trust-rank websites vs. their site maps. If Google says ‘I never caught XYZ lying’, it’s good enough for me.

    3) Change tracking – do it similar to DVCS. I.e. publish SHAs for entities. If you think about it, robots.txt is primitive. No history, very difficult to have fine grained control… It sure could do with a replacement. Of course, everybody and their brother would need to do it. Wonder if a push system for changes would be a good idea, too.

    Unfortunately, all of those are ultimately people-issues, not engineering issues. So I guess you’ll be stuck with your crawlers for a while. Might want to limit them by the amount of useful referrals they generate…

  9. Kevin Olson
    July 28, 2010 at 9:56 am | #21

    From what I remember, Facebook changed their crawler policy because it exposed just how absolutely horrible their privacy (or lack thereof) practices were. Instead of actually closing privacy/security holes, they’re going the “security through obscurity” route.

  10. July 31, 2010 at 12:35 pm | #22

    Thanks for your support for this idea Don. I’ve mentioned it before at various times but sometimes it takes time for ideas to resonate with others. I’m glad I’m not off-base and that others see this as a significant prpblem.

    Also, it’s a problem with a fairly easy fix plus the benefits are a faster Internet for all without any new infrastructure. And, zero carbon costs for a faster Internet

Comments are closed.
Follow

Get every new post delivered to your Inbox.

Join 33 other followers

%d bloggers like this: