The 18th annual International World Wide Web Conference, WWW 2009, was held this past April 20-24 in Madrid, Spain, and it has become the premiere event to publish research and development on the evolution of our favorite medium.
A fascinating (if you're a web geek) paper was presented by Uri Schonfeld of UCLA and Narayanan Shivakumar of Google called "Sitemaps: Above and Beyond the Crawl of Duty". The main thrust of the paper was that traditional web crawlers employed by search engines are becoming overwhelmed by number of new websites and pages appearing daily on the web; by one count, there are more than 3 trillion (!) pages that need to be indexed, deduplicated, and tracked for inbound/outbound links.
The Sitemaps protocol is becoming more and more important to search engines as they try to prioritize and filter this mound of information, and part of the problem is the rise of large-scale CMS systems, which dynamically generate pages regardless of whether there's any real content in them or not. They used the example of Amazon.com, which for any given product will have dozens of subsidiary pages, such as reader reviews, excerpts, images, specifications. Even if there is no content, the link to a dynamically generated page will still return a page with no data in it, creating literally tens of millions of unique URLs at amazon.com, which "dumb" crawlers must follow and index.
The Sitemap protocol defines an XML file format for search engines to use which not only lists all the URLs that should be indexed, but also provides information on how important the page is, how often it's updated, and when it was last updated. Search engines can use this file to rapidly index the important content and ignore what isn't there, improving the accuracy and time taken to index a website. Every site should have a sitemap, but as of October 2008 it was estimated that only 35 million sitemaps have been published, out of billions of URLs.
Amazon makes a concerted effort to publish accurate sitemap data, as it dramatically reduces the time required to index new content. Even so, Amazon's robots.txt file lists more than 10,000 sitemap files, each holding between 20,000 and 50,000 URLs, for a total of more than 20 million URLs on amazon.com alone! The authors note that there is still a lot of content duplication and null content pages there, but the number is staggeringly large. After monitoring URLs on another website, they also noted that sitemap crawlers picked up new content significantly faster over time than when using the simple "discovery" method used when there is no sitemap file.
We said before that every website should have a properly constructed sitemap, as it will improve the quality and accuracy of search engines as a whole. Beyond creating the sitemap, registering it with major search engines will provide valuable feedback for the webmaster on crawl and index rates, and provide insights into what the search engine "sees" when it looks at your website. Please create a sitemap for your website today, or just ask us if you need help!