Search Engine Optimization and Marketing for E-commerce

Auto-Submitting Sitemaps to Google...Necessary?

by Andrew Kagan 1. May 2009 10:13

Google's Webmaster Tools provides webmasters with a way to upload XML sitemaps to improve the accuracy of Google's index. Registering and maintaining an accurate sitemap (Google, Yahoo, and Microsoft all accept sitemap data) is important to proper indexing of your website pages, and Google provides two methods for notifying them when the sitemap is updated: manually through the Google website, and "semi-automatically" by sending an HTTP request that signals Google to reload the sitemap.

Ping me when you're ready

The second method can be automated through server-side scripting, so that when content on a website or blog is updated, the sitemap file is updated as well, and the update request is sent to Google at the same time. In theory, this should provide rapid updating of Google's index to include the latest content on your website.

Depending on a number of factors, Google will automatically reload your sitemap file without you specifically requesting it to do so. One factor is the content of the sitemap itself. Besides a list of URLs on your website, the sitemap file can also hold information about date the URL was last updated, and how frequently it is updated. For example, if your homepage content changes every day, you can assign a frequency of "daily" to that URL, telling the search engine it should check that page every day.

It should be noted that incorrect use (or "abuse") of a sitemap, such as indicating pages are new when the content hasn't changed, can cause problems if the search engine recrawls the page too many times without seeing any new data. Empirical data have shown that pages may be dropped from the search engine index under this scenario, and new pages added to this "unreliable" sitemap may be ignored or crawled more slowly.

It's a popularity contest

Another factor in sitemap reloading is link popularity. If a lot of websites are linking to particular pages on your website, search engine spiders will crawl those pages more often, and if the site is large, the sitemap will help prioritize which pages are crawled first.

To Submit, or Not to submit...

We have seen that once a sitemap is submitted and indexed by search engines, they will regulary come back and reload the sitemap looking for new URLs, whether you re-submit it or not. As your website's pagerank (on Google) and general link popularity grows, there's an increase in the frequency that the sitemap will be reloaded, without your taking any do you need to submit it manually or automatically?

The answer is "it depends". Google itself warns webmasters not to resubmit sitemaps more than once per hour, probably because that's as fast as it's going to process the changes and redirect Googlebot to the URLs in the sitemap. If you are auto-submitting sitemaps more than once an hour, the "punishment" could range from the SE ignoring the subsequent re-submits, to something more dire...but no one really knows the consequences. It would probably be safer to resubmit sitemaps on a regular schedule, but we do not have any hard data about this at this time.

When you Should re-submit a sitemap

So when should you re-submit a sitemap? The obvious answer is whenever your content changes, but not more than once an hour. Google does not yet provide an API to query when it last loaded your sitemap, although you can see this data in its Webmaster Tools. If you have some very timely news that the SE really needs to know about, then resubmit the may not increase the crawl rate, but it may impact which URLs are crawled first.

The bottom line is that sitemaps are becoming increasingly important to search engines to help them prioritize the content they crawl, so use them, don't abuse them, help the internet be a better place!

Tags: , , ,


Mission Control we have liftoff!

by Andrew Kagan 29. April 2009 04:57

Launching the website was an interesting experiment in measuring Google's crawl rate. The domain had been parked at a registrar for some time, nearly a year, so Googlebot and other crawlers would have known about it, but would not have found any content. This may have been a negative factor in the subsequent crawl rate.

Before launching the website, all the appropriate actions were taken to insure a rapid crawl and index rate:


  • Creation of all relevant pages, with informational pages of high quality and narrow focus
  • Implementation of appropriate META data
  • Validation of all links and HTML markup
  • Implementation of crawler support files such as robots.txt and an XML sitemap 
Finally a sitemap was registered with Google and the site brought online...and then the waiting began. 
  • It took more than two days (approx. 57 hours) after registering the sitemap for Google to actually parse it. Google found no errors.
  • It took three more days after parsing the sitemap for Googlebot to actually crawl the site. 
  • More than 24 hours after crawling the site, Google had added only three pages to its index.
It seems that the days of "launch today, indexed tomorrow" are in the past. Even with publishing a website based on Google's best practices, it seems that Google is somewhat overwhelmed at this point and crawl rates for new sites are being delayed.

Two unknowns:
  • Does leaving a domain parked for a long time negatively impact the initial crawl rate?
  • Does the TLD -- "COM", "NET", "PRO" -- affect the crawl rate? Does Google give precedence to well-regarded TLDs over new/marginal TLDs?

I will be testing this hypothesis with additional sites in the near future. 


Tags: , , ,

General | SEO

Powered by BlogEngine.NET
Theme by Mads Kristensen updated by Search Partner Pro