by Andrew Kagan
30. April 2009 08:59
Following up on yesterday's post about Googlebot and crawlers, I see that Googlebot is coming back to read the sitemap on a regular basis without needing to submit it...probably based on the update-frequency parameter specified in the sitemap...a good thing I hope, although Google is not adding more pages to the index yet.
Registering the sitemap with MSN/Live and Yahoo resulted in immediate crawls by the MSNbot and Slurp crawlers, which of course is a good thing...will see what the indexing rate is in a future post.
by Andrew Kagan
29. April 2009 11:22
To follow up on my previous post, after the initial long delay before Google downloaded the sitemap.xml for searchpartner.pro, the sitemap was resubmitted 24 hours later and Google downloaded it within minutes.
A quick review of the server logs showed Googlebot hitting the website shortly thereafter, which corresponds to behavior reported by Adam at BlogIngenuity. Adam also reported pages quickly appearing in Google's index, but no additional pages appear to have made it in yet for searchpartner.pro. This could be attributable to time-of-day as well as how recently the website was added to Google's index. Presumably the page text is "in the hopper" and being processed (wouldn't a progress bar be a cool webmaster's tool?).
Appropriately tagging the sitemap file with date/frequency/importance data for each URL will probably build the site reputation in Google's index and hopefully priortize content indexing. We know that the better a website's reputation the faster Google will add pages to the index.
by Andrew Kagan
29. April 2009 04:57
Launching the Searchpartner.pro website was an interesting experiment in measuring Google's crawl rate. The domain had been parked at a registrar for some time, nearly a year, so Googlebot and other crawlers would have known about it, but would not have found any content. This may have been a negative factor in the subsequent crawl rate.
Before launching the website, all the appropriate actions were taken to insure a rapid crawl and index rate:
- Creation of all relevant pages, with informational pages of high quality and narrow focus
- Implementation of appropriate META data
- Validation of all links and HTML markup
- Implementation of crawler support files such as robots.txt and an XML sitemap
Finally a sitemap was registered with Google and the site brought online...and then the waiting began.
- It took more than two days (approx. 57 hours) after registering the sitemap for Google to actually parse it. Google found no errors.
- It took three more days after parsing the sitemap for Googlebot to actually crawl the site.
- More than 24 hours after crawling the site, Google had added only three pages to its index.
It seems that the days of "launch today, indexed tomorrow" are in the past. Even with publishing a website based on Google's best practices, it seems that Google is somewhat overwhelmed at this point and crawl rates for new sites are being delayed.
Two unknowns:
- Does leaving a domain parked for a long time negatively impact the initial crawl rate?
- Does the TLD -- "COM", "NET", "PRO" -- affect the crawl rate? Does Google give precedence to well-regarded TLDs over new/marginal TLDs?
I will be testing this hypothesis with additional sites in the near future.