As a developer at redsauce.com, I am always looking for the most efficient way of doing things, whether it be a PHP script or a database schema. Working on a truly huge site recently, it became obvious that google’s crawling was causing performance issues with the server the site was hosted on. If google doesn’t already know what pages your site contains (i.e. if its a new site), the google bot will have to go through every link on your site, possibly visiting certain pages several times.
Of course this is inefficient but we can solve this problem by helping google and other search engines to find our content with an XML sitemap. An XML sitemap can tell a search engine more information about a page such when it was last modified, how important the page is as well as how often it is updated. By providing this information to the search engines, they can crawl your site more efficiently, neither wasting their resources or yours.
Creating an XML sitemap is extremely simple. Here is an example:
<?xml version='1.0' encoding='UTF-8'?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc>http://www.redsauce.com/</loc> <lastmod>2009-11-04</lastmod> <changefreq>daily</changefreq> <priority>1</priority> </url> </urlset>
For each individual page, you can specify the URL between the <loc></loc> tags, when the page was last updated between the <lastmod></lastmod> tags, how often the page is updated between the <changefreq></changefreq> tags and how important the page is between the <priority></priority> tags.
The priority of a page can be between 0.1 and 1; 1 being your most important page, e.g. your home page and the deeper you go into your site the less the priority score should be. By providing all this information, search engines can crawl your pages more intelligently without wasting your server’s resources.
Once you have finished creating your XML file, save it to the root of your server and name it sitemap.xml for example. You can let the search engines know it exists by adding this line to a robots.txt file (placed at the root of your domain):
Sitemap: http://www.redsauce.com/sitemap.xml
or by notifying google in webmaster tools about the location of your XML sitemap file.
An XML sitemap file can contain a maximum of 50,000 URLs and cannot exceed a maximum of 10MB file size. You can compress XML sitemap files to save your bandwidth but still mustn’t exceed 10MB once uncompressed.
You can write a script to create your sitemap from the pages on your site, or if your site doesn’t contain too many pages, you can use an online service such as xml-sitemaps.com to generate the sitemap file for you.
Tags: crawling, sitemaps, spidering, XML, XML sitemaps