Google has an XML-based sitemap specification that has been picked up by most of the other search engines. If you generate the site map in this format, Google can more quickly and accurately traverse our site and return more relevant results. It also gives the googlebot an idea of how frequently to re-index pages (do specific pages change often or rarely?).
To encourage Google and other engines to fully index your website, the “sitemap.php” (or “sitemap.html”, etc.) file, and especially the “sitemap.xml” file need to be at the root level of the site, and a link to the human-readable one needs to be included near the top of every page (above the top 20% of the page in code view, not just in display view) to be sure the sitemap is easily located and indexed. All 404 error pages and other error documents should include a link to the sitemap to help both people and bots find an alternate choice if they followed a stale link.
Google’s site includes links to some tools that help automate the sitemap.xml creation process. I’ve tried several and haven’t found any I liked. Of course, part of that is because there are certain parts of my website that I don’t want indexed or included in a sitemap. Other websites may not have the same issue.
Some other tips to help index a site as well as to improve search engine rankings:
- Try to keep the structure as flat as possible. It’s more of a maintenance and filing nightmare to have everything stored in the root and up to 2 levels down, but for many search engines, that’s as far down as they dig for relevant content. It’s presumed that anything buried more than 2 folders deep is not very interesting, so it is either ignored or given a very low ranking.
- Extend the domain name registration period as long as possible. The theory here is that a site that only has a few months left is probably not a long-term web site. The owner may give up in a few months, so why bother to weight it highly? A site that has 1-2 years left on it is average. A site with more years (and the more the better) indicates a long-term investment. It’s presumed that the site will be around for a long time, and that the content will likely remain as well. That makes it more stable, easier to index, easier to link to, and thus gets ranked higher on some websites. This is a relatively new development in search engine rankings.
3a. Avoid moving files around on the server. Develop a good, logical, expandable system (
/articles/2005/10/BreadMaking.php instead of
/familyfun/BreadMaking.php, which is still better than
/new/BreadMaking.php). Then, as you add new articles, you just add the appropriate dated folders, and things are easy to find and maintain. Where I say “
articles“, it could be “
events“, etc., or the topic/agency/etc could come first and then add articles after that. Either way is fine, though the longer the pathname, the more it violates tip #1 above. That’s the paradox of good content management. As an alternative, you could use dashes instead of slashes (
/articles-2005-10/BreadMaking.php), which keeps the folder depth low, but still gives you easy linking. It still looks ugly to human designers working within Dreamweaver—imagine month after month of article folders with just one document inside; it could look a bit overwhelming. Even though it might be a bit ugly for designers, it would be easy for both the bots and the users to access and index.
3b. If you do move a file, on an Apache web server add a redirect to the .htaccess file in the original folder to point to the new location or to some alternate content (even if it’s just the home page). For example, after an event is over, delete the page, add a redirect inside the .htaccess file to point to a page that explains that the desired event is past and then gives them a link to the main Events page so they can look for other upcoming events. This helps in two ways. 1) Users aren’t flummoxed by a 404 error message, and they can look for new events (which helps promote the new ones), and 2) Bots lower the rankings of sites with error pages. The more errors, the lower your ranking drops. Permanent redirects are not considered errors (more like a status code), and while not as nice as finding valid content, it doesn’t hurt your ranking and could even improve it slightly. Sites that setup permanent redirects like that are deemed to be more conscientious and are probably better resources by some search engines. It could also be seen as an active and dynamic site that changes often, not a stodgy one that never needs to redirect because nothing ever changes.
Within Firefox, there are a ton of developer tools that make testing even easier. The Web Developer Toolbar and FANGS are two of my favorites. With WDT you can enable and disable all sorts of things for testing. FANGS emulates what the website would “look” like to a blind person listening to a screen reader reading your page. If you’ve never tried it, you will be amazed (and probably on the verge of tears) at how badly your new layout looks/sounds in different scenarios.
You can drive yourself crazy with this stuff. So here’s a goal: determine what browsers and versions of those browsers account for 75-80% of your site’s traffic. Test those browsers and versions rigorously to be sure the content looks good and doesn’t break horribly in them. Then look at the next 10% and be sure that the site still looks reasonably decent (it might look kind of ugly, but it should still be readable and usable). Finally, it is a good idea to test any of the remaining browsers and versions that you can, just to make sure that they are functional enough that people can at least get contact information and their browser doesn’t bomb. And then check it all over again the next time you do any major change.
- Validate your code! This should probably come before testing. If your code (HTML, XHTML, CSS, XML, WML, etc.) can pass the various code validators that are out there, then it’s pretty likely your page will render (maybe not really well, but at least it will render). After you have all the syntax of your code correct (i.e., it’s HTML 4.01 or XHTML 1.0, CSS 2.0, and/or RSS/ATOM compliant) and have verified your hyperlinks, then run it past Bobby, Cynthia, and other validators that check for handicapped accessibility of your site. If it passes those more rigorous tests, then your page should render fairly well on every browser out there. It might even look really, really good in most browsers. And that’s a lot more than can be said for most websites out there. Firefox has a plugin called HTML Validator that loads HTMLTidy into the source code viewer of Firefox. It helps you quickly and easily correct syntax errors without sending lots of requests to the validation services, and it does it in real-time. Very cool. While you’re checking things, also check the speed at which your pages load.
You can look at our site for an example of a human-readable sitemap and a Google-friendly one. I have slightly tweaked the XML document to reference a stylesheet to make it more human readable than the default XML. This does not reduce the functionality for the googlebot, and it increases the functionality if a human hits the page. Eventually I plan to turn the currently non-clickable web addresses into clickable URLs so that humans can click on any link to go to the page easily.