We’re often asked how Yahoo! Search determines which pages get indexed and which pages are left un-crawled. First and foremost, we honor the industry-standard robots.txt file format, which gives Webmasters several layers of control over which sites, pages and specific URLs should be indexed. Lately we’ve heard from a number of Webmasters asking how best to prevent ad tracking URLs and dead URLs from getting indexed, so we thought we’d respond via this post.
Ad tracking URLs
Ad tracking URLs are used by Webmasters to help determine what traffic is coming in from advertisements (e.g., Yahoo! Sponsored Search and Yahoo! Publisher Network) but aren’t necessary to include in the Yahoo! Search index. Sometimes you might notice that these URLs still appear in the index. That’s because they’ve appeared on pages that are “crawlable” or may have been copied over to crawlable pages by users. If you don’t want Yahoo! Slurp, our Web crawler to index these URLs you can use wildcards in robots.txt. For example, if you are using the parameter ‘ref’ to track ad sources, you can use a rule like the one below to keep your tracking URLs from being Slurped:
User-Agent: Yahoo! Slurp
The best way to remove dead URLs from the Yahoo! Search index is to return an HTTP Error 404 when our crawler requests the page. If you want to act before the 404 discovery and URL removal process completes, you can use Site Explorer to quickly delete the URLs from the index. One advantage to using Site Explorer is that you can delete multiple URLs including an entire subpath so long as the URL prefix is the same. As Danny Sullivan points out in his deep-dive post on the delete function, if you delete http://domain.com/subarea1/, then all the pages that begin with “domain.com/subarea1” will get removed. E.g.:
We’ll continue to visit the Yahoo! Search blog to give Webmasters like you pointers on how to better manage your sites in the Yahoo! Search index. Be sure to visit us at the Site Explorer Suggestion Board if there are specific areas that you’d like us address in more detail.