Robots.txt: The Complete Guide to Controlling How Search Engines Crawl Your Site
The robots.txt file is one of the oldest and most fundamental pieces of technical SEO infrastructure on the web. It's a plain text file sitting at the root of your domain — always accessible at yourdomain.com/robots.txt — and its entire purpose is to communicate with search engine crawlers about which parts of your website they're allowed to access. Despite its age and simplicity, it remains one of the most misunderstood files in web development. A misconfigured robots.txt can accidentally block your entire website from Google, and you might not notice for weeks.
Understanding how robots.txt works, what it can and cannot do, and how to audit it properly is essential knowledge for anyone serious about technical SEO.
How robots.txt Works
When a crawler like Googlebot arrives at your domain, one of the very first things it does before crawling any other page is request your robots.txt file. It reads the directives in that file and uses them to determine which URLs it's allowed to fetch. If no robots.txt exists, crawlers assume they have permission to crawl everything.
The file is organized around "User-agent" declarations, which specify which crawler a particular set of rules applies to. The wildcard User-agent: * applies to all crawlers. You can also write rules specific to individual bots — User-agent: Googlebot for Google, User-agent: Bingbot for Bing, and so on.
Disallow vs. Allow Directives
The two most commonly used directives are Disallow and Allow. Disallow tells a crawler it cannot access a specific path — for example, Disallow: /admin/ blocks all URLs starting with /admin/. Allow overrides a broader Disallow rule for a more specific path. For example, you could disallow an entire directory but allow a specific file within it.
A critical point that many developers misunderstand: Disallow prevents crawling — it does not prevent indexing. If a page has backlinks pointing to it from other websites, Google may still index it even if you've disallowed it in robots.txt, because it discovers the URL from those external links. To actually prevent a URL from appearing in search results, you need a noindex tag. Robots.txt controls the crawler's door; noindex controls the index itself.
The Sitemap Directive
Many robots.txt files include a Sitemap: directive pointing to the location of the XML sitemap. This is a helpful signal for crawlers, letting them discover your sitemap without having to search for it. You can include multiple Sitemap directives if you have a sitemap index or separate sitemaps for different sections of your site.
Crawl-Delay: Use With Caution
The Crawl-delay directive tells crawlers to wait a specified number of seconds between requests. This can be useful for protecting a low-resource server from being overwhelmed by aggressive crawling. However, Google has publicly stated that it does not honor the Crawl-delay directive in robots.txt — you need to use Google Search Console to set a crawl rate limit for Googlebot specifically. Other crawlers like Bingbot do respect Crawl-delay.
Common Robots.txt Mistakes
The most catastrophic mistake is accidentally disallowing everything with Disallow: /. This happens more often than you'd think — usually during development when a staging environment is correctly blocked but the rule accidentally makes it into the production robots.txt during a site launch. The result is that Googlebot stops crawling your entire site, and your rankings can collapse within days as Google re-evaluates pages it can no longer access.
Another frequent mistake is trying to use robots.txt to hide sensitive content. If something should genuinely be private, robots.txt is not the right tool — authentication and proper server access controls are. Robots.txt is a public file that anyone can read, so it can actually reveal the existence of paths you'd rather keep private.
Using This Tool
Our Robots.txt Checker fetches and parses the robots.txt file from any domain and presents the key directives in an easy-to-read format. Check your own site regularly — especially after deployments — and audit competitor robots.txt files to understand what they're hiding from or exposing to search engine crawlers.