Robots.txt — a complete guide to the robots.txt file
The robots.txt file is one of the foundations of website configuration for search engines. Despite its simple syntax, incorrect configuration can block indexing of the entire site or reveal a sensitive service structure. This guide will take you through everything you need to know — from basic syntax to advanced techniques and common mistakes.
Want to check your domain's robots.txt file?
Open robots.txt testerWhat is a robots.txt file?
Robots.txt is a text file placed in the main directory of a domain at example.com/robots.txt. It defines rules for web robots (crawlers) specifying which parts of the site can be visited and indexed. This protocol is known as REP (Robots Exclusion Protocol) and is respected by all major search engines: Google, Bing, Yandex, DuckDuckGo, and others.
Important: robots.txt is only a suggestion, not an order. Malicious bots may ignore its rules. It should not be used as the sole mechanism for protecting sensitive resources.
Where to place the robots.txt file?
The file must be located in the root directory of the domain — not a subdomain or a subdirectory. Available at:
https://example.com/robots.txt
✓ Correct
https://www.example.com/robots.txt
✓ Correct
https://example.com/folder/robots.txt
✗ Incorrect
https://sub.example.com/robots.txt
✓ Separate robots.txt for a subdomain
Basic syntax
The robots.txt file consists of rule groups. Each group starts with one or more User-agent directives, followed by Allow and Disallow directives. Groups are separated by blank lines.
File structure
Directives — full list
User-agent
Everyone
Specifies the bot. * means all.
Disallow
Everyone
Blocks access to the path and subdirectories.
Allow
Google, Bing
Allows access even if the parent path is blocked.
Sitemap
Everyone
Indicates the location of the XML sitemap.
Crawl-delay
Bing, Yandex
Minimum pause between crawler requests in seconds. Google ignores this.
Host
Yandex
Indicates the preferred domain. Used by Yandex.
Clean-param
Yandex
Informs bots about URL parameters that have no significance for the content.
Wildcards and path patterns
Google and Bing support two special wildcard characters in paths:
Disallow: /*.pdf$
Matches any string of characters (zero or more).
Disallow: /search$
Matches the end of the URL — the path must end exactly at this point.
Pattern examples
Rule priority — what wins?
When several rules match the same URL, Google applies the longest match rule — the rule with the longest matching pattern wins. In the case of equal length, Allow takes precedence over Disallow.
Configuration examples
1. Basic configuration — WordPress
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /?s=
Disallow: /tag/
Disallow: /author/
Sitemap: https://example.com/sitemap.xml
2. E-commerce store
User-agent: *
Disallow: /cart/
Disallow: /order/
Disallow: /my-account/
Disallow: /dashboard/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Allow: /products/
Allow: /categories/
Sitemap: https://sklep.pl/sitemap.xml
Sitemap: https://sklep.pl/sitemap-produkty.xml
3. Blocking selected AI bots
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: *
Disallow: /admin/
Allow: /
Sitemap: https://example.com/sitemap.xml
4. Page in maintenance mode
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
Known crawlers and their User-Agents
Googlebot
Googlebot-Image
AdsBot-Google
Google-Extended
bingbot
YandexBot
DuckDuckBot
Baiduspider
GPTBot
anthropic-ai
CCBot
SemrushBot
AhrefsBot
Most common mistakes in robots.txt
Robots.txt and SEO — what you need to know
Robots.txt directly affects the crawl budget — the crawling budget granted by Google to each site. Effective use of robots.txt allows you to direct crawlers to important subpages and avoid wasting budget on insignificant URLs.