wtoolsy.com
Network, DNS, IP
Developer tools
SEO and page analysis
Finance and calculators
Universal
Articles
All articles

Robots.txt — a complete guide to the robots.txt file

The robots.txt file is one of the foundations of website configuration for search engines. Despite its simple syntax, incorrect configuration can block indexing of the entire site or reveal a sensitive service structure. This guide will take you through everything you need to know — from basic syntax to advanced techniques and common mistakes.

Want to check your domain's robots.txt file?

Open robots.txt tester

What is a robots.txt file?

Robots.txt is a text file placed in the main directory of a domain at example.com/robots.txt. It defines rules for web robots (crawlers) specifying which parts of the site can be visited and indexed. This protocol is known as REP (Robots Exclusion Protocol) and is respected by all major search engines: Google, Bing, Yandex, DuckDuckGo, and others.

Important: robots.txt is only a suggestion, not an order. Malicious bots may ignore its rules. It should not be used as the sole mechanism for protecting sensitive resources.

Where to place the robots.txt file?

The file must be located in the root directory of the domain — not a subdomain or a subdirectory. Available at:

https://example.com/robots.txt ✓ Correct
https://www.example.com/robots.txt ✓ Correct
https://example.com/folder/robots.txt ✗ Incorrect
https://sub.example.com/robots.txt ✓ Separate robots.txt for a subdomain

Basic syntax

The robots.txt file consists of rule groups. Each group starts with one or more User-agent directives, followed by Allow and Disallow directives. Groups are separated by blank lines.

File structure

# Comment — line starts with #
User-agent: [bot-name]
Disallow: [path]
Allow: [path]
Crawl-delay: [seconds]
User-agent: [other-bot]
Disallow: [path]
Sitemap: [sitemap-URL]

Directives — full list

User-agent Everyone
User-agent: *

Specifies the bot. * means all.

Disallow Everyone
Disallow: /admin/

Blocks access to the path and subdirectories.

Allow Google, Bing
Allow: /public/

Allows access even if the parent path is blocked.

Sitemap Everyone
Sitemap: /sitemap.xml

Indicates the location of the XML sitemap.

Crawl-delay Bing, Yandex
Crawl-delay: 10

Minimum pause between crawler requests in seconds. Google ignores this.

Host Yandex
Host: example.com

Indicates the preferred domain. Used by Yandex.

Clean-param Yandex
Clean-param: sid

Informs bots about URL parameters that have no significance for the content.

Wildcards and path patterns

Google and Bing support two special wildcard characters in paths:

*
Disallow: /*.pdf$

Matches any string of characters (zero or more).

$
Disallow: /search$

Matches the end of the URL — the path must end exactly at this point.

Pattern examples

# blocks the entire site
Disallow: /
# blocks /admin/ and all subdirectories
Disallow: /admin/
# blocks all URLs ending in .pdf
Disallow: /*.pdf$
# blocks all URLs with query parameters
Disallow: /*?
# blocks only /search, not /search/results
Disallow: /search$
# allows a subdirectory of a blocked directory
Allow: /admin/public/

Rule priority — what wins?

When several rules match the same URL, Google applies the longest match rule — the rule with the longest matching pattern wins. In the case of equal length, Allow takes precedence over Disallow.

# Example rules:
User-agent: *
Disallow: /folder/
Allow: /folder/public/
# For URL /folder/private/ → Disallow (longer match)
# For URL /folder/public/ → Allow (longer match)
# For URL /folder/ → Disallow

Configuration examples

1. Basic configuration — WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /?s=
Disallow: /tag/
Disallow: /author/

Sitemap: https://example.com/sitemap.xml

2. E-commerce store

User-agent: *
Disallow: /cart/
Disallow: /order/
Disallow: /my-account/
Disallow: /dashboard/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Allow: /products/
Allow: /categories/

Sitemap: https://sklep.pl/sitemap.xml
Sitemap: https://sklep.pl/sitemap-produkty.xml

3. Blocking selected AI bots

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Disallow: /admin/
Allow: /

Sitemap: https://example.com/sitemap.xml

4. Page in maintenance mode

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

Known crawlers and their User-Agents

Googlebot
Google
Googlebot
Google Images
Google
Googlebot-Image
Google AdsBot
Google
AdsBot-Google
Google Extended
Google AI
Google-Extended
Bingbot
Microsoft
bingbot
Yandex
Yandex
YandexBot
DuckDuckBot
DuckDuckGo
DuckDuckBot
Baidu
Baidu
Baiduspider
GPTBot
OpenAI
GPTBot
Claude
Anthropic
anthropic-ai
CCBot
Common Crawl
CCBot
SemrushBot
Semrush
SemrushBot
AhrefsBot
Ahrefs
AhrefsBot

Most common mistakes in robots.txt

Blocking the entire site
Disallow: / for all bots blocks the entire site — one of the most costly SEO mistakes. Google regularly reports such pages in Search Console.
Blocking pages with noindex
If a page has a noindex meta tag, do not block it in robots.txt. The crawler must visit the page to see the noindex directive. A blocked page may remain in the index if there was a link to it.
Revealing the site structure
Robots.txt is public. By entering Disallow: /secret-panel/, you inform everyone of that directory's existence. Use robots.txt to control crawling, not to hide resources.
Lack of separate files for subdomains
Robots.txt on example.com does not apply to blog.example.com. Each subdomain needs its own robots.txt file.
Blocking CSS and JS resources
Google needs access to CSS and JavaScript to render the page and evaluate its quality. Blocking these resources can harm rankings.
Confusing robots.txt with .htaccess
Robots.txt does not block access to files — it only informs bots not to visit them. A user can still access a blocked URL. For real protection, use .htaccess or server configuration.

Robots.txt and SEO — what you need to know

Robots.txt directly affects the crawl budget — the crawling budget granted by Google to each site. Effective use of robots.txt allows you to direct crawlers to important subpages and avoid wasting budget on insignificant URLs.

Block insignificant URLs
Sorting, filtering, and session parameters — block them so that crawlers focus on valuable subpages.
Always add a Sitemap
The Sitemap directive in robots.txt is a quick way to inform all search engines about the location of the sitemap.
Protect admin panels
Block /admin/, /wp-admin/, /phpmyadmin/ — not for security, but to avoid wasting crawl budget.
Verify in Search Console
Complementary to our tool, you can use Google Search Console, which also has a built-in tool for testing robots.txt - it shows how Google directly interprets the rules.