Robots.txt — a complete guide to the robots.txt file

The robots.txt file is one of the foundations of website configuration for search engines. Despite its simple syntax, incorrect configuration can block indexing of the entire site or reveal a sensitive service structure. This guide will take you through everything you need to know — from basic syntax to advanced techniques and common mistakes.

Want to check your domain's robots.txt file?

Open robots.txt tester

What is a robots.txt file?

Robots.txt is a text file placed in the main directory of a domain at example.com/robots.txt. It defines rules for web robots (crawlers) specifying which parts of the site can be visited and indexed. This protocol is known as REP (Robots Exclusion Protocol) and is respected by all major search engines: Google, Bing, Yandex, DuckDuckGo, and others.

Important: robots.txt is only a suggestion, not an order. Malicious bots may ignore its rules. It should not be used as the sole mechanism for protecting sensitive resources.

Where to place the robots.txt file?

The file must be located in the root directory of the domain — not a subdomain or a subdirectory. Available at:

https://example.com/robots.txt ✓ Correct

https://www.example.com/robots.txt ✓ Correct

https://example.com/folder/robots.txt ✗ Incorrect

https://sub.example.com/robots.txt ✓ Separate robots.txt for a subdomain

Basic syntax

The robots.txt file consists of rule groups. Each group starts with one or more User-agent directives, followed by Allow and Disallow directives. Groups are separated by blank lines.

File structure

# Comment — line starts with #

User-agent: [bot-name]

Disallow: [path]

Allow: [path]

Crawl-delay: [seconds]

User-agent: [other-bot]

Disallow: [path]

Sitemap: [sitemap-URL]

Directives — full list

User-agent Everyone

User-agent: *

Specifies the bot. * means all.

Disallow Everyone

Disallow: /admin/

Blocks access to the path and subdirectories.

Allow Google, Bing

Allow: /public/

Allows access even if the parent path is blocked.

Sitemap Everyone

Sitemap: /sitemap.xml

Indicates the location of the XML sitemap.

Crawl-delay Bing, Yandex

Crawl-delay: 10

Minimum pause between crawler requests in seconds. Google ignores this.

Host Yandex

Host: example.com

Indicates the preferred domain. Used by Yandex.

Clean-param Yandex

Clean-param: sid

Informs bots about URL parameters that have no significance for the content.

Wildcards and path patterns

Google and Bing support two special wildcard characters in paths:

Disallow: /*.pdf$

Matches any string of characters (zero or more).

Disallow: /search$

Matches the end of the URL — the path must end exactly at this point.

Pattern examples

# blocks the entire site

Disallow: /

# blocks /admin/ and all subdirectories

Disallow: /admin/

# blocks all URLs ending in .pdf

Disallow: /*.pdf$

# blocks all URLs with query parameters

Disallow: /*?

# blocks only /search, not /search/results

Disallow: /search$

# allows a subdirectory of a blocked directory

Allow: /admin/public/

Rule priority — what wins?

When several rules match the same URL, Google applies the longest match rule — the rule with the longest matching pattern wins. In the case of equal length, Allow takes precedence over Disallow.

# Example rules:

User-agent: *

Disallow: /folder/

Allow: /folder/public/

# For URL /folder/private/ → Disallow (longer match)

# For URL /folder/public/ → Allow (longer match)

# For URL /folder/ → Disallow

Configuration examples

1. Basic configuration — WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /?s=
Disallow: /tag/
Disallow: /author/

Sitemap: https://example.com/sitemap.xml

2. E-commerce store

User-agent: *
Disallow: /cart/
Disallow: /order/
Disallow: /my-account/
Disallow: /dashboard/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Allow: /products/
Allow: /categories/

Sitemap: https://sklep.pl/sitemap.xml
Sitemap: https://sklep.pl/sitemap-produkty.xml

3. Blocking selected AI bots

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Disallow: /admin/
Allow: /

Sitemap: https://example.com/sitemap.xml

4. Page in maintenance mode

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

Known crawlers and their User-Agents

Googlebot

Google

Googlebot

Google Images

Google

Googlebot-Image

Google AdsBot

Google

AdsBot-Google

Google Extended

Google AI

Google-Extended

Bingbot

Microsoft

bingbot

Yandex

YandexBot

DuckDuckBot

DuckDuckGo

DuckDuckBot

Baidu

Baiduspider

GPTBot

OpenAI

GPTBot

Claude

Anthropic

anthropic-ai

CCBot

Common Crawl

CCBot

SemrushBot

Semrush

SemrushBot

AhrefsBot

Ahrefs

AhrefsBot

Most common mistakes in robots.txt

Blocking the entire site

Disallow: / for all bots blocks the entire site — one of the most costly SEO mistakes. Google regularly reports such pages in Search Console.

Blocking pages with noindex

If a page has a noindex meta tag, do not block it in robots.txt. The crawler must visit the page to see the noindex directive. A blocked page may remain in the index if there was a link to it.

Revealing the site structure

Robots.txt is public. By entering Disallow: /secret-panel/, you inform everyone of that directory's existence. Use robots.txt to control crawling, not to hide resources.

Lack of separate files for subdomains

Robots.txt on example.com does not apply to blog.example.com. Each subdomain needs its own robots.txt file.

Blocking CSS and JS resources

Google needs access to CSS and JavaScript to render the page and evaluate its quality. Blocking these resources can harm rankings.

Confusing robots.txt with .htaccess

Robots.txt does not block access to files — it only informs bots not to visit them. A user can still access a blocked URL. For real protection, use .htaccess or server configuration.

Robots.txt and SEO — what you need to know

Robots.txt directly affects the crawl budget — the crawling budget granted by Google to each site. Effective use of robots.txt allows you to direct crawlers to important subpages and avoid wasting budget on insignificant URLs.

Block insignificant URLs

Sorting, filtering, and session parameters — block them so that crawlers focus on valuable subpages.

Always add a Sitemap

The Sitemap directive in robots.txt is a quick way to inform all search engines about the location of the sitemap.

Protect admin panels

Block /admin/, /wp-admin/, /phpmyadmin/ — not for security, but to avoid wasting crawl budget.

Verify in Search Console

Complementary to our tool, you can use Google Search Console, which also has a built-in tool for testing robots.txt - it shows how Google directly interprets the rules.

Robots.txt tester

Check any domain's robots.txt file and test rules for specific URLs.

Check DNS records

Verify your domain's DNS configuration — A, MX, TXT records, and others.