robots.txt

Q: What is a robots.txt file?

A robots.txt file is a plain text file placed at the root of a website. It tells compliant crawlers which pages, directories, or URL patterns they are allowed or not allowed to access.

Q: Does robots.txt affect search rankings?

robots.txt does not directly improve rankings. It can affect SEO indirectly by influencing how search engines crawl a website and whether important pages or resources are accessible.

Q: Can robots.txt prevent a page from appearing in search results?

Not reliably. robots.txt controls crawling, not indexing. If you want to keep a page out of search results, use a noindex directive or protect the page with authentication.

Q: Is robots.txt a security tool?

No. robots.txt is public and should not be used to protect sensitive information. Private content should be protected with authentication, permissions, or server-level restrictions.

Q: Should I include my sitemap in robots.txt?

Yes. Including the sitemap location helps search engines discover the XML sitemap more easily, especially on larger websites.

Q: Can I block specific parameters or dynamic URLs?

Yes. robots.txt can block certain parameter patterns, such as sorting, filtering, or internal search URLs. These rules should be tested carefully to avoid blocking useful crawl paths.

Q: What happens if I don’t have a robots.txt file?

If a website does not have a robots.txt file, compliant crawlers generally assume they can crawl the accessible URLs they discover, unless other restrictions are in place.

Q: Should every website have a robots.txt file?

Most websites should have one, even if it only includes the sitemap location. A simple robots.txt file gives crawlers a predictable place to check for crawl instructions.

Q: What is the difference between robots.txt and noindex?

robots.txt controls whether crawlers are allowed to access a URL. noindex tells search engines not to include a page in search results after they crawl it.

Q: Where should robots.txt be located?

The file should be placed at the root of the site origin, such as https://example.com/robots.txt . Subdomains may need their own separate robots.txt files.

A Practical Guide to Controlling Web Crawlers

SEOWebsiteTechnical

Author: Steven Hsu
Published: 15/03/2026
Updated: 13/05/2026

The robots.txt file is a small but important part of a website’s technical SEO foundation. It tells compliant search engine crawlers which parts of a website they are allowed to access and which areas they should avoid.

By guiding how automated crawlers interact with a site, robots.txt helps manage crawl behavior, reduce unnecessary crawler requests, and keep search engines focused on the sections of a website that matter most.

robots.txt is not a ranking tool, indexing tool, or security mechanism. Its job is to guide crawler access before search engines crawl a website.

Used properly, it keeps crawl paths cleaner. Used carelessly, it can block important pages, hide rendering resources, or create a false sense of control.

What Is robots.txt?

A robots.txt file is a plain text file placed at the root of a website. Search engine crawlers usually check this file before they begin crawling a site.

Its purpose is to provide crawl instructions for bots such as Googlebot, Bingbot, and other web crawlers. These instructions tell crawlers which directories, URL paths, or page patterns they are allowed to access and which ones should be avoided.

Example location:

https://example.com/robots.txt

Because robots.txt sits at the root level of a site origin, it applies to that specific protocol, host, and port.

That means different versions of a site may need separate robots.txt files:

https://example.com/robots.txt
https://www.example.com/robots.txt 
https://blog.example.com/robots.txt

Caution

A robots.txt file is publicly accessible. It should never be used to hide private, sensitive, or confidential information.

How robots.txt Works

When a compliant search engine crawler visits a website, it usually checks the robots.txt file first.

The crawler reads the rules, identifies which instructions apply to its user-agent, and then decides which URLs it can request.

In practice, the process looks like this:

The crawler requests the robots.txt file from the root of the site.
It reads the rules specified in the file.
It checks which rule group applies to its user-agent.
It determines which paths are allowed or disallowed.
It crawls the website based on those instructions.

These instructions apply only to crawlers that choose to respect robots.txt.

Major search engines generally follow robots.txt rules, but malicious bots may ignore them. This is why robots.txt should be treated as crawl guidance, not access control.

Basic Structure of a robots.txt File

A robots.txt file consists of simple directives that define crawler behavior.

The most common directives are User-agent, Disallow, Allow, and Sitemap.

Basic Structure

User-agent: * 
Disallow: /admin/ 
Disallow: /private/ 
Allow: /public/ 
Sitemap: https://www.example.com/sitemap.xml

User-agent specifies which crawler the rule applies to. The asterisk in User-agent: * means the rule applies to all crawlers.
Disallow tells crawlers not to access specific paths. For example, Disallow: /admin/ asks crawlers not to crawl URLs inside the /admin/ directory.
Allow permits access to a specific path, especially when a broader rule might otherwise block it.
Sitemap specifies the location of the XML sitemap. This helps search engines discover the site’s important URLs more efficiently.

Common robots.txt Directives

robots.txt directives are simple, but they need to be written carefully. Small syntax mistakes can cause important sections of a site to be blocked or low-value areas to remain crawlable.

Directive	Example	What It Does
User-agent	User-agent: Googlebot	Applies rules to a specific crawler
User-agent: *	User-agent: *	Applies rules to all crawlers
Disallow	Disallow: /checkout/	Prevents crawlers from accessing a path
Allow	Allow: /images/public/	Allows access to a specific path
Sitemap	Sitemap: https://www.example.com/sitemap.xml	Provides the XML sitemap location

For most websites, robots.txt should remain short and easy to understand.

The more complex the file becomes, the easier it is to create unintended crawl restrictions.

Practical Uses of robots.txt

robots.txt is useful when a website has sections that do not need to be crawled.

This is common on large websites, ecommerce sites, booking engines, internal platforms, and websites that generate many dynamic URLs.

The goal is not to block as much as possible. The goal is to guide crawlers away from low-value or unnecessary paths while keeping important pages and resources accessible.

Controlling Crawl Efficiency

Large websites may have thousands or even millions of URLs.

Some of these URLs are valuable landing pages, articles, product pages, service pages, or category pages. Others may be internal search results, filter combinations, sorting parameters, cart pages, checkout pages, or admin paths.

robots.txt can help search engines avoid wasting crawl resources on areas that do not need to be crawled.

User-agent: *
Disallow: /internal-search/
Disallow: /*?sort=
Disallow: /*?filter=

This kind of setup can be useful for ecommerce websites, travel websites, media libraries, booking systems, and other platforms where URL parameters can create many near-duplicate crawl paths.

However, parameter blocking should be handled carefully.

Some parameterized URLs may still support important content discovery, filtering, pagination, or product access. A broad rule should only be added when the crawl impact is clearly understood.

Protecting Non-Public Areas From Crawling

Administrative sections, staging paths, testing folders, and checkout flows often do not need to be crawled by search engines.

User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /checkout/

This reduces crawler access to areas that are not meant to appear as search landing pages.

However, this does not make those areas private.

robots.txt Usage Error

robots.txt should not be used as a security mechanism. The file itself is public, and disallowed paths can still reveal where sensitive sections may exist. Private content should be protected with authentication, permissions, or server-level restrictions.

Avoiding Duplicate Crawl Paths

Certain parameters and filtered URLs can generate multiple versions of similar content.

For example, sorting a product list by price, popularity, date, or availability may create different URLs with mostly the same content.

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?view=

Blocking unnecessary parameter patterns can help keep crawling cleaner.

This does not automatically solve duplicate content or indexing issues, but it can reduce crawl waste when those URLs do not need to be accessed.

For indexation control, canonical tags and noindex may still be needed depending on the situation.

robots.txt controls crawling, not whether a known URL is eligible to appear in search results.

robots.txt vs. noindex

A common misunderstanding is confusing robots.txt with noindex.

They solve different problems.

Control Method	Primary Purpose	Best Used When
robots.txt	Controls crawler access before crawling	You want to prevent crawlers from requesting certain paths
noindex	Controls whether a page should appear in search results	You want a crawlable page excluded from the index
Canonical tag	Signals the preferred version of duplicate or similar pages	You want search engines to consolidate signals to one URL
Authentication	Protects private content from public access	You need true access control or security

This distinction matters.

If a page is blocked by robots.txt, Google may not be able to crawl the page and see a noindex directive placed on it. That means robots.txt can accidentally prevent Google from seeing the very instruction that was supposed to remove the page from the index.

Use robots.txt when the goal is crawl control.
Use noindex when the goal is index control.
Use authentication or server restrictions when the goal is security.

Common Mistakes to Avoid

robots.txt is simple, but small mistakes can create serious crawling and indexing problems.

Most issues happen when the file is used for the wrong purpose or when broad rules are added without testing.

Common Mistakes

Using robots.txt to hide sensitive information
Assuming Disallow removes a page from search results
Blocking pages that should use noindex instead
Blocking CSS, JavaScript, images, or rendering resources
Accidentally blocking important sections with broad rules
Blocking parameter URLs without checking whether they support useful content discovery
Forgetting that subdomains may need separate robots.txt files
Publishing changes without testing the affected URLs
Blocking the sitemap URL
Using overly complex rules that no one audits
Forgetting to review robots.txt after migrations, redesigns, or CMS changes

A good robots.txt file should be clear, minimal, and intentional.

If a rule does not solve a specific crawl-control problem, it probably should not be there.

Best Practices for robots.txt

A good robots.txt file should be easy to read, easy to test, and easy to maintain. The goal is to help crawlers avoid unnecessary areas without blocking important pages, assets, or rendering resources.

Keep It Simple

robots.txt should be concise and readable.

Overly complex rules increase the chance of accidental blocking, especially when multiple teams manage the same website.

A simple file is easier to review, test, and maintain:

Simple robots.txt Example

User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /internal-search/
Sitemap: https://www.example.com/sitemap.xml

Before adding a rule, the question should be simple:

What crawl problem does this solve?

If the answer is unclear, the rule probably should not be added.

Avoid Blocking Important Content

Blocking important pages can prevent search engines from crawling and understanding them.

This is one of the most damaging robots.txt mistakes.

It is also risky to block important page resources such as CSS, JavaScript, images, or frontend assets. If search engines cannot access the resources needed to render a page, they may not understand the page properly.

For modern websites, this matters even more because many pages depend on JavaScript, CSS, media files, and frontend bundles to display content correctly.

Use robots.txt for Crawl Control, Not Security

robots.txt is publicly accessible.

Anyone can visit it and see which paths are being disallowed.

That means it should not contain sensitive paths that expose private business logic, confidential folders, or internal systems. It also should not be relied on to keep private information away from malicious crawlers.

Sensitive content should be protected with proper access control.

This may include password protection, authentication, server rules, firewall restrictions, role-based permissions, or removing the content from public access entirely.

Include the Sitemap

Adding the sitemap location helps search engines discover the site’s important URLs more efficiently.

The sitemap URL should be fully qualified, including the protocol and host. Larger websites may include multiple sitemap entries or point to a sitemap index.

Sitemaps Example

Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/blog-sitemap.xml
Sitemap: https://www.example.com/product-sitemap.xml

This does not replace internal linking or good site architecture, but it gives crawlers another clear discovery signal.

Test Before Publishing

robots.txt changes should be tested before they go live.

A single broad rule can accidentally block a full section of the website, important assets, or even the entire site.

For example, this rule blocks the whole site from crawling:

User-agent: *
Disallow: /

That may be useful for some staging environments, but it can be disastrous on a production website.

Before publishing, check whether the new rules affect important pages, templates, assets, and sitemap URLs.

After publishing, monitor Google Search Console for crawl, indexing, and robots.txt-related issues.

Review After Site Changes

robots.txt should be reviewed whenever the website changes structurally.

This includes migrations, redesigns, CMS changes, staging-to-production launches, URL restructuring, ecommerce filter changes, booking engine updates, and new subdomains.

A rule that was safe in an old structure may become harmful after the site changes.

Technical SEO rules should not be left unattended after the architecture around them changes.

Example of a Well-Structured robots.txt File

A basic robots.txt file for a public website might look like this:

Well-Structured robots.txt Example

User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /internal-search/
Disallow: /*?sort=
Disallow: /*?filter=

Allow: /assets/
Allow: /images/

Sitemap: https://www.example.com/sitemap.xml

This configuration blocks common low-value crawl paths while allowing important resources such as assets and images.

It also provides the sitemap location so search engines can discover important URLs more efficiently.

The exact rules should always depend on the website.

A small brochure site may need almost no restrictions. A large ecommerce, marketplace, booking, travel, or media website may need more deliberate crawl control.

Final Thoughts

robots.txt is one of the simplest files in a website’s technical architecture, but it can have a major impact on how search engines interact with a site.

When implemented correctly, it helps manage crawl access, reduce unnecessary crawler requests, and guide search engines away from low-value areas.

When implemented poorly, it can block important pages, hide rendering resources, confuse crawl paths, or create a false sense of indexing control.

In the broader context of technical SEO, robots.txt works alongside XML sitemaps, canonical tags, noindex, internal linking, structured data, and clean site architecture.

Each tool has a specific job.

robots.txt should be used for crawl control. noindex should be used for index control. Authentication should be used for security.

That separation is what keeps technical SEO clean, predictable, and safe.

Frequently Asked Questions

Robots.txt

What is a robots.txt file?

Does robots.txt affect search rankings?

Can robots.txt prevent a page from appearing in search results?

Is robots.txt a security tool?

Should I include my sitemap in robots.txt?

Can I block specific parameters or dynamic URLs?

What happens if I don’t have a robots.txt file?

Should every website have a robots.txt file?

What is the difference between robots.txt and noindex?

Where should robots.txt be located?