✨ Get 25% OFFon any plan. Use the coupon:

Robots.txt: what it is and how to use the file on your site

By Tiago CostaUpdated on July 2, 2026

Illustration of a robot in front of a website gate holding a list, with allowed and blocked signs, representing the robots.txt file.
Definition

Robots.txt is a plain text file, placed at the root of a domain (at yourdomain.com/robots.txt), that guides search engine robots on which parts of the site they may crawl. It usually contains:

  • a User-agent line naming the target robot;
  • Disallow rules for paths that should not be crawled;
  • Allow rules that open exceptions;
  • a Sitemap line pointing to the site map.

What the robots.txt file is

Robots.txt is a simple text file, saved in the root of the domain (at yourdomain.com/robots.txt), that tells search engine robots which parts of the site they can or cannot crawl. It is the front door that every well behaved crawler checks before browsing the pages.

The file follows the Robots Exclusion Protocol (REP), a standard created in 1994 and adopted by virtually every search engine. The rules are public: anyone can open a site's robots.txt by typing the address in the browser.

It helps to understand its role right away: robots.txt controls crawling, not indexing. In other words, it says where the robot may go, but it is not the right tool to hide a page from search results.

How robots.txt works in practice

Whenever a search engine visits a site, the first thing the search engine does is look for the file at /robots.txt. If it exists, the robot reads the instructions and respects them; if it does not exist, the robot assumes it can crawl everything.

The rules are organized into blocks, each starting with a User-agent (the target robot) followed by Disallow and Allow lines. A typical example:

  • User-agent: * applies the rule to all robots;
  • Disallow: /admin/ asks them not to crawl the admin folder;
  • Allow: /admin/ajax.php opens an exception inside the blocked folder;
  • Sitemap: https://yourdomain.com/sitemap.xml points to the site map.

It is important to remember that these rules are guidance, not a lock. Legitimate robots like Googlebot obey them, but malicious robots can simply ignore the file.

Infographic of the anatomy of a robots.txt file showing the User-agent, Disallow, Allow and Sitemap directives stacked.
Anatomy of a robots.txt file: the directives stacked from top to bottom.

Syntax and main directives

Robots.txt accepts a small set of directives. Knowing each one avoids accidental blocks:

DirectiveWhat it does
User-agentDefines which robot the following rules apply to (use * for all).
DisallowMarks a path the robot should not crawl.
AllowFrees a specific path inside a blocked area.
SitemapProvides the URL of the XML sitemap to help page discovery.

There are technical limits to respect. According to the Google Search Central documentation, Googlebot processes only the first 500 kibibytes (about 512 KB) of a robots.txt file and ignores anything after that limit. Older directives such as Crawl-delay, by the way, are not interpreted by Google.

Robots.txt does not block indexing

This is the most expensive misunderstanding about robots.txt. Blocking a page with Disallow stops the robot from reading the content, but does not guarantee it stays out of Google. If other pages link to it, the search engine may index the URL even without crawling it, showing the result with no description.

Google Search Central's own documentation warns that robots.txt is not meant to keep a page out of search results. For that there is the noindex directive, applied on the page itself.

There is one more trap: if you block the page in robots.txt, Google cannot read the noindex tag inside it and the command is never obeyed. The practical rule to remove something from search is the opposite of the intuitive one: let the robot crawl and use noindex. Understanding the difference between crawling and indexing solves most of these cases.

Illustration of a page locked by a padlock that still appears in search results with no description, showing that Disallow is not the same as noindex.

Where robots.txt lives and how to create it on WordPress

Robots.txt always lives in the root of the domain, reachable at yourdomain.com/robots.txt. There is no robots.txt per subfolder: each domain (and subdomain) has its own.

On WordPress, the system generates an automatic virtual file when there is no physical one. To take control, you can:

  • use an SEO plugin like Yoast or Rank Math, which offers a robots.txt editor right in the dashboard;
  • create a physical file named robots.txt in the root folder of the installation, via FTP or a file manager;
  • turn on the plugin's custom robots.txt option, which replaces the virtual file with your own rules.

Turning on the custom robots.txt means exactly that: swapping the default file generated by the platform for an editable version, where you manually define what to allow and what to block.

Common mistakes and best practices

Because it is a small and powerful file, robots.txt causes damage when misconfigured. The most frequent slips:

  • Blocking the whole site: a Disallow: / left over after development takes every page out of crawling.
  • Blocking CSS and JavaScript: blocking these files hinders Google from rendering the page and can hurt the assessment.
  • Trusting robots.txt for privacy: since the file is public, listing sensitive folders in it ends up revealing the path.
  • Forgetting the sitemap: including the Sitemap line helps the search engine discover your URLs faster.

After any change, test the file in Google Search Console and treat robots.txt as part of your technical SEO routine. Good use also helps steer the crawl budget toward the pages that really matter.

FAQ

Frequently asked questions

What does turning on a custom robots.txt mean?

Turning on a custom robots.txt means replacing the automatic file the platform generates with an editable version you create. In plugins like Yoast and Rank Math, this option opens an editor where you manually define which areas of the site to allow and which to block.

Where is the robots.txt file on WordPress?

It lives in the root of the domain, at yourdomain.com/robots.txt. On WordPress, the file is virtual by default (generated by the system). To edit it, use an SEO plugin or create a physical robots.txt file in the root folder of the installation via FTP.

What is the main function of the robots.txt file for SEO?

To tell crawlers which parts of the site they can or cannot access, sparing them from crawling irrelevant pages and helping steer the crawl budget toward important content. It controls crawling, not indexing.

What does Googlebot do?

Googlebot is Google's crawler robot. It browses the web following links, reads each site's robots.txt to learn what it may access, and sends the allowed pages to be processed and, when relevant, indexed in the search results.

Leave the technical side of your blog on autopilot

Automarticles writes, optimizes and publishes your blog articles on its own, taking care of technical SEO and crawling so you can focus on the business.

Start free trial
Keep learning

Related concepts

CrawlerA crawler is a robot program that travels the web from link to link, downloading and reading pages to feed a search engine's index. Also called a spider, robot or bot, the best known example is Googlebot. The crawler is the first stage of search: before a page can be indexed and ranked, it has to be found and read by one of these crawlers.XML SitemapAn XML sitemap is a file in XML format that lists the important URLs of a site to help search engines discover, crawl and prioritize those pages. It works as a site map handed to Google, telling it which addresses exist and, optionally, when they were updated, which is especially useful on large, new sites or ones with pages poorly connected by internal links.NoindexNoindex is a directive that tells search engines not to include a page in the search results. It is applied through a robots meta tag in the HTML or through an HTTP header, and it makes Google drop the page from the index even when other sites link to it. Unlike robots.txt, which blocks crawling, noindex requires the page to stay crawlable so the search engine can read the instruction.IndexingIndexing is the process by which a search engine adds a page to its index, the huge database it consults to answer queries. After crawling and analyzing the content, Google decides whether to store the page in the index, and only what is indexed can appear in the results. In SEO, ensuring indexing is the mandatory step before any attempt to rank: a page outside the index is, in practice, invisible to searchers.