robots.txt is a plain text file placed at the root of a website that provides instructions to search engine crawlers and other automated bots about which pages and directories they are allowed (or not allowed) to access. When a search engine bot like Googlebot arrives at your site, its first request is typically to https://yoursite.com/robots.txt to check for any crawling restrictions before proceeding.
The file uses a simple set of directives — User-agent, Allow, and Disallow — to communicate these rules. It’s one of the oldest standards on the web (dating to 1994), and while compliance is voluntary rather than enforced, all major search engines including Google, Bing, and others respect its directives as a matter of practice.
[Image: Screenshot of a typical robots.txt file showing User-agent, Disallow, and Sitemap directives]
How robots.txt Works
The file lives at the root of your domain and follows a specific syntax:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://yoursite.com/sitemap.xml
Breaking down the key components:
User-agent— Specifies which crawler the rules apply to. An asterisk (*) applies rules to all crawlers. Named agents (Googlebot,Bingbot) apply rules only to that specific bot.Disallow— Tells the specified crawler not to access a given path.Disallow: /blocks the entire site;Disallow: /private/blocks just that directory.Allow— Overrides aDisallowrule for a specific path within a blocked directory. Most commonly used to allow access to specific files within otherwise-blocked folders.Sitemap— Points crawlers to your XML sitemap, making it easier for them to discover all your pages.
An important nuance: blocking a URL in robots.txt prevents crawlers from fetching the page, but it doesn’t prevent the URL from appearing in search results if other sites link to it. To prevent a page from being indexed, use the noindex directive directly on the page itself.
Purpose & Benefits
1. Managing Your Crawl Budget
Search engines allocate a limited amount of crawling resources to each site — a concept called crawl budget. For large sites with thousands of pages, a well-configured robots.txt file helps ensure crawlers spend their time on valuable, indexable content rather than wasting crawl budget on admin pages, duplicate content, search results pages, or internal utilities. This is particularly important for technical SEO on large e-commerce or content-heavy sites.
2. Keeping Internal and Administrative Areas Private
Areas like /wp-admin/, staging paths, internal search result URLs, and backend utility directories don’t need to be crawled. Blocking them in robots.txt keeps unnecessary content out of the crawl queue, reduces the risk of administrative URLs appearing in search results, and slightly reduces server load from crawler traffic on pages that add no SEO value.
3. Coordinating with Your XML Sitemap
Pointing crawlers to your sitemap via the Sitemap: directive in robots.txt is a clean way to give search engines a complete map of pages you want indexed. This combination — blocking pages you don’t want crawled and actively pointing to the pages you do — gives you meaningful control over how search engines navigate your site.
Examples
1. Standard WordPress robots.txt
A typical WordPress site configuration balances accessibility with security:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /?s=
Disallow: /search/
Sitemap: https://yoursite.com/sitemap.xml
This blocks the WordPress admin area (while allowing the AJAX endpoint used by plugins), blocks the wp-includes directory, and prevents search result pages from being crawled — a common source of duplicate content.
2. Blocking a Specific Crawler
Some site owners block aggressive bots that crawl frequently but provide no SEO or traffic value — including AI training crawlers that some publishers want to exclude:
User-agent: GPTBot
Disallow: /
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://yoursite.com/sitemap.xml
Named user-agent rules can be stacked to give different permissions to different crawlers. Google, Bing, and legitimate crawlers proceed normally; the specified bot is blocked entirely.
3. E-Commerce Site with Internal Search and Faceted Navigation
Large e-commerce sites often generate thousands of near-duplicate URLs through faceted navigation (filtering products by color, size, price range, etc.) and internal search queries. Blocking these prevents crawl waste and duplicate content issues:
User-agent: *
Disallow: /search?
Disallow: /shop/?filter_
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Sitemap: https://yoursite.com/sitemap.xml
Cart, checkout, and account pages have no SEO value and should never be indexed.
Common Mistakes to Avoid
- Blocking CSS and JavaScript files — Search engines need to render your pages to evaluate them accurately, which means they need access to your stylesheets and JavaScript. Blocking these files in robots.txt prevents Google from understanding how your pages actually look, which can harm rankings.
- Confusing robots.txt with a noindex directive — Blocking a URL in robots.txt prevents crawling, not indexing. If external sites link to a blocked URL, Google can still discover and index it — just without seeing its content. Use the
noindexmeta tag on pages you want excluded from search results. - Accidentally blocking the entire site — A single malformed
Disallow: /directive underUser-agent: *blocks all crawlers from the entire site. This mistake is more common than it sounds, especially when robots.txt files are edited manually. Always verify changes using Google Search Console’s robots.txt tester. - Blocking your sitemap or key pages — Review your robots.txt against your sitemap to confirm that all pages you want indexed are not inadvertently blocked. Blocked sitemap pages remain uncrawled and unindexed regardless of the sitemap submission.
Best Practices
1. Keep the File Simple and Specific
A robots.txt file doesn’t need to be exhaustive. Block only the directories and URL patterns that genuinely provide no SEO value or that you don’t want crawled. Overly aggressive rules create risk of accidental blocking. When in doubt, it’s safer to allow crawling and use noindex on individual pages that shouldn’t appear in search results.
2. Always Include Your Sitemap Reference
Include the full URL of your XML sitemap in your robots.txt file:
Sitemap: https://yoursite.com/sitemap.xml
This helps search engines discover your sitemap even before you’ve submitted it through Google Search Console, and serves as a persistent pointer that remains useful as crawlers re-visit your site.
3. Validate Changes Before Publishing
Any edit to robots.txt can have significant crawling consequences. Use Google Search Console’s URL Inspection tool and the robots.txt tester to verify your rules behave as intended before publishing. After making changes, monitor your crawl budget and indexing stats in Search Console to confirm the impact matches your expectations.
Frequently Asked Questions
Does robots.txt affect SEO?
Yes, indirectly. Blocking valuable pages prevents them from being crawled and indexed, which means they can’t rank. Blocking low-value pages can improve crawl budget efficiency on large sites. The most common SEO issue with robots.txt is accidentally blocking pages that should be indexed, which removes them from search results entirely.
Where is the robots.txt file located in WordPress?
On a WordPress site, robots.txt is generated dynamically by WordPress and served at yoursite.com/robots.txt. WordPress generates a simple default robots.txt that blocks /wp-admin/ while allowing /wp-admin/admin-ajax.php. To create a custom robots.txt file, you can use an SEO plugin like Yoast or Rank Math, which provides a visual editor, or upload a physical robots.txt file to your server’s root directory.
Can I block Google from my entire site?
Technically yes — Disallow: / under User-agent: Googlebot will prevent Googlebot from crawling your site. But remember: blocked pages can still be indexed if Google discovers them through external links. And completely blocking Googlebot means your site won’t appear in Google search results. This is only appropriate for sites that genuinely should not be publicly indexed (internal tools, staging sites, etc.).
Do all bots respect robots.txt?
Major search engines (Google, Bing, and their various named crawlers) respect robots.txt by convention. However, compliance is voluntary — there’s no technical enforcement. Malicious bots and scrapers that ignore the standard won’t be stopped by robots.txt. For those, server-level rate limiting, IP blocking, and security tools are more appropriate countermeasures.
How often do search engines check robots.txt?
Google typically re-fetches your robots.txt file every 24 hours, though the frequency can vary. Google caches the file and applies it to all crawl decisions during that cache period. If you make an urgent change (such as allowing a previously blocked section), you can request a re-crawl in Google Search Console to speed up the update.
Related Glossary Terms
How CyberOptik Can Help
A well-configured robots.txt is a quiet but important part of a healthy SEO foundation. Mistakes here can silently remove pages from search results for weeks before anyone notices. Our team audits robots.txt configurations as part of our technical SEO work, ensuring your site’s crawling rules support your indexing goals rather than working against them. Contact us for a free website review or learn more about our SEO services.


