
When people search for a local service, Google and Bing use crawlers to visit pages and decide what to show. Most small business owners focus on copy and keywords, but technical SEO for small business websites also includes the robots.txt file. This guide explains robots.txt for SEO, how a clean robots.txt setup can protect your crawl budget, and how to block pages from search engines that never need to appear in search results.
Robots.txt is not exciting, and it is not hard to set up, which is why it often gets missed. Yet a small mistake in that file can stop key pages being crawled. A sensible setup can also stop search engines wasting time on pages you never wanted in search results in the first place.
Summary
This guide explains what robots.txt does for SEO, what it does not do, and how a sensible setup helps search engines spend their crawl time on the pages that matter. It covers the difference between blocking crawling and blocking indexing, and why robots.txt is not a security tool or a way to hide private pages.
You’ll learn which URLs are often worth blocking on small business sites, like internal search results, filter and parameter URLs, staging areas, and low value utility pages that do not need crawling. It also highlights common mistakes that can wipe out visibility after a redesign, like blocking entire folders by accident or carrying old rules across to a new site structure.
The guide also shows how to test changes safely in Search Console and how to roll updates out with care. The aim is to treat robots.txt as a small, regular maintenance task, not something you only remember when rankings drop.
What is a robots.txt file for SEO?
A robots.txt file is a plain text file that sits in the root folder of your website. That means it should load at:
https://yourdomain.co.uk/robots.txt
Search engines look for it in that exact place. If it is missing, crawlers assume there are no special rules and they crawl what they can access. If it is present, it gives them guidance on what they should and should not crawl.
Google describes robots.txt as a tool that is mainly used to manage crawler traffic to your site. It can also be used to keep certain files off Google in some cases, depending on file type.
The key point is that it sets out rules for bots. It is not written for people. Most of the time, you only notice it when something goes wrong, like a service page dropping out of search results after a site update.
What robots.txt does not do
Robots.txt is not a security tool. It does not hide data. It does not protect private files. It does not stop somebody from visiting a URL in their browser if they already know it.
It also does not force all crawlers to obey your rules. Good crawlers do, but some do not. Google is clear on this point. Robots.txt instructions cannot enforce crawler behaviour, it is up to the crawler to obey them. For private content, it is better to use proper protection such as password controls.
So, treat robots.txt like traffic signs. Useful. Worth having. Not a lock on the door.
Top Tip
“To check your current file, typeyourdomain.co.uk/robots.txtinto your browser. If it loads, you have one. If it shows a 404, you do not.”
When robots.txt is the right tool (and when it isn’t)
Robots.txt works best when you want to guide crawlers away from low value areas of a site. It is ideal for admin paths, internal search results, filters, and confirmation pages that do not help users coming from search.
It is not the right tool when your goal is to remove a page from Google, hide sensitive content, or fix duplication problems caused by poor site structure. In those cases, noindex tags, redirects, canonical tags, or proper access controls are usually better solutions.
A simple rule helps.
If a page should exist but not be crawled much, robots.txt may help.
If a page should not exist in search at all, robots.txt alone is rarely enough.
This distinction prevents many of the mistakes that cause accidental visibility loss.
Why robots.txt matters for SEO
For a small business site, robots.txt usually does three helpful things.
First, it helps search engines spend time on pages that matter. If your site has lots of thin pages, old test pages, or internal admin paths, you can point crawlers away from them so they focus on service pages, product pages, and useful content.
Second, it reduces the risk of the wrong pages being crawled, and helps limit how often low value URLs are discovered. This is common on sites with plugins, filters, internal search results, or old campaign URLs that still exist.
Third, it can prevent crawl waste on large sites. If you run a property portal, an ecommerce store, or a site with lots of filters and tags, you can accidentally create thousands of crawlable URLs. Robots.txt is one way to keep that under control.
Google refers to the limits of how much time and resource Googlebot can spend crawling a site as crawl budget. Even if you never use the phrase day to day, the idea still matters. You want bots spending their time on the pages you want people to find.
A quick warning before you change anything
A single line in robots.txt can block your whole website. That is not rare. It happens most often during redesigns and platform changes, when a staging rule is copied across to the live site.
If you are going to edit robots.txt, test it first and keep a copy of the old version so you can roll back quickly.
How robots.txt rules work in plain English
Robots.txt is built from groups of rules. Each group begins with a User-agent line, then one or more rules under it.
Here is the simplest possible structure:
User-agent: *
Disallow:
This tells all bots they can crawl everything, because the Disallow line is blank.
Now here is a simple block:
User-agent: *
Disallow: /admin/
This asks all bots not to crawl anything under /admin/.
Documentation about robots.txt rules on datatracker.ietf.org covers how these directives are interpreted, including placement, supported formats, and handling of errors.
The three directives you will use most
Most small business sites only need these directives:
User-agent
This names the crawler the rules apply to. * means all crawlers.
Disallow
This tells the crawler which path it should not crawl.
Allow
This creates an exception, allowing a specific URL inside an otherwise blocked section.
Google’s robots.txt specification documentation covers the rules and how they are processed.
File format details that trip people up
A robots.txt file must be plain text. Google states it should be UTF-8 encoded, and it also enforces a file size limit.
For most sites, that will never be a problem. Still, it matters if you have a messy file with hundreds of lines added over time. Keeping it tidy is part of keeping it safe.
Robots.txt for SEO: pages you should block and why
Common pages that usually do not belong in search
For many small business sites, these are the first areas worth reviewing:
- Admin and login areas
- Thank-you and confirmation pages
- Internal search results
- Filter and sort URLs that create duplicates
- Staging or development folders
You do not need to block everything listed here. The goal is to spot obvious crawl waste and deal with it deliberately.
Not every site needs robots.txt beyond the basics. A simple brochure site with five pages and a contact form might not see much difference.
But many small business sites grow in a messy way. A new plugin adds a new folder. A booking tool creates extra URLs. A theme update adds tag archives. Suddenly you have more crawlable URLs than you expected.
Here are the common areas worth reviewing.
Admin, login, and account areas
These pages are not meant for search results. They also tend to create thin content issues if indexed.
If you have paths like these, consider blocking them:
/wp-admin//login//my-account//checkout/
If you are on WordPress, there is a common exception you will see that allows admin-ajax.php, because it is needed for some front-end functions. That tends to look like this:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Do not copy that blindly. Check your own setup. If you are unsure, test the URL in Search Console and confirm nothing important relies on that file.
Thank-you pages and confirmation screens
A “thank you” page is useful for tracking conversions. It is rarely useful in search.
It also creates awkward search snippets if indexed. Imagine somebody searching for “kitchen fitter in Bristol” and landing on a thank-you page that says “Thanks, we have received your enquiry”. That is not a good experience.
Typical URLs include:
/thank-you//booking-confirmed//order-confirmation/
Blocking them can keep search results cleaner and reduce crawl waste.
Internal search results pages
Some sites generate URLs when people use the search box. For example:
/search?q=plumber/search/?s=boiler+repair
These pages are often thin, can create duplicates, and can explode into thousands of combinations over time.
Many sites block them via robots.txt, or handle them with other methods depending on platform. It is worth checking what your CMS does, because some platforms handle this better than others.
Filters, sort pages, and parameter URLs
This is where crawl waste becomes a real issue on larger sites.
An ecommerce store might create URLs like:
/shoes?colour=black/shoes?size=8/shoes?colour=black&size=8&sort=price-asc
A directory site might create:
/properties?beds=2/properties?postcode=E1/properties?beds=2&postcode=E1&page=4
Each combination is a new URL. Crawlers can spend a lot of time on these, and it can distract them from your main category pages and product pages.
Robots.txt can help here, but you need to be careful. Some filtered pages can be valuable for search, especially if they have strong demand, like “black running shoes size 8”. So the right approach depends on your category size, your product range, and the search demand you can see in data.
A safe first step is to block obviously low value parameters, like tracking tags or endless sort orders, rather than blocking all parameters across the board.
Staging sites and development folders
If you have a staging site that is accessible online, it should normally be blocked from crawling. Many developers add a full-site block like:
User-agent: *
Disallow: /
That makes sense on staging.
The danger is when that file ends up on the live site, or a staging rule is copied across by mistake. That single line can remove your visibility faster than almost any other SEO issue.
If you ever do a redesign, add “check robots.txt” to the launch checklist. It is one of the simplest checks and one of the most important.
How To Create A Robots.txt File
Here is a simple method that works for most small business owners.
Step 1: Start with what you have
Open your current robots.txt in a browser. If it exists, copy it into a plain text editor so you can review it properly.
If you do not have one, create a new file called robots.txt.
Step 2: Decide what needs blocking, and why
Do not start by blocking loads of folders. Start with a short list of pages you know are not meant for search, like admin pages and thank-you pages.
Then look at your analytics and Search Console data. If you see lots of weird URLs being crawled, it is a sign the crawler is spending time where it does not need to.
This is also where a quick crawl with a tool, or a look through server logs, can help. Even if you are not technical, you can often spot patterns just by scanning URL lists.
Step 3: Write rules in clean groups
A clean, basic file for a typical small business site might look like this:
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /thank-you/
Sitemap: https://yourdomain.co.uk/sitemap.xml
Adding your sitemap can help crawlers find it easily.
Keep it short. Keep it clear. If you need something more complex later, you can build it up.
Step 4: Upload it to the root folder
Your robots.txt must live in the right place. Google explains that it must be placed in the top-level directory of a site for it to apply.
Most CMS platforms let you edit the file via a file manager, or via SEO plugins. If you are using FTP, upload it to the same folder where your homepage sits.
Step 5: Test before you treat it as finished
Google offers reporting and tools inside Search Console that show errors and warnings related to robots.txt.
Even a simple test, checking a few key URLs, can stop a nasty surprise later. You can use my robots.txt testing tool, to see if certain sections of your site, are included/excluded from the robots.txt file for your site.
Top Tip
“Test your homepage, a main service page, and a recent blog post. If any of those are blocked, stop and fix the file before you publish changes.”
Crawl budget, and when you should care about it
If your site has 10 pages, crawl budget is not going to keep you awake at night.
If your site has 10,000 URLs, it starts to matter. Verkeer explains crawl budget as the amount of time and resources it devotes to crawling a site. If a crawler spends hours going through parameter URLs and thin tag pages, it may crawl key pages less often, or take longer to discover updates.
Signs your site may have crawl waste
Here are a few practical signs:
You see a lot of strange URLs in Search Console, especially ones with parameters you did not plan for.
You publish new pages, but they take longer than expected to show in search results.
You have lots of near-duplicate pages, like tag archives, author pages, or internal search pages, and many of them are being crawled.
You have a shop, directory, or large blog and the number of URLs has grown far beyond the number of pages you actually care about.
Robots.txt is one part of the fix. It is not always the first part. Sometimes the better solution is canonical tags, noindex directives, or cleaning up URL generation at source. Still, robots.txt is a quick way to stop crawlers going down obvious rabbit holes.
Setting rules for specific bots, including AI crawlers
Robots.txt can target specific crawlers by name. That can be useful if you have a reason to treat bots differently.
A common example is letting major search engine bots crawl the full site, while limiting a crawler that is causing heavy load on your server. Bing’s documentation includes guidance on creating a robots.txt file and controlling what crawlers can access.
You may also see AI related crawlers and user agents in logs, such as GPTBot, ClaudeBot, and others. Some businesses choose to set clear rules for these, based on their own policies and comfort level.
A simple structure might look like this:
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: GPTBot
Disallow: /private-resources/
For your reference, here are some of the more common crawler names as of 2025 used to crawl sites:
Google: Googlebot
Google Images: Googlebot-Image
Bing: Bingbot
Yahoo: Slurp
Yandex: YandexBot
Duckduckgo: DuckduckBot
OpenAI: GPTBot, OAI-SearchBot, ChatGPT-User
Anthropic: ClaudeBot, AnthropicBot
Perplexity AI: PerplexityBot
Google AI: Google-Extended, GoogleOther, Google-CloudVertexBot
Meta AI: Meta-ExternalAgent, FacebookBot
Amazon AI: Amazonbot, NovaAct
ByteDance (parent of TikTok): Bytespider
Apple: Applebot
Be careful not to block parts of the site you still want found in search. Also remember that robots.txt is a request, not a guaranteed block. Google notes that it cannot enforce crawler behaviour. Blocking AI crawlers does not remove your content from Google Search, but it may affect whether your site is used in AI-generated answers, summaries, or previews.
If your goal is data protection, robots.txt is not the right tool. Use proper access controls.
Common robots.txt mistakes that hurt SEO
Robots.txt problems are often spotted after rankings drop, traffic dips, or pages vanish from search results. Here are the mistakes I see most often, with the practical impact.
Mistake 1: Blocking the whole site by accident
This usually happens during a redesign.
A developer blocks crawling on staging with Disallow: /. The site goes live. The rule stays. Within days, important pages stop being crawled. Within weeks, visibility can fall away.
Fixing it is simple. The painful part is the delay. Crawlers still need time to return, recrawl, and rebuild trust in the site.
Mistake 2: Blocking CSS and JavaScript folders
Some older advice told people to block script folders to “save crawl budget”. That can backfire.
Search engines render pages. If you block resources needed to render the page properly, Google may not see the layout and content as intended. You can end up with indexing issues that look mysterious on the surface.
If you are blocking resource folders, do it on purpose, and check rendering in Search Console.
Mistake 3: Trying to hide private content
Some sites add things like:
Disallow: /customer-data/
Disallow: /invoices/
That is not safe. It is basically a signpost to sensitive URLs. Anyone can view your robots.txt file and see those paths listed.
If content should be private, protect it properly. Use logins, server rules, and correct permissions.
Mistake 4: Not updating robots.txt when the site changes
A site that has been online for five years often has a robots.txt file written for a site structure that no longer exists.
Maybe your blog moved from /news/ to /blog/. Maybe you changed booking systems. Maybe you moved from Magento to Shopify.
If you keep an old robots.txt file without reviewing it, you can block new sections by mistake, or fail to block new low value pages that have appeared.
Practical templates you can adapt
Always check your live pages are not blocked after you paste a template in. One wrong line can cause a visibility drop. These are starting points. Do not copy them without checking your own URL structure.
Template A: a basic small business site
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /thank-you/
Sitemap: https://yourdomain.co.uk/sitemap.xml
This is a simple setup for service businesses, like electricians, cleaners, and trades. It blocks common non-public paths and points to the sitemap.
Template B: WordPress service site with standard admin rules
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-login.php
Disallow: /thank-you/
Sitemap: https://yourdomain.co.uk/sitemap.xml
This is common on WordPress, but still needs a quick test. If your theme or plugins use other admin endpoints, check those too.
Template C: site with internal search pages
User-agent: *
Disallow: /search/
Disallow: /?s=
Disallow: /thank-you/
Disallow: /login/
Sitemap: https://yourdomain.co.uk/sitemap.xml
This aims at internal search results. The exact URL pattern depends on your platform, so confirm how your search URLs are created.
How to test robots.txt properly
Testing is not just about checking if the file exists. It is about checking what bots can do with it.
Google’s robots.txt report in Search Console shows which robots.txt files Google found for the top hosts on your site, along with warnings and errors. That is a good place to start if you suspect trouble.
A simple testing routine that works
Pick five URLs and test them:
Your homepage.
A main service page, like /boiler-repair-manchester/.
A location page, if you have them, like /seo-leeds/.
A recent blog post.
A URL you meant to block, like a thank-you page.
If any of the first four are blocked, fix it straight away. If the last one is not blocked and it should be, you can decide if it is a real problem or just a nice tidy-up.
After you publish changes
After edits, monitor the impact.
Watch Search Console for crawl issues and indexing changes.
Keep an eye on organic landing pages in analytics, especially pages that usually bring leads.
If you run a large site, look at crawl stats and server logs, because robots.txt changes can shift crawl patterns quickly.
Three practical tips you can apply this week
1: Audit what Google is wasting time on
Open Search Console and look for URLs that do not help your business. Think login pages, thin tags, filter combinations, and old campaign pages.
Make a short list and group them by folder or pattern. That makes it easier to block cleanly. It also helps you spot causes, like a plugin that is generating pages you never asked for.
If you only do one thing, do this. It stops you guessing.
2: Use robots.txt as a tidy-up tool, not a hiding tool
Robots.txt is at its best when it keeps crawlers focused. It is not a safe way to hide private content, and it is not a fix for poor site structure.
If you have private files, protect them properly. If you have duplicate pages, solve that at source where possible. Robots.txt can support those fixes, but it should not be the main line of defence.
This mindset keeps your rules clean and reduces the risk of blocking something you actually need.
3: Treat robots.txt like a live document
Most SEO work is steady, boring maintenance, and it pays off over time.
Any time you change themes, add major plugins, switch booking tools, or rebuild a section of the site, add “check robots.txt” to the checklist. A two minute review can prevent months of confusion later.
If you work with a developer or agency, ask them to confirm the robots.txt status at launch. It is a simple request that avoids a common mistake.
Frequently Asked Questions About Optimising Your Robots.txt file
Can robots.txt hurt my SEO if it’s wrong?
Yes. A single incorrect rule can block important pages or resources and reduce crawling over time. The risk is not theoretical; it commonly happens during redesigns and platform changes. That’s why testing and version control matter more than complexity.
What happens if I do not have a robots.txt file?
Search engines usually crawl what they can access, as long as there are no other blocks in place. For many small sites, that works fine and you may never notice a problem. The risk shows up when your site creates lots of low value URLs, because bots can spend time crawling pages you never wanted indexed. Adding a simple robots.txt file can help keep things tidy.
Can robots.txt remove a page from Google?
Robots.txt is mainly about crawling, not removal. If a page is already indexed, blocking crawling does not always remove it quickly, and it can even stop Google from seeing signals that would help it understand changes. For removals, you normally look at noindex, redirects, or Search Console removal tools, depending on the case. Robots.txt is still useful, but it is not a removal switch.
Should I block my entire site while it is being rebuilt?
Blocking the whole site on a private staging domain is common, and Disallow: / can be fine there. The bigger risk is accidentally leaving that rule on the live domain when you launch. If the rebuild is happening on the live domain, blocking everything can cause crawling to slow down and pages to drop over time. A better approach is usually to keep the key pages available and protect private work areas with logins or IP restrictions.
Do small local businesses need to worry about crawl budget?
For most local service sites, crawl budget is not the main issue. Still, crawl waste can creep in through plugins, filters, internal search pages, and tag archives. If you notice Search Console full of odd URLs, it is worth cleaning up. Google explains crawl budget as the time and resources Googlebot devotes to crawling a site, so it matters more as sites grow.
How often should I review my robots.txt file?
A quick review every few months is usually enough for a stable site. You should also review it after any big change, like a redesign, platform migration, new booking system, or major plugin change. Those are the moments when URL structures shift and hidden folders appear. Keeping robots.txt up to date reduces the chance of accidental blocks.
The Bottom Line
Robots.txt is a small file with a big job. It helps search engines crawl your site in a sensible way, so the pages that bring in real enquiries are easier to find and revisit. For a local business, that often means keeping bots away from admin areas, login screens, internal search pages, and confirmation URLs.
It is also worth treating it as part of normal site maintenance. A quick check after site changes, plus basic testing in Search Console, can prevent the kind of mistakes that quietly damage visibility over time.
If you’re ready to improve visibility and attract more local customers, get in touch to build a tailored SEO strategy for your business.





