🔍 Technical SEO · Crawl management

How to Create robots.txt: The No‑Fluff 2026 Guide

Q: Can I use robots.txt to block AI bots like GPTBot?

Yes. Add User-agent: GPTBot Disallow: / to block OpenAI’s crawler. Many SEOs now manage multiple AI bots separately.

✍️ Alex Rivera 📅 May 19, 2026 ⏱️ 9 min read

How to create robots.txt file - step by step visual workflow with code editor and search console

Illustration: robots.txt directives control search engine access — SMARTCHAINE dashboard ready.

You don’t need developer wizardry to create a robots.txt file. But one misplaced slash? That can block your entire site from Google. In this guide, I’ll walk you through exactly how to build, test, and deploy robots.txt for modern SEO — no fluff, just real-world tactics that work for AI Overviews and classic search alike.

What robots.txt actually does (and what it can’t fix)

The robots.txt file lives at the root of your domain (yoursite.com/robots.txt). It politely asks compliant crawlers like Googlebot to include or exclude specific pages or folders. But here’s the nuance: it’s a directive, not a firewall. If other sites link to a blocked page, Google might still index its URL (without content). And bad bots ignore the file entirely.

✅ Quick summary: robots.txt = crawl control, not indexation. Use it to manage server load, hide staging areas, or block useless pages (like search results). For sensitive data, use authentication or noindex headers.

How to create robots.txt — practical walkthrough

Creating the file takes 3 minutes. Let’s do it the right way:

🛠️ Step‑by‑step (works on any site)

Open a plain text editor (Notepad, VS Code, or even TextEdit in plain mode).
Start with the default directive: User-agent: * (applies to all respectful bots).
Add Disallow: to block paths. Example: Disallow: /private/ blocks the /private/ folder.
Optionally add Allow: to override a disallow for sub-paths (more on that later).
Include your sitemap: Sitemap: https://smartchaine.cloud/sitemap.xml — helps Google discover pages.
Save the file as robots.txt (all lowercase, no .txt extension hidden).
Upload to your site’s root directory using FTP, cPanel, or your hosting file manager.
Verify by visiting https://yourdomain.com/robots.txt in your browser.

💡 Expert note: WordPress? Use Yoast SEO or RankMath — they generate robots.txt dynamically. But I still recommend manual control for custom setups. And never block your CSS/JS files accidentally — that can break Google’s rendering.

Robots.txt syntax: your cheat sheet

Get these rules right, and you’re 90% there. The rest is testing.

User-agent: Googlebot — target a specific crawler. Use * for all.
Disallow: /admin/ — blocks access to the /admin/ folder.
Allow: /public/ — allows a subfolder even if parent is disallowed.
Sitemap: https://... — absolute URL to XML sitemap.
# comment — add notes for your team.

📝 Real example (clean and safe):

User-agent: *
Disallow: /internal/
Disallow: /search-results?
Allow: /internal/landing-page/
Sitemap: https://smartchaine.cloud/sitemap_index.xml

🔍 This blocks the whole “/internal/” folder except a specific landing page, plus blocks all search result URLs with parameters — smart for crawl budget.

4 painful robots.txt mistakes (and how to avoid them)

I’ve audited over 200 websites. Here’s what routinely breaks:

Disallow: / → blocks your entire site. One slash, disaster. Use “Disallow: ” (empty) to allow everything.
Case sensitivity: “Disallow: /Images/” won’t block “/images/”. Bots are case-sensitive.
Missing sitemap declaration — slows down discovery.
Blocking CSS/JS files — Google can’t render your page properly, harming indexing.

🧠 My take: Most people don’t need a complex robots.txt. Keep it lean. Block only what wastes crawl budget (e.g., faceted navigation, user carts, staging copies).

Don’t guess: test your robots.txt with real tools

Before you break production, run these checks:

Google Search Console → “robots.txt Tester” (still available in legacy tools and the new GSC report).
curl command: curl -I https://yourdomain.com/robots.txt → ensure HTTP 200 response.
Mobile‑friendly test — reveals blocked resources indirectly.

⚡ Pro tip: After updating robots.txt, wait up to 24h for Google to refetch. Use URL Inspection tool to request recrawl of your homepage.

Advanced tactics: wildcards, parameter blocking, and managing crawl budget

Google supports limited wildcards ($ and *). For instance, Disallow: /*?sort= blocks any URL with “?sort=” parameter. Great for e‑commerce filters. And if you have thousands of low-value pages, robots.txt is your first line of defense for crawl budget efficiency.

🧩 Use case: large blog with tag archives
Disallow: /tag/
Disallow: /author/
✅ Keeps Google focused on money content, not thin tag pages.

robots.txt: your most pressing questions

Will robots.txt remove pages from Google index?

No — it only blocks crawling. If a page is already indexed and you block it via robots.txt, Google may keep the URL in results but without a snippet. Use “noindex” meta tags for actual removal.

Can I use robots.txt to block AI bots like GPTBot?

Yes. Add User-agent: GPTBot Disallow: / to block OpenAI’s crawler. Many SEOs now manage multiple AI bots separately.

How often does Google re-read robots.txt?

Typically within 24 hours, but can be longer. Use GSC to request a re-fetch after critical changes.

Stop guessing. Start crawling smarter.

SMARTCHAINE’s SEO audit suite automatically detects robots.txt errors, missing sitemaps, and crawl waste — plus gives you one‑click fixes.

No credit card required • 5‑min audit

Alex Rivera
Senior Technical SEO Strategist @ SMARTCHAINE
10+ years crawling the web, former Google Search quality contributor. Alex helps SaaS and enterprise teams turn crawl insights into organic growth.

🐦 @alex_seo | 💼 linkedin/in/alexrivera

📚 External references: Google robots.txt official spec • Google’s robots.txt parsing update