Robots.txt & Sitemap Validator

Validate robots.txt syntax, test crawl rules, verify sitemap structure, and diagnose why URLs aren't being indexed.

What It Validates

robots.txt syntax: Correct directives, proper formatting
Crawl rules: Which bots can access which paths
Sitemap references: Sitemap location specified correctly
XML sitemaps: Valid structure, URL count, change frequencies
URL accessibility: Test if specific URL is blocked
Common errors: Wildcard mistakes, case sensitivity, disallow conflicts

Quick Validation

Go to Robots & Sitemap Validator
Enter domain (e.g., example.com)
Tool fetches /robots.txt and /sitemap.xml automatically
Review syntax errors, warnings, and recommendations
Test specific URLs against crawl rules

Robots.txt Basics

Location: Must be at root: https://example.com/robots.txt

Structure:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

Sitemap: https://example.com/sitemap.xml

Common Directives

Directive	Purpose
`User-agent`	Which bot these rules apply to (* = all)
`Disallow`	Block access to paths (e.g., /admin/)
`Allow`	Override Disallow for specific paths
`Sitemap`	Location of XML sitemap (can have multiple)
`Crawl-delay`	Seconds to wait between requests (not supported by Google)

Common Mistakes & Fixes

1. Blocking Everything Accidentally

❌ BAD:

User-agent: *
Disallow: /

This blocks ALL pages from all search engines. Your site won't be indexed!

✅ FIX: Remove the disallow or make it specific:

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/

2. Incorrect Wildcard Usage

❌ WRONG:

Disallow: *.pdf  # Wildcards in robots.txt are not standard

✅ CORRECT:

User-agent: *
Disallow: /*.pdf$  # Supported by Google ($ = end of URL)

3. Typo in User-Agent

Case matters! It's User-agent, not User-Agent or Useragent.

Common user-agents:

Googlebot — Google search
Googlebot-Image — Google Images
Bingbot — Bing
Slurp — Yahoo (defunct but still in old files)
* — All bots

4. Missing Sitemap Directive

Always add sitemap location (helps Google discover it faster):

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-images.xml  # Can list multiple

Sitemap.xml Validation

Basic structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2024-01-15</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2024-01-10</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Sitemap Requirements

Max 50,000 URLs per file
Max 50MB uncompressed (or 10MB gzipped)
All URLs must start with same protocol/domain
Use absolute URLs (https://example.com/page, not /page)
Dates in ISO 8601 format (YYYY-MM-DD)

Sitemap Index (for large sites)

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2024-01-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2024-01-14</lastmod>
  </sitemap>
</sitemapindex>

Testing Crawl Access

Use the validator to test if a URL is blocked:

Enter domain
Click "Test URL Access"
Enter specific URL (e.g., /products/item-123)
Select bot (Googlebot, Bingbot, etc.)
See if URL is allowed or disallowed

Example Results

✅ ALLOWED: /products/item-123 for Googlebot
No matching Disallow rule found.

❌ BLOCKED: /admin/settings for Googlebot
Matched rule: Disallow: /admin/

Best Practices

✅ Block admin/private areas: /admin/, /wp-admin/, /login
✅ Block duplicate content: ?sort=, ?filter= parameters
✅ Add sitemap location even if submitted via Search Console
✅ Update lastmod dates in sitemap when pages change
✅ Use sitemap index for sites with 10,000+ pages
❌ Don't block CSS/JS files (Google needs them to render pages)
❌ Don't use robots.txt for sensitive data (files are still accessible directly)
❌ Don't block entire site unless intentionally deindexing

SEO Impact Warnings

The validator checks for:

Entire site blocked (Disallow: /)
Homepage blocked (Disallow: /$)
CSS/JS blocked (hurts mobile ranking)
Sitemap missing or returns 404
Sitemap URLs don't match domain
Conflicting Allow/Disallow rules

Common Scenarios

Staging Site (Don't Index)

User-agent: *
Disallow: /

E-commerce Site

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /*?sort=  # Block sorted listings
Allow: /products/

Sitemap: https://shop.example.com/sitemap.xml

Blog with Admin

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Allow: /wp-admin/admin-ajax.php

Sitemap: https://blog.example.com/sitemap.xml

Validate Your Robots.txt & Sitemap

Run Validation →