ED

Robots.txt & Sitemap Validator

Validate robots.txt syntax, test crawl rules, verify sitemap structure, and diagnose why URLs aren't being indexed.

What It Validates

Quick Validation

  1. Go to Robots & Sitemap Validator
  2. Enter domain (e.g., example.com)
  3. Tool fetches /robots.txt and /sitemap.xml automatically
  4. Review syntax errors, warnings, and recommendations
  5. Test specific URLs against crawl rules

Robots.txt Basics

Location: Must be at root: https://example.com/robots.txt

Structure:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

Sitemap: https://example.com/sitemap.xml

Common Directives

DirectivePurpose
User-agentWhich bot these rules apply to (* = all)
DisallowBlock access to paths (e.g., /admin/)
AllowOverride Disallow for specific paths
SitemapLocation of XML sitemap (can have multiple)
Crawl-delaySeconds to wait between requests (not supported by Google)

Common Mistakes & Fixes

1. Blocking Everything Accidentally

❌ BAD:

User-agent: *
Disallow: /

This blocks ALL pages from all search engines. Your site won't be indexed!

✅ FIX: Remove the disallow or make it specific:

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/

2. Incorrect Wildcard Usage

❌ WRONG:

Disallow: *.pdf  # Wildcards in robots.txt are not standard

✅ CORRECT:

User-agent: *
Disallow: /*.pdf$  # Supported by Google ($ = end of URL)

3. Typo in User-Agent

Case matters! It's User-agent, not User-Agent or Useragent.

Common user-agents:

4. Missing Sitemap Directive

Always add sitemap location (helps Google discover it faster):

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-images.xml  # Can list multiple

Sitemap.xml Validation

Basic structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2024-01-15</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2024-01-10</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Sitemap Requirements

Sitemap Index (for large sites)

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2024-01-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2024-01-14</lastmod>
  </sitemap>
</sitemapindex>

Testing Crawl Access

Use the validator to test if a URL is blocked:

  1. Enter domain
  2. Click "Test URL Access"
  3. Enter specific URL (e.g., /products/item-123)
  4. Select bot (Googlebot, Bingbot, etc.)
  5. See if URL is allowed or disallowed

Example Results

✅ ALLOWED: /products/item-123 for Googlebot
No matching Disallow rule found.
❌ BLOCKED: /admin/settings for Googlebot
Matched rule: Disallow: /admin/

Best Practices

SEO Impact Warnings

The validator checks for:

Common Scenarios

Staging Site (Don't Index)

User-agent: *
Disallow: /

E-commerce Site

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /*?sort=  # Block sorted listings
Allow: /products/

Sitemap: https://shop.example.com/sitemap.xml

Blog with Admin

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Allow: /wp-admin/admin-ajax.php

Sitemap: https://blog.example.com/sitemap.xml

Validate Your Robots.txt & Sitemap

Run Validation →

Related Articles