Robots.txt & Sitemap Validator
Validate robots.txt syntax, test crawl rules, verify sitemap structure, and diagnose why URLs aren't being indexed.
What It Validates
- robots.txt syntax: Correct directives, proper formatting
- Crawl rules: Which bots can access which paths
- Sitemap references: Sitemap location specified correctly
- XML sitemaps: Valid structure, URL count, change frequencies
- URL accessibility: Test if specific URL is blocked
- Common errors: Wildcard mistakes, case sensitivity, disallow conflicts
Quick Validation
- Go to Robots & Sitemap Validator
- Enter domain (e.g., example.com)
- Tool fetches /robots.txt and /sitemap.xml automatically
- Review syntax errors, warnings, and recommendations
- Test specific URLs against crawl rules
Robots.txt Basics
Location: Must be at root: https://example.com/robots.txt
Structure:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Sitemap: https://example.com/sitemap.xmlCommon Directives
| Directive | Purpose |
|---|---|
User-agent | Which bot these rules apply to (* = all) |
Disallow | Block access to paths (e.g., /admin/) |
Allow | Override Disallow for specific paths |
Sitemap | Location of XML sitemap (can have multiple) |
Crawl-delay | Seconds to wait between requests (not supported by Google) |
Common Mistakes & Fixes
1. Blocking Everything Accidentally
❌ BAD:
User-agent: *
Disallow: /This blocks ALL pages from all search engines. Your site won't be indexed!
✅ FIX: Remove the disallow or make it specific:
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/2. Incorrect Wildcard Usage
❌ WRONG:
Disallow: *.pdf # Wildcards in robots.txt are not standard✅ CORRECT:
User-agent: *
Disallow: /*.pdf$ # Supported by Google ($ = end of URL)3. Typo in User-Agent
Case matters! It's User-agent, not User-Agent or Useragent.
Common user-agents:
Googlebot— Google searchGooglebot-Image— Google ImagesBingbot— BingSlurp— Yahoo (defunct but still in old files)*— All bots
4. Missing Sitemap Directive
Always add sitemap location (helps Google discover it faster):
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-images.xml # Can list multipleSitemap.xml Validation
Basic structure:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2024-01-15</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2024-01-10</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>Sitemap Requirements
- Max 50,000 URLs per file
- Max 50MB uncompressed (or 10MB gzipped)
- All URLs must start with same protocol/domain
- Use absolute URLs (https://example.com/page, not /page)
- Dates in ISO 8601 format (YYYY-MM-DD)
Sitemap Index (for large sites)
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2024-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2024-01-14</lastmod>
</sitemap>
</sitemapindex>Testing Crawl Access
Use the validator to test if a URL is blocked:
- Enter domain
- Click "Test URL Access"
- Enter specific URL (e.g., /products/item-123)
- Select bot (Googlebot, Bingbot, etc.)
- See if URL is allowed or disallowed
Example Results
✅ ALLOWED: /products/item-123 for Googlebot
No matching Disallow rule found.
No matching Disallow rule found.
❌ BLOCKED: /admin/settings for Googlebot
Matched rule:
Matched rule:
Disallow: /admin/Best Practices
- ✅ Block admin/private areas:
/admin/,/wp-admin/,/login - ✅ Block duplicate content:
?sort=,?filter=parameters - ✅ Add sitemap location even if submitted via Search Console
- ✅ Update lastmod dates in sitemap when pages change
- ✅ Use sitemap index for sites with 10,000+ pages
- ❌ Don't block CSS/JS files (Google needs them to render pages)
- ❌ Don't use robots.txt for sensitive data (files are still accessible directly)
- ❌ Don't block entire site unless intentionally deindexing
SEO Impact Warnings
The validator checks for:
- Entire site blocked (Disallow: /)
- Homepage blocked (Disallow: /$)
- CSS/JS blocked (hurts mobile ranking)
- Sitemap missing or returns 404
- Sitemap URLs don't match domain
- Conflicting Allow/Disallow rules
Common Scenarios
Staging Site (Don't Index)
User-agent: *
Disallow: /E-commerce Site
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /*?sort= # Block sorted listings
Allow: /products/
Sitemap: https://shop.example.com/sitemap.xmlBlog with Admin
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Allow: /wp-admin/admin-ajax.php
Sitemap: https://blog.example.com/sitemap.xml