Web Crawler & Bot Reference Guide
Understanding the different bots (also called “crawlers” or “spiders”) that visit your website is essential for interpreting your crawl data effectively. These automated programs serve various purposes, from search engine indexing to AI training and social media previews.
| Key Insight for 2025: AI crawlers now make up over 50% of web traffic on many sites. Some collect data for AI training (which you may want to block), while others fetch content when users ask AI assistants questions (which can drive referral traffic to your site). |
🔍 Search Engine Crawlers
These bots index your pages so they appear in search results. Blocking them removes your site from their search engine.
Googlebot
Operator: Google
Detection Pattern: googlebot, google-structured-data-testing-tool
Purpose: Google’s primary web crawler that discovers and indexes content for Google Search. Operates in Desktop and Smartphone versions for mobile-first indexing.
Why Track: Directly impacts your Google Search visibility. Higher visit frequency indicates Google sees your content as important. Essential for SEO.
Recommendation: Always Allow
Bingbot
Operator: Microsoft
Detection Pattern: bingbot, msnbot, bingpreview
Purpose: Microsoft’s crawler for Bing search. Also powers Copilot features and Microsoft Edge suggestions. BingPreview generates page previews for search results.
Why Track: Essential for reaching Bing users. Powers Microsoft ecosystem integration including Edge, Cortana, and Office products.
Recommendation: Always Allow
DuckDuckBot
Operator: DuckDuckGo
Detection Pattern: duckduckbot, duckduckgo
Purpose: Crawls for DuckDuckGo’s privacy-focused search engine. No user tracking or data collection during crawling.
Why Track: Reach privacy-conscious users. Growing user base makes this increasingly important for traffic diversification.
Recommendation: Recommended
YandexBot
Operator: Yandex
Detection Pattern: yandexbot, yandex
Purpose: Russia’s leading search engine crawler. Optimized for Russian-language content and Cyrillic text processing.
Why Track: Essential for reaching Russian-speaking audiences and Eastern European markets.
Recommendation: Regional (Allow if targeting these markets)
Baiduspider
Operator: Baidu
Detection Pattern: baiduspider, baidu
Purpose: China’s dominant search engine crawler. Focuses on Chinese-language content with special handling for simplified Chinese characters.
Why Track: Crucial for visibility in China’s massive digital market. Required for Chinese audience reach.
Recommendation: Regional (Allow if targeting China)
Yahoo Slurp
Operator: Yahoo
Detection Pattern: slurp, yahoo
Purpose: Historically Yahoo’s primary crawler. Yahoo search now largely uses Bing’s index, but Slurp still handles specific Yahoo services.
Why Track: Some Yahoo properties still use independent crawling. Powers certain Yahoo-specific features.
Recommendation: Optional
Applebot
Operator: Apple
Detection Pattern: applebot
Purpose: Crawls for Apple’s ecosystem including Siri suggestions, Spotlight search, and Safari’s intelligent features. Separate from Applebot-Extended.
Why Track: Optimize for Siri voice search and iOS device integration. Important for Apple ecosystem users.
Recommendation: Recommended
Sogou Spider
Operator: Sogou (Tencent)
Detection Pattern: sogou
Purpose: Chinese search engine owned by Tencent. Popular in China as an alternative to Baidu.
Why Track: Additional reach in Chinese market. Integrated with Tencent’s ecosystem including WeChat.
Recommendation: Regional (Allow if targeting China)
PetalBot
Operator: Huawei
Detection Pattern: petalbot
Purpose: Huawei’s search engine crawler for Petal Search, used on Huawei devices without Google services.
Why Track: Important for reaching Huawei device users, particularly in markets where Google services are unavailable.
Recommendation: Regional
🤖 AI Training Crawlers
These bots collect content to train AI/LLM models. Blocking them will NOT affect your search rankings.
| Important: You can block all AI training bots via robots.txt without any negative impact on your Google, Bing, or other search engine visibility. This is a personal/business choice about whether you want your content used for AI training. |
GPTBot
Operator: OpenAI
Detection Pattern: gptbot
Purpose: OpenAI’s crawler that collects data to train GPT models (ChatGPT, GPT-4, etc.). The highest-volume AI crawler, accounting for approximately 30% of all AI bot traffic.
Why Track: Monitor AI training data collection. Highest volume means potential server impact. Respects robots.txt.
Recommendation: Your Choice (Safe to block)
Google-Extended
Operator: Google
Detection Pattern: google-extended
Purpose: Collects training data for Google’s Gemini AI models. Completely separate from Googlebot – blocking this does NOT affect Google Search indexing.
Why Track: Control whether Google uses your content for Gemini AI training. Safe to block without SEO impact.
Recommendation: Your Choice (Safe to block)
ClaudeBot
Operator: Anthropic
Detection Pattern: claudebot
Purpose: Anthropic’s crawler for training Claude AI models. Respects robots.txt directives.
Why Track: Monitor Anthropic’s data collection for Claude training. Medium traffic volume.
Recommendation: Your Choice (Safe to block)
Meta-ExternalAgent
Operator: Meta (Facebook)
Detection Pattern: meta-externalagent
Purpose: Meta’s crawler for training Llama AI models. Very high volume – accounts for approximately 19% of AI crawler traffic.
Why Track: One of the most aggressive AI training crawlers. High server load potential. Different from Facebook’s social preview bot.
Recommendation: Your Choice (Safe to block)
PerplexityBot
Operator: Perplexity AI
Detection Pattern: perplexitybot
Purpose: Crawls for Perplexity’s AI search engine. The fastest-growing AI crawler with 157,000%+ growth in 2024.
Why Track: Rapidly increasing presence. High growth trajectory means increasing server requests over time.
Recommendation: Your Choice (Safe to block)
CCBot
Operator: Common Crawl Foundation
Detection Pattern: ccbot
Purpose: Creates open datasets used by many AI companies and researchers for training models. Non-profit organization.
Why Track: Data ends up in many AI systems indirectly. Blocking prevents broad AI training data collection.
Recommendation: Your Choice (Safe to block)
Bytespider
Operator: ByteDance (TikTok)
Detection Pattern: bytespider
Purpose: TikTok parent company’s AI crawler. Activity has been declining recently.
Why Track: Track ByteDance data collection. Currently showing declining traffic patterns.
Recommendation: Your Choice (Safe to block)
Grok
Operator: xAI (Elon Musk)
Detection Pattern: grok
Purpose: Powers Grok AI chatbot with real-time web access. Has special integration with X (Twitter).
Why Track: Monitor xAI data collection. Low to medium traffic volume currently.
Recommendation: Your Choice (Safe to block)
Cohere-AI
Operator: Cohere
Detection Pattern: cohere-ai
Purpose: Enterprise AI company’s crawler for training language models focused on business applications.
Why Track: Track enterprise AI data collection. Lower volume than consumer AI crawlers.
Recommendation: Your Choice (Safe to block)
AI2Bot
Operator: Allen Institute for AI
Detection Pattern: ai2bot
Purpose: Non-profit AI research organization’s crawler. Data used for academic AI research.
Why Track: Academic/research focused. Generally lower impact than commercial AI crawlers.
Recommendation: Your Choice (Safe to block)
Diffbot
Operator: Diffbot
Detection Pattern: diffbot
Purpose: AI-powered web scraping and knowledge graph building. Used by enterprises for data extraction.
Why Track: Commercial data extraction service. May be accessing your content for client projects.
Recommendation: Your Choice (Safe to block)
Applebot-Extended
Operator: Apple
Detection Pattern: applebot-extended
Purpose: Apple’s AI training crawler, separate from regular Applebot. Used for Apple Intelligence features.
Why Track: Different from regular Applebot. Safe to block without affecting Siri or Spotlight.
Recommendation: Your Choice (Safe to block)
Amazonbot
Operator: Amazon
Detection Pattern: amazonbot
Purpose: Amazon’s crawler for Alexa answers and AI training. Used to improve Alexa’s knowledge base.
Why Track: Powers Alexa voice responses. Blocking may affect how Alexa answers questions about your content.
Recommendation: Your Choice
YouBot
Operator: You.com
Detection Pattern: youbot
Purpose: AI search engine You.com’s crawler for indexing and AI training.
Why Track: Smaller AI search player. Lower traffic volume.
Recommendation: Your Choice (Safe to block)
Omgilibot
Operator: Omgili/Webz.io
Detection Pattern: omgili
Purpose: Data collection crawler that powers news and social media monitoring services.
Why Track: Commercial data aggregation. Content may appear in news monitoring products.
Recommendation: Your Choice (Safe to block)
💬 AI Assistant Crawlers (User-Triggered)
These bots fetch pages when real users ask AI assistants questions. They can drive referral traffic when AI cites your content.
| Recommendation: Consider ALLOWING these bots. Unlike training crawlers, these represent actual users seeking information. When AI assistants cite your content, users may click through to your site. |
ChatGPT-User
Operator: OpenAI
Detection Pattern: chatgpt-user
Purpose: Fetches pages in real-time when ChatGPT users ask questions. Enables ChatGPT’s web browsing feature. Shows 2,825%+ growth in 2024.
Why Track: Your content may be directly quoted in ChatGPT responses with attribution. High referral potential.
Recommendation: Consider Allowing (Can drive traffic)
OAI-SearchBot
Operator: OpenAI
Detection Pattern: oai-searchbot
Purpose: OpenAI’s dedicated search crawler for ChatGPT’s search feature. Separate from GPTBot training crawler.
Why Track: Powers ChatGPT search results. Can drive traffic when users explore cited sources.
Recommendation: Consider Allowing (Can drive traffic)
Claude-User
Operator: Anthropic
Detection Pattern: claude-user
Purpose: Fetches web content when Claude users request current information. Enables Claude’s web search capabilities.
Why Track: Real users asking Claude questions. Potential for content citation and referral traffic.
Recommendation: Consider Allowing (Can drive traffic)
Perplexity-User
Operator: Perplexity AI
Detection Pattern: perplexity-user
Purpose: Fetches pages for Perplexity AI search answers. Note: Has been known to sometimes ignore robots.txt.
Why Track: High referral potential – Perplexity shows sources prominently. May not fully respect robots.txt.
Recommendation: Consider Allowing (High referral potential)
📱 Social Media Crawlers
These bots generate link previews when your URLs are shared on social media. Blocking them breaks your link previews.
facebookexternalhit
Operator: Meta
Detection Pattern: facebookexternalhit, facebot
Purpose: Generates link previews for Facebook, Instagram, Messenger, and WhatsApp. Reads Open Graph metadata for titles, descriptions, and images.
Why Track: Essential for social media marketing. Broken previews significantly reduce engagement and click-through rates.
Recommendation: Always Allow
Twitterbot
Operator: X (Twitter)
Detection Pattern: twitterbot
Purpose: Creates Twitter/X card previews when links are shared. Reads Twitter Card metadata.
Why Track: Essential for X/Twitter engagement. Card previews drive significantly more clicks than plain links.
Recommendation: Always Allow
LinkedInBot
Operator: LinkedIn
Detection Pattern: linkedinbot
Purpose: Generates link previews for LinkedIn posts and messages. Important for B2B and professional content.
Why Track: Critical for business and professional content sharing. LinkedIn is key for B2B marketing.
Recommendation: Always Allow
Operator: Pinterest
Detection Pattern: pinterest, pinterestbot
Purpose: Indexes images for Pinterest’s visual discovery platform. Enables Rich Pins with real-time information.
Why Track: Essential for visual content and e-commerce. Pinterest can drive significant referral traffic.
Recommendation: Always Allow
Slackbot
Operator: Slack (Salesforce)
Detection Pattern: slackbot
Purpose: Generates link previews in Slack workspaces when URLs are shared in channels or messages.
Why Track: Important for B2B. Links shared in work conversations benefit from rich previews.
Recommendation: Recommended
Discordbot
Operator: Discord
Detection Pattern: discordbot
Purpose: Creates link previews in Discord servers and direct messages.
Why Track: Important for community engagement. Discord has millions of active communities.
Recommendation: Recommended
TelegramBot
Operator: Telegram
Detection Pattern: telegrambot
Purpose: Generates link previews for Telegram chats and channels.
Why Track: Growing messaging platform. Important for international audiences.
Recommendation: Recommended
Operator: Meta
Detection Pattern: whatsapp
Purpose: Creates link previews in WhatsApp messages. Part of Meta’s family of apps.
Why Track: Massive user base globally. Link previews important for sharing.
Recommendation: Always Allow
📊 SEO Tool Crawlers
These commercial bots gather data for SEO analysis tools. Generally safe to allow, but can be resource-intensive on some sites.
AhrefsBot
Operator: Ahrefs
Detection Pattern: ahrefsbot, ahrefs
Purpose: Builds Ahrefs’ comprehensive backlink database. Powers backlink analysis, keyword research, and site audits. Very high crawl volume.
Why Track: Influences your metrics in Ahrefs tools. Can be resource-intensive – one of the most active SEO crawlers.
Recommendation: Allow (Consider rate-limiting if needed)
SemrushBot
Operator: Semrush
Detection Pattern: semrushbot, semrush
Purpose: Gathers data for Semrush’s SEO platform including site audits, keyword tracking, and competitive analysis.
Why Track: Powers Semrush metrics and competitive intelligence. High crawl activity.
Recommendation: Allow (Consider rate-limiting if needed)
MJ12bot
Operator: Majestic
Detection Pattern: mj12bot, majestic
Purpose: Creates Majestic’s link index. Powers Trust Flow, Citation Flow, and other link authority metrics.
Why Track: Influences Trust Flow and Citation Flow scores. Medium crawl volume.
Recommendation: Allow
DotBot
Operator: Moz
Detection Pattern: dotbot, moz, opensiteexplorer
Purpose: Crawls for Moz’s Link Explorer. Calculates Domain Authority and Page Authority scores.
Why Track: Domain Authority is a widely-used SEO metric. Lower crawl volume than Ahrefs/Semrush.
Recommendation: Allow
Rogerbot
Operator: Moz
Detection Pattern: rogerbot
Purpose: Another Moz crawler used for their SEO tools and research.
Why Track: Supplements DotBot data collection. Low to medium volume.
Recommendation: Allow
Screaming Frog
Operator: Screaming Frog Ltd
Detection Pattern: screaming frog
Purpose: Popular desktop SEO crawler. Often used by SEO professionals to audit sites.
Why Track: Usually indicates someone is analyzing your site for SEO purposes (could be your own team or competitors).
Recommendation: Allow
SEOkicks
Operator: SEOkicks
Detection Pattern: seokicks
Purpose: German SEO tool crawler for backlink analysis.
Why Track: Regional SEO tool. Lower volume.
Recommendation: Allow
SISTRIX
Operator: SISTRIX
Detection Pattern: sistrix
Purpose: European SEO tool crawler, particularly popular in Germany.
Why Track: Powers SISTRIX visibility index. Important for European markets.
Recommendation: Allow
SerpstatBot
Operator: Serpstat
Detection Pattern: serpstatbot
Purpose: SEO platform crawler for keyword research and site analysis.
Why Track: Growing SEO tool. Moderate crawl volume.
Recommendation: Allow
Quick robots.txt Reference
Here are ready-to-use robots.txt snippets for common scenarios:
Block AI Training Bots
Add this to your robots.txt to prevent AI companies from using your content for training:
| User-agent: GPTBot
Disallow: / User-agent: Google-Extended Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: meta-externalagent Disallow: / User-agent: PerplexityBot Disallow: / |
Allow AI Assistant Bots (for referral traffic)
These bots can send traffic to your site when AI cites your content:
| User-agent: ChatGPT-User
Allow: / User-agent: Claude-User Allow: / User-agent: Perplexity-User Allow: / User-agent: OAI-SearchBot Allow: / |
| Note: Not all bots respect robots.txt. For stronger control, use server-level blocking via .htaccess rules or a Web Application Firewall (WAF). |
—
This reference is part of Linkilo’s Crawler Analyzer feature.
Linkilo: Automatic internal links for WordPress. Improved SEO and flow. More time for content.
More Info
Compare Us
General Info
🎉 Limited Time Offer!
Get Linkilo at 40% off – price increases in:


