Automatically Validated Crawlers

In Security Ninja, we have implemented a feature to automatically validate known crawlers to ensure they are legitimate and not malicious bots. This validation process helps in reducing false positives and ensures genuine crawlers can access your website without being blocked. By recognizing and allowing legitimate crawlers, we help maintain your site’s visibility on search engines and other essential services.

How Does It Work?

The code in Security Ninja validates the IP addresses of incoming requests against a list of known crawler hostnames. Here’s a detailed overview of how this process works:

  1. It first checks if the IP has already been validated. If the IP is found in the list of previously validated crawlers, it immediately returns true.
  2. If the IP has not been validated yet, the code retrieves the hostname associated with the IP address. This step translates the IP into a human-readable hostname.
  3. The retrieved hostname is then checked against a list of known valid crawler hostnames. This list includes popular and trusted search engines and crawlers.
  4. If a match is found, the function performs a reverse DNS lookup to verify that the IP address corresponds to the hostname. This ensures that the IP is legitimately associated with the trusted hostname.
  5. If the IP and hostname match, the IP is added to the list of validated crawlers. If not, the code marks the IP as from a not-validated crawler, and the security check of the request continues.

List of Automatically Validated Crawlers

Below is the list of crawlers that Security Ninja automatically validates:

  • .crawl.baidu.com – Baidu is a major Chinese search engine, and this domain is used for its web crawling activities.
  • .crawl.baidu.jp – Similar to the above, this is Baidu’s Japanese search engine crawler.
  • .search.msn.com – This domain is used by Microsoft’s Bing search engine to crawl the web and index pages.
  • .google.com – A domain used by Google’s various web crawlers, including those for indexing and other search-related tasks.
  • .googlebot.com – Specifically used by Googlebot, the web crawler for Google’s search engine.
  • .crawl.yahoo.net – Yahoo’s search engine crawler, used to index web pages for its search engine.
  • .yandex.ru – Yandex is a Russian search engine, and this domain is used for its web crawling activities.
  • .yandex.net – Another domain used by Yandex for its web crawling operations.
  • .yandex.com – Used by Yandex’s international web crawlers for indexing web pages outside of Russia.
  • .petalsearch.com – Petal Search is a search engine developed by Huawei, and this domain is used for its crawling activities.
  • applebot.apple.com – Applebot is Apple’s web crawler, used primarily for Siri and Spotlight Suggestions.
  • .ahrefs.com – Ahrefs is a popular SEO toolset, and this domain is used by its web crawler for indexing web pages.
  • .semrush.com – SEMrush is another SEO tool, and this domain is used for its web crawling activities to gather data.
  • .duckduckgo.com – DuckDuckGo is a privacy-focused search engine, and this domain is used by its web crawler.
  • facebookexternalhit.com – This domain is used by Facebook to scrape link previews when shared on its platform.
  • .commoncrawl.org – Common Crawl is a non-profit organization that crawls the web to build and maintain an open repository of web data.
  • .googleother.com – Used by various other Google crawlers that do not fall under the primary Googlebot domain.
  • .google-inspectiontool.com – Used by Google’s inspection tool for analyzing and crawling web pages.
  • .swiftype.com – Swiftype is an enterprise search solution, and this domain is used by its web crawler.
  • .sogou.com – Sogou is a Chinese search engine, and this domain is used for its web crawling activities.
  • .yahoo.com – Used by Yahoo’s international search engine crawlers for indexing web pages.
  • .bing.com – Bing is Microsoft’s search engine, and this domain is used by its web crawlers for indexing.

Benefits of Automatically Validated Crawlers

This feature ensures that legitimate crawlers, such as search engine bots, are not blocked by the firewall, allowing them to index your website properly. By validating these crawlers, we prevent unnecessary blocks and ensure that your site remains visible and accessible to search engines.

Troubleshooting

If you encounter issues with legitimate crawlers being blocked, ensure that your Security Ninja plugin is up-to-date and that the crawler’s IP addresses are correctly resolving to the known hostnames listed above. You can manually add these IPs to the whitelist if necessary.

It’s important to note that not all search engine or robot crawler systems support automatic validation. In some cases, manual whitelisting might still be necessary. If you notice a crawler with an automatic validation system that is not supported, please contact our support so we can add the system to the list.

For more details on how to manage and troubleshoot crawler validation, visit our documentation on creating or updating database tables.

Was this helpful?