Software di web crawling. How does Robots.txt work? The robots file is part of the Robots Exclusion Protocol (REP), a conglomerate of standards that regulate the way i robot perform the web crawl, access and l 'indexing of the contents, and how they present such content to users. The REP also includes indications such as meta robots, as well as pages, subdirectories, or website-level instructions for how search engines should treat links (such as " nofollow " or " follow "). Robots.txt example: Below are some examples of robots.txt in action for a site. L 'URL of the robots file must be.

Quick notions to know about the robots

works Search engines have two main objectives: Run the web crawler to find out the contents Index that content so that it can be found by information seekers. In general, to search for websites, search engines follow the links to switch from one website to another, browsing through billions of links and sites. This crawling behavior is also known by the name of "spidering". Once on a website and before spidering, crawlers search for a robots file. If there is one, they read it even before continuing on the " scan " of the entire page.

If the robots file does not contain any disallow

Rules or the website does not have a robots file, the crawlers search for other information on the website. : To be found, a robots file must be entered in the top-level directory of a website also called root. The he end of any main domain to see the directions on that website (if that site has a robots file!). This means that anyone can see which pages you have set to be or not scanned. So, don't use them for hide sensitive user information. Some robots may decide to ignore your robots file. This is especially common with malicious crawlers, such as email address scraper or i malware robot. Each subdomain on a main domain uses separate robot files. This means that both and