Online toolbox

robots file generation

 Paths are relative, but each path must be preceded by:"/"
 Blank blank for none, Google for xml format, Baidu for html format
popular search engines
Foreign search engines
Special search engine (robot)
Other (unconventional search engines, even malicious)
Please save the following results in Notepad, name them robots.txt and upload them to the root directory of the website

What file is robots.txt?

Search engines use a program "spider"(also known as spider) to automatically access web pages on the Internet and obtain web information. You can create a plain text file robots.txt in your website, which declares the parts of the website that you don't want spiders to access, so that part or all of the content of the website can not be accessed and included by search engines, or you can specify robots.txt so that search engines only include specified content. The first file that search engines visit when crawling websites is robots.txt.

Please describe the robots.txt file in detail?

  1. file location

    The robots.txt file should be placed in the root directory of the website. For example, when a search engine accesses a website, it first checks whether the robots.txt file exists in the website. If the robot robot program finds this file, it will determine the scope of its access rights based on the content of the file. The robots.txt file location of wordpress has not been uploaded to the root program of the wordpress website. When search engines and users visit a certain file, the wordpress program will actively generate a robots.txt to the search engines and users; if we upload the written robots.txt to the root program of the website, the users and search engine spiders will visit the file we uploaded, and wordpress will no longer generate that file. Wordpress will only generate this file if the server cannot find robots.

  2. file format

    The "robots.txt" file contains one or more records separated by blank lines (ending with CR, CR/NL, or NL). Each record has the following format:"<field>:<optionalspace><value><optionalspace>" You can use #for annotations in this file, in the same way as in UNIX. Records in this file usually start with one or more lines of User-agent followed by several Disallow lines. The details are as follows:User-agent: The value of this item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple User-agent records, it means that multiple robots will be bound by the protocol. Therefore, there must be at least one User-agent record in the "robots.txt" file. If the value of this item is set to *(wildcard), the protocol is valid for any search engine robot. In the "robots.txt" file, there can only be one record such as "User-agent:*". Disallow: The value of this item is used to describe a URL that you do not want to be accessed. This URL can be a complete path or a partial path. Any URL that starts with Disallow will not be accessed by the robot. For example:"Disallow: /help" does not allow search engines to access either/help.html or/help/index.html, while "Disallow: /help/" allows robots to access/help.html but not/help/index.html. Any Disallow record is empty, indicating that all parts of the website are allowed to be accessed. In the "/robots.txt" file, there must be at least one Disallow record. If "/robots.txt" is an empty document, the site is open to all search engines robots.

  3. general shielding

    Block privacy pages, background login pages, cache pages, picture catalogs, css catalogs, template pages, block the content of dual pages, and block some pages with poor quality, such as all member user space pages of Jinnet, and dynamic links of dz. You can also set blocking. Use the Disallow: command to set it.