Common robots.txt mistakes – how to avoid them? Robots.txt is one of the most vital elements of Search Engine Optimization. It is the first thing a Search engine crawler checks when it visits your site. Robots.txt is used to guide a crawler about which sections of the site it is allowed and disallowed to crawl. A tiny error while placing a robots.txt file can lead to poor crawlability, which directly impacts website ranking.
Assure that the method of this feature is appropriate. If Google study state that inappropriate use of robots.txt might block out crucial pages of your site. It will hurt the efforts that go into Search Engine Optimization.
Let’s discuss below in this blog post how to avoid common robots.txt file mistakes.
Common Robots.txt Mistakes
Not locating the file in the root directory
One of the typical mistakes that people make is ignoring to place the file in the right location.
A robots.txt file should always be placed in the root directory of your site. Putting it within other subdirectories makes the file unreadable for the search engine crawler when it visits your site.
- Incorrect way – https://www.example.com/assets/robots.txt
- Correct way – https://www.example.com/robots.txt
Improper use of wildcards
Wildcards are unique characters used in the directives specified for crawlers in a robots.txt file. Two wildcards can be used in the robot file: * and $.
The character * is used to describe “0” or “all or more examples of any actual character”. And the character $ is used to describe the end of a website URL.
Let’s understand how wildcards work in the below example and use them intelligently.
Example of correct implementation
- User-Agent: * (Here * is used to describe all types of user agents)
- Disallow: /assets* (Here * denote that any website URL with “/assets” present in it will be blocked)
- Disallow: *.pdf$ (This directive indicates any URL ending with .pdf extension should be blocked)
Unnecessary use of trailing slash
One of the common mistakes is applying a trailing slash while allowing/blocking a URL in robots.txt. For instance, if you want to block a URL: https://www.example.com/blog.
What results do you get to add an unnecessary trailing slash?
User-Agent: *
Disallow: /blog/
This will show Googlebot to not crawl any URLs inside the “/blog/” folder. Also, it won’t block the URL “/blog” as there is no trailing slash in it.
The ideal way to block the URL
User-Agent: *
Disallow: /blog
Using the NoIndex directive in robots.txt
NoIndex is the old practice that people have now stopped. On September 1, 2019, Google officially announced that the NoIndex directive would work in robots.txt files. If you are using the NoIndex directive, you should get rid of it. Rather, you should determine a NoIndex attribute in the robots meta tag for the URLs you don’t want indexing by Google.
Use the meta robots tag instead
<meta name=”robots” content=”noindex”/>
Use this code in the page code of the URLs you need to block Google from indexing rather than using a NoIndex directive in the robots.txt file.
Not mentioning the sitemap URL
Many times people forget to mention the sitemap in the robots.txt file, which is not acceptable. The sitemap location will help the search engine crawler discover the sitemap from the robots file itself.
Googlebot won’t have to waste time finding the sitemap as it already mentioned upfront. Making things more straightforward for crawlers will always help your site.
How to set sitemap location in robots file?
Use the command mentioned below in your robots.txt file to submit your sitemap.
Sitemap: https://www.example.com/sitemap.xml
Blocking JS and CSS
Blocking JavaScript and CSS files can have an effect on ranking on the search engine result page. Often, people usually think that JavaScript and CSS files can get indexed by search engine bots and therefore end up blocking them in robots.txt. Google’s Senior Webmaster Trends Analyst– John Mueller has advised not to block CSS and JavaScript files as Googlebot requires them to crawl them for rendering the page efficiently. If Googlebot is unable to render pages, it is most likely that it will not index or rank those pages.
No need to create a dedicated robots.txt file for each subdomain
Every subdomain of a website, as well as the staging subdomain, should have its robots.txt file. Not doing so can lead to crawling and indexing of undesired subdomains and extravagant crawling of important subdomains. Hence, it is highly recommended ensuring that a robots.txt file is determined for each subdomain.
Ignoring case sensitivity
It is crucial to remember that URLs are case-sensitive for search engine crawlers. Parallel rules in robots.txt are case-sensitive, which means you will need to perform multiple rules to match different cases.
For example, let say you want to block the URL https://www.example.com/something
Incorrect approach – Disallow: /Something
Correct approach- Disallow: /something
Don’t block crawlers from accessing the staging site
All development works for a site are first tested on staging and then deployed on the main website. But, one important thing people neglect is that for Googlebot, a staging website is just like any other normal website.
A crawler can find, crawl and index your staging site just like any other website. And if you don’t stop crawlers from indexing and crawling your staging website, there is a high chance that your staging website URLs will get indexed by crawlers.
Conclusion
These were some common mistakes related to robots.txt files that can harm Search Engine Optimization.
Robots.txt is a small but very vital file that is simple to set up.
Hence, you should take the utmost care when placing a robots.txt file and avoid making any errors.
Leave a Reply
Want to join the discussion?Feel free to contribute!