Do you want to remove a webpage or an entire website from Google’s search results?
When you are creating a new digital product like a website or a web app, it’s a good idea to get the basics of Google indexing down! To maintain good search engine visibility and protect rankings, it’s imperative that a development environment does not prematurely show up in the index and compete with the live site. Here is our handy guide to indexing, and some advice on how you can hide unfinished development environments from Google, plus all you need to know about using robots.txt and the meta robots tag.
What is indexing?
In SEO, indexing means being picked up by the search engine crawler and being stored in the search engine’s own files. Websites only appear in search results when they have been indexed. Websites can regularly be crawled as search engines update the index. It’s important to serve relevant and compelling content to search engine crawlers and ensure they are not indexing irrelevant pages. You may want to hide pages from crawlers to ensure they do not turn up in the index and cannot be found on search engines, e.g. when the content is behind a paywall or it’s not public, the page is not conducive to rankings, or the page or website is not finished.
Meta robots tag – block Google from indexing your site
The correct way to remove pages, or even an entire website or development environment from search engines, is to use the meta robots tag. You can also use the meta robots tag retrospectively when content or pages have already been indexed: using this tag will reverse the process and ensure they are promptly de-indexed.
The meta robots tag can be used at page-level to prevent a single page from being indexed, or implement it server side so that it blocks the entire site (this is the method we would recommend for de-indexing a development environment).
Detailed instructions for using the tag can be found from Google:
- Block Search indexing with ‘noindex’
- Robots meta tag, data-nosnippet, and X-Robots-Tag specifications
If the situation is urgent, e.g. a development site is rising in the search results and it is imperative that it is de-indexed immediately, it is possible to request certain URLs to be removed from the search results using Google Search Console (but you should also implement the meta robots tag). Once you have implemented the meta robots tag, submit the site to Google via the Search Console so that Google will come and re-crawl your site, realise it is blocked, and start to de-index the site.
Robots.txt – improve your site’s indexing
In all its simplicity, a robots.txt text file is a file that is added to your root domain to give search engines more information about how your site should be indexed. It essentially gives instructions to crawlers, telling them whether you want the site to be crawled, whether any areas should not be crawled etc. You can use robots.txt to ask crawlers not to crawl a customer area or marketing pages for example.
Like sitemaps, a robots.txt file should be added to every site as a default. At the end of your robots.txt file, it’s a good idea to tell crawlers where they can find the site’s sitemap. Robots.txt is not suitable for the complete blocking or removal of crawlers to your website, but a way to control Google’s search engine robots. It will also not help you de-index pages that have ‘leaked’ into the index.
Examples robots.txt files and their usage
Robots.txt that blocks indexing:
User-agent: *
Disallow: /
Robots.txt allows everything to be indexed (default robots.txt):
User-agent: *
Allow: /
Sitemap: https://www.domain.com/sitemap.xml
A common WordPress robots.txt that blocks unnecessary WordPress stuff:
User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
On the Semrush website, you will find an example of a very complex robots.txt that can be used to block several different scripts.
This example also blocks URLs with UTM parameters and posts certain dynamic pages. (/ language, / results etc)
Robots.txt is a so-called ‘lightweight’ method of controlling search engine robots. Unlike a meta robots tag, Robots.txt can, for example, tell you that certain URL paths or parts of a site aren’t important and don’t need to be included in search results. A useless page for search engines would be a custom URL page that was automatically generated by a member site when a new member creates an account. When a page adds no value for search, it’s a good idea to restrict crawling so that you don’t burn through your crawl budget on low-value pages.
Robots.txt lives here.
Please note!
The meta robots tag will not work if robots.txt has disabled crawling because Google must first access the site to see the no-index tag. In other words, if your meta robots tag doesn’t seem to be working, make sure your robots.txt file is giving Google permission to come to your site.
Timehouse provides technical SEO and content strategy SEO services. Read more about our services here!
Do you want to take your website or ecommerce business to a new level? Call us on +358 20 749 1449 or send a message to info@timehouse.fi.