Meta Robots Tag 101: Blocking Spiders, Cached Pages & More ~ Pori Shikhi

Monday, 24 November 2014

Meta Robots Tag 101: Blocking Spiders, Cached Pages & More

23:12

A Braveheart

The meta robots tag was an open standard created over a decade ago and designed initially to allow page authors to prevent page indexing. Over the years, various search engines have added additional support to the tag.

Let me start off by saying that if you DO want your pages in search engines, then DO NOT use the tag. By default, the major search engines will index any page they find. Yes, there is a form of the meta robots tag you can use to explicitly tell search engines to index your pages. It looks like this:

There’s also a form you can use that adds the command “follow,” which tells the search engines to index your page and also follow any links they find on that page to other pages, which they can then index. It looks like this

You do NOT need to use either form if you DO want your pages in the search engines. Without either form, they’ll naturally index your pages and follow your links. That’s what they do.

I always joke that putting these forms of the meta robots tag on your web pages is like putting a Post-It note on your chest that says “breathe.” Hey, if you forget to look at that note, you’ll still breathe. That’s what you do, by default. And that’s what the major search engines do. By default, they inhale web pages without you putting up a meta tag telling them to do so.

Now if you DO NOT want your pages in a search engine, then it’s time to perhaps break out the meta robots tag, if for some reason the robots.txt alternative isn’t suitable. Want to keep a particular page out? Then put this on that page:

See the “noindex” value? That tells the search engines that see this page not to include them in their listings. Remember — as I explained before — this will not prevent the pages from being spidered. That’s because search engines have to keep revisiting the page in order to see if the tag is removed. The tag only keeps the page out. Here’s my earlier chart on that topic.

System	Robots. txt	Meta Robots	Yahoo Delete URL Option
Stops Crawling	Yes	No	No
Stops Index Inclusion	Yes	Yes	Yes
Stops Link Only Listing	No	No (Yes, for Google)	Yes
Why Use?	Easy to block many pages at once	Can’t access root domain	Don’t even want URL to appear or need page out fast

What if you don’t want links followed? Sure, you can do this:

That extra command, “nofollow,” tells the search engines not to follow any links on that page. Google recently covered this more as an option. But as Google also explained, links from a page with this tag might still get crawled. That’s because if anyone else links to a particular page WITHOUT a nofollow value, then the search engine will follow that link.

So far, I’ve covered all the commands that were originally created with the tag back in May 1996. Since then, more commands (also called values or attributed) have been added. For example, Google writes today to summarize several options you can use. Quoting Google:

NOINDEX – prevents the page from being included in the index.
NOFOLLOW – prevents Googlebot from following any links on the page. (Note that this is different from the link-level NOFOLLOW attribute, which prevents Googlebot from following an individual link.)
NOARCHIVE – prevents a cached copy of this page from being available in the search results.
NOSNIPPET – prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
NOODP – blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.

At times, you may want to use more than one of these commands. I’ll get back to that. But first, how about another chart? I’ll cover the major commands you may want to use below:

COMMAND	Ask	Google	Microsoft	Yahoo
NOINDEX	Yes	Yes	Yes	Yes
NOFOLLOW	Yes	Yes	Yes	Yes
NOARCHIVE	Yes	Yes	Yes	Yes
NOODP	No	Yes	Yes	Yes
NOYDIR	No	No	No	Yes
NOSNIPPET	No	Yes	No	No
Robot Name	TEOMA	GOOGLEBOT	MSNBOT	SLURP
Does Robot Specific Tag Override All Robots Tag?	???	No	No	No

Several of these are already explained above, in what I quoted from Google. They work the same way for the other major search engines. I’ve also linked to help information from each search engine for more specific advice.

The NOYDIR command is fully explained in my previous Yahoo Provides NOYDIR Opt-Out Of Yahoo Directory Titles & Descriptions post. Only Yahoo supports this, but none of the other major search engines used Yahoo titles and descriptions for listings, so it doesn’t really matter for them.

Now on to the topic of a meta robots tag having multiple values. What if you wanted to keep a page from being cached by all the major search engines and also ensure that neither Open Directory or Yahoo Directory descriptions are used. First, you need the values of the commands to say this. From the table above, they are:

NOARCHIVE
NOODP
NOYDIR

Next, you need to decide what robots to target. We’ll keep it simple for now. To target ALL robots, you use this value:

ROBOTS

Now to the meta robots format. Without the values, it looks like this:

We replace that NAME-OF-ROBOTS-TO-TARGET part with the name of the robots we’re, well, targeting. As explained, that’s ROBOTS, in order to target them all. I’ll put it in bold below:

Now we put in the commands we want to tell the robots, each separated by a command. The order doesn’t matter. Again, I’ll bold the commands:

Voila! Put that tag ANYWHERE inside the header area of a web page like this:

Then you will be telling all major search engines not to cache the page, nor to use Open Directory or Yahoo Directory titles or descriptions for you page listings.

Notice that in the tag above, there are no spaces between the commands. What if I did this?

Google writes today that spaces make no difference. Use them if you want or not, the tag means the same thing. Microsoft tells me the same thing, as does Yahoo.

What if you did this, with no commas:

Microsoft tells me this is fine. I didn’t ask Yahoo about this, and Google says commas MUST be used. So use commas and don’t be a pain.

Now what if you want to tell search engine different things. Maybe you want Microsoft not to use the ODP descriptions, Google not to cache pages, Yahoo not to follow links on a page and Ask not to index the page at all. Maybe you want to get your head examined for being so strange, too. But aside from your mental health, it is possible to do all this.

You need to have a robots tag for each particular search engine you want to target. See that chart above? At the bottom there’s a “Robot Name” row. That shows you the name of each search engine’s “robot” or “spider” that you’ll issue a command to. With the robot names, we then give each of them their specific commands:

You could also tell all robots to do one thing — say not to follow links — while also issuing a second robots-specific command such as telling only Google not to cache the page:

But wouldn’t a search engine only follow the specific tag written for it? In other words, if you target Google with a specific command in the “GOOGLEBOT” tag, then wouldn’t it follow only that tag and ignore the other?

Google, Microsoft and Yahoo say they will honor them both. I don’t know about Ask. That’s why you see “???” in that “Does Robot Specific Tag Override All Robots Tag?” section of the chart above. I’ll try to get that answered.

What if you had more than one “all” robots tag like this:

As explained, you could easily do this instead:

But if for some reason you did do it the other way, Microsoft and Yahoo have told me that’s just fine. They honor the information in BOTH of the robots tags. Google’s post today says the same thing.

Finally, the Google post provides reassurance that capitalization doesn’t make a difference. I’ve shown things in various ways above, sometimes the commands in ALL CAPS, sometimes in lowercase. As Google says, case makes no difference. To quote their post:

Googlebot understands any combination of lowercase and uppercase. So each of these meta tags is interpreted in exactly the same way:

Ah, but what about something like this:

Well, Google didn’t go that far. But my experience over the past decade has been that meta tags are not case sensitive at all with the major search engines. So I think you’re safe in whatever case, for all the major search engines.

Pori Shikhi

About Me

Blog Archive

Popular Posts

Monday, 24 November 2014