You may have noticed that the Google Search Console gives you a link to a new version. Well! This new version is great for at least one thing: it actually tells you about your excluded pages, not only a link to them but the reason why they decided to exclude said pages from their index. This is super practical in order to be able to fix your pages quickly. Before there was no such information about why some pages would not make it and it was 100% guess work to apply fixes. Since you did not even know which page, on a large website, there was close to no way to know what was happening. Period.
The picture at the top shows you the four items that Google now offers: Errors, Warnings, Valid, and Excluded. With the old console, they only offered the list of Errors, which was better than nothing, although it really only had the following few errors in that report:
- 403 Forbidden
- 404 Not Found
- 500 Server Error
Not much to go on.
This new console gives you a list of errors as before.
This list includes the same pages as in the old console. For example, if you delete a page and do not add a 301 redirect to a new location for that page, Google will report it as an error.
If you really can’t have a 301 for that page, I suggest you look into your setup and return a 410 Gone error instead. The 404 will stick around forever. A 410 error gets removed for about a month or two.
With Apache, you can add this instruction in your <VirtualHost> or your .htaccess definition:
Redirect gone /the-path/to-the-page/that-was-deleted
The Redirect instruction is pretty much always available, but make sure that your changes work as expected by going to that old page and making sure you get the 410 Gone error as expected, and not on other pages.
If you modify a file in your /etc/apache2 settings, remember you have to reload the files. On a modern Debian or Ubuntu system, you do this:
sudo systemctl reload apache2
You may just reboot your system if you’re not too sure how to do it otherwise.
301 Redirect SEO Penalty
You may have read that using a 301 Redirect can generate an SEO Penalty. This is correct. A direct link from a good website to your website is going to be like gold.
A link on your someone’s website that hits one of your 301 is going to be pretty good, it could be better if the link could be updated on their website to the new destination, though.
A link that goes through a third party system such as bit.ly, addthis, etc. includes a 301 as well. Those are not too bad, but because they are from a third party website, they have yet less weight than your own 301 Redirects.
Also, a link that first goes through a third party system as mentioned in the previous paragraph, and then hits your website and again does a 301 Redirect are going to be more penalized than links that have a single 301 Redirect. For sure, your own website (what you have full control over) should not do a 301 to another 301. The Broken Link Checker plugin can be used to help you find such double redirects on your website. The penalty for multiple 301 in a row increases dramatically, probably at a logarithmic speed, and is limited to an undisclosed number (i.e. a 301 to a 301 to a 301 … too many times, and GoogleBot stops en route and any juice left is lost.)
However, in all cases, getting a little bit of Juice is very much more positive than getting Juice to a 404 or a 410, because that one is for sure lost.
How to Fix Warnings
Unfortunately, at this point I have not received warnings so I’m not totally sure what they are in the New Google Search Console.
These are likely temporary errors such as a 500 Server Error. In other words, Google lets you know that whenever their spider tried to reach your server it encountered some problem. However, there is nothing to fix in your website per se, only your server needs to be back on its feet.
If you get some warnings, please post a comment about them. I would be grateful if you did.
That page includes an entry of valid pages. These are all the pages that are included in the Google Index and are considered valid and kicking. These are the pages your visitors will eventually find you their search.
Note that this Google Interface limits the number of pages that they show to 1,000. If your website has more than 1,000 pages, you will only see that many in this list. Don’t be alarmed, it’s normal. The number shown at the top is the total number of pages in that list. In any event, it’s probably not very useful to find a certain page in this list. The most important part is to not find it in the other lists (Errors or Excluded especially.)
How to Fix Excluded Pages
The list of pages under the Excluded Pages is the most important one, I think. Errors are very important, but in most cases these are easy and quick to fix. The excluded pages can be harder to fix as they are quite diverse.
Whenever you click on that one button at the top, you get a list of reasons for exclusion below the graph. Note that our list may be shorter or longer depending on which errors you had. If you never had a certain exclusion error, then you won’t get an entry at all for it. That’s normal and probably a good thing. Once an error occurred to you, it looks like it sticks around, though, even when the counter drops back to zero.
In the following I describe various errors I got which got some of my pages excluded from the index and explain why it happens and how to fix the problem.
Excluded Page: Page with Redirect
The Google Search Console now tells you of all the pages with a Redirect on them (well, up to 1,000 of them at least).
These are pages you had earlier and change the URL of, or pages you deleted and are now using a 301 to a new location on your website (or even on another website.)
In most cases, there is nothing to do about those. It is totally normal to have redirects and there is really no reasons for such pages to be part of the Google Index. Therefore, you should leave these alone.
Yet, there is one case where you could do something about your 301 warnings. If you wanted to write a page which better corresponds to one of those old pages, you could decide to reuse that URL for your new page. This is perfectly legal. I’ve done that before, but never tried to track how the new page would do in search engines. From what I’ve read, though, it’s not a bad idea to reuse old URLs so you should be just fine doing that too.
Excluded Page: Crawled – Currently Not Indexed
This one lets you know of pages that Google knows about, but that it decided not to include in their index.
I have a theory as to why that could happen, but so far I have no real proof. All my pages are decidedly unique (I wrote them all from scratch) but still, I think that Google views these few pages (I have 4 in that list) as duplicates from other websites out there. It could be because the subject is close to another page on the Internet.
In my case, I have these four pages in that list:
- All Your Amazon Affiliate Links Must Be Public and Other Restrictions
- Creating My Niche Website, Step by Step
- HTML and CSS, The Basics so You Become Dangerous
- First Step in Creating a Niche Website
To me it looks like all four pages are rather distinct:
- A page about Internet Affiliate on Amazon and what to be cautious about
- A short page about niche website (really a list of links to various other pages)
- A technical page about HTML and CSS, the basic technology used to create web pages
- The page explaining how to start with creating your Niche Website
So really it does not make much sense to me if it were really a duplication problem. I really created all those pages myself by hand…
Anyway, I’ll update this section if I am to find out more about this problem. I noticed that new pages often get in there for a little while, but they quickly get indexed (like within one month.) These 4, though, have been in that list like forever!
Excluded Pages: Submitted URL not Selected as Canonical
This one is different from the other Canonical error in that the “Submitted URL” refers to a URL that you specified in your sitemap.xml file.
For efficiency, the sitemap.xml file is expected to only include canonical URLs.
The sitemap.xml plugin I used with WordPress would include the /home page which WordPress generates by default. Unfortunately, that default page really is the home page and thus it’s path is just /. This is viewed as a mismatch and generated this error.
I fixed the problem by making sure /home would not be included in the sitemap.xml file and by redirecting /home to / so GoogleBot sees a 301 on that one and not a duplicated page.
Excluded Pages: Not Found (404)
Since by default you have the Error tab selected in the graph above, you are not unlikely to see an entry that says Not Found (404).
This entry is about errors, but obviously Google is not going to index pages that return a 404 error. There would really be no point in sending users to such pages (even if they have some fun graphics).
The point here is to let you know that you have some missing pages. It could be a temporary problem such as a database that’s not accessible or a software that is currently offline. In such cases, the errors should get resolved once your server is back on track.
Other 404 Not Found errors should be fixed by using either
- A 301 Redirect to a new location that makes sense.
Note that some people don’t bother and redirect all to their home page, this is not always a good choice, though. The home page is not always a match for what that old page was about. To the minimum, try to find a match and if you really don’t see one in your existing pages, then send these users to the home page.
- The other fix is to use a 410 Gone error. I show how you can do that above with Apache. The 410 error tells Google that the page is really not there anymore. Whereas a 404 will generally stick around for years. That being said, a 404 can be kept as such as long as you’d like. It won’t hurt your website SEO unless you have many of them all at once (when you delete a large section of your website.)
Excluded Pages: Crawl Anomaly
Whenever the GoogleBot computers crawl your website, they may detect problems that Google reports in this list.
A common problem would be a page that has a redirect loop. So a redirect that sends the user (and thus GoogleBot) to itself. Attempting to reload the page over and over again would not result in anything. This is a Crawl Anomaly and the redirect loop is probably one of the main one you’ll ever encounter.
Excluded Pages: Google Chose Different Canonical Than User
As the error says, GoogleBot read a page with a canonical path which did not match the page.
A beautified HTML code snippet will look like this:
<section class="main"> <h2>This is a neatly formatted HTML code snippet</h2> <p>It has newlines and indentation to make it easy to read.</p> </section>
The fact is that the nice formatting can be removed and it will have no impact on the resulting page (unless your browser is broken, but now a day, that’s not common!)
The example above ends up on a single long line of HTML code:
<section class="main"><h2>This is a neatly formatted HTML code snippet</h2><p>It has newlines and indentation to make it easy to read.</p></section>
This may not look like much to you, but such minification can actually save you 10% to 20% which when you receive 1 million hits per month is a lot of bandwidth saved.
The Minify HTML plugin will also go through the code and remove any reference to the same domain. As I explain about Internal Links, it is possible to shorten them by removing the protocol and domain name part. Only the Minify plugin does it against the canonical meta tag as well. So it transforms:
<link rel="canonical" url="https://www.internetaffiliate.com/my-page"/>
<link rel="canonical" url="/my-page"/>
Which in itself is a great minification, only it’s completely invalid. I entered a bug report for it and the author said that since you can turn off that feature, it was not a bug. It’s just that this specific feature is therefore totally useless, which is sad because many other links would benefit from that minification.
So in my case, the reason I had that error entry was because of the Minify HTML WordPress plugin and I turned on that feature to minify URLs. Your case may be different. In any event, you need to find out why your canonical would be URL A on page B. The URL of the page and its canonical need to match one to one.
If the canonical link says “https://www.example.com/my-page” then going to that very page must present a canonical of “https://www.example.com/my-page”.
Another page, say “https://www.example.com/my-page-duplicate” may have the canonical of “https://www.example.com/my-page” if indeed it is a duplicate and going to “https://www.example.com/my-page” shows the right canonical.
This one is simple enough to understand. Your sitemap.xml may not include certain pages on your website. Especially, Content Management Systems (CMS) are notorious for adding your pages to the sitemap.xml, but not meta-pages; pages created automatically such as lists of other pages or automatic pages that reference things such as the date and show the corresponding moon. Such pages being fully dynamic, they are often not included in the sitemap.xml.
The fact that such pages are excluded just means that they were not yet fully analyzed for inclusion. There is nothing for you to do. In most cases these will disappear from this list over a short period of time (i.e. about one month.)
If your website does not include a sitemap.xml that you submitted to your Google Search Console (and that’s fine if you do so), then this entry will include your new pages as they are found by Google.
Note that GoogleBot does not always re-read the sitemap.xml file before checking other pages on your website such as your home page. This means you may end up with a new entry under this label even though your page is indeed included in your sitemap.xml. So don’t be alarmed if you see pages appearing here. It’s common.
This list mainly gives you the ability to see how long it takes for your pages to get indexed.
Valid: Submitted and Indexed
When you select the Valid tab, you see this entry. This shows you 1,000 of your valid pages that were submitted using a sitemap.xml. Whether you submitted the sitemap.xml to your Google Search Console or not, pages in the sitemap.xml will appear under this label.
Valid: Indexed, not submitted in sitemap
This list are pages that Google found on its own and decided to index. As I mentioned above, GoogleBot is capable of finding all your public web pages. This list will include pages that it found following your links and that are not part of your sitemap.xml file.
<meta name="robots" content="NOINDEX"/>
But in most cases this does not matter much. It’s actually not such a bad thing to have your legal pages indexed because some people may want to search those pages to find something that Google will be capable of matching even with synonyms or even acronyms.