For posterity, a quick note that today I noticed //ardamis.com/ was back up to a toolbar page rank of 6 after a number of months at 5. A little later in the day, I noticed 3 one-line sitelinks to: Colophon – About – Portfolio. We’ll, I’d have preferred different pages, but these are my first sitelinks, so I’m still thrilled.
I was tired of seeing the majority of my posts’ comments feeds show up in Google’s Supplemental Index, so I changed all the individual posts’ comments RSS links to rel=”nofollow”. This should at least cause Googlebot to stop passing PageRank through those links, but what I really want is for Googlebot to stop spidering the individual posts’ comment feeds, in hopes that they’ll eventually be removed from the index. To see only those pages of a site that are in the Supplemental Index, use this neat little search feature:
site:DOMAIN.com *** -view. For example, to see which pages of Ardamis.com are in the SI, I’d search for:
site:ardamis.com *** -view. This is much easier than the old way of scanning all of the indexed pages and picking them out by hand.
To change all the individual posts’ comments feed links to rel=”nofollow”, open ‘wp-includesfeed-functions.php’ and add rel=”nofollow” to line 84 (in WordPress version 2.0.6), as so:
echo "<a href="$url" rel="nofollow">$link_text</a>";
One could use the robots.txt file to disallow Googlebot from all /feed/ directories, but this would also restrict it from the general site’s feed and the all-inclusive /comments/feed/, and I’d like the both of these feeds to continue to be spidered. Another, minor consequence of using robots.txt to restrict Googlebot is that Google Sitemaps will warn you of “URLs restricted by robots.txt”.
To deny all spiders from any /feed/ directory, add the following to your robots.txt file:
User-agent:* Disallow: /feed/
To deny just Googlebot from any /feed/ directory, use:
User-agent: Googlebot Disallow: /feed/
For whatever reason, the whole-site comments feed at //ardamis.com/comments/feed/ does not appear among my indexed pages, while the nearly empty individual post feeds are indexed. Also, the general site feed at //ardamis.com/feed/ is in the Supplemental Index. It’s a mystery to me why.
I needed to find a satisfactory way of adding WordPress tags and theme elements (such as the sidebar) to pages that exist outside of WordPress. A non-WordPress page could then appear to be seemlessly incorporated into the site, wherein the layout automatically updates with changes to the theme template files, and could use the same header, sidebar, and footer as a normal WordPress page.
The first few solutions that I found involved adding a
line to each non-WordPress page. This does indeed allow the page to incorporate WordPress tags, theme elements and styles, but there is a serious drawback to this method because of the way WP manages a web site.
When you click on the link to a WP page, or enter it into the address bar, you aren’t actually going to a file that resides at that address. Instead, WP uses that address as an instruction to pull various database entries and form an index.php page that resides in the WP installation directory. For example, while the address for this page appears to be https://ardamis.com/2006/07/10/wordpress-googlebot-404-error/ , the actual page is at https://ardamis.com/index.php.
WordPress assumes that it is responsible for every page on the site, and that there are no pages on the site that it doesn’t know about. If you try to browse to a page that WP doesn’t know about, it sends you a 404 error page instead. This is what you want it to do, so long as you don’t create any pages outside of WordPress.
But a problem arises when you do want to create a page that WP doesn’t know about. When you visit the page, WP checks the database and finds that it doesn’t exist, so it puts a 404 error in the http header, but the page does exist, so Apache sends the page with the 404 error. This behavior seemed to cause some problems with some versions of IE but none with Firefox. It did, however, result in a 404 header being given to Googlebot, so that non-WordPress pages would incorrectly show up in Google Sitemaps as Not Found.
To get around this problem and send the correct http header code: HTTP Status Code: HTTP/1.1 200 OK, I needed to require a different file, wp-config.php, and then select specific functions for use in the page. results in a page that can use all of the desired tags and theme elements and also sends the correct header code: HTTP Status Code: HTTP/1.1 200 OK
The following code results in a page that can use all of the tags and theme elements (you may need to adjust the path to wp-config.php):
<?php require('./wp-config.php'); $wp->init(); $wp->parse_request(); $wp->query_posts(); $wp->register_globals(); ?> <?php get_header(); ?> <div id="content"> <div class="post"> <h2>*** Heading Goes Here ***</h2> <div class="entry"> *** Content in Paragraph Tags Goes Here *** </div> </div> </div> <?php get_sidebar(); ?> <?php get_footer(); ?>
Testing the method
Using wp-blog-header.php as the include, I created a GoogleBot/WordPress 404 test page as the index.php file in a /testing/ directory. I added the url https://ardamis.com/testing/ to my Google XML sitemap file, and waited for the sitemap to be downloaded. Sure enough, a few days later, Google Sitemaps was listing the /testing/ url among the Not Found errors.
The next step was to remove what I suspected was the culprit, the include of the WordPress header, wp-blog-header.php, and see if Googlebot could again access the page. A few days after removing the include, and after the sitemap was downloaded, the Not Found error disappeared. I’m interpreting this as Googlebot once again successfully crawling the page.
The third step was to use the above code, including wp-config.php and then testing the HTTP Request and Response Header. The header looks ok, and Googlebot likes it. It looks like this does the trick.