Tag Archives: php

Using timestamps to reduce WordPress comment spam

Update 8.27.11: The method described in this post uses PHP to generate the timestamps. If your site is using a caching plugin, the timestamps in the HTML will be stale, and this method will not work. Please see my updated post at A cache-friendly method for reducing WordPress comment spam for a new method using JavaScript for sites that use page caching.

In this post, I’ll explain how to reduce the amount of comment spam your WordPress blog receives by using an unobtrusive ‘handshake’ between the two files necessary for a valid comment submission to take place. I’ve written a few different articles on reducing comment spam by means of a challenge response test that the visitor must complete before submitting a comment, but I’m now looking for ways to achieve the same results while keeping the anti-spam method invisible to the visitor.

I’m a big fan of Akismet, but I also want to block as much spam as possible before it is caught by Akisment in order to reduce the number of database entries.

One thing this method does not do is rename and hide the path to the form processing script, but it makes that technique obsolete, anyway.

In the timestamp handshake method, a first timestamp is generated and written as a hidden input field when the post page loads. When the comment is submitted, a second timestamp is generated by the comment-processing script and it and the page-load timestamp are saved as variables. If the page-load timestamp variable is blank, which should be the case if the spambot uses any other page to populate the comment, the script will die. The page-load timestamp is then subtracted from the comment-submission timestamp. If the comment was submitted less than 60 seconds after the post page was loaded, the script dies with a descriptive error message. Hopefully, this will separate the bots’ comments from those left by thoughtful human visitors who have taken the time to read your post. If a human visitor does happen to submit a comment within 60 seconds of the page loading, he or she can click his or her browser’s back button and try resubmitting the comment again in a few seconds.

One drawback is that this method does involve editing a core file – wp-comments-post.php. You’ll have to re-edit it each time you upgrade WordPress, which is a nuisance, I know. The good thing is that if you forget to do this, people can still comment – you just won’t have the anti-spam protection.

Note that the instructions in the following steps are based on the code in WordPress version 2.3 and the Kubrick theme included with that release. You may need to adjust for your version of WordPress.

Step 1 – Add the hidden timestamp input field to the comment form

Open the comments.php file in your current theme’s folder and find the following lines:

<p><textarea name="comment" id="comment" cols="100%" rows="10" tabindex="4"></textarea></p>

<p><input name="submit" type="submit" id="submit" tabindex="5" value="Submit Comment" />

Add the following line between them:

<p><input type="hidden" name="timestamp" id="timestamp" value="<?php echo time(); ?>" size="22" /></p>

Step 2 – Modify the wp-comments-post.php file to create the second timestamp and perform the comparison

Open wp-comments-post.php and find the lines:

$comment_author       = trim(strip_tags($_POST['author']));
$comment_author_email = trim($_POST['email']);
$comment_author_url   = trim($_POST['url']);
$comment_content      = trim($_POST['comment']);

Immediately after them, add the following lines:

$comment_timestamp    = trim($_POST['timestamp']);
$submitted_timestamp  = time();

if ( $comment_timestamp == '' )
	wp_die( __('Hello, spam bot!') );
	
if ( $submitted_timestamp - $comment_timestamp < 60 )
	wp_die( __('Error: you must wait at least 1 minute before posting a comment.') );

That’s it; you’re done.

Credits

Thanks to Jonathan Bailey for suggesting the handshake in his post at http://www.plagiarismtoday.com/2007/07/24/wordpress-and-comment-spam/.

A collection of PHP code snippets

This is a collection of php code snippets that seem to come in handy rather often. They are assembled here more for my own organization than anything else.

String: trim and convert to lowercase

A very straightforward but useful snippet. A string is first trimmed of any leading or trailing white space, and then converted to lowercase letters. Good for normalizing user input.

<?php
$string = "Orange";
$string = strtolower(trim($string));
echo $string;
?>

String: truncate and break at word

This will attempt to shorten a string to $length characters, but will then increase the string (if necessary) to break at the next whole word and then append an ellipses to the end of the string. Good for shortening readable text while keeping it looking pretty.

<?php 
function truncate($string, $length) {
	if (strlen($string) > $length) {
		$pos = strpos($string, " ", $length);
		return substr($string, 0, $pos) . "...";
	}
	
	return $string;
}

	echo truncate('the quick brown fox jumped over the lazy dog', 10);
?>

In the above example, the resultant output will be the quick brown…, because the 10th character is the space immediately before the ‘b’ in ‘brown’, which is counted as part of the word ‘brown’.

What season is it?

Note that this is only a very rough approximation of when a season begins and ends. This snippet would be good for rotating a seasonal background or something, but it’s not astronomically correct, and I wouldn’t use it as a calendar. Reckoning a season is rather complex.

<?php echo "It is day " . date('z') . " of the year. <br />"; ?>
<?php $theday = date('z');
	if($theday >= "79" && $theday <= "171") { 
	$season = "Spring";
	} elseif($theday >= "172" && $theday <= "264") { 
	$season = "Summer";
	} elseif($theday >= "265" && $theday <= "355") { 
	$season = "Autumn";
	} else { 
	$season = "Winter";
	}
	echo "It's " . $season . "!";
?>

Get the number of days since something happened

This function takes a date (formatted as a Unix timestamp) and calculates the number of days since that date. The floor() function shouldn’t really be necessary, but it’s a hold-over from a less accurate function that used only the hours elapsed. In that function, the results would vary depending on the time of day the function was called. In this method, the times are normalized to 12:00:00 AM.

function calc_days_ago($date){
	// The function accepts a date formatted as a Unix timestamp
	
	// First, normalize the current date down to the Unix time at 12:00:00 AM (to the second)
	$now = time() - ( (date('G')*(60*60)) + date('i')*60 + date('s') );
	// Second, normalize the given date down to the Unix time at 12:00:00 AM (to the second)
	$then = $date - ( (date('G', $date)*(60*60)) + date('i', $date)*60 + date('s', $date) );
	$diff = $now - $then;
	$days = floor($diff/(24*60*60));
	switch ($days) {
	case 0:
 		$days_ago = "today";
		break;
	case 1:
		$days_ago = $days . " day ago";
		break;
	default:
		$days_ago = $days . " days ago";
	}
	return $days_ago;
}

Get the hours and minutes remaining until something happens

This function takes a time (formatted as a Unix timestamp) and calculates the number of hours and minutes remaining until that time. If the time has already passed, the function returns “historical”. Example outputs would be “7 hours”, “6 hours and 34 minutes”, and “12 minutes”. It could probably be made even more accurate if you changed it to use 3 decimal places and then round to 2 decimal places, but this is good enough for my purposes.

function calc_time_left($date){
	// The function accepts a date formatted as a Unix timestamp

	$now = time();
	$event = $date;
	if ($event >= $now) {
		$diff = $event - $now;
		$unroundedhours = $diff/(60*60);
		// Find the hours, if any, and assemble a string
		$hours = floor($unroundedhours);
		if ($hours > "0") {
			$hourtext = ($hours == "1")? " hour" : " hours";
			$thehours = $hours . $hourtext;
		}else{
			$thehours = "";
		}
		// Find the minutes, if any, and assemble a string
		if (strpos($unroundedhours, '.')) {
			$pos = strpos($unroundedhours, '.') + 1;
			$remainder = substr($unroundedhours, $pos, 2);
			$minutes = floor($remainder * .6);
			$minutetext = ($minutes == "1")? " minute" : " minutes";
			$theminutes = $minutes . $minutetext;
		}elseif ($minutes == "0") {
			$theminutes = "";
		}else{
			$theminutes = "";
		}
		if ($thehours && $theminutes) {
			$sep = " and ";
		}
		$timeleft = $thehours . $sep . $theminutes;
	}else{
		$timeleft = "historical";
	}
	return $timeleft;
}

Get the path of the containing directory

This one really comes in handy. It will give you the URL of the folder where the executing script resides, so you can reference the full path to other files in that folder, no matter where the folder may be located. It works on both Linux and Windows servers, and it adds a trailing slash to the path if one doesn’t already exist, so that root looks the same as a subfolder.

<?php 
function get_path() {
	// Get the path of the folder where the executing script resides, with the trailing slash
	
	// Determine HTTPS or HTTP
	$url = (isset($_SERVER['HTTPS']) && $_SERVER['HTTPS'] == 'on') ? 'https://' : 'http://';
	$url .= $_SERVER['HTTP_HOST'] . dirname($_SERVER['PHP_SELF']);
	// Convert the trailing backslash (on Windows root) to a forward slash
	$url = str_replace('\\', '/', $url);
	// Determine whether the current location is root by looking for a trailing slash (Windows or Linux)
	if (strlen($url) != strrpos($url, '/') +1) {
		$url .= '/';
	}
	return $url;
}
?>

Centering unordered list items

I wrote this script because I wanted to center the thumbnails in the Plogger image gallery while still using an unordered list item to contain each thumbnail. The script figures out how many thumbnails exist on a page and how many will fit in the space provided, then adds sufficient left padding to each to give the appearance of them being centered. It can be easily adapted for other uses. The full explanation and code example is at //ardamis.com/2007/08/05/centering-the-thumbnails-in-plogger/.

Parse .html as .php (Apache .htaccess)

This isn’t actually a PHP script, but it’s still handy. If you need to write pages with a .html or .htm extension but still want to use PHP in those pages, adding the following line to your .htaccess file will force an Apache server to parse .html files as .php files. I have confirmed this to work with GoDaddy’s hosting (GoDaddy runs PHP as CGI).

AddHandler x-httpd-php .php .htm .html

If you are running an Apache server as part of an XAMPP installation on top of Windows, try using this instead:

AddType application/x-httpd-php .html .htm

Fixing warnings in the WordPress Sociable plugin

This plugin is once again actively supported. Please download the latest version from http://wordpress.org/extend/plugins/sociable/.

I’ve fixed some errors that I was experiencing with version 2.0 (dated 2007-02-02) of the Sociable WordPress plugin by Peter Harkins. Specifically, when running WordPress 2.2+, I would get the following warnings when saving changes:

Warning: implode() [function.implode]: Bad arguments. in PATH\wp-content\plugins\sociable\sociable.php on line 651

Warning: Invalid argument supplied for foreach() in PATH\wp-content\plugins\sociable\sociable.php on line 762

For each button, I’d get another error:

Warning: in_array() [function.in-array]: Wrong datatype for second argument in PATH\wp-content\plugins\sociable\sociable.php on line 797

I think I’ve narrowed the cause down to the way the plugin was writing the newly selected options to the database.

In addition to fixing those warnings, I also corrected a few behaviors:

The StumbleUpon button didn’t send the URL to the correct address at http://www.stumbleupon.com/. The button now works correctly, and also sends the page’s title.
In Options -> Sociable, leaving the field for “text displayed in front of the icons” blank now results in no extraneous code being inserted into the page. Leaving the field blank also eliminates the popup “These icons link to social bookmarking sites…” text.
The links to the social networking sites are now all rel="nofollow".

Defeating contact form spam by hiding the webmail script

My clients and I have been receiving increasing amounts of spam sent through our own contact forms. Not being a spammer myself, I’m left to speculate on how one sends spam through a webmail form, but I’ve come up with two ways of preventing it from happening. Both of these methods involve editing the contact form’s HTML and adding a JavaScript file. They also require that legitimate users of the contact form have DOM-compliant browsers with JavaScript enabled.

Defeating human-like robots

For a very long time, I suspected that the spammers’ bots were filling out and submitting forms just like regular human visitors. They would look for input fields with labels like ‘name’ and ’email’, and, of course, for textarea elements. The bots would enter values into the fields and hit the submit button and move on to the next form.

To combat this, one could institute a challenge-response test in the form of a question that must be correctly answered before the form is submitted. Eric Meyer wrote a very inspiring piece at WP-Gatekeeper on the use of easily human-comprehensible challenge questions like “What is Eric’s first name?” as a way to defeat spambots. There are a number of accessibility concerns and limitations with this method, mostly with respect to choosing a challenge question that any human being (of any mental or physical capacity, speaking any language, etc.) could answer, but that a robot would be unable to recognize as a challenge question or be unable to correctly answer. However, these issues also exist with the CAPTCHA method.

In this case, the challenge question will be What color is an orange? If answered correctly, the form is submitted. If answered incorrectly, the user is prompted to try again.

Here’s how to implement a challenge question method of form validation:

First, create a JavaScript file named ‘validate.js’ with the following lines:

function validateForm()
{
    valid = true;

    if ( document.getElementById('verify').value != "orange" )
    {
        alert ( "You must answer the 'orange' question to submit this form." );
		document.getElementById('verify').value = "";
		document.getElementById('verify').focus();
		valid = false;
    }

    return valid;
}

This script gets the value of the input field with an ID of ‘verify’ and if the value is not the word ‘orange’, the script returns ‘false’ and doesn’t allow the form to post. Instead, it pops up a helpful alert, erases the contents of the ‘verify’ field, and sets the cursor at the beginning of the field.

Add the JavaScript to your HTML with something like:

<head>
...
<script src="validate.js" type="text/javascript"></script>
...
</head>

Next, modify the form to call the function with an onSubmit event. This event will be triggered when the form’s Submit button is activated. Add an input field with the ID ‘verify’ and an onChange event to convert the value to lowercase. Add the actual challenge question as a label.

<form id="contactform" action="../webmail.php" method="post" onsubmit="return validateForm();">
...
	<div><input type="text" name="verify" id="verify" value="" size="22" tabindex="1" onchange="javascript:this.value=this.value.toLowerCase();" /></div>
	<label for="verify">What color is an orange?</label>
...
</form>

A visitor to the site who fills out the form but does not correctly answer the challenge question will not be able to submit the form.

Defeating non-human-like robots

I believe that the challenge-response method is becoming less effective, however. According the article ‘Spamming You Through Your Own Forms‘ by Will Bontrager, the spammers’ bots are not using the form as it is intended.

This is what appears to be happening: Spammers’ robots are crawling the web looking for forms. When the robot finds a form:

It makes a note of the form field names and types.

It makes a note of the form action= URL, converting it into an absolute URL if needed.

It then sends the information home where a database is updated.

Dedicated software uses the database information to insert the spammer’s spew into your form and automatically submit it to you.

His response is to stop the process at step 2 by eliminating the bots’ access to the webmail script. He suggests doing this by hiding the URL of the webmail script in an external JavaScript file, then using JavaScript to delay the writing of the form’s action attribute for a moment. The robots parsing just the page’s HTML never locate the URL to the webmail script, so it is never available for the spammers to exploit.

While I like the idea, I think I’ve come up with a better way of implementing it.

First, rename the webmail script, because the spammers already know the name and location of that script. For example, if GoDaddy is your host, contact forms on your site may be handled by ‘gdform.php’, located in the server root. You’ll need to rename that to something else. For purposes of illustration, I’ll rename the script ‘safemail.php’, but a string of random hexadecimal characters would be even better.

Next, give your contact form an ID. If you are running WordPress or other blogging software, be sure to give the contact form a different ID than the comment form, or else the JavaScript will cause the comment form to post to the webmail script. I’ll give my contact form the ID ‘contactform’.

<form id="contactform" action="../gdform.php" method="post">

We want to prevent the spammers from learning about the newly renamed script. This is done by giving the URL to a fake webmail script as the form’s action attribute and using JavaScript to change the action attribute of the form to the real webmail script only after some user interaction has occurred. I’ll use ‘no-javascript.php’ as my fake script.

To accommodate visitors who aren’t using JavaScript, the fake script could instead be a page explaining that JavaScript is required to submit the contact form and offering an alternate way to contact the author.

Edit the contact form’s action attribute to point to the fake script.

<form id="contactform" action="no-javascript.php" method="post">

Create a new, external JavaScript file called ‘protect.js’, with the following lines:

function formProtect() {
	document.getElementById("contactform").setAttribute("action","safemail.php");
}

The function formProtect, when called, finds the HTML element with ID ‘contactform’ and changes its ‘action’ attribute to ‘safemail.php’. Obviously, one could make this script more complex and potentially more difficult for spammers to parse through the use of variables, but I don’t see that as necessary at this point.

Add the JavaScript to your HTML with something like:

<head>
...
<script src="formprotect.js" type="text/javascript"></script>
...
</head>

Finally, call the script at some point during the process of filling out the form. Exactly how you want to do this is up to you, and it’ll be effective longer if you don’t share how you do it. Perhaps the most straight-forward way would be to call the script at the point of submission by adding onsubmit="return formProtect();" to the <form> element.

<form id="contactform" action="no-javascript.php" method="post" onsubmit="return formProtect();">

If you want to use both the challenge question and the action rewriting functions, you may want to combine them into a single file or trigger formProtect separately with an event on one of the required input fields. If you decide to trigger formProtect with an event other than onsubmit, consider usability/accessibility issues—not everyone uses a mouse.

In conclusion

By implementing both of these methods, it is possible to dramatically reduce or even completely stop contact form spam. In the two months since I implemented this system, I haven’t received a single spam email from any of my contact forms.

The challenge-response test should deter or at least hinder human spammers and robots that fill out forms as though they were human. The trade-off is some added work for legitimate users of the form.

The action attribute rewriting method should immediately eliminate all spam sent directly to your form by spammers who have the URL of your webmail script in their databases. It should also prevent the rediscovery of the URL. Visitors with JavaScript enabled won’t be aware of the anti-spam measures.

For WordPress users

Defeating WordPress comment spam explains how to apply the attribute rewriting method to your WordPress site.

A plugin for adding the post date to wp_get_archives

The WordPress function wp_get_archives(‘type=postbypost’) displays a lovely list of posts, but won’t show the date of each post. This plugin adds each post’s date to those ‘postbypost’ lists, like so:

Add dates to wp_get_archives

Usage

Upload and activate the plugin
Edit your theme, replacing wp_get_archives('type=postbypost') with if (function_exists('ard_get_archives')) ard_get_archives();

The function ard_get_archives(); replaces wp_get_archives('type=postbypost'), meaning you don’t need to specify type=postbypost. You can use all of the wp_get_archives() parameters except ‘type’ and ‘show_post_count’ (limit, format, before, and after). In addition, there’s a new parameter: show_post_date, that you can use to hide the date, but the plugin will show the date by default.

show_post_date
(boolean) Display date of posts in an archive (1 – true) or do not (0 – false). For use with ard_get_archives(). Defaults to 1 (true).

Customizing the date

By default, the plugin displays the date as “(MM/DD/YYYY)”, but you can change this to use any standard PHP date characters by editing the plugin at the line:

$arc_date = date('m/d/Y', strtotime($arcresult->post_date));  // new

The date is wrapped in tags, so you can style the date independently of the link.

How does it work?

The plugin replaces the ‘postbypost’ part of the function wp_get_archives, and adds the date to $before. The relevant code is below. You can compare it to the corresponding lines in general-template.php.

	} elseif ( ( 'postbypost' == $type ) || ('alpha' == $type) ) {
		('alpha' == $type) ? $orderby = "post_title ASC " : $orderby = "post_date DESC ";
		$arcresults = $wpdb->get_results("SELECT * FROM $wpdb->posts $join $where ORDER BY $orderby $limit");
		if ( $arcresults ) {
			$beforebefore = $before;  // new
			foreach ( $arcresults as $arcresult ) {
				if ( $arcresult->post_date != '0000-00-00 00:00:00' ) {
					$url  = get_permalink($arcresult);
					$arc_title = $arcresult->post_title;
					$arc_date = date('m/d/Y', strtotime($arcresult->post_date));  // new
					if ( $show_post_date )  // new
						$before = $beforebefore . '<span class="recentdate">' . $arc_date . '</span>';  // new
					if ( $arc_title )
						$text = strip_tags(apply_filters('the_title', $arc_title));
					else
						$text = $arcresult->ID;
					echo get_archives_link($url, $text, $format, $before, $after);
				}
			}
		}
	}

The lines ending in ‘// new’ are the only changes.

So you want the date to appear after the title? Edit the plugin to modify $after, instead:

	} elseif ( ( 'postbypost' == $type ) || ('alpha' == $type) ) {
		('alpha' == $type) ? $orderby = "post_title ASC " : $orderby = "post_date DESC ";
		$arcresults = $wpdb->get_results("SELECT * FROM $wpdb->posts $join $where ORDER BY $orderby $limit");
		if ( $arcresults ) {
			$afterafter = $after;  // new
			foreach ( $arcresults as $arcresult ) {
				if ( $arcresult->post_date != '0000-00-00 00:00:00' ) {
					$url  = get_permalink($arcresult);
					$arc_title = $arcresult->post_title;
					$arc_date = date('j F Y', strtotime($arcresult->post_date));  // new
					if ( $show_post_date )  // new
						$after = '&nbsp;(' . $arc_date . ')' . $afterafter;  // new
					if ( $arc_title )
						$text = strip_tags(apply_filters('the_title', $arc_title));
					else
						$text = $arcresult->ID;
					echo get_archives_link($url, $text, $format, $before, $after);
				}
			}
		}
	}

Download

Get the files here: (Current version: 0.1 beta)

Download the Ardamis DateMe WordPress Plugin

Apricot – A Minimalist WordPress Theme

Apricot is a text-heavy and graphic-light, widget- and tag-supporting minimalist WordPress theme built on a Kubrick foundation. Apricot validates as XHTML 1.0 Strict and uses valid CSS. It natively supports the excellent Other Posts From Cat and the_excerpt Reloaded plugins, should you want to install them.

WordPress version 2.3 introduces native support for ‘tags’, a method of organizing posts according to key words. Apricot has been updated to use this native tag system. The tag cloud will appear in the sidebar and the tags for each post appear above the meta data.

I used Apricot on this site for over a year, making little tweaks and adjustments the whole time, so the theme is pretty thoroughly tested in a variety of different browsers and resolutions. While the markup is derived from the WordPress default theme, Kubrick, I’ve added a few modifications of my own. I’ve listed some of these changes below.

header.php

Title tag reconfigured to display “Page Title | Site Name”

single.php

Post title is now wrapped in H1 tags
Metadata shows when the post was last modified (if ever)
Added links to social bookmarking/blog indexing sites: Del.icio.us, Digg, Furl, Google Bookmarks, and Technorati
I’ve published a fix for the Sociable plugin, which I’m now using instead of hard-coded links
If the Other Posts From Cat plugin is active, the theme will use it
Comments by the post’s author can be styled independently

page.php

Displays the page’s last modified date (instead of date of publication)

index.php

Displays the full text of the latest post and an excerpt from each of the next nine most recent posts
Native support for the_excerpt Reloaded plugin, if active

sidebar.php

Displays tag cloud, if tags are enabled

search.php

If no results found, displays the site’s most recent five posts

404.php

Displays the site’s most recent five posts

footer.php

Archive and index page titles + blog name wrapped in H1 tags

Screen shot

Search engine optimization

Apricot takes care of most of the on-page factors that Google values highly. It places the post’s title at the beginning of the title tag and in a H1 tag near the top of the page. It is free of extraneous markup and the navigation is easily spiderable. It generates what I think is a pretty logical site structure from the various post and category pages, though I have yet to study the effect of the new tagging system.

I’ve had a few top-ranked pages with this and other structurally similar layouts. Your mileage with the search engines may vary, but the layout uses fundamentally sound structural markup, which should give your site a good start.

Download

Download the theme from http://wordpress.org/extend/themes/apricot or from the link below.

Download the Apricot WordPress Theme

What if I want to use an image as a header?

Lots of people would rather use a graphic as a header, including me, but the WordPress guys insist on each theme uploaded to http://wordpress.org/extend/themes/ display the blog title and tag line.

If you want to replace the blog title and tag line with an image, download this zip file and follow these instructions (also included in readme.txt).

1. Make a PNG image, name it “header.png” and upload it to the /wp-content/themes/apricot/images/ folder. It should be 800px wide by 130px tall, or less.

2. Replace the original Apricot theme’s header.php file with the header.php file from this folder.

Download the Apricot Image Header

WordPress comments feed in Google’s Supplemental Index

I was tired of seeing the majority of my posts’ comments feeds show up in Google’s Supplemental Index, so I changed all the individual posts’ comments RSS links to rel=”nofollow”. This should at least cause Googlebot to stop passing PageRank through those links, but what I really want is for Googlebot to stop spidering the individual posts’ comment feeds, in hopes that they’ll eventually be removed from the index. To see only those pages of a site that are in the Supplemental Index, use this neat little search feature: site:DOMAIN.com *** -view. For example, to see which pages of Ardamis.com are in the SI, I’d search for: site:ardamis.com *** -view. This is much easier than the old way of scanning all of the indexed pages and picking them out by hand.

To change all the individual posts’ comments feed links to rel=”nofollow”, open ‘wp-includesfeed-functions.php’ and add rel=”nofollow” to line 84 (in WordPress version 2.0.6), as so:

echo "<a href="$url" rel="nofollow">$link_text</a>";

One could use the robots.txt file to disallow Googlebot from all /feed/ directories, but this would also restrict it from the general site’s feed and the all-inclusive /comments/feed/, and I’d like the both of these feeds to continue to be spidered. Another, minor consequence of using robots.txt to restrict Googlebot is that Google Sitemaps will warn you of “URLs restricted by robots.txt”.

To deny all spiders from any /feed/ directory, add the following to your robots.txt file:

User-agent:*
Disallow: /feed/

To deny just Googlebot from any /feed/ directory, use:

User-agent: Googlebot
Disallow: /feed/

For whatever reason, the whole-site comments feed at //ardamis.com/comments/feed/ does not appear among my indexed pages, while the nearly empty individual post feeds are indexed. Also, the general site feed at //ardamis.com/feed/ is in the Supplemental Index. It’s a mystery to me why.

Showing a WordPress post’s ‘last modified’ date

I’ll occasionally return to a post and revise it for improved methodology, test results or whatever. But the ‘posted on’ date always remains the same, even after the post has been updated. I feel that displaying only the ‘posted on’ date could be somewhat confusing, particularly when I state that I’ve updated the post in a comment dated months later. So, in the interest of full disclosure, I have added a few lines of code to the WordPress ‘single.php’ template file to supplement each post’s meta data with the date it was last modified. If the post has never been modified or if it was last modified within 24 hours of the ‘posted on’ date, only the ‘posted on’ date is shown.

Of course, this could be used anywhere inside the WordPress loop, not just in the meta data section. The code I use to show a WordPress post’s last modified date and time is as follows.

The default Kubrick template’s meta data section:

<p class="postmetadata alt">
<small>
This entry was posted 
...
on <?php the_time('l, F jS, Y') ?> 
at <?php the_time() ?>
and is filed under <?php the_category(', ') ?>.
You can follow any responses to 
this entry through the 
<?php comments_rss_link('RSS 2.0'); ?> feed.

The new code, modified to selectively display the last modified date:

<p class="postmetadata alt">
<small>
This entry was posted
...
on <?php the_time('F jS, Y') ?> 
at <?php the_time() ?>
						
<?php $u_time = get_the_time('U'); 
$u_modified_time = get_the_modified_time('U'); 
if ($u_modified_time >= $u_time + 86400) { 
echo "and last modified on "; 
the_modified_time('F jS, Y'); 
echo " at "; 
the_modified_time(); 
echo ", "; } ?>
	
and is filed under <?php the_category(', ') ?>.
You can follow any responses to 
this entry through the 
<?php comments_rss_link('RSS 2.0'); ?> feed.

You can see how this works in the meta data section of this post.

Further customization

I’m using a grace period of 24 hours from the time the post was published, but you could change this by replacing 86400 with however much time you want, specified in seconds.

A WordPress Plugin for Title Case Capitalization

I’ve written a WordPress plugin that will convert the page title and post title to ‘title case’ capitalization. Title case is also often referred to as “headline style”, and incorrectly as “initial caps” or “init caps”. Title case means that the first letter of each word is capitalized, except for certain small words, such as articles, coordinating conjunctions, and short prepositions. The first and last words in the title are always capitalized.

This plugin may be useful if you’re trying to give the titles on your site a consistent appearance, but it’s no substitute for writing a good title. There are way too many exceptions and rules to make a simple script behave correctly all of the time.

The plugin is smart enough to not capitalize the following:

Coordinating conjunctions (and, but, or, nor, for)
Prepositions of four or fewer letters (with, to, for, at, and so on) (limited)
Articles (a, an, the) unless the article is the first word in the title

But the plugin isn’t perfect. It won’t capitalize an article that is the last word in the title. It fails on subordinating conjunctions. It conservatively de-capitalizes only some of the prepositions, hopefully reducing the chance of incorrect behavior. For example, it leaves the word over caps, because over can be an adverb, an adjective, a noun, or a verb (caps) or a preposition (not caps), and determining how a word is being used in a title is really beyond the scope of a humble plugin.

The plugin requires you to edit it for certain product names, like “iPod”, and cool-people names, like “Olivia d’Abo” or “Jimmy McNulty”. It’s not savvy enough to know that acronyms, like “HTML”, should be all caps unless they’re used in particular ways, such as in the case of “Using the .html Suffix”, unless you tell it. That said, editing the plugin for these particular words is very easy.

Even with all these limitations, it beats using CSS to {text-transform: capitalize} the titles or just applying PHP’s ucwords() to the entire thing. But I’m guessing that dissatisfaction with one or both of those two methods is what brought you to this page in the first place.

On the upside, it capitalizes any word following a semicolon or a colon, e.g.: “Apollo: A Retrospective Analysis”. It also capitalizes any word immediately preceded by a double or single quote, but only if you haven’t bypassed WordPress’s fancy quotes feature.

How it works

The plugin first finds all words that begin with a double or single fancy WordPress quote and adds a space behind the quote. It capitalizes all of the words in the title with ucwords(), then selectively de-capitalizes some of the words using preg_replace(). It then uses str_ireplace(), a case-insensitive string replace function, to correct the odd capitalization of certain other words. Finally, it removes the spaces behind the quotes.

The code

This is what the code looks like. It should be pretty easy to follow what’s happening.

<?php

function ardamis_titlecase($title) {
		$title = preg_replace("/&#8220;/", '&#8220; ', $title); // find double quotes and add a space behind each instance
 		$title = preg_replace("/&#8216;/", '&#8216; ', $title); // find single quotes and add a space behind each instance
		$title = preg_replace("/(?<=(?<!:|;)W)(A|An|And|At|But|By|Else|For|From|If|In|Into|Nor|Of|On|Or|The|To|With)(?=W)/e", 
'strtolower("$1")', ucwords($title));  // de-capitalize certain words unless they follow a colon or semicolon
		$specialwords = array("iPod", "iMovie", "iTunes", "iPhone", " HTML", ".html", " PHP", ".php"); // form a list of specially treated words
		$title = str_ireplace($specialwords, $specialwords, $title); // replace the specially treated words
		$title = preg_replace("/&#8220; /", '&#8220;', $title); // remove the space behind double quotes
		$title = preg_replace("/&#8216; /", '&#8216;', $title); // remove the space behind single quotes

		return $title;
}

add_filter('wp_title', 'ardamis_titlecase');
add_filter('the_title', 'ardamis_titlecase');

?>

Download

Download the plugin, upload it to your site, and activate it.

Download the Title Case Capitalization WordPress plugin

Further customization

The plugin won’t alter words written in all caps or CamelCase. You could use ucwords(strtolower($title)) to convert the entire $title to lowercase before applying ‘ucwords’. This may fix instances where someone has typed in a bunch of titles with the caps lock key on. But you’ll then have to compensate for words that should be all caps, like ‘HTML’, ‘NBC’, or ‘WoW’, in $specialwords.

An alternative using a ‘foreach’ loop

It’s possible to do something similar using a foreach loop. This isn’t as graceful, in my opinion, but I suppose it’s possible that someone may find it works better.

<?php

function ardamis_titlecase($title) {
	$donotcap = array('a','an','and','at','but','by','else','for','from','if','in','into','nor','of','on','or','the','to','with'); 
	// Split the string into separate words 
	$words = explode(' ', $title); 
	foreach ($words as $key => $word) { 
		// Capitalize all but the $donotcap words and the first word in the title
		if ($key == 0 || !in_array($word, $donotcap)) $words[$key] = ucwords($word); 
		if (preg_match("/^&#8220;/", $word))
			$words[$key] = '&#8220;' . ucwords(substr($word, 7));
		elseif (preg_match("/^&#8216;/", $word))
			$words[$key] = '&#8216;' . ucwords(substr($word, 7));
	} 
	// Join the words back into a string 
	$newtitle = implode(' ', $words); 
	return $newtitle; 
}

add_filter('wp_title', 'ardamis_titlecase');
add_filter('the_title', 'ardamis_titlecase');

?>

Credits

Thanks to Chris for insight into the preg_replace code at http://us2.php.net/ucwords. Thanks to Thomas Rutter for insight into the foreach code at SitePoint Blogs » Title Case in PHP.

External Files Causing a WordPress 404 Error?

I needed to find a satisfactory way of adding WordPress tags and theme elements (such as the sidebar) to pages that exist outside of WordPress. A non-WordPress page could then appear to be seemlessly incorporated into the site, wherein the layout automatically updates with changes to the theme template files, and could use the same header, sidebar, and footer as a normal WordPress page.

The first few solutions that I found involved adding a line to each non-WordPress page. This does indeed allow the page to incorporate WordPress tags, theme elements and styles, but there is a serious drawback to this method because of the way WP manages a web site.

When you click on the link to a WP page, or enter it into the address bar, you aren’t actually going to a file that resides at that address. Instead, WP uses that address as an instruction to pull various database entries and form an index.php page that resides in the WP installation directory. For example, while the address for this page appears to be https://ardamis.com/2006/07/10/wordpress-googlebot-404-error/ , the actual page is at https://ardamis.com/index.php.

WordPress assumes that it is responsible for every page on the site, and that there are no pages on the site that it doesn’t know about. If you try to browse to a page that WP doesn’t know about, it sends you a 404 error page instead. This is what you want it to do, so long as you don’t create any pages outside of WordPress.

But a problem arises when you do want to create a page that WP doesn’t know about. When you visit the page, WP checks the database and finds that it doesn’t exist, so it puts a 404 error in the http header, but the page does exist, so Apache sends the page with the 404 error. This behavior seemed to cause some problems with some versions of IE but none with Firefox. It did, however, result in a 404 header being given to Googlebot, so that non-WordPress pages would incorrectly show up in Google Sitemaps as Not Found.

To get around this problem and send the correct http header code: HTTP Status Code: HTTP/1.1 200 OK, I needed to require a different file, wp-config.php, and then select specific functions for use in the page. results in a page that can use all of the desired tags and theme elements and also sends the correct header code: HTTP Status Code: HTTP/1.1 200 OK

The following code results in a page that can use all of the tags and theme elements (you may need to adjust the path to wp-config.php):

<?php require('./wp-config.php');
$wp->init();
$wp->parse_request();
$wp->query_posts();
$wp->register_globals();
?>

<?php get_header(); ?>

<div id="content">
<div class="post">
<h2>*** Heading Goes Here ***</h2>
<div class="entry">
*** Content in Paragraph Tags Goes Here ***
</div>
</div>
</div>

<?php get_sidebar(); ?>

<?php get_footer(); ?>

Testing the method

Using wp-blog-header.php as the include, I created a GoogleBot/WordPress 404 test page as the index.php file in a /testing/ directory. I added the url https://ardamis.com/testing/ to my Google XML sitemap file, and waited for the sitemap to be downloaded. Sure enough, a few days later, Google Sitemaps was listing the /testing/ url among the Not Found errors.

The next step was to remove what I suspected was the culprit, the include of the WordPress header, wp-blog-header.php, and see if Googlebot could again access the page. A few days after removing the include, and after the sitemap was downloaded, the Not Found error disappeared. I’m interpreting this as Googlebot once again successfully crawling the page.

The third step was to use the above code, including wp-config.php and then testing the HTTP Request and Response Header. The header looks ok, and Googlebot likes it. It looks like this does the trick.