Category Archives: Web Site Dev

Posts concerning web site design and development. Examples of php, xhtml, and javascript code. Wordpress-related posts may be cross-categorized here, but also have their own category.

I like Akismet, and it’s undeniably effective in stopping the vast majority of spam, but it adds a huge number of comments to the database and a very small percentage of comments still get through to my moderation queue.

It’s annoying to find comments in my moderation queue, but what I really object to is the thousands of records that are added to the database each month that I don’t see.

In the screenshot below, January through April show very few spam comments being detected by Akismet. This is because I was using my cache-friendly method for reducing WordPress comment spam to block spam comments even before Akismet analyzed them.

Akismet stats

In May, I moved hosting providers to asmallorange.com and started with a fresh install of WordPress without implementing my custom spam method, which admittedly was not ideal because it involved changing core files. This left only Akismet between the spammers and my WordPress database. Since that time, instead of 150 or fewer spam comments per month making it into my WordPress database, Akismet was on pace to let in over 10,000.

So, in the spirit of fresh starts and doing things the right way, I created a WordPress plug-in that uses the same timestamp method. It’s actually exactly the same JavaScript and PHP code, just in plug-in form, so it’s not bound to any core files or theme files.

I am looking for cheap, shared web hosting because the downtime and customer support I’m getting with JustHost is intollerable (and well documented). I’m willing to pay about $10/month for hosting, and I don’t ask for much other than reasonable uptime. I have a blog (ardamis.com) that gets about 1000 visits per day, a few other sites that barely get any traffic, and a small photo gallery that I share only with family and friends. I don’t stream or make available for download any video or audio files, but I do want to be able to upload all of my personal photos and use my web host as an off-site backup. I have an account with Drop Box, but I have gigs of photos, so I would have to purchase extra storage, and I kinda want to keep the Drop Box stuff separate. I’ve also considered using Google Drive, but I have enough photos that I would need to purchase additional space.

I’m currently paying $108/year for hosting and an additional $19.99/year for a dedicated IP address. I’m willing to pay a little more, but not terribly much more.

Before JustHost, I was a GoDaddy customer for something like 5 years, and there was almost no downtime. Customer service was incredibly, suprisingly good and the reps were always knowlegeable and effective. I left because I didn’t like the proprietary admin panel that GoDaddy has developed and wanted to keep my domain registrar and my web host at separate companies. But really, there was nothing wrong with GoDaddy’s hosting at the time, that I could tell.

The problem with researching new web hosts is that unbiased information (if it’s out there) is buried under tons of completely untrustworthy garbage sites that sell reviews and rankings. If you look for personal recommendations based on experiences with multiple hosts, it is extremely hard to tell from the obscure bulletin board threads whether the posts come from shills or actual customers. Even if one finds posts that appear to be from genuine customers, their descriptions of their experiences are usually subjective, anecdotal, and not comparative.

(As a little aside, BlueHost, HostGator, HostMonster, and JustHost are all owned by the same corporate parent, Endurance International Group. An outage affecting one can affect them all, as described in the Mashable article: Bluehost, HostGator and HostMonster Go Down. So, take these things into consideration if you are trying to choose between them.)

This post, then, is just my notes on what I’ve found while researching my next web host.

No Host

One option would be to not go with a hosting company at all and just use Amazon EC2 to self-host my blog. I have seriously considered doing this, but by some accounts, it’s actually more expensive to run an EC2 instance than purchase shared hosting. And I don’t really want to become my own linux administrator. As much as I enjoy occasionally tinkering with Ubuntu, I just want my sites to run smoothly and for someone to keep the server up-to-date and to fix problems for me quickly if something breaks.

Bluehost

http://www.bluehost.com/

Everyone has heard of Bluehost. They’re huge. Their tagline is “Trusted by Millions as the Best Web Hosting Solution”. So there you have it. Maybe you want to be one of their millions of customers.

According to the builtwith.com profile for ardamis.com, the site’s hosting provider is not JustHost but BlueHost!

Their website has terribly low production values, for being such a huge company. They have a crappy stock photograph of a dude with a headset providing customer support, below the text “We specialize in customer service. Call or Chat!” I do not want to live chat with this dude, or anyone else. I probably don’t need any customer service from my hosting company unless they screw something up, so prominently featuring your support phone numbers make me wonder if they get tons of calls.

They do have cPanel, which is nice, as I am wary of proprietary admin panels after using GoDaddy for years.

They also offer unlimited domains, unlimited storage, unlimited bandwith, unlimited email accounts, and a free domain. They also offer custom php.ini files and SSH access. And they do all of this for just $6.95/month for 1 year.

It actually sounds pretty decent, on paper. But one does get the sense that they are completely driven by the bottom-line, and that your site is going to be jammed into an already crowded server.

HostGator

http://www.hostgator.com/

HostGator is the other huge discount web hosting company, competing with BlueHost and GoDaddy for what I picture to be the same sort of confused customers or WordPress blog/Google Adsense scammers.

Like with BlueHost, I’m immediately turned off by the website, which is just ugly as all get out and also conspicuously promotes Live Chat support with a stock photograph of a girl with a (wired) headset. The pricing for the hosting plans is also a bit misleading, as all of the pricing shown is discounted 20%, which discount is only valid for the first invoice.

Another thing I really dislike is that there are three tiers of shared hosting, named “Hatchling Plan”, “Baby Plan”, and “Business Plan”. I just can’t see myself signing up for the ridiculously named “Hatchling Plan” or “Baby Plan”. Which is probably by design so that people like me upgrade to the more respectable and grown-up sounding “Business Plan”.

They do offer a single free website transfer for shared hosting, whether or not the site being transferred uses cPanel.

Lithium Hosting

http://www.lithiumhosting.com/

I first heard about Lithium Hosting while reading the ars technica article How to set up a safe and secure Web server, which also mentioned A Small Orange.

Their site looks pretty good, but it’s a little bit too template-like, and I felt that a quick glance on themeforest.net would turn up about a dozen hosting reseller templates for which the layout and stock placeholder text is an almost exact match. For example, the tagline on their home page is “Why we’re not the best host in the world.” Yeah, that sort of false modesty smells just a bit too contrived here. I was almost ready to sign up with Lithium Hosting, but the cookie-cutter stuff gave me enough pause to keep looking.

One of the things that I didn’t like is that they offer a free month coupon code, but it doesn’t work when you configure your cart to pay a year-at-a-time. Another thing I didn’t like was the $36/year cost of a dedicated IP address – $5 to set up and then $3 per month thereafter. It seems like the price on dedicated IP addresses has dramatically increased in the months since all the breathless news reports about the world running out of IPv4 addresses. They also charge an extra $60/year for shell access via SSH. That’s pretty shameful.

Bailing out of the cart just before purchase doesn’t cause a window to pop up an lure you back with a discount, as I fully expected to happen.

I checked out their Facebook page and the most recent post was from a guy who’s website was down. Two other recent posts were about downtime.

At the end of the day, Lithium Hosting just seems too much like a hosting reseller itself than a company that will be around for years.

A Small Orange

http://asmallorange.com/

I first heard about A Small Orange while reading the same ars technica article How to set up a safe and secure Web server that mentioned Lithium Hosting.

My first impression of the site was that it looked pretty much exactly as I wanted my hosting company to look. Again, I’m being pretty superficial here, but I want to be happy with my choice and if the company’s outward appearance is cruddy then I’m not going to be as satisfied. I want to be certain that I made the best choice, and a crappy, thrown-together-looking site injects a small amount of doubt. The site clearly sets out the costs of the different hosting plans, and I knew I would be looking at shared hosting first. There are five tiers of shared hosting, the least expensive being $35/year for 250 MB storage and 5 GB bandwidth. Basically, the only differentiating factor between the different levels of shared hosting is the amount storage and bandwidth. The standard features for all shared hosting accounts include:

  • Unlimited Parked and Addon Domains
  • Unlimited Subdomains
  • Unlimited POP3/IMAP Mail Accounts
  • Unlimited MySQL Databases
  • Unlimited FTP Accounts
  • Automated Daily Backups
  • cPanel
  • Automatic Script Installation (Softaculous)
  • Jailed shell upon request
  • FTP and SFTP access
  • Cron jobs for scheduled tasks
  • 99.9% uptime guarantee

A Small Orange uses CloudLinux as the server OS.

OK, yes, there is still a Live Chat button, but it’s this inconspicuous little green tab at the left side of the window that states “live help”. That’s it.

I checked out their Facebook page for recent posts, and they were almost entirely positive, with a good amount of interaction from the admin. The page claims that the company is home to 45,000 web sites, which might be exactly what I’m looking for, without even knowing it.

One of the things that ultimately convinced me to go with them was the Inc.com write up of CEO Douglas Hanna in America’s coolest College Start-Ups 2012 and the Duke Chronical article Hanna makes juicy profits with A Small Orange. Hanna worked at HostGator for two years in customer service, so one could reasonably expect that he knows what customers want in affordable hosting.

I have also found a ton of coupon codes for A Small Orange (just Google it) for either $5 or 15% off your order.

saveme$5
saveme15%
save_$5
save_15%

November was a rough month for ardamis.com. What are JustHost’s thoughts on uptime?

The term uptime refers to the amount of time that the website will be accessible. It is important to remember that unforeseen events do occur and that uptime guarantees are not written in stone. That being said, however, any established web hosting provider worthy of your business will strive to guarantee no less than 99.5% uptime.

http://www.justhost.com/web-hosting-articles/2010/12/06/the-uptime-guarantee/

Here’s my Pingdom Monthly Report for 2012-11-01 to 2012-11-30 for ardamis.com. Boy, those 34 outages for a total of 6 hours and 45 minutes (0.94%) sure feels like a lot of downtime.

Uptime Outages Response time
99.06% 34 1665 ms

Downtimes

From To Downtime
2012-11-02 06:14:08 2012-11-02 06:29:08 0h 15m 00s
2012-11-02 07:19:08 2012-11-02 07:44:10 0h 25m 02s
2012-11-05 22:54:09 2012-11-05 22:59:08 0h 04m 59s
2012-11-05 23:09:08 2012-11-05 23:19:08 0h 10m 00s
2012-11-07 14:24:09 2012-11-07 14:34:08 0h 09m 59s
2012-11-10 11:49:08 2012-11-10 11:54:08 0h 05m 00s
2012-11-10 12:09:08 2012-11-10 12:14:08 0h 05m 00s
2012-11-10 13:44:08 2012-11-10 13:49:10 0h 05m 02s
2012-11-10 15:24:08 2012-11-10 15:29:08 0h 05m 00s
2012-11-10 16:24:08 2012-11-10 16:29:09 0h 05m 01s
2012-11-10 16:49:08 2012-11-10 16:54:08 0h 05m 00s
2012-11-10 22:29:08 2012-11-10 22:34:08 0h 05m 00s
2012-11-11 22:34:08 2012-11-11 22:39:08 0h 05m 00s
2012-11-12 17:54:08 2012-11-12 17:59:08 0h 05m 00s
2012-11-17 00:49:08 2012-11-17 02:59:08 2h 10m 00s
2012-11-18 14:19:08 2012-11-18 14:24:08 0h 05m 00s
2012-11-19 03:54:08 2012-11-19 04:04:08 0h 10m 00s
2012-11-23 15:09:08 2012-11-23 15:24:08 0h 15m 00s
2012-11-23 15:44:08 2012-11-23 15:49:08 0h 05m 00s
2012-11-23 16:49:08 2012-11-23 16:54:08 0h 05m 00s
2012-11-26 10:04:08 2012-11-26 10:09:08 0h 05m 00s
2012-11-27 07:54:08 2012-11-27 07:59:08 0h 05m 00s
2012-11-27 15:24:08 2012-11-27 15:29:08 0h 05m 00s
2012-11-27 20:29:08 2012-11-27 20:34:08 0h 05m 00s
2012-11-27 21:34:08 2012-11-27 21:39:08 0h 05m 00s
2012-11-27 22:19:08 2012-11-27 22:24:08 0h 05m 00s
2012-11-27 23:54:08 2012-11-27 23:59:08 0h 05m 00s
2012-11-28 06:14:08 2012-11-28 06:19:08 0h 05m 00s
2012-11-28 06:24:08 2012-11-28 06:49:08 0h 25m 00s
2012-11-28 06:54:08 2012-11-28 06:59:08 0h 05m 00s
2012-11-28 07:04:08 2012-11-28 07:24:08 0h 20m 00s
2012-11-30 06:44:08 2012-11-30 06:49:08 0h 05m 00s
2012-11-30 07:04:08 2012-11-30 07:29:08 0h 25m 00s
2012-11-30 12:04:09 2012-11-30 12:09:08 0h 04m 59s
Copyright © 2012 Pingdom AB

That’s just a really pretty sad report.

I have got to be better about catching my contract before it automatically renews. I’m looking at Lithium Hosting and a small orange as replacements, as they seem to be well-regarded by Ars Technica readers.

For years, ardamis.com has had a Google rankings nemesis in ardamis.gr. For much of the time that I’ve spent watching the results for the search phrase ‘ardamis’, ardamis.com has consistently ranked #1, and ardamis.gr typically landed in second or third place. But at some point in 2011, and my recollection is that this was occurring pre-Panda, ardamis.gr moved to the top spot and has stayed there since.

Google search results for ardamis on March 15, 2012

Google search results for ardamis on March 15, 2012

The top 10 results returned for ‘ardamis’ as of March 15, 2012, while not signed in to Google, connecting from Chicago, IL, using IE9:

  1. http://www.ardamis.gr/
  2. http://www.ardamis.gr/index.php?lang=en
  3. //ardamis.com/
  4. //ardamis.com/2005/08/11/xampp-apache-namevirtualhost/
  5. https://twitter.com/#!/ardamis
  6. http://www.tripadvisor.com/Hotel_Review-g285708-d274279-Reviews-Ardamis_Hotel-Monemvasia_Peloponnese.html
  7. https://github.com/ardamis
  8. http://www.greeka.com/peloponnese/monemvasia/hotels/monemvasia-ardamis/index.html
  9. http://www.linkedin.com/company/ardamis
  10. http://www.facebook.com/pages/ardamis/90788288272

I can’t really explain why a post from 2005 on configuring a setting in Apache would be the second best page on the site, but I guess I’ll take it. My properties do pretty well, for what isn’t a highly competitive phrase. Items related either to ardamis.com or me personally appear in positions 3, 4, 5, 7, 9, 10.

I’ve done some comparing of these two domains, and I am still unsure why Google is currently favoring ardamis.gr.

Location

I felt pretty confident that geography and Google’s focus on local search would mean that North American users would be returned results that favored ardamis.com, so long as they were not obviously searching for travel information about Greece. But this isn’t proving to be a safe assumption. Even more strange is that it’s the Greek language version of the page that Google is ranking first, even before the English language version. This promotion of a foreign-language page is very odd.

Metrics

Google Toolbar Page Rank (I know, I know, but it’s one of many metrics I’ll use) shows ardamis.com getting a 5 and ardamis.gr getting a 3. I won’t make too much of this, but I wanted to point out that the toolbar PR is not equal.

I ran the list of URLs on the first page of Google through the Open Site Explorer to get a better sense of how strong the pages and domains were, and ardamis.com comes out on top.

URL Page Authority Domain Authority Links
www.ardamis.gr/ 45.29 34.3 433
www.ardamis.gr/index.php?lang=en 23.65 34.3 33
www.ardamis.com/ 74.06 69.38 149621
www.ardamis.com/2005/08/11/xampp-apache-namevirtualhost/ 42.18 69.38 29

As the table shows, the home page at ardamis.com has significantly more Page Authority than the home page at ardamis.gr, the ardamis.com domain has more Domain Authority than ardamis.gr, and ardamis.com has 300 times the number of inbound links. (Although, the vast majority of inbound links come from footer links in the various WordPress and Plogger themes I’ve designed. See below.)

Author attribution

The pages on ardamis.com all contain verified authorship markup linking them to my Google Plus profile, and I get my profile picture next to my pages in the results.

I don’t detect any author markup on ardamis.gr.

Structured markup

The pages on ardamis.com contain structured markup (HTML5 microdata as described at schema.org and hCard microformat). The Rich Snippets Testing Tool returns no warnings for ardamis.com. Rich snippets from the pages at ardamis.com are displayed as part of the page data in Google’s results.

The page at ardamis.gr does not contain authorship or rich snippet markup.

Site links

In July of 2009, ardamis.com had a Toolbar Page Rank of 6 and 3 one-line sitelinks, before later disappearing. Then, in October of 2010, the sitelinks returned for awhile before disppearing again. I last noticed the sitelinks in January of 2011.

(I would point out that the site still shows sitelinks when searching for my name.)

Inbound links

I’ve developed and released a WordPress theme and a few Plogger themes, and put links back to ardamis.com and the theme’s post in the footer. These links have helped the home page gain nearly 2 million inbound links, with the Apricot WordPress theme’s page gaining nearly 1.5 million and the most popular Plogger theme’s page gaining just over 70,000. That’s a lot of links.

Page Speed

Google’s Page Speed Online tool awards ardamis.com a Page Speed Score of 96 (out of 100), while ardamis.gr gets a score of 68 (out of 100).

I have put quite a bit of effort into optimizing the performance, and I’m pretty happy with a 96.

Panda

Post-Panda, I combed through ardamis.com and weeded out the posts that I was unsure about.

Other domains

I also own ardamis.net and ardamis.org, and have one-page placeholders at these domains with links back to ardamis.com

Conclusion

At this point, I wonder if ardamis.com is suffering a penalty somewhere. Maybe all of those footer links are actually hurting the site.

Or maybe the combination of a country code top-level domain and a real geographic location is just incredibly powerful when compared to a random word attached to a .com domain.

Here’s an example script demonstrating how a publicly accessible home page can leverage JavaScript to detect whether a machine is on a corporate intranet and then redirect the browser to an intranet page.

In the example, http://alephstudios.com acts as the corporate intranet site that is not accessible from outside the company’s network, and //ardamis.com acts as the publicly accessible site, which can be accessed both from within and outside the corporate network.

The browser is set to use a page on the public //ardamis.com site which includes some JavaScript that attempts to load an image from a location on the company intranet. If the image can be successfully loaded by the browser, we have establishe that the machine is on the internal network. The browser can then be redirected via JavaScript to an appropriate intranet page. Otherwise, the browser is redirected to an Internet page.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Intranet Detection Script</title>
<script type="text/javascript">
<!--

var internalURL = 'http://alephstudios.com';
var publicURL = '//ardamis.com';
var detectionCounter = 0;
var detectionTimeOut = 5;
var detectionImage = 'http://alephstudios.com/testing/intranet/transparent.gif?' + (new Date()).getTime();
var detectionElement = document.createElement('img');
detectionElement.src = detectionImage;

function detectIntranet() {
    detectionCounter = detectionCounter + 1;
    //  alert('Attempt ' + detectionCounter + ': Sniffing intranet connection by loading an internal resource at ' + detectionImage);
    if (detectionElement.complete) {
        if (detectionElement.width > 0 && detectionElement.height > 0) {
            //      alert('Attempt ' + detectionCounter + ': The intranet resource was loaded!');
            window.location = internalURL;
        } else {
            //      alert('Attempt ' + detectionCounter + ': The intranet resource could not be loaded!');
            window.location = publicURL;
        }
    } else {
        if (detectionCounter < detectionTimeOut) {
            setTimeout("detectIntranet()", 1000);
            //      alert('Attempt ' + detectionCounter + ': Still trying to load: ' + detectionImage);
        } else {
            alert('Attempt ' + detectionCounter + ': Gave up trying to load: ' + detectionImage);
            //	  window.location = publicURL;
        }
    }
}

window.onload = function () {
    detectIntranet();
}

//-->
</script>
</head>

<body>
</body>
</html>

Setting up an intranet detection/redirection page as the browser’s home page allows IT to display an intranet page while the device is on the network and an Internet page when the device is off the network.

7:21 PM 2/26/2012

I recently ran the spider at www.xml-sitemaps.com against www.ardamis.com and it returned a list of URLs that included a few pages with some suspicious-looking parameters. This is the second time I’ve come across these URLs, so I decided to document what was going on. The first time, I just cleared the cache, spidered the site to preload the cache, and confirmed that the spider didn’t encounter the pages. And then I forgot all about it. But now I’m mad.

Normally, a URL list for a WordPress site includes the various pages of the site, like so:

//ardamis.com/
//ardamis.com/page/2/
//ardamis.com/page/3/

But in the suspicious URL list, there are additional URLs for the pages directly off of the site’s root.

//ardamis.com/
//ardamis.com/?option=com_google&controller=..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F%2Fproc%2Fself%2Fenviron%0000
//ardamis.com/page/2/
//ardamis.com/page/2/?option=com_google&controller=..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F%2Fproc%2Fself%2Fenviron%0000
//ardamis.com/page/3/
//ardamis.com/page/3/?option=com_google&controller=..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F..%2F%2F%2Fproc%2Fself%2Fenviron%0000

This occurs only for the pagination of the main site’s pages. I did not find URLs containing the parameter ?option=com_google&controller= for any pages that exist under a category or tag, but that also use the /page/2/ convention.

The parameter is the urlencoded version of the text:

?option=com_google&controller=..//..//..//..//..//..//..//..///proc/self/environ00

Exploration

I compared the source code of the pages at the clean URLs vs that of the pages at the bad URLs and found that there was a difference in the pagination code generated by the WP-Paginate plugin.

The good pages had normal-looking pagination links.

<div class="navigation">
<ol class="wp-paginate">
<li><span class="title">Navigation:</span></li>
<li><a href="//ardamis.com/page/2/" class="prev">&laquo;</a></li>
<li><a href='//ardamis.com/' title='1' class='page'>1</a></li>
<li><a href='//ardamis.com/page/2/' title='2' class='page'>2</a></li>
<li><span class='page current'>3</span></li>
<li><a href='//ardamis.com/page/4/' title='4' class='page'>4</a></li>
<li><a href='//ardamis.com/page/5/' title='5' class='page'>5</a></li>
<li><a href='//ardamis.com/page/6/' title='6' class='page'>6</a></li>
<li><a href='//ardamis.com/page/7/' title='7' class='page'>7</a></li>
<li><span class='gap'>...</span></li>
<li><a href='//ardamis.com/page/17/' title='17' class='page'>17</a></li>
<li><a href="//ardamis.com/page/4/" class="next">&raquo;</a></li>
</ol>
</div>    

The bad pages had the suspicious URLs, but were otherwise identical. Other than the URLs in the navigation, there was nothing alarming about the HTML on the bad pages.

I downloaded the entire site and ran a malware scan against the files, which turned up nothing. I also did some full-text searching of the files for the usual base64 decode eval type stuff, but nothing was found. I searched through the tables in my database, but didn’t see any instances of com_google or proc or environ that I could connect to the suspicious URLs.

Google it

Google has turned up a few good links about this problem, including:

  1. http://www.exploitsdownload.com/search/com_/36 – AntiSecurity/Joomla Component Contact Us Google Map com_google Local File Inclusion Vulnerability
  2. http://forums.oscommerce.com/topic/369813-silly-hacker/ – “On a poorly-secured LAMP stack, that would read out your server’s environment variables. That is one step in a process that would grant the hacker root access to your box. Be thankful it’s not working. Hacker is a bad term for this. This is more on the Script Kiddie level.”

    The poster also provided a few lines of code for blocking these URLs in an .htaccess file.

    # Block another hacker
    RewriteCond %{QUERY_STRING} ^(.*)/self/(.*)$ [NC]
    RewriteRule ^.* - [F]
    
  3. http://forums.oscommerce.com/topic/369813-silly-hacker/ – “This was trying for Local File Inclusion vulnerabilities via the Joomla/Mambo script.”
  4. http://core.trac.wordpress.org/ticket/14556 – a bug ticket submitted to WordPress over a year earlier identifying a security hole if the function that generates the pagination isn’t wrapped in a url_esc function that sanitizes the URL. WP-Paginate’s author submits a comment to the thread, and the plugin does use url_esc.

So, what would evidence of an old Joomla exploit be doing on my WordPress site? And what is happening within the WP-Paginate plugin to cause these parameters to appear?

Plugins

It seemed prudent to take a closer look at two of the plugins used on the site.

Ardamis uses the WP-Paginate plugin. The business of generating the /page/2/, /page/3/ URLs is a native WordPress function, so it’s strange to see how those URLs become subject to some sort of injection by way of the WP-Paginate plugin. I tried passing a nonsense parameter in a URL (//ardamis.com/page/3/?foobar) and confirmed that the navigation links created by WP-Paginate contained that ?foobar parameter within each link. This happens on category pages, too. This behavior of adding any parameters passed in the URL to the links it is writing into the page, even if they are urlencoded, is certainly unsettling.

The site also uses the WP Super Cache plugin. While this plugin seems to have been acting up lately, in that it’s not reliably preloading the cache, I can’t make a connection between it and the problem. I also downloaded the cache folder and didn’t see cached copies of these URLs. I turned off caching in WP Super Cache but left the plugin activated, cleared the cache, and then sent the spider against the site again. This time, the URL list didn’t contain any of the bad URLs. Otherwise, the lists were identical. I re-enabled the plugin, attempted to preload the cache (it got through about 70 pages and then stopped), and then ran a few spiders against the site to finish up the preloading. I generated another URL list and the bad URLs didn’t appear in it, either.

A simple fix for the WP-Paginate behavior

The unwanted behavior of the WP-Paginate plugin can be corrected by changing a few lines of code to strip off the GET parameters from the URL. The lines to be changed all reference the function get_pagenum_link. I’m wrapping that function in the string tokenizing function strtok to strip the question mark and everything that follows.

The relevant snippets of the plugin are below.

			
$prevlink = ($this->type === 'posts')
? esc_url(strtok(get_pagenum_link($page - 1), '?'))
: get_comments_pagenum_link($page - 1);
$nextlink = ($this->type === 'posts')
? esc_url(strtok(get_pagenum_link($page + 1), '?'))
: get_comments_pagenum_link($page + 1);
			
function paginate_loop($start, $max, $page = 0) {
    $output = "";
    for ($i = $start; $i <= $max; $i++) {
        $p = ($this->type === 'posts') ? esc_url(strtok(get_pagenum_link($i), '?')) : get_comments_pagenum_link($i);
        $output .= ($page == intval($i))
        ? "<li><span class='page current'>$i</span></li>"
        : "<li><a href='$p' title='$i' class='page'>$i</a></li>";
    }
    return $output;
}

Once these changes are made, WP-Paginate will no longer insert any passed GET parameters into the links it’s writing into that page.

Bandaid

The change to the WP-Paginate plugin is what we tend to call a bandaid – it doesn’t fix the problem, it just suppresses the symptom.

I’ve found that once the site picks up the bad URLs, they can be temporarily cleaned by clearing the cache and then using a spider to recreate it. The only thing left to do is determine where they are coming from in the first place.

The facts

Let’s pause to review the facts.

  1. The http://www.xml-sitemaps.com spider sent against //ardamis.com discovers pages with odd parameters that shouldn’t be naturally occurring on the pages
  2. The behavior of the WP-Paginate plugin is to accept any parameters passed and tack them onto the URLs it is generating
  3. Deleting the cached pages created by WP Super Cache and respidering produces a clean list – the bad URLs are absent

So how is the spider finding pages with these bad URLs? How are they first getting added to a page on the site? It would seem likely that they are originating only on the home page, and the absence of the parameters on other pages that use pagination seems to support that theory.

An unsatisfying ending

Well, the day is over. I’ve added my updated WP-Paginate plugin to the site, so hopefully Ardamis has seen the last of the problem, but I’m deeply unsatisfied that I haven’t been able to get to the root cause. I’ve scoured the site and the database, and I can’t find any evidence of the URLs anywhere. If the bad URLs come back again, I’ll not be so quick to clean up the damage, and will instead try to preserve it long enough to make a determination as to their origin.

Update 07 April 2012: It’s happened again. When I spider the site, two pages have the com_google URL. These page have the code appended to the end of the URL created by the WordPress function cancel_comment_reply_link(). This function generates the anchor link in the comments area with an ID of cancel-comment-reply-link. This time, though, I see the hijacked URL used in the link even when I visit the clean URL of the page.

This code is somehow getting onto the site in such a way that it only shows up in the WP Super Cache’d pages. Clearing the cache and revisiting the page returns a clean page. My suspicion is that someone is visiting my pages with the com_google code as part of the URL. WordPress puts the code into a self-referencing link in the comment area. WP Super Cache then updates the cache with this page. I don’t think WordPress can help but work this way with nested comments, but WP Super Cache should know better than to create a cached page from anything but the content from the server.

In the end, because I wasn’t using nested comments to begin with, I chose to remove the block of code that was inserting the link from my theme’s comments.php file.

    <div class="cancel_comment_reply">
        <small><?php cancel_comment_reply_link(); ?></small>
    </div>

I expect that this will be the last time I find this type of exploit on ardamis.com, as I don’t think there is any other mechanism that will echo out on the page the contents of a parameter passed in the URL.

Google, what is up with your Google+ profile badges and your Google +1 buttons being different sizes?

There is a neat wizard for creating the code snippet for a Google+ profile badge. We get to pick an image size from one of four options, if we want the image to be hosted on Google’s server. (And why wouldn’t we?)

Small (16px)
Standard (32px)
Medium (44px)
Tall (64px)

OK, those are pretty acceptable, I guess, but I would really like to see something in the 20 to 24 pixel range.

What about the wizard for the Google +1 button?

Small (15px)
Standard (24px)
Medium (20px)
Tall (60px)

Wait, what?!? You’re using the same labeling, but the sizes are totally different. Not only that, but none of the sizes are shared between the two buttons. The Standard profile button is 75% larger than the +1 button. Grrr.

The Google +1 button wizard has more options, including a field where you may specify the path to an image, and the button itself is more dynamic. The profile button wizard is very basic, but it is easy to edit the HTML output to use any image. If forced to make my own image and host it, coming up with a custom profile button is clearly less involved.

As a third option, the configuration tool for the Google+ brand page badge makes it possible to essentially combine the functions of both buttons. While the badge takes up a large chunk of real estate, it is probably the best choice, as it looks good, has some bold colors, and adds some extra Google+ stuff (thumbnail images, a counter) that you can’t get from the basic generator.

My Google Reader feed is primarily Mashable, SEOmoz, and Smashing Magazine, with a few other sources that tend to come and go. Ideally, I’d like to come up with a way of displaying my starred items on a dedicated page here at ardamis.com, but until then, I guess I’ll just have to do it the old fashioned way.

Here are a few SEO articles that really are worth reading.

seomoz.org: Find Your Site’s Biggest Technical Flaws in 60 Minutes is a collection of tools and methods suitable for the non-technical site owner who wants to be a little more self-sufficient when it comes to identifying crawling, indexing and potential Panda-threatening issues.

seomoz.org: A New Way of Looking at Ranking Factors includes the really neat Periodic Table of SEO Ranking Factors and some explanation of the thought process behind it, and a short video on SEO basics.

searchengineland.com: The Periodic Table of SEO Ranking Factors is the original table, at full size.

seomoz.org: Set It and Forget It SEO: Chasing the Elusive Passive SEO Dream is a terrific article, both funny and technical, with two scripts to improve your tracking of inbound links and your site’s handling of requests that would normally 404.

seomoz.org: 12 Creative Design Elements Inspiring the Next Generation of UX is a randfish article with some really neat design examples.

sem-group.net: How To Optimize 7 Popular Social Media Profiles For SEO would be a good article to share with someone responsible for setting up profiles, but who doesn’t have a great deal of familiarity with things like H1 tags and nofollow links and their importance.

www.distilled.net: 7 Technical SEO Wins for Web Developers identifies areas where the developer, rather than the designer or content writer, can make improvements to a page’s SEO potential.

smashingmagazine.com: Clear Indications That It’s Time To Redesign isn’t really an SEO article at all, but it could be helpful when making an argument to change the site with the intention of improving bounce rate or other things related to visitor satisfaction.

smashingmagazine.com: Introduction To URL Rewriting does a good job of explaining what URL rewriting is and why you might want to do it.

Techcrunch just posted a story about BuiltWith – a startup that reports on the technology behind the web site. I’ve written before about ways to ID the server, the scripting language, and the CMS technology behind a site, but BuiltWith goes further and provides analysis of that data.

The BuiltWith report for ardamis.com knows that the site is running WordPress (no surprise there) on an Apache server, with the WP Super Cache plugin and Google Analytics and Google Plus One.

For each item it recognizes that a site is using, WordPress as the CMS, for example, it provides statistical data on the number of sites on the web also using that technology, along with trend information showing whether its use is increasing or decreasing. It’s really pretty sad to see that Apache use continues to steadily trend down, even while it continues to account for the majority of web servers. HTML5 as a Doctype is on the rise, though, along with the use of microdata.

BuiltWith can also tell what other sites are using that technology, but this information is a paid service, and an expensive one, too.

Google’s Panda update and Google+ has motivated me to start using more cutting-edge technology at ardamis.com, starting with a new theme that makes better use of HTML5 and microformats.

I rather like the look of the current theme, but one of the metrics that Panda is supposedly weighting is bounce rate. Google Analytics indicates that the vast majority of my visitors arrive via organic search on Google while looking for answers to a particular problem. Whether or not they find their answer at ardamis.com, they tend not to click to other pages on the site. This isn’t bad, it’s just the way it works. I happen to be the same sort of user – generally looking for specific information and not casually surfing around a web site.

In the prior WordPress theme, I moved my navigation from the traditional location of along side the article to the bottom of the page, below the article. This cleaned up the layout tremendously and focused all the attention on the article, but it also made it even more likely that a visitor would bounce.

For the 2012 redesign, I moved the navigation back to the side and really concentrated on providing more obvious links to the About, Portfolio, Colophon and Contact pages.

I’ve been a fan of the HTML5 Boilerplate template for starting off hand-coded sites, and I’ve once again cherry-picked elements from it to use as a foundation. If you’re interested in a running start, you may try out the very nice Boilerplate WordPress theme by Aaron T. Grogg.

The latest version of the theme also faithfully follows the sometimes idiosyncratic whims of Google Webmaster Tools’ Rich Text Snippet Testing Tool. Look, no warnings.