Combating Spam for Healthier Websites

by Eli White

Spam is a fact of life on Websites with user-generated content, but there are steps you can take to minimize its risk and impact.

Published April 2011

Talk to anyone who runs a Website where users are allowed to post comments or profiles, and you’ll soon be discussing spam.  It’s become a blight on Websites, including the ones I've personally worked on, such as Digg.com.

In this article, I will present numerous techniques for keeping the spam at bay on your own Web properties.

The What and Why

First, let's define what exactly spam is. In this article, we are going to define it as any user-generated content at your Website (such as comments, posts, or articles) that is out of context. 

So why do people post spam?  The most obvious reason is an attempt to promote something and make money.  Often spam will promote a product or service that someone is trying to sell.  Even more frequently, spam will include links to external Websites where the poster is hoping to lure unsuspecting people, thereby driving up ad revenue.  Or, spammers are hoping to directly influence search engine rankings because with many more links from your legitimate domain pointing to their site, they get credibility with search crawlers.

As we discuss these techniques for handling spam, please do not immediately implement every single option at once.  You will destroy the user experience for your legitimate users and waste a lot of time on solutions that may do nothing for your own situation.  Instead, make sure that you research the kinds of spam that you are getting first. Analyze them. Look and see what kind of solution may best suit your situation, give the lowest chance of false positives (stopping valid posts), and affect the user experience for your valid users the least.

Create Hurdles

One of the effective objectives is to simply make it harder for the spammer to post information.  The harder it is for them to do so, the more chance they will leave and go somewhere else instead.

Moderate
The first technique is actually the most effective one: Implementing moderation, or displaying user-generated content only after someone has reviewed and approved it. 

This comes at great cost however, and therefore is rarely employed.  It works fine for a small personal blog where you can handle a few dozen comments yourself.  But it doesn’t scale.  If you run a Website where you get millions of pieces of content a day from users, how are you going to scan all of them cost-effectively? Therefore, you need to explore automated options.

Require Login
If you disallow anyone from commenting anonymously, and require that user accounts be created to gain the privilege, you actually gain two things.  First of all, you’ve created a basic hurdle:  No longer can a simple script just post directly to your Website.  Now the spammer must create an account, teach their script how to log into your website, pass cookies back-and-forth, and otherwise act like a real user.  That alone will stop the simplest of spammers and convince many of them to try a different target.

You also gain a great benefit in the future as well.  Many of the techniques we will discuss later rely on being able to track a single person’s actions.  By requiring a login, it is very easy to keep statistics on that person and track what they are doing on your Website. This can help you determine if that person is legitimate or a spammer.

Make Users Prove They are Human

The next category of techniques revolves around proving that a human, not a computer script, is performing the task at hand.  This is an important step in fighting spam, as it will stop all automated spam from hitting your system.  Of course, this is still only a partial solution.  While it may stop the casual spammer, there is a growing trend of "human spam," in which people are being hired to sit and manually enter spam posts on Websites.  These spammers, since they are real people, appear legitimate to any technique in this section.  

CSRF Protection

Cross Site Request Forgery (CSRF) is a common Website vulnerability and is an important security issue against which to implement protection. This doesn’t directly have anything to do with spam, but the protection that you implement happens, by its nature, to stop much automated spam.

The traditional solution to CSRF requires you to store a unique ID in the PHP session for a user. Then, when presenting a submission form to that user, you place the unique ID as a hidden form field. When the form submission is made, the server checks that the session’s copy of the unique ID matches the one submitted with the form.  By doing so, you require that the user has actually loaded the form to have retrieved the correct hidden field value.

So besides stopping a bad security hole, this also means that an automated script using your site would need to do the same thing.  It would need to load the Webpage, parse all form elements out of it, and then resubmit them along with appropriate cookies for session tracking.  This is certainly possible, but a high barrier for a simple script.

CAPTCHA

A CAPTCHA, or Completely Automated Public Turing Test to Tell Computers and Humans Apart, is an attempt at a basic Turing test. The most common form of CAPTCHA is presenting some letters or words in an image that a human can easily read but that a computer cannot.  As computers have gotten better at performing optical character recognition (OCR), and processing images of words into the words themselves, the CAPTCHA technology became more intensive. CAPTCHAs can now contain highly distorted letters, making it even harder for people to discern them, but hopefully making it impossible for scripts. Some examples of what these CAPTCHAs may look like:

combating-spam-f1 combating-spam-f2

 

The second example is of interest, because it’s provided by a free service, called reCAPTCHA.  You can use reCAPTCHA easily and let them worry about coming up with new solutions to combat spammers using better OCR software.  Plus reCAPTCHA actually helps to translate old books in the process, so it’s a useful service. 

Integrating PHP with reCAPTCHA is easy. First sign up for a free account on their Website.  It will provide you with both a Public and Private key:

combating-spam-f3


You then download a PHP library that they provide, and then make simple function calls to generate the CAPTCHA, such as:

<form method="POST" action="/submission">
       <textarea name="comment"></textarea>
       <?php
           require_once 'recaptchalib.php';
           echo recaptcha_get_html(PUBLICKEY);
       ?>
       <input type="submit" />
</form>


When the user submits the form, you similarly have a single function that can verify if they successfully recognized the CAPTCHA text:

<?php
       require_once 'recaptchalib.php';
       $captcha = recaptcha_check_answer(PRIVATEKEY, 
           $_SERVER["REMOTE_ADDR"], 
           $_POST["recaptcha_challenge_field"],
           $_POST["recaptcha_response_field"]);
    if ($captcha->is_valid) {
           // Successful captcha, process the submission
       } else {
           // Invalid, offer them the CAPTCHA again.
       }
   ?>


If you are inside a firewall, you will need to modify recaptchalib.php to use a proxy. See the example on the unofficial reCAPTCHA wiki.

It should be mentioned that while most people think of CAPTCHA as being letters in images, there are numerous other options that I’ve seen used successfully.  One was a set of pictures of cat faces with one dog, where you had to pick the dog.  Another involves a simple math problem: just asking the user to calculate the answer to a problem like 2+3 and enter it into a field.  (While a script could solve this, it would have to be custom written to your own Website.)  I’ve even seen a CAPTCHA that was just asking you to enter the name of the blog owner.  The answer never changed but, according to the blog, it reduced spam greatly.

When implementing a CAPTCHA solution you should also keep in mind that many solutions provide bad accessibility. reCAPTCHA solves this by providing an acoustic CAPTCHA.

Require a User Agent

One of the simplest things you can do is to require a user agent header.  All valid Web browsers include a user agent string in the headers that they send to the Web server.  However, many scripts don’t bother setting an agent.  In fact, PHP by default doesn’t set a user agent, and therefore any basic spam script written in PHP will be identifiable by this omission.

In a PHP application you can simply block all posts that don’t include a value for $_SERVER["HTTP_USER_AGENT"] and immediately stop many scripts in their tracks.  This blocking could also be done at your firewall or load balancer as well.

CSS Hidden Field

Another trick that attempts to catch over-exuberant scripts is to leave a "honeypot field." The idea is to have a separate field in your HTML form that appears to be a real field, and even has a name attribute that would match commonly requested information.  Such as the users location, website, etc.  You then hide this field with CSS.  For example:

<html>
     <head>
       <style>
         .honeypot { display: none; }
       </style>
     </head>
     <body>
       <form method="POST" action="/submission">
         <p>Comment: <textarea name="comment"></textarea></p>
         <p class="honeypot">URL: <input name="url" type="text" /></p>
         <p><input type="submit" /></p>
       </form>
     </body>
</html>


This means that a valid user will not see the field and therefore won’t ever fill it in - whereas a script that’s attempting to act like a human will fill out every field possible.  On the back end you can simply check to see if that field was filled out.  If it was, you can assume the submission was made by a script.

It should be pointed out that this particular test isn’t very accessible.  Someone using a browser with CSS turned off, or simply an older text based browser, might in fact see that field and try to fill it out.  Therefore it should probably only be used as an indicator of possible spam, and not the sole determination.

If you look at the above example, you could note that a script could be smart enough to look for any CSS that has:  “display: none;”.  To this end, I’ve seen people who used this get far more creative with their CSS, simply hiding the text from view via changing the Z layer, making the font color match the background, or any other such technique that, in effect, makes the field invisible to the human eye but is hard to detect by computer script. 

Require JavaScript

I will not focus on this approach much as it’s a fairly drastic measure and has serious consequences as well.  But one additional solution is to build your submission form to require JavaScript to function.  Most scripts are not going to have complete JavaScript engines built into them and therefore will not be able to submit data to your server.

Examples of this could include building the <form> completely on the fly via DOM manipulation, not using a form at all but using live editable HTML text via the DOM designMode, or even more fancy methods such as including live JavaScript math problems that need to be solved by JavaScript, then the answers passed back as part of the submission process are matched on the server (a complicated variation of CSRF protection).

The problem with all of these solutions is that, while effective, they will also stop anyone who is browsing without JavaScript capability. Perhaps more importantly, they will stop everyone from submitting data if for some reason there was a JavaScript error (perhaps only on certain browsers).

IP-based Solutions

The next class of solutions revolves around the use of blacklists.  In these cases you are going to completely block all access to your Website (or at least to the submission process) if the user’s IP address matches a blacklist of ‘known-bad’ hosts.  This is an effective way of stopping specific users from spamming you.

Public Blacklists

There are a number of publicly available blacklists, curated by others, which you can access programmatically.  It should be pointed out that often these lists are designed around email spammers and not Website spammers.  However, quite often if a host is being used for one negative activity, it’s safe to assume that other activities are negative as well.

How you interface with each black list varies and it is beyond the scope of this article to go into details.  Two of the most commonly known ones are SpamCop and SpamHaus.  You can also find a large list of various servers at DNSBL.info.  While some of the services are offered for free, they are often only free when used as part of an email filtering system.  To get direct access to the raw list in order to use it as an anti-comment-spam measure can cost money. 

As a quick example, you can check any IP address against the SpamCop database by performing a DNS lookup on the IP address reversed followed by ‘.bl.spamcop.net’. You could use the following code:

<?php
   function spamIP($ip) {
       $reversed = preg_replace('/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/', '$4.$3.$2.$1', $ip);
       return checkdnsrr("{$reversed}.bl.spamcop.net", 'ANY');
   }
   ?>


If you get a valid response from SpamCop, then SpamCop has them on the blacklist.

Custom Blacklist

While using a public blacklist is useful for catching known spammers before they get to you, it doesn’t help at all if the spammer is only targeting your site.  The best solution to this is to keep your own blacklist of spammers that you block access to.

How exactly you add entries into this blacklist would depend on your Website.  Is it easier to manually add them via SQL when needed?  Or do you have an administration website that you can use to flag IPs for the blacklist?  A sample blacklist table might look something like this in MySQL:

CREATE TABLE `blacklist` ( 
    `id`         INTEGER(11) UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    `addr`       INTEGER(11) UNSIGNED NOT NULL,
    `expiration` DATETIME NOT NULL 
   ) ENGINE=innodb DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;


You’ll notice one very important addition to this otherwise simple table: the `expiration` column.  It’s very important that you never block IP addresses forever.  IP addresses are transient.  Just because one person was using that IP address at this moment, doesn’t mean the same person will be a year from now (or even five minutes from now).  This is especially true with ISPs where each time a person connects to the Internet they may be granted a different IP address from a pool.

When adding an IP address such as 1.2.3.4 to this table, you simply need to run:

INSERT INTO `blacklist` SET
     `addr` = INET_ATON('1.2.3.4'),
     `expiration` = DATE_ADD(NOW(), INTERVAL 1 WEEK);


In this case, we’ve set the expiration at one week in the future.  You can modify this to suit your own needs.  One common tactic is to have a rolling system.  So the first time you block an IP address you perhaps only block it for one hour.  The second time you block it for one day.  After that you move on to one week.  That’s probably the longest you should block an IP address.

To check against the blacklist you’ve created is just a straightforward date-based lookup:

SELECT count(*) from `blacklist`
     WHERE `addr` = INET_ATON('1.2.3.4') AND `expiration` > NOW();


Scan the Content

At this point all of our techniques have focused on stopping the spammer, regardless of the content.  If those techniques have failed, or aren’t relevant, you need to actually start scanning the content itself to see if it appears to be spam.  This can be tricky as now you really start to run the potential of getting false-positive results, and rejecting otherwise valid results because a service you were using happened to think that the content looked fishy.

There are numerous services out there that will help you to scan for spam.  Some are based specifically on a certain platform (AntiSpamBee for Wordpress), others provide many options in a single package, such as Mollom, which not only does anti-comment spam, but also provides CAPTCHA solutions.  You can even repurpose some mail scanning solutions, such as SpamAssassin by simply reformatting the submission to look like an email, and passing it in.  In the end, these are just a couple of the multitude of solutions that exist, and a few Internet searches will pull up others to explore.  In the meantime let’s explore in more detail a common one, Akismet.

Akismet

Akismet is a product that started as a solution for spam on Wordpress, created by Automattic.  A number of years ago it was opened up with a robust API for general use.  More important it’s free for personal use and very inexpensive for commercial uses.

There are numerous libraries that people have provided for interfacing with Akismet, however its API is well documented and uses simple REST-based access.  After signing up for a free API key on the website, the following code is an example of how you would check for spam:

<?php

$APIkey = 'MY_API_KEY';
$commentData = array(
    'blog' => 'http://myapp.example.com/',
    'user_ip' => $_SERVER['HTTP_X_FORWARDED_FOR'] ?:
        $_SERVER['REMOTE_ADDR'],
    'user_agent' => $_SERVER['HTTP_USER_AGENT'],
    'referrer' => $_SERVER['HTTP_REFERER'],
    'permalink' => 'http://myapp.example.com/post/id/5767',
    'comment_type' => 'comment',
    'comment_author' => 'Martha Jones',
    'comment_author_email' => 'mj@example.com',
    'comment_author_url' => 'http://martha.example.com/',
    'comment_content' => 'Loved this article.  Thanks for writing it!',
    );

$options = array(
    'http' => array(
        'method' => 'POST',
        'user_agent' => 'TestApp/0.9 | Akismet/1.11',
        'header' => "Content-Type: application/x-www-form-urlencoded",
        'content' => http_build_query($commentData)
    )
);
$ctxt = stream_context_create($options);
$result = file_get_contents(
          "http://{$APIkey}.rest.akismet.com/1.1/comment-check", false, $ctxt);
$isSpam = ($result === 'true');

if ($isSpam) {
    // Treat the comment as spam
} else {
    // Accept the comment as valid
}

?>


You may see PHP “notices” if the $_SERVER array values are not set in your environment – in production code you would test for this. If you are inside a firewall, set $option['http']['proxy'] to your Internet proxy.

One of the great benefits of using a system like Akismet is that it’s always learning and improving - not only from the masses who are using it, but it can also learn specific trends on your own Website.  The API provides ways for you to help train it better to your specific needs.  There are ‘submit-spam’ and ‘submit-ham’ endpoints provided, where you can submit the exact same information the comment-check endpoint required, but the server marks that data as invalid or valid respectively.  In this way you can inform it of the false-positives that it made, as well as spam that it missed.

Make Spam Less Useful

So far we’ve attempted to catch spammers before they even submit to us, and we’ve explored how to scan the data afterwards to see if it’s spam.  One additional step that you can take is to make any spam that does get through your system less useful. If spamming your system doesn’t actually help the spammer in any way, eventually they will stop. (Or at least one hopes so!)

rel="nofollow"

One of the common tactics for reducing the benefits of spam is to add the rel="nofollow" attribute to any link that is posted to your Website.  This is a concept that was introduced by Google many years ago and that most search engines now adhere to.  Simply put, by adding that attribute to any <a href> tag, search engines will not give any extra precedence to the Website thus linked.  Traditionally the number of Websites that link to another Website is a highly prized statistic in understand the relevance of a site.  Spammers attempt to capitalize on this via injecting links into other Websites.

There is some controversy about using this.  Typically by applying this it affects all links, which also means that useful, relevant links end up gaining no extra traction either.  So while punishing the spammers, you also punish legitimate uses.

Disallow Links

The next step is to simply disallow all links inside of submissions.  This definitely keeps the spammers at bay, since they can’t redirect people to theirWwebsite.  However this obviously highly affects your legitimate users as well.  Plus it still won’t stop all spam.  You can get spammers just entering text links and hoping that someone will cut/paste them into their browser.  You will also still get "branding" spam whose whole purpose is to promote a certain product, so they aren’t worried about having a link.

Conclusion

Overall you now hopefully have a very good set of tools at your disposal for combating spam on your Website.  Again, you should use this set sparingly, picking what works best for your own situation and seeing what will impact your legitimate users the least.

To that end, there are some thoughts I’d like to leave you with.  There are lots of variations on how you might want to apply these techniques.  For example, if you detect something as spam, should you blatantly deny it or do you silently handle it?  In the former you make a clear statement that spam isn’t accepted here, but you also may be issuing someone a challenge to find a way around your filters.  In the latter case you might make the submission appear to work, but only for that user - no one else can see it.  In that case the spammer is allowed to spin their wheels fruitlessly.  Of course, it doesn’t get rid of the spammer.

Similarly, should you instantly run all your filters on the content which means that your users need to wait for these (sometimes slow) checks to complete?  Or do you let everything through and then batch process for spam afterwards?  The latter makes your website more performant, but can you afford to have spam appearing for even five minutes?

You might also want to consider the idea of tracking a user’s karma over time.  Doing so - letting a user prove themselves - can allow you to use rather harsh filters, but open them up on known good users.  For example, initially requiring comment moderation, but after you approve a single post by a user, all their future posts are not moderated.  Or trusting an account if it’s over a year old. CAPTCHA is a good place to use karma, perhaps not presenting a CAPTCHA after a certain user, or user session, has successfully completed three in a row.  After all, whether they are a script or human, if they can solve three in a row, they are going to continue to solve them.

My last word of warning for you is one I learned myself the hard way:  beware of malleable content.  If you allow users to edit their submissions after they’ve been published, then they will happily submit valid comments and later change them into spam.  Therefore you need to check every iteration of a comment through your same filters.  Similarly be extra concerned about any URLs in content.  A URL that today points to a valid and relevant website can be changed via redirect tomorrow to head someplace shady.



Eli White
is a longtime PHP user and the author of the book PHP 5 in Practice. He has worked on many large scale PHP projects including Digg, TripAdvisor, and for the Hubble space telescope program. He frequently speaks at PHP conferences to share his knowledge. More about Eli can be found at eliw.com.