By Eli White
Learn techniques to help prevent the gaming of your MySQL-PHP Website.
Published April 2011
A new term has begun to appear in the Web application lexicon: "gaming". Gaming in this context doesn’t refer to people playing games online. Rather, it refers to people treating your Website like it's a game, and attempting to influence it.
The most obvious example of gaming can occur in online voting, when someone attempts to vote numerous times by making new accounts or through other gaming methods, thereby pushing their candidate over the top. However, the concept can apply to any situation where a user might want to manipulate your Website for their own purposes.
Why would someone do this? Mass, rapid commenting on an article could be a variation in which the perpetrator wants to fill the page so full of random comments that the real ones can’t be read. Perhaps they might log in/out regularly to increase their number of logins, or keep reloading their own pages to make them appear more popular. Worst case, someone might just have a grudge against your Website and want to make its data invalid. In any case, this is not a vulnerability that you want to expose.
In this article, I will present some techniques to help prevent the gaming of your Website. Some of them rely on shutting down scripts; others try to find the humans who are manually executing this behavior.
In my article “Combating Comment Spam to Create Healthier Websites,” I describe techniques to ensure user comments are valid. These techniques can also be applied as anti-gaming measures for the same purpose. Because the alternatives (such as creating and logging into multiple accounts) is tedious, most attempts to game a system will be automated ones. Putting hurdles in place that make scripts ineffective, or at least require much more effort to write, will dissuade people from continuing to write them.
In this case you simply need to keep track of when the actions have taken place, then manually test how fast a human can make informed decisions and click the links in question. Set a threshold so that if votes come in faster than a human can be doing them, you know that a script is generating them. You can then discount the validity of all those votes.
Let’s take as an example a Website where you list a number of items on one page, say movies, and ask a user to rate each one on a scale of one to five. The original schema for this ‘Movie Rating’ table might look something like this in MySQL:
CREATE TABLE `ratings` ( `id` INTEGER UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, `user_id` INTEGER UNSIGNED NOT NULL, `movie_id` INTEGER UNSIGNED NOT NULL, `rating` TINYINT UNSIGNED NOT NULL ) ENGINE=innodb DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci; CREATE UNIQUE INDEX `user_movie` ON `ratings`(`user_id`, `movie_id`);
Someone might write a script that would attempt to rate every item as a ‘1’. To help us apply rate limiting to movie ratings, we should modify the table to store a few extra pieces of data:
ALTER TABLE `ratings` ADD COLUMN `invalid` TINYINT UNSIGNED NOT NULL DEFAULT 0, ADD COLUMN `created` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP;
Now we can track when the link was clicked, as well as having a way to mark ratings as invalid. With that complete, let’s assume that we run some manual tests and find the quickest that someone can click is five links in five seconds. Give the benefit of the doubt that someone might click a little faster than you and assume five links in four seconds. Now before adding a new movie rating we can run the following SQL to detect if the click rate limit has been exceeded:
SELECT `id` FROM `ratings` WHERE `user_id` = 42 AND `created` > DATE_SUB(NOW(), interval 4 second);
This code returns any entries that were made in the last four seconds. If there are four or more (since this is the fifth one you are about to add) then you’d want to insert your current movie rating row with the invalid field set. You’d also want to go back and set all the original ratings to invalid as well.
Another piece of information that you can begin to use in your investigation is the source of the action. Look at all information available to track who performed the action.
The simplest version of this is obviously to look at the specific user. This is easier if, like in our example above, we already are tracking a logged in user. If you don’t require logins to be used then you could track the actions by storing the session ID in each DB row instead. This would allow you to track a single person moving across your Website and performing different actions.
One of the biggest indicators that you can use when tracking information on a user-basis is looking at total activity. For example, if there are lots of users with very little history in your system, either newly created, or that have only rated one movie . . . that can look suspicious. Perhaps one person created a large number of accounts to try and promote their own movie. A good solution to this is to count votes by new users as worth less in your overall scheme.
A simple version of this for our Movie Rating system would be to exclude all ratings from the overall average rating for a movie if the user has made less than five ratings altogether. Following is a SQL statement that would do this (warning, this would not be performant on large databases and would need rewriting):
SELECT AVG(`rating`) FROM `ratings` WHERE `movie_id` = 1 AND `user_id` IN (SELECT `user_id` FROM `ratings` GROUP BY `user_id` HAVING count(`id`) >= 5)
Of course, tracking which user performed an action can only get you so far, since people gaming your site can be fairly tricky and are expecting it. To that end, it can be quite useful to start tracking both the IP address of any action as well as the HTTP Referrer of the action. Each of these provides a different insight into the actions. Let’s explore the IP address first.
When looking at the IP address you are primarily looking for an unusual amount of activity from a single address, usually across multiple user accounts. You are trying to find out if one person is using numerous accounts, logging in/out of each and attempting to influence the voting from them. You do have to be careful with this. Employees at a company, or students at a school, may all appear to be the same IP address. Of course, that itself may be an interesting statistic as well. If a large group of people from a single IP address (or IP address range) are all voting for something specific, you may have a case of collusion on your hands. (Perhaps in the movie rating situation it is the movie studio telling all their employees to rate their movie as five stars. This should not be considered valid data.)
A similar case exists for tracking the HTTP referrer of an action. If you know what original HTTP address brought them to your Website, you may also be able to find sources of collusion. Perhaps it was as simple as a public Web page that states: “Everyone go vote for this”. While perhaps the intent is pure, traditionally such situations lead to blind voting and not real, honest input. It should be noted that to do this properly you need to track not just the ‘last’ HTTP referrer because that will typically be your own Website (or even blank in some AJAX situations). Instead you need to have captured the referrer as part of the session when the user very first came to your Website for this visit. Then in turn log that with any action.
We might consider adding this data to our Movie Rating database as the following:
ALTER TABLE `ratings` ADD COLUMN `ip_addr` INTEGER UNSIGNED NULL, ADD COLUMN `refer` VARCHAR(1048) NULL;
Now there’s a tricky point that you’ll need to consider here. You’ll notice that I’ve made the ‘refer’ field be a 1,048-character VARCHAR. There are some major trade offs that you need to make at this point for your application. URLs can sometimes be upward of 2,000 characters long. If you have a shorter VARCHAR as we have listed here, you will lose data that might be important to you. On the other side though, keeping all of that information with every single rating can start to become a very large database size pain.
One solution to this would be to keep just an MD5 or SHA1 hash of the HTTP referrer instead. Doing this will still let you programmatically detect if collusion is happening, while limiting your pain to 32 (or 40) characters. By doing this you lose the ability to actually see the original URL and perhaps visit it to satisfy your own curiosity. It would be possible keep a separate lookup table, if you still wished that information, pushing the full URL with its associated hash (or just use an autoincrement key in this case) into the lookup table. But you still end up adding additional database overhead and maintenance to do this. Such are the trade offs you always have when trying to track as much information as possible to defeat Website gaming.
At this point we’ve exhausted the simple ‘source’ based and ‘does it look like a script’ techniques for finding people gaming the system. Now if you want to catch more esoteric situations, you need to go deeper into the route of pattern matching.
In this case, we are going to look for suspicious patterns in the database. The simpler patterns to look for are the timing and order of actions. For example, in our Movie Rating system: Can you see that the same 50 accounts voted in the exact same order with the same ratings on multiple movies? If so, that’s a pattern that wouldn’t exist in normal Website usage and therefore you can assume that either a script performed those actions, or a human was following a spreadsheet of accounts they had setup and was logging into each one in turn.
Similarly you can look at the timing of actions. We talked about rate limiting before, but you can now look at simply suspicious timing. What if one account left a rating exactly every 10 minutes on the nose? Or what if there was a set of accounts from which one at random left a comment every minute? Interesting timing patterns like this can open your eyes onto suspicious activity.
None of these analysis tasks are ones that are going to be happening live on your Website. They can be rather expensive queries and need to be handled offline, either as manual processes by an administrator who is searching for this activity, or perhaps an automated process that runs nightly and flags anything suspicious it found from the previous day.
One of the biggest sources for collusion, whether on purpose or not, is friendship relations. However, if your Website has a social aspect to it (as many do now), you have a great source of data to identify this. Most of these Websites support the concept of "friending" across accounts. In that case, you can build friendship trees that show who is related to whom. You can start to examine the depth of the trees to see if there are patterns emerging. Perhaps one circle of friends all acted in the exact same manner giving identical ratings to a certain movie.
On the surface, some of this will obviously happen. True friends will share similar interests, and therefore like similar things. But there is a threshold level for this. At some point you move beyond "typical" similar tastes and enter a realm of collusion. If a circle of friends is always performing the exact same actions, then you don’t have valid data coming from them.
You can extend this pattern even farther. Many Websites now allow you to link your Twitter, Facebook, or other social media Website account in. Perhaps as an authentication method for logging in, or perhaps just as a way to "Find your friends." If you do this you have the ability to do some in-depth analysis that might otherwise not be expected. (And I should warn that you start to cross the lines of white to black hat here.)
For example, you can keep track of which of your Website’s accounts happen to be linked to Twitter accounts. You can call Twitter’s API to get friend lists of each of your users. At that point, you can build relationship maps between accounts on your own Website that may not be linked as friends but that are actually related according to Twitter. This can allow you to put together connections that were otherwise hidden to you either because of user apathy or because the user was attempting to hide their attempts at collusion from you and thought that not linking accounts as friends on your Website would leave them hidden.
This becomes a never ending source of data mining where you can delve deeper and deeper into relationships and actions, trying to find patterns that were otherwise hidden from you.
After the deep topic of patterns and the discussion of how to detect gaming after it has happened, let’s embark down one last path, perhaps a little lighter and definitely effective.
The first I will call a "login cookie." The idea is to set a cookie when a browser visits your Website. You leave this cookie in place regardless of how many times the user may log in or out. (This differs from session IDs that you should be regenerating upon each visit.) Then you can save the account name and login cookie value in a database lookup table . This directly lets you track when one person (one browser) logs in using multiple accounts. This can be combined with the IP address tracking previously mentioned to let you peek deeper behind dynamic IP addresses and gateways.
If that user either right away, or at any point in the future, attempts to create additional accounts by signing out and then clicking on "add account," you can track that it was really the same user.
In this way, you can link together groups of accounts that are really a single user, and act on that information accordingly. Again, this is something you have to be careful with. Having just a few accounts linked might be a case of a family each having an account. But when that number rises you can be more suspicious of those accounts.
Hopefully gaming is a problem you never have to deal with. Unfortunately the world isn’t always so nice. Detecting gaming is never an easy situation. You always struggle against trying to find the blatant cases of it while hoping that you aren’t incorrectly finding false-positives. The toolset of techniques given in this article is just a starting point. Take these concepts, go apply them to your own application, and see what you can find out about your users.