Here on SQLblog, we take spam seriously. Actually, I should rephrase that: I take spam really, really, really seriously, some other people take spam somewhat seriously, and some don't really seem to care. It all balances out in the end. We've done a pretty good job keeping the site clean and mostly spam-free. We have a few different automated anti-spam tools, but it's not enough. I do manual sweeps though most of the site once or twice a week and pick up anything that managed to slip through the cracks. I'm constantly updating and refining our rules to try to keep the manual work as quick and painless as possible. I take it as a point of pride that we have one of the lowest spam rates I've ever seen on a large blog site with anonymous comments enabled.
And all of that work, and my pride, is still not enough. A couple of days ago I discovered, amidst a mild rush of panic, that spammers were getting their links on the site by creating new user accounts and putting the links in their profiles. Many of these accounts had obvious names that immediately signaled that something was up. A user account called "69 Sexy" on a site like SQLblog is a dead giveaway. Some of them were obvious but in stranger ways; we had over 80 accounts created by spammers advertising various fake Christmas tree sites. Christmas tree spammers? Seriously?
Being the kind of guy I am, and given the amount of work I put in fighting spam on the site, there was no way I was going to let these accounts live. So I hit the database and started hacking around trying to figure out how to stop them. "Hacking around" is not generally a good thing when you're trying to do database work, but the design that underlies Community Server is, shall we say, not the greatest I've ever seen. After several minutes of prodding and false starts, I found the table where the profile information is stored and thought about how to do the cleanup.
I looked at a few of the spam accounts and found that they shared some common qualities: They all had "interesting" keywords in their profiles, but none of them had associated posts in blog comments or the forums--these, if they ever did exist, would have been taken care of by one of our other anti-spam measures.
I did a few queries and discovered that there are five or six core tables in which most user data is stored, and about 30 satellite tables. My fear was that I might catch a real, valid account using one of the spam keywords (some valid word could have an embedded fragment of a spam keyword), and I decided to let referential integrity protect me. I wrote a script that started a transaction, did the deletes from the core tables in a TRY block, and if any exception occurred a rollback would fire in the CATCH block. I tested this logic using my own account (safely inside of a nested transaction to make absolutely certain I didn't delete myself!), and was pleased to see that the transaction was rolled back as expected.
And from that point on, it was kind of fun. I thought of all sorts of keywords and ran my script against them, removing scores of user accounts from the site. Every few keywords I would check the users list and find a few more. Goodbye, "forex traders". Goodbye, "payday loans". And see you later, "wire frame Christmas trees". (What the hell is a wire frame Christmas tree? I may never know.)
Of course, I wasn't using these keywords in their full, natural form. If I had a user with "forex trader" in his profile, I might delete all users where I found the string "forex", and maybe all users where the string "trade". The work was done using code along the lines of: Profile LIKE '%' + @string + '%'. I figured that it didn't matter if I caught some non-spam users, thanks to my diligent work with TRY-CATCH and the database's referential integrity constraints.
The next day, I received an e-mail from one of the site's bloggers. He was locked out of his account, unable to access his blog. And so I jumped back into the database and made a startling discovery: The database, as it turns out, has a few referential integrity constraints. And it also lacks a few referential integrity constraints in places where a normal person might expect to see them. Oops.
Turns out that I deleted the user accounts for about a quarter of our bloggers, and
about 50 users who had left perfectly valid comments in blog posts. Luckily, we're almost as serious about backups as we are about spam, and a quick database restore and an hour and a half of scripting
later all of the deleted user data was restored.
The morals of this story, for me: First and foremost, assumptions aren't worth much. This is something we all know, yet it's a trap I fall into over and over and over. Some lessons are more difficult to learn than others. Second, and more technical, don't trust a database design until you fully understand its ins and outs. Otherwise, surprises can and will pop up. This is another lesson I've learned countless times and seem to keep hitting. Something in me really wants to believe that most database designers do the right thing. But the reality is, they don't.
And the final lesson of the day is not to get too caught up in pride or determination. I wanted those spammers out, and I wanted them out that moment. I wasn't willing to wait a bit longer or think a bit more carefully through my solution. And I paid the price for that rash decision making. Those spam links weren't actually hurting anything, they were merely causing me irritation. In retrospect, I could have handled the situation in a much more thoughtful manner.
In the end, no real harm was done. A valuable lesson or two was learned. And I got a topic for a blog post out of the deal. Not bad. And one day those spammers will feel my wrath... In a cool, level-headed way.