Mark Cuban got me started thinking about this spam blog thing. It’s something I’ve been chewing on for a while without writing anything. I think it’s the sort of thing that’s gone through most folks’ heads when they try to search for information in blogs or try to parse their referer logs (assuming they have already filtered anything involving the phrase “poker” in advance), but Mark did a good job of making the idea more concrete (and made me aware of the “splog” and “zombie” and “blam” phrases). Now it’s the sort of stuff that I find occupies my idle thoughts and keeps me awake at night. I’d recommend reading his commentary, and then reading all of the comments if the topic interests you.
In short, the problem statement is we have content, being produced by individual actors on the web (the good guys), and we have (often illegal) facsimile syndication of that content (the bad guys), interleaved with link stuffing and advertisements in order to increase page traffic or referral revenue or whatever the current nefarious activity du jour is. I’m not saying syndicated advertising referral content is bad, but fake content to redirect to revenue sources is bad.
Of course there’s the easy approach, we tax every bit of content. Problem solved. Too late for that, though; back to reality.
Now, email has had more penetration from direct advertising, but managed to solve this problem. First, there are the rule-based and machine-learning solutions to spam. These methods work to some level of efficacy, but they are far from perfect. Further, they require constant retraining and or modifications to keep up in an effective manner. Great, we’ve learned to drop a message with the word viagra in it on sight, but as soon as people start writing M4K3 Y0UR P3N1S B1GG3R!!!! the email gets through. This is a criminal oversimplification, but it serves to remind that machine learning mechanism will ever be 100% effective so long as the world changes.
Next, there are the honeypot and blacklist methods which are somewhat effective as well, but again, only as a heuristic for known attacks. All it takes is a little cleverness and ingenuity (of which there is plenty) to get around these sorts of services. These mechanisms do not solve the problem, but they reduce it to a dull roar by identifying the known violators. The problem is that for practical purposes there is an infinite source of unknown violators.
But, for email we have one advantage that we do not on the web — email is communication from many to one, not many to many or one to many. Because there is a single endpoint, we can enforce constraints that cannot be applied globally. Given this ability, email spam is a solved problem — the first time somebody tries to contact a particular email address, they have to go through a challenge-response mechanism to prove they are human. This does not prevent an individual from gaming this process and personally bothering you, but it does prevent mass spam and automated mechanism, and ensures that there are easier targets out there.
This is pretty cool, and this mechanism also solves the blog comment spam problem, which is cool. The unfortunate bit is that this model breaks down when we try to extend this model to syndicated content. Sure, we can add all sorts of checks on the various commercial blogging services to verify that individual authors, but the great thing about the Internet is anybody can go out and start their own web site and make a mess of things, and there’s no evil overlord to police their content and verify its originality. In short, the lack of a single endpoint or centralized service prevents us from applying any direct control to solve the problem. In theory, I don’t think this is a bad thing, as I don’t want to force everybody to be under the same umbrella. Obviously that wasn’t what I wanted, or I wouldn’t be running my site the way I am.
So the obvious solutions that we fall back on are:
- Known originality indexes, where we specifically nominate and verify and validate that a given blog is original content, and rate its originality (not its quality). This suffers from the fatal flaw that it requires action for new syndicated content to be recognized, and there is a bias towards the more popular and older content, compared to the new. To a certain extent, icerocket is doing this now.
- Known offender indexes, where we enumerate fake content as it is found, but we are always fighting a losing battle. A good example of this is Splot Reporter (now defunction).
The tricky bit is we need to somehow establish a trust ranking. My goal is not a quality ranking; I don’t care if somebody has nothing interesting to say, or if they are the single most interesting blogger on earth. Rather, I want to find a way to say “I trust with some certainty that this is a human originating content” or “I do not trust this source to be original.” Even if we fix this problem on all of the major blogging systems, we’ve only fixed a localized problem, and the splogs will still pop up; I am trying to dream up a way to solve this in a platform-agnostic manner.
So, the problem statement is quite simple (as most problems with difficult solutions are): We want to limit the universe of blogs to unique and original content, how do we do it? I know a few things:
- The solution will fail unless it has a fundamental basis in human input
- The solution must be automated, to make human input unobtrusive and universal
But that’s where I draw a blank. Is this a problem to be solved with a website, an extension, how much automation is too much, and how much is too little? How do we avoid the system being gamed and ruining the metric? How do we encourage people to provide trust rankings (for this is where the value comes in), and how do we encourage people to use them? The problem of usage becomes less severe once enough rankings are produced, as introducing such data into bigger systems (google, icerocket, bloglines) should be straightforward enough, but that still does not make it a trivial process.