Tuesday, October 2, 2007

The price of spam

Like anyone, I get lots of spam. Having kept the same e-mail address since 1995 and having that e-mail address posted all over the web may contribute to absolutely inordinate amount of spam I receive. I get over 1000 pieces of spam mail a day.

To make matters worse, I run my own mail server on a Soekris 4801 at home. I use postfix running on FreeBSD 6.2 with SpamAssassin to identify spam. My spam situation is desperate, so anything that looks remotely like spam gets immediately sent to the bit bucket. Still, over 50 spam mail a day make it to my Inbox.

My Soekris box is my mail server, file server, firewall, and PPPoE tunnel end-point (for my DSL connection). I also run a low-traffic web site off the box. That said, between pppd, the postfix and spamd processes, receiving and processing spam consumes almost 100% of the CPU all day, every day. My load average rarely dips below 1.0 and, at the heaviest times, the inbound mail queue grows to a few hundred messages.

Now, this isn't the fastest machine in the world. But when I started my first ISP back in 1995, we ran dozens of web sites (admittedly, mostly static content) off a single 100Mhz Pentium server with 128MB of RAM. Our entire 27000+ newgroup Usenet feed was hosted on another 100MHz Pentium server, also with 128MB of RAM. But in 2007, I'm here to tell you that it takes a 266Mhz Pentium-class machine (with the same 128MB of RAM) running 24/7 just to deliver mail for two e-mail accounts.

So I can tell you personally that the price of spam in 2007 is roughly one Soekris 4801 plus disk space. It may not seem like much in comparison to today's top-of-the-line computers, but it is enough to make me sick just thinking about.


Kumar McMillan said...

This is exactly why I use gmail. I guess that makes me lazy and cheap! My email address is all over the web but due to their filtering (I imagine it learns from all the many gmail users) I only get about 4 spam messages a week, on average.

Anonymous said...

You say you get over 1000 pieces of SPAM per day. Let's say you get 2000, and that 50 percent of your CPU cycles go to processing it. There are 86,400 seconds in a day, 50 percent of that is 43,200 divided by 2000 is over 20 CPU seconds per email. You really need a faster machine.

Anonymous said...

You should maybe give spambayes a try. It's python and don't do the spamassassin online checks that might not be that good.

Kelly Yancey said...

To the anonymous poster suggesting spambayes: I haven't tried that particular filter, but I did try dspam a while back. It advertised that it was written all in C so it was faster than SpamAssassin (which is written in perl).

It was fast alright: it didn't do much of anything. I realize that I have to train it, but that is a pain to do unless you use mutt as your mail reader (and configure keys for flagging messages as false negatives and positives). I really didn't enjoy reading through 1000+ messages a day to train my filter. One thing SpamAssassin has going for it is that it uses both a bayesian filter and static analysis, so I get decent results immediately.

That said, perhaps disabling some of the SpamAssassin online checks may be in order. It appears that SpamAssassin is CPU bound rather than network I/O bound, though, so I doubt it will help much.

Kelly Yancey said...

Kumar: Thanks for the comment. Your blog is really interesting. I've been thinking about switching over to gmail or Yahoo! mail, but have resisted the idea because changing my e-mail address feels too much like capitulation.

Of course, forwarding all of my mail from my home server to gmail (or wherever) would work, but spam would consume twice as much bandwidth as it currently does (once to download it, and again to forward it to the hosted mail service). :(

Anonymous said...

Kelly: That little mail server you've got is pretty neat! I'm assuming that back in the day your P100 wasn't doing much in the way of spam prevention, except maybe blacklisting. Sorting mail is much less resource-intensive :-)

I was the email admin for a fortune 500 about a year ago, roughly 80% of the inbound email was spam, something like 45,000 messages/day. We actually had two layers of spam protection, one right behind the other (not my design, but it worked.) Maybe you need a couple more Soekris in series to cut the volume down further ;-)

As for me, I've been using Yahoo for something like 9 years. I get maybe 2-3 pieces of spam a week.

- A

Anonymous said...

I'm using a Soekris box running OpenBSD as a firewall. When I enabled greylisting in spamd, the number of spam mails plummeted by a factor of 100 -- and it's stayed that low ever since.

Anonymous said...

I've got the same setup - FreeBSD/SpamAssassin, and I averaged over 2,300 mails flagged as spam per day for the past month. I maybe get ~25/day that aren't caught, and about one false positive per month.

All the mail that gets flagged as spam is forwarded to a Google account. Virtually all of it gets flagged as spam by Google as well, and I review what doesn't every few days. I figure if my spam filters and Google's spam filters both agree that it's spame, then I can safely ignore it.

The server I use doesn't really have any trouble processing that volume mail. You are running spamd so you don't have to fire up a Perl process for eaach mail, right? If you're still having difficulty, try greylisting, I've heard it's a very inexpensive way of cutting out a lot of spam.

bsergean said...

Yes you need to train spambayes with ham and spam. I keep my spams to do that when I want to play with another setup.

$ sb_filter.py -d $HOME/.hammie.db -n
Created new database in /home/bsergean/.hammie.db

$ sb_mboxtrain.py -d $HOME/.hammie.db -g ~/Mail/sent-mail/cur/ -s ~/Mail/Spam/cur/
Training ham (/home/bsergean/Mail/sent-mail/cur/):
Reading as MH mailbox
Trained 8020 out of 8020 messages
Training spam (/home/bsergean/Mail/Spam/cur/):
Reading as MH mailbox
Trained 1847 out of 1847 messages

I am using kmail right now, here are the two rules I use when something bad happens.

(re)train as a bad (spam) message
sb_filter.py -d $HOME/.hammie.db -s

(re)train as a good (ham) message
sb_filter.py -d $HOME/.hammie.db -g

There are two options for the database format, pickle and berkely db, each one have their own advantage (I forgot, it's in the man page :).

One last word, ... if it ain't break don't fix it. It's always a pain to reconfigure something already working ... And the grey listing seems like a good thing too, accoring to another comment.


Anonymous said...

Try Google Apps, it's gmail with your own domain name. I average about 10 spams a week.

You do need to check your spam folder regularly though. Some of my client's emails ended up inside.

Anonymous said...

The important thing about greylisting is that it's insanely cheap. SpamAssassin takes a lot of memory and CPU cycles to do it's work. Greylisting just takes a database lookup and a write, plus a delayed delete for more spam.

It's not at all surprising that if you have an expensive mechanism like SpamAssassin without protecting it by cheaper mechanisms, that your Soekris can barely keep up.

I will say that your problem is totally solvable. I've had the same e-mail address since 1994 and it's all over the net, and I only get 1 or 2 spams through to my main mailbox per day. Legitimate messages misclassified as spam is maybe 2 to 3 messages per week.

I use SPF, greylisting, rejecting mail from the SpamCop "top 200" site list, SpamAssassin, ClamAV, auto-whitelisting of people I mail to, and a custom set of black and white lists along with a long custom set of header and body rules.

It's a lot of work, but it's aazingly effective.