MyRBL / Smart Filter -- New Spam Handling Process --

 

 

 

How it works

With this upgrade we've made some very major enhancements to how the spam handling works while maintaining the same basic structure.

Smart Rule based filtering

aspam_mfilter.txt is replaced with sf_mfilter.txt.

The rule file sf_mfilter.txt now produces a list of significant features for any email message, the features are then analyzed using the rules in feature_gen.dat to come up with a 'score'. So the scores are not 'hard coded' into sf_mfilter.txt

The file feature_gen.dat is created by analyzing sample messages from your own server, so lets say we have a feature "blob" which on your server correlates 98% with spam, and on my server correlates 20% with spam (so in other words an email on your server with the feature 'blob' is a spam email 98 times out of a hundred, and on my server 80 times out of a hundred its not spam. Then on your server the score in x-spamdetect header will be something like "plus 10" for an email with 'blob' and on my server it will be 'minus 4'...

The feature 'blob' might relate to something like the length of the 'to' header, or weather or not the spf tests passed etc...

Then, in addition to simple rules the automatic process generates combinational rules based on your sample messages, so it might notice that a message which is from yahoo and has a long "To" header, is always spam. These 'combined' rules are also used to further increase accuracy.

You must enable this automatic generation of feature_gen.dat, by default surgemail will use a rule file that we have generated. The setting to enable is g_sf_generate "true"

Built-in RBL / Reputation system

SurgeMail now includes it's own RBL system (Realtime Blocking List) and Reputation system. This is a two level database, a local database based on each server, and a reporting system and DNS based query system to merge data between all SurgeMail servers in the world.

This system classifies all ip addresses into one of 5 colors

Blue = new
Brown = new, looks bad
Orange = not new, but not decided.
Yellow = some good some bad.
Black = mostly spam
White = rarely spam

As most 'real' email comes from servers you talk to all the time this system quickly identifies the trusted mail servers that never send spam so that messages from those server will be very unlikely to be accidentally classified as spam.

 

The advantages this has over traditional RBL services (which should also be used of course)

It is free to SurgeMail customers

It makes use of the many users clicking 'spam/not spam' on messages as they read them to help identify spam more accurately.

It provides both positive and negative indications for the spam filter, this is much more valuable than purely negative responses as given by many rbl services because the bulk of spam comes from 'unknown' transient ip addresses, so the significant information is the list of known mail servers that regularly send 'non' spam.

Probing suspect URL's

One common form of spam is emails containing URL's to spam sites, and as spammers create new urls every day even SURBL is often unable to provide fast enough updates to block emails of this sort. SurgeMail includes a probe feature where an unknown url is probed to see if the page it points to contains known 'spam' keywords.

Content filtering

SurgeMail has several modules intended to try and identify the content of a message as fundamentally spammy, one module 'spamc' analyzes the words used using the messages in the training folders to find combinations of words that imply 'spam'. It tends to be about 80% accurate. In addition to this there are a couple of modules with very specific words/phrases that surgemail will try and identify. Content filtering only works when the content of the message is the spam (as opposed to random words used to fool filters like this, that is why this method is only 80% accurate and needs to be used in combination with the above features.

Url base unblock bounce

Many systems block potential spammers with a 'hard' bounce, even if they are right 99% of the time, the 1% failure rate creates enormous headaches for the computer manager and the users. SurgeMail bounces with a url that any sender can use to bypass the spam system, this greatly reduces the severity of 'false positives'. Also this web page includes a 'human' test to ensure an automated robot can not bypass it (unlike the previous email based system in earlier versions of surgemail) You can enable the old email based system if you wish with the settings mentioned below.

How to enable the new system

 

To disable the new Smart Filter mechanisms and return to the old behaviour!

g_myrbl_disable "true"
g_sf_disable "true"
g_friends_byemail "true"
g_spf_byemail "true"

New Commands:

tellmail sf_train - Rebuild feature_gen.dat from sf_mfilter.txt using local data in 'train' subdirectories

tellmail sf_compare - Test feature_gen.dat on train sub directories.

tellmail friends_url - Show a sample URL for unblocking a message, use to test your web access/ports are set correctly.

New Optional settings:

g_myrbl_share "true" - Share IP reputation information with netwinsite.com
g_sf_generate "true" - Generate feature_gen.dat locally rather than using a standard generic one from NetWin.

 

Please note both of these have privacy considerations, we don't recommend these settings unleses you are very sure all your users will be happy, probably only suitable for small home based servers!

g_report_spam "true" -- Automatically send some spam samples to netwinsite for anlysis
g_report_notspam -- Automatically send some miss classified not spam samples to netwinsite

Colors (MyRBL)

Blue = new
Brown = new, looks bad
Orange = not new, but not decided.
Yellow = some good some bad.
Black = mostly spam
White = rarely spam

Process the message...

    1. Get color from RBL/Myrbl/Surbl etc...
    2. If from friend accept
    3. Run sf_mfilter.txt to find 'features' of message
    4. Score message using feature_gen.dat and then bounce with url or give to user.

Features to stop cracking local accounts and sending out spam

  1. g_breakin_whitelist - used to stop a spammer sending from multiple ip addresses
  2. (the setting that limits sends per user per day)
  3. New setting to scan outgoing email for 'spam' content/urls.. or unusual headers or unusual urls etc...
  4. Login web page for smtp/pop... (document)

Tips for users to avoid spam:

Never put your email address on a web page, instead use a service like this one: http://www.emailmeform.com/