MyRBL / Smart Filter -- New Spam Handling Process

 

 

 

How it works

With this upgrade we've made some very major enhancements to how the spam handling works while maintaining the same basic structure.

Smart Rule based filtering

aspam_mfilter.txt is replaced with sf_mfilter.txt.

The rule file sf_mfilter.txt now produces a list of significant features for any email message, the features are then analyzed using the rules in feature_gen.dat to come up with a 'score'. So the scores are not 'hard coded' into sf_mfilter.txt

The file feature_gen.dat is created by analyzing sample messages from your own server, so lets say we have a feature "blob" which on your server correlates 98% with spam, and on my server correlates 20% with spam (so in other words an email on your server with the feature 'blob' is a spam email 98 times out of a hundred, and on my server 80 times out of a hundred its not spam. Then on your server the score in x-spamdetect header will be something like "plus 10" for an email with 'blob' and on my server it will be 'minus 4'...

The feature 'blob' might relate to something like the length of the 'to' header, or weather or not the spf tests passed etc...

Then, in addition to simple rules the automatic process generates combinational rules based on your sample messages, so it might notice that a message which is from yahoo and has a long "To" header, is always spam. These 'combined' rules are also used to further increase accuracy.

You must enable this automatic generation of feature_gen.dat, by default surgemail will use a rule file that we have generated. The setting to enable is g_sf_generate "true"

Built-in RBL / Reputation system

SurgeMail now includes it's own RBL system (Realtime Blocking List) and Reputation system. This is a two level database, a local database based on each server, and a reporting system and DNS based query system to merge data between all SurgeMail servers in the world.

This system classifies all ip addresses into one of 5 colors

Unknown = 98% spam

Blue = less than 10 days old, nothing significant known (typically 70% spam)
Brown = 95% spam
Orange = 40% spam
Yellow = 20% spam
Black = 99% spam
White = Less than 4% spam

As most 'real' email comes from servers you talk to all the time this system quickly identifies the trusted mail servers that never send spam so that messages from those server will be very unlikely to be accidentally classified as spam.

 

The advantages this has over traditional RBL services (which should also be used of course)

It is free to SurgeMail customers

It makes use of the many users clicking 'spam/not spam' on messages as they read them to help identify spam more accurately.

It provides both positive and negative indications for the spam filter, this is much more valuable than purely negative responses as given by many rbl services because the bulk of spam comes from 'unknown' transient ip addresses, so the significant information is the list of known mail servers that regularly send 'non' spam.

It is also a long term reputation rbl, so instead of automatically forgetting everything every 2 days like many rbl systems we try and store a long term record of stats for each ip address.

Probing suspect URL's

One common form of spam is emails containing URL's to spam sites, and as spammers create new urls every day even SURBL is often unable to provide fast enough updates to block emails of this sort. SurgeMail includes a probe feature where an unknown url is probed to see if the page it points to contains known 'spam' keywords.

Content filtering

SurgeMail has several modules intended to try and identify the content of a message as fundamentally spammy, one module 'spamc' analyzes the words used using the messages in the training folders to find combinations of words that imply 'spam'. It tends to be about 80% accurate. In addition to this there are a couple of modules with very specific words/phrases that surgemail will try and identify. Content filtering only works when the content of the message is the spam (as opposed to random words used to fool filters like this, that is why this method is only 80% accurate and needs to be used in combination with the above features.

Url base unblock bounce

Many systems block potential spammers with a 'hard' bounce, even if they are right 99% of the time, the 1% failure rate creates enormous headaches for the computer manager and the users. SurgeMail bounces with a url that any sender can use to bypass the spam system, this greatly reduces the severity of 'false positives'. Also this web page includes a 'human' test to ensure an automated robot can not bypass it (unlike the previous email based system in earlier versions of surgemail) You can enable the old email based system if you wish with the settings mentioned below.

How to enable the new system

To disable the new Smart Filter mechanisms and return to the old behaviour!

Only use these settings if you really must :-)

g_myrbl_disable "true"
g_sf_disable "true"
g_friends_byemail "true"
g_spf_byemail "true"

How to convert your local.rul file to sf_mfilter_local.txt

You can tailor your own rules still with this new system however, we suggest you consider the following, try using the builtin rules first and see how they perform. When adding rules (e.g. converting an existing local.rul file) you will need to change the actions to choose from the various possibilities

  1. Add a manual score - call feature_manual(0.8, "Manual addition")
  2. Add a self tuning score based on your spam sample - call feature_add(1.4,"featurename")

In the second case the score '1.4' is ignored. So the recommended method is to convert all call spamdetect(x,y) statements to call feature_add(x,y)

In the first case the value 0.8 is NOT the value added to the spam score, it is the probability that such a message is a spam message, so a value of 0.99 might add 12 to the spam score. A value above 0.5 will add a positive value to the spam score, a value below 0.5 will decrease the spam score. Examples:

call feature_manual(0.95, "Probably spam")

call feature_manual(0.7, "may be spam")

call feature_manual(0.0, "almost never spam")

call feature_manual(0.1, "Probably not spam")

 

You should only use 'manual' rules when the feature is so 'rare' that your sample data does not give useful figures on it, and in that case, the rule is probably of little or no value, so we suggest you don't use it at all :-) But there are exceptions where the sample spam messages will tend to give the wrong result (as the sample is not entirely random) so a manual rule might make sense.

 

Then run the following commands:

tellmail sf_train

tellmail sf_compare

The first will generate a feature_gen.dat rule file and the second will use it to compare results with the sample spam folders.

If you examine 'feature_gen.dat' after the sf_train command you will be able to see what surgemail thought the feature was and how significant it was (sig = the number of messages with the feature), A feature with a probability near 0.5, or one that occurs less than 20 times in the sample is probably of little point... Near 0.0 means the feature implies the message is not spam, near 1.0 implies the feature correlates with spam...

We are always interested in new features you make up that prove useful. Be cautious that some features can give misleading results due to the nature of the sample messages.

 

Convert into automatic rules instructions (recommended)

Convert into manual rules instructions:

New Commands:

tellmail sf_train - Rebuild feature_gen.dat from sf_mfilter.txt using local data in 'train' subdirectories

tellmail sf_compare - Test feature_gen.dat on train sub directories.

tellmail friends_url - Show a sample URL for unblocking a message, use to test your web access/ports are set correctly.

New Optional settings:

g_myrbl_share "true" - Share IP reputation information with netwinsite.com (strongly recommended, this setting really helps contribute to the wide area rbl which all customers benefit from)
g_sf_generate "true" - Generate feature_gen.dat locally rather than using a standard generic one from NetWin. This is worth setting once you have a reasonable sample collected (surgemail automatically collects sample messages within a few days)

g_friends_lang_auto "true" - Guess the users language(s) by observing messages from each users friends, then add a tag if the user receives a message which is primarily in a language that the user does not have listed. The users language settings are prefixed with the word 'Auto,' when this setting is used so users who have manually set their language(s) will not get adjusted.

Colors (MyRBL)

Blue = new
Brown = new, looks suspect so far
Orange = not new, but no evidence good or bad...
Yellow = some good some bad (usually public email services like yahoo/gmail).
Black = mostly spam
White = rarely spam

Process the message steps (in brief)

    1. Get color from RBL/Myrbl/Surbl etc...
    2. Run sf_mfilter.txt to find 'features' of message
    3. Score message using feature_gen.dat and then bounce with url or give to user.
    4. Run mfilter.rul file
    5. If from friend accept
    6. If exceeds friends setting then bounce message
    7. Deliver to inbox

Features to stop cracking local accounts and sending out spam

  1. g_breakin_enable "true" - used to stop a spammer sending from multiple (3+) ip addresses. (g_breakin_white can be used in rare problem cases, e.g. g_breakin_white "user1@domain.com,user2@domain2.com,*@domain3.com")
  2. g_user_send_warning - alert manager when user sends too many messages.
  3. g_user_send_max max="500" - limit users to a modest daily total
  4. g_safe_smtp "true" - stops a user logging into surgemail to send email without first logging into imap or pop, this will stop 'most' spammers in their tracks even after they hack into an account (but not all) It won't usually cause people problems but it might on rare occasions.

Explanation of the X-SpamDetect header

Here is an example header:

*******: 7.8 sd=7.8 [194]99%13.1(!9,46) [126]10%-7.2(!33,108) [38]87%5.4(X-myrbl:unknown)"

This shows a score of 7.8, then a list of the rules that were applied seperated by spaces. There are two sorts of rules, simple rules and combination rules.

A combination rule looks like this: [rulenumber]percent%score([!]a,[!]b)

rulenumber = this rule number as listed in feature_gen.dat

percent = The percent of messages that have spam if this rule is true

score = The score which is generated using the percent. Anything over 50% generates a positive score, below 50% generatese a negative score.

(a,b) = The two rule which were true that made this combined rule true, ! signs are used to indicate 'not'.

A simple rule looks like this: [38]87%5.4(X-myrbl:unknown)

rulenumber = this rule number as listed in feature_gen.dat

percent = The percent of messages that have spam if this rule is true

score = The score which is generated using the percent. Anything over 50% generates a positive score, below 50% generatese a negative score.

(a:b) = The header and value that were 'matched' that made this rule true, if no header is specified then it's a feature as defined in spf_mfilter.txt

The total sd=7.8 is not a simple sum of rules, but rather an 'average' of the rules that matched. Offset by '4' to the right, e.g. sum(scores)/n+4

 

Tips for users to avoid spam:

Never put your email address on a web page, instead use a service like this one: http://www.emailmeform.com/

What if it doesn't work at all ?

If the scoring is completely blank or if you see this text in the headers:

X-SpamDetect: : 0.0 sd=0 feature_gen.net (or .dat) is blank or missing, update from netwinsite failed see netwinsite.com/surgemail/help/myrbl.htm for help

It might mean you are running a new build with the new spam handling mechanism, and most likely it's failed to pickup it's main rule file so it's not applying any rules at all.

It might fail if you don't have updates, or if you have a firewall blocking port 80 outgoing connections from your server. Once you fix the problem you can 'trigger' an update automatically by deleting aspam_update.done and restarting surgemail.

The two files you need are:

sf_mfilter.txt
feature_gen.net

They should automatically be fetched from netwinsite but that 'can' fail if your firewall is blocking port 80 connections. In which case you could download them manually then restart surgemail.

http://netwinsite.com/surgemail/sf_mfilter.txt
http://netwinsite.com/surgemail/feature_gen.net

Or You can disable the new system with this setting: g_sf_disable "true"