fanf: (Default)
[personal profile] fanf
Counting all offered messages (rejected or not), we saw 1 447 252 different HELO names in the last month. If I count the number of dots in each name, the resulting histogram is as follows. The small end (0-2 dots) is inflated by incompetence and forgery. The big end (>10 dots) is 99.99% abuse.

 25765
450511 .
218188 ..
432343 ...
197647 ....
 33647 ..... 5
 28485 ......
 19790 .......
  4582 ........
  2040 .........
  3069 .......... 10
  7005 ...........
  9483 ............
  7722 .............
  4390 ..............
  1840 ............... 15
   568 ................
   150 .................
    23 ..................
     3 ...................
     1 .................... 20


Of the messages we accept, 274 902 different HELO names were used (19% of the total). If I count the number of dots in each name, the resulting histogram looks like this:

 5723
69182 .
84906 ..
75131 ...
26182 ....
 4723 ..... 5
 4436 ......
 2686 .......
  279 ........
  123 .........
  123 .......... 10
  317 ...........
  447 ............
  320 .............
  211 ..............
   87 ............... 15
   21 ................
    4 .................
    1 ..................


A lot of these are clearly bogus, for example 80 characters of random
words concatenated with an IP address, like

Antigone.meter.ernet.ne.jpsouthparkmail.comnetlane.comlouiskoo.comjpopmail.comtw60.186.213.104

or a random collection of concatenated domain names, like

cave.ngs.ouse.hello.nlsammail.compcmail.com.twsouthparkmail.com

(These should obviously be added to my HELO heuristics!) After removing them, there are 272 890 HELO names. If I count the number of dots in each name, the resulting histogram looks like this:

 5723
69182 .
84905 ..
75130 ...
26176 ....
 4688 ..... 5
 4334 ......
 2521 .......
  179 ........
   47 .........
    0 .......... 10
    2 ...........
    3 ............


This still includes various stupidities. 26631 of the 37272 single dot names ending in com|net|org have no name servers so are invalid. Of the unfiltered list, 208323 of the 288884 com|net|org names are invalid.

Edit: Actually, if you use less-strict DNS validity checking those numbers are 22015 (instead of 26631) and 206556 (instead of 208323).

Date: 2004-12-02 19:20 (UTC)
From: [identity profile] kaet.livejournal.com
Looking at these in gnuplot, my initial hunch would be that there are four naming policies at work here (this is my hypothesis):

policy mean-len sd-len prop-of-nms
  A       1      0.5   0.346
  B       3      0.7   0.468
  C       4      2.4   0.162
  D      12      1.4   0.023


I'd guess that A policy is incompetent naming, B is "showy" names, mail.foo.com, etc, C is "topographical" names, mail.border.london.router.dodgy.net, etc, and D is spammy long naming.

Looking at the accepted names, my best fit for these classes is (prob of name being in class)

A 0.33
B 0.53
C 0.14
D 0


Applying Bayes, that means the following probability of goodness (acceptance) given a message is in each class

A 0.181
B 0.215
C 0.164
D 0


Dividing by overall probability of goodness (0.19) gives how much this model predicts you should multiply your estimate of goodness by given it was allocated according to a particular class.

A 0.95
B 1.13
C 0.864
D 0


If my intuition of the nature of the classes are correct then it should be possible to detect each class (approximately). The most valuable, I think would be to distinguish between B and C mail. C might be, for example, "A-record contains numbers". But it's not something I can really hypothesise on, not having the full data, :).

As spam-assassin has linear scoring, presumably intended to be a sum-of-likelihood-logs model, I'd give the following penalties (modulo arbitrary scaling multiplier).

A +0.05
B -0.12
C +0.15
D +infinity


Just a quick hour or so modelling: nothing too serious.

December 2025

S M T W T F S
 123456
78910111213
14151617181920
21222324 252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated 2025-12-30 21:27
Powered by Dreamwidth Studios