From de13cdbcd51c8934e9362cab9ed266474ae41339 Mon Sep 17 00:00:00 2001 From: Carlo Strub Date: Wed, 28 Feb 2018 23:43:19 +0100 Subject: [PATCH] add slides --- docs/bayes.png | Bin 0 -> 8466 bytes docs/meetup.slide | 74 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 74 insertions(+) create mode 100644 docs/bayes.png create mode 100644 docs/meetup.slide diff --git a/docs/bayes.png b/docs/bayes.png new file mode 100644 index 0000000000000000000000000000000000000000..61cc4c8722105ede926e91dd155a23e73732f3f8 GIT binary patch literal 8466 zcmb7qcOcY%{6B{rDrYMhmlfx1k$KL@I23ipIpnB}lB~05W=6Iol#${%*?XLova_cN-ac$t)U7p{9tBboXA62ZO?B+x$lV-g2PSY3;OUvEXr3Ow8wRjo&`k{Zn3=U&*VkeE#_1|oko#ZC57v)l1b z2Y3FarqRW6A4Vr9v339Ue?{@$jS(SWYj3-G1Rb4hHL^L?{~exN?*ACocy{a%uY3v| z>t>}5nTlW%OkO(v)2qTweDY_0(qkGM^RML<-9r_&5%7{SbG5!$h zXLC_vn(MQ5_^tYbWSrmmzk`qJ7}4HL70Y)D9?v~9tVMiOPPbdL>-6scTaYvoe>7R` z^2D;%`(X3s?tp)5WA0evZ$YjP7pfVG?4d{qK+&dl8VOLn);`WXiE@4wbN7t?chD#B zM#m6{@(bR_JrW;GhZLxk5P|M14=$NbdmKCMA20et+FGkRdeC zTbt57GOo!@$9L;d2q+{jt&~V49*D#GOFqgQfv4EZ?tR&qYyC&QlL$k+xMA{K{q^iLDeOY>AR1bFF;w!? z6j7}Fc3h+1zvGwI@gtpU3xWbd;nZnWu5>aBzfC$~xh)$6fBv0|rg;u^+O752n%b4| zT0i!g5OAT;;?nDJ(3)$#fu*f?)%BY_2K;q1^qYZo;vsL zF?(TnKwpHZguCA6RF1?evOxAA{;Wi@<|^^*Zs+#V{S`e@Hi2@lw^jgZ%?8Q~NrJzc z_B$zBb{nu3UOeTzeG37$BRfIpddgXY+P_KKl(KGHCC_vWg*mfLr9z%C23PbfFY0|a z0z-Orf9tDBT@Eb9SA&wfdv-&wG@;dr${xm7scE2B{EaeQeo0Pq2_ z+wv-HoO7l(`bwe))G#Wc=n|pP4CvFt?RuRIgsBoC42evK~zA#`c%z8Wrp4 z>c0$%)^7f+JiIq$`*Vk6sz-h^hyjz#fLg&ezAGwA%~_CxMC7`gJ$;l=^Ky%^1{-uu zp>$rRDp>~0B56V61&1@2*NWV3SP@46xcx4~a86$L^VtXmI-iWY_1_1n&x_$Vz%(3+ z;WT4R?B$fNI|h|H;6fxdT>|J!68<{ZfV%3`t(&J2>dz z_V>w}mN0dIqxUNB4dw{|QS$qZt-$xzJ%1~Vk(79(W^qY}c$9YwRAU~m;9b)b-P0?c zNA~8SH2TlW=ekvWHOl#kufBB$zhOL|tsbrkxl9PII?{QUz`o}eFR_{-mwTL}!8s{@ zl1&D+Sxii=bBtWc*0xj$?-1uq2|q_X#K5eNC*;_J>z<2{f2ed>O_K7f{QU#Fv3Bx- zBvbl<32v)USpFfaxpanHHqj4TG^ss){Sy|218e-l1~q-DZ++5p5wBcyNdEW)-Ask=+S52jXCZl6&snP8+t zNpE7LBI^Nbrr8xObTUO&k90*AzDtdN7!0`dwck{x?T??3DA;}~ZRGSNNes2WOT5kD zJ)C-v4Kx+SQc&*0qm9vmr%C{eD3&i(By#-V`kHgnVkQbny#@oXOY}81SYiGhON0=@ zu0V$I$a<2Su-$!Ae!!2?B&J9#xSFukZGq%4dT#fIj^|M#ob1bO{ye#LIeYupr(D*C ziztw^b0V&?U8x}T4g>sBcWq3yg0c>+hS(WJbGC$et^2d-6N%;nEXST@)Y|nO26@D& zi&nM_f@Cax)B)45(DlxV~*&lsC2R1 z6WnO_Ib{yU(BU8>9QEre*4mzWLBvbA6b{3T3H?ufZT>S(;MHFf6B8rbveKwa!}PH~ zZ*grynclO2@BD%&)n&l&2sIHmne=fE|5hEzJgYGikot)C3ZsL2xfre|+OvcxUl7#( z3rba)JYiJTC~Dc@Go$>~)W|tA+WzM7(Pu_x^>-E%MHH`9Y z(x?e>nU2Kgm1s6fh2)l~^HJdCQXc+S(=WyUw_rP&ZZNBH>OG{3)oXr4j+T|BqrT4d zQ%D(E*BH(34fBZlULFw~=#g7I0YFiY?7)&)DVkpz93PLo@6@Zj{=e&dZ!QHTg@&a| zAIWkpre1!j8}Tve_4vV@inW9P1Q6bvJZZ*6wB&-j!eedn zxrK8rp@D^>(XbdJvw-D|cThjy-i%}itS}VOK6=Cu#KMz{^=-KqP-06!INcfYV#F6i{`qfE}8y=CYapypowh6`WaF<7#M)M=;j>QTTWa}QD#M#y6P^phF zblFui`4Y&866{Shk{Br{Bg>|mSGjN;==-Q_SB>$$UQvF8 zPMbg8_i~z?E=?Rk{-)aNH;W3!0eF^phlPTYC0wD86qLN>`qtDpJ+8-&cS6?l*Prh88E;4Kd`4#W>m+_kusLhIE9|z~B6~u;GQCjw16LoC2{Mar zO%rxhU8OHU-tw0C${J4hzbKQcJB&4ZRTGeLOEULv>M-?hzhXq0Nuc*_=?j)yJX zAgk*gHjGFcbqo?AvJOQWuz%I0HOI&a1iiz>{|p>>Me%Giebh|<%jFBXTBFVb#A;b_ zHlOy|nbXhJw1|GA$~mY~d@MLKvB6`(O&fa!b3e=lB1QDv;g%7+>(kKl0`u+fM+yTY zLV_T(OKFNEi?{3rUM}10m(e$w?Qk~<#x z{&bdjq~w35SfxS)QTpD^;;VRU22*t`q_T+Z-D}^OYm4&zP2bkO8WR3sMb|h3ib|H* zlH%OHeE+&aS2w+Ja+yy*;neq>-6}(WV-cy=2qFK^)gx3%}&uuGR&HH6yELrMQO=&N zqer09bav(vOW&Gl1OntxmiFsX0B){IpJrPG2tf$vr zxibU;bt8?>aRiD&-l^*C&ura0&MF~R6lSw|s}#w%i#N56lQ4G<$|rp#n55fnilaR; zwriC4+y**!8oYP^d`p6<`j#OZa@(F<@uC6?94dHhif{<%oz?*j%hdFGu!MtQbRXx1-{AG&-nb@?dD*X z-f8(jQ>^xieU;INBAASe$Ll~YUHy`!8~uduX65h9J{Scg-p)nJHH}0^D3%{4?&Vip zqq77%Z0ZUf1MNWFgW>8cBlD|2zXV%e=F|dnFAFR2g~wXhcC-h6fAf%$XWYKe%YtSt zwBY_uHS4$@+9k>b=1L1m9e+0ORS-&Zxj8y}nS=ReP6bC(UYurBQEYZcoG`-b-!iS? z_h(?*8>=b%Nfl>(ypNW>8~>hCeYuG!6DR#CK}ViRFvxBRD`8xB+JqL1SeMEqVk#{D| zgW8tm$U8Tl!nC@z6Va*x%S+(Z{z5~=WL9*M!Kl|%i&Qd^Te%rTic$&xA)OyQj)Y&A zuSWfF9m*10yt%Kc7)nM=5-|$uH(a$}MrCO2tD!kK&j(+|%?N`bO4^m1j9*b3_F~!* zuM5Kht;pxvR7uBN_5X%Dqx)Zp3MZQRLs@*1`Eot~{T=z*W^6|ul!abLl`Kih>RN^u zU@;a?sqWmEsf8sMqmA>6=pb0CDmM=kw(*4#`<}~6PQ=}3Dykq-GLxpR5dEf}yDwF6 z-s`>gZx)GirQmVDj{W>HvK_{q9MoltHznCO^$%4x^mT=VHOJM^ZbtJnLaJ*Z^@-*e z^<|En>`0XavnZ)S7ykEezW8M5K==^iV-X+dWCp$QqN$?%PD~p^J~JyxmqNmg-+S7V zgVv1nPs3)g36P(JwnskPt!!uC<~nmfvN2-JD`M13V9^|Wp#7tMaST9 z-0zB3Hnp>Q*+$9DB=|@Oj#%G(m*^5gH0wrxa(Jdg$CAc%E$9dDdD%RRyQBqRqs6dv-AY+{)oTBt>w?3+14;4kNj(wsKSeA-dsPQHN9U`A-K}= ztblk$$g0rfr)g!7ahAOf`JCR7k#|6IvL`*L(@* zL|f}gS>qI6+v6}i~9O_Xe5=_s9F7`50RRdKhkUmN$9D#_hh>K>{>-+NqLSe<`)$t-_ zFMmDB^hLXh0Git$1cGD+OybI--~8o(T%HH&2r`80a+e!W^d5ZlRw)V)JHyjze3yJg!cGfxw@v9a-~7buP$PpL^}-;bJQhzedjI)>&~_f6mn zApkol3C{qLVD8C=&tzC(2y3aBb{_5LGmo!dAIVVgxC)L-rKt=d+u8NcK^?!hg)PW2 z^EmsQY*a?ktDMF~M@JWZ{dc@pY^3t}b3b7AQ*N1;N+#rz!#Q~{R%^`X^3~BSbqCMz zjdXW-?tidaFsZ4liJ^;NNvT4QV>QVKdh;{=9QO^E}B>}DgY^;Af=CticvBt^WLIBN7=yj^jme})8_qSTj5bY&3Po4RlVQN>3 z2k%JtS%YkJ!4R?YyE*}clra|g?ZbEwUyP;Sk?qb8+36A>$h%djoE_p;?>>|<=Qw0a?Y>taQyT#j`%7O|zCYUGN9dJd}EJjZ7K@sl`qpdo#ga z^Rv+X^KU5~+PEjZ?>DH1$GybV@MvgRANy;dG*xgnRQ@&B)|e9>MwH6dPYb`CsRxGI zD4GL4{;!bXeAd^m>*f>5o%Y3^lC6;qfbvw``T@XJ*VR7Mgpc=6A-BXZ4>@k6qw7cE z>YKMABKZ;w*PtP@D8p>!DUEpM>O^;;b(P)gBXHBlsItTF3h;4xFo ztx?$af+_bkobGmi`u(&FbBjL|xV&@LRtO;u#RaL?I{yyk;0`!%U9$%>q~C&7bf&$g zwq6}B!kjREenfpahFkvX9oDI@<>hWO6EjZGC4O{3=ier0F>Hi^wr;^FS@6-zbCyVO>La7`HD`6x{J?(QeDz*~JwI5N z4F8G#xE?>M>Y5nS+eiNC^*K)thlw)Zr7LtftW5+tVuN8w9B~=drYz$w+w&t`Hhbf97dg?e%G_Xk`DK;V@dmfKU#hV@&IO6o?`RX&#(xPS&d8 zTDR~;T-@y}H6~5SKLC5r47HJd3_}E1*yj6rF0}~LKonEmP?x#vLr_;pNIN!xxf-i- zE#mgd)+rI!aS2Yq;&vH%7Nx--qS(YP`8t)ud^QQZ16*HOdF0PEbgB>U5OqM5!?!$d z%?9>PirHL!En%vwhVUp{GZlP*%Bb6!V0S=0eKEM7#M!sR+&hyP@+!nt!hsy*de5%= zannNvo_(mn8g&{I&>_8Owx)W9!~Jf`@!p!MUvQXrJN|pfsX*E;+;OK0UHs83!MAl) z-K$H&?jT=!UOxV8&3WZuwsv(vGDst*SN`K4Nb^k+6}3>?IzV>N1U=OHf{p_g7>Boy z(n~&3i~_&s&)#JAd7_kZHK-;`*?4J62tKfpYVER(&%MoKf84heNMJXN9HWC|3D%?k z=oJQ$^_w&_A6zvu*xXpsQ38kJr}LPNw|lND)(Sg@z^X$s2^So$mmh_(m?~;QACr zpp}bX)9Rb+3e-_i%3PK62{!=d2MT+Ya*n&@AXZ43T6TY}&FuP2&q}Eeco=IWPo5Ti z9{#NJgjil)&dgO!6dGqC6@L1^RUfTYTK!sIT+I7*?+WY2-pU~LKmtw|E^rn{^XvHt zoRi1tw~Dv`>g?%q?ekra^k$ZGYUgAQxz8S#W?k6BbqoOtEu`?W#aZlV=uFRO#5gM+ z1|)iXBFvbuUTsG5`skmM`J#2SVfJ*MGIqSEt7t=$@6Ufak_4(VAjx6%-}+ubLrf$f z-EZbTu)7%U1N>mZH67*5cR27cC+XQSAIpV)zRfJ0hr@@HgIIZT|KRX*Q1lF*YB^MR z#{JQx9y8DXo}>eu#LmlG(2?rQ0Pxu4xkBjcVsf&!u+5$xV8|&10=Og_{_QyxNbh}I zO1lY ?F~FI7y?(`iSWA+&`4Q)M9AxrC;MSuDWB)wMVm+|g7dkGIwXcA9%JWh|sL zYV!{uQidf)W(dE0HQ9B~I83Zi=SWhGxlz#pZeM-&lU|50E(K}se>&RL_g9)B7 zU8@(OwW*cO5Fb$L|`GxPF z8P{gGil+w|yTBzw$N=?e4O$t7)V8(z4;9|GX7XLo|ITjo{FHhEW-&o6eDYz=iB>Kj z@TF-5G%D#4QUsu);uf1Iqr2Fz6ntao%@Q1wo%V4l7Kp6#b1ch+=A!=frHW!}%6y`q z?OJ%Mgir_R*1wg5@YUAVcHzqGW*s&pZ+@Wxl$|34`IY^pX?#Y3M(Bws%CoPSXR|?l@s}&aB;{{b?lc?T~duDq!Otm3WxzQgk&^J z%gl~URQaClPoiHc9trE zM*@CaJPRN>e%w5?mcw^(KS_|ho$(i;ZU%K2k^8XgTjl>9X6)?l->rD4t$4}dDuscz z-NoH(`1T0wIoeJ_JP$?WAlzR{)5E^g0d@ypW5ABFTIW#3?Q;EnIy^jlOS$vue?n2I zhnP3|`K}W7f~sP19Wn4p&4aD!%o+uoR~Ook)5*GJPWPOB&n2m}D&7+q;r;(SIB-t! Y_HDJjtg2=KurV@i4SjV2(kkeG084G{D*ylh literal 0 HcmV?d00001 diff --git a/docs/meetup.slide b/docs/meetup.slide new file mode 100644 index 0000000..dbc57e6 --- /dev/null +++ b/docs/meetup.slide @@ -0,0 +1,74 @@ + +Sisyphus +How to store 50 000 mails in 10MB to fight Spammers +1 Mar 2018 +Tags: sisyphus, spam, junk, mail + +Carlo Strub +economist, gopher, rustacean, FreeBSD developer +cs@carlostrub.ch +cs@FreeBSD.org +https://carlostrub.ch +https://github.com/carlostrub + + +* Junk Mail + +What is it? + +- Mail we do not want to have in our mailbox. +- The same sender might sometimes be in either category. + +How to fight it? + +- Block lists, e.g. Spamhaus, etc. +- Sophisticated filters, e.g. SpamAssassin +- Greylisting, tarpit, and other exotic punishments + +* Sisyphus + +- requires zero configuration, neither on the server nor on the client +- works with any MTA and any client +- learns about your preferences based on all messages in your inbox and your junk folder +- can handle multiple mail accounts with independent junk mail preferences +- requires minimal resources, e.g. learning over 50 000 mails and keeping track of roughly 90 000 words requires only 10MB of storage +- BSD licensed + +* How it works — Bayes' Rule + +.image bayes.png + +* It's all about counters + +- All needed probabilities can be calculated using counters +- But counters are costly in general (storage complexity proportional to number of elements) +- What if we learn a mail twice? + +* HyperLogLog Algorithm + +- Hashes of a stream of data has interesting properties regarding cardinality: + 1) number of leading zeroes yields estimate on lower bound (bit-pattern observables) + 2) smallest values yield estimate on cardinality (order statistics observables) +- Two consequences for Sisyphus: + 1) we can count all words in all mails on very small space + 2) we do not have to check whether we already learned a mail + + +* Implementation +- Pure go +- Database: bolt (stores sisyphus.db in Maildir) +- Learns all mails in Maildir +- Classifies new mail, triggered by FSNotify +- Dependencies: + github.com/boltdb/boltdb + github.com/carlostrub/maildir + github.com/fsnotify/fsnotify + github.com/gonum/stat + github.com/kennygrant/sanitize + github.com/retailnext/hllpp + github.com/sirupsen/logrus + github.com/urfave/cli +- Principles: 12factor App, semantic versioning + +* API +.link https://godoc.org/github.com/carlostrub/sisyphus