在Ubuntu 9.10上与FuzzyOCR和SpamAssassin打斗图像垃圾邮件

在Ubuntu 9.10上与FuzzyOCR和SpamAssassin打斗图像垃圾邮件 本教程介绍如何使用FuzzyOCR扫描图像垃圾邮件的电子邮件 Ub ...

使用FuzzyOCR和SpamAssassin在Ubuntu 9.10上打击图像垃圾邮件

本教程介绍如何在Ubuntu 9.10服务器上使用FuzzyOCR扫描图像垃圾邮件。 FuzzyOCR是SpamAssassin的一个插件,其针对的是包含图像作为主要内容载体的未经请求的批量邮件。 使用不同的方法,它分析图像的内容和属性,以区分正常的邮件(火腿)和垃圾邮件。 FuzzyOCR尝试通过仅扫描尚未被SpamAssassin分类为垃圾邮件的邮件来保持系统负载低,从而避免不必要的工作。

我不会保证这将为您工作!

1初步说明

在本文中,我将使用Ubuntu 9.10作为基础系统。

我假设SpamAssassin已经安装并工作,以/ etc / mail / spamassassin /作为其主配置目录。 如果您的目录不同(例如,如果您安装了ISPConfig 2 ,目录是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / ),这没有问题。 我会注释在哪里改变什么。

请确保您的SpamAssassin版本适用于FuzzyOCR。 例如,我要在这里安装的FuzzyOCR版本( fuzzyocr-3.5.1 )需要SpamAssassin 3.1.4或更新版本。

2安装FuzzyOCR

FuzzyOCR可以安装如下:

aptitude install fuzzyocr netpbm gifsicle libungif-bin gocr ocrad libstring-approx-perl libmldbm-sync-perl imagemagick tesseract-ocr

这将将FuzzyOCR配置文件放在/ etc / mail / spamassassin /目录中。

如果您的SpamAssassin目录不同,例如/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / ,那么您可以将FuzzyOCR配置文件复制到该目录,如下所示:

cp /etc/mail/spamassassin/FuzzyOcr* /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/

所以现在FuzzyOCR已经安装了,现在我们需要配置它。

3配置FuzzyOCR

FuzzyOCR的配置文件是/etc/mail/spamassassin/FuzzyOcr.cf 。 在该文件中,几乎所有内容都被注释掉。 我们现在打开该文件并进行一些修改:

vi /etc/mail/spamassassin/FuzzyOcr.cf

将以下行放入其中以定义FuzzyOCR的垃圾邮件字文件的位置:

[...]
focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words
[...]

/etc/mail/spamassassin/FuzzyOcr.words是FuzzyOCR附带的预定义的单词列表。 如果你喜欢,你可以根据自己的需要进行调整。

下一个变化

[...]
# Include additional scanner/preprocessor commands here:
#
focr_bin_helper pnmnorm, pnminvert,  ppmtopgm
#not available in Debian: pamthreshold,pamtopnm
focr_bin_helper tesseract
[...]

[...]
# Include additional scanner/preprocessor commands here:
#
#focr_bin_helper pnmnorm, pnminvert,  ppmtopgm
#not available in Debian: pamthreshold,pamtopnm
#focr_bin_helper tesseract
focr_bin_helper pnmnorm, pnminvert, convert, ppmtopgm, tesseract
[...]

最后添加/启用以下行:

[...]
# Search path for locating helper applications
focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin
focr_preprocessor_file /etc/mail/spamassassin/FuzzyOcr.preps
focr_scanset_file /etc/mail/spamassassin/FuzzyOcr.scansets
focr_enable_image_hashing 2
focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb
focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db
focr_db_safe /etc/mail/spamassassin/FuzzyOcr.safe.db
[...]

使用最后四行可以启用图像散列。 这是FuzzyOCR开发人员关于图像散列的说法:

“图像散列数据库功能允许插件将图像特征向量存储到数据库中,所以当它第二次到达时就知道这个图像(因此不需要再次扫描)。这个功能的特殊之处在于如果它稍稍改变(垃圾邮件发送者完成),它也会再次识别图像。“

如果使用/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin而不是/ etc / mail / spamassassin ,FuzzyOCR的配置文件是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / FuzzyOcr .cf而不是/etc/mail/spamassassin/FuzzyOcr.cf ,所以编辑一个。 在配置文件中,您现在必须确保使用正确的路径(即/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin )。

这就是FuzzyOCR配置。 现在看看它是否按预期工作。

4测试模糊OCR

FuzzyOCR附带样本图像垃圾邮件(在/ usr / share / doc / fuzzyocr / examples /目录中):

ls -l /usr/share/doc/fuzzyocr/examples/

输出应如下所示:

total 156
-rw-r--r-- 1 root root 13633 2008-09-25 22:47 ocr-animated.eml
-rw-r--r-- 1 root root 16108 2008-09-25 22:47 ocr-gif.eml
-rw-r--r-- 1 root root 27506 2008-09-25 22:47 ocr-jpg.eml
-rw-r--r-- 1 root root 27842 2008-09-25 22:47 ocr-multi.eml
-rw-r--r-- 1 root root 24657 2008-09-25 22:47 ocr-obfuscated.eml
-rw-r--r-- 1 root root 18236 2008-09-25 22:47 ocr-png.eml
-rw-r--r-- 1 root root 16113 2008-09-25 22:47 ocr-wrongext.eml
-rw-r--r-- 1 root root  3576 2008-09-25 22:47 README

我们现在可以将这些电子邮件提供给SpamAssassin,看看FuzzyOCR是否正确连接到SpamAssassin中。 找出你的spamassassin可执行文件的位置(通常它在你的PATH中 - 你可以通过运行

which spamassassin

如果显示结果,则spamassassin位于PATH中,您不需要指定spamassassin的完整路径来运行它。)

如果你不知道spamassassin在哪里,你可以通过运行找到

updatedb
locate spamassassin

如果您使用ISPConfig 2,则spamassassin位于: / home / admispconfig / ispconfig / tools / spamassassin / usr / bin / spamassassin

现在你知道spamassassin在哪里,你可以将垃圾邮件的垃圾邮件提供给垃圾邮件地址,如下所示:

/path/to/spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null

例如

/home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null

或者,如果spamassassin在您的路径中

spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null

你现在应该看到很多输出,结束应该是这样的:

[...]
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: Friday Augurt 4, 4:01 pm ET
[10025] dbg: FuzzyOcr: LAS VEGAS, NEVADA--(MARKET WIRE)--Aug 4, 2006 -- auantum Energy, lnc. (OTC
[10025] dbg: FuzzyOcr: BB:aEGY.oB-_-
[10025] dbg: FuzzyOcr: auantum Energy, lnc. is pleased to announce that it has applied to have its shares listed for
[10025] dbg: FuzzyOcr: trading on the Frankfurt Stock Exchange. The company has retained the services ofBaltic
[10025] dbg: FuzzyOcr: lnvestment Group of Hamburg, Germany to assist with the application.
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: _ qEGY,OB "
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: <<=end
[10025] info: FuzzyOcr: Scanset "ocrad" found word "target" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "service" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "hot energy stocki"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "current price o"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "company" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "recommendation" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "sboog bup recommendation"
[10025] dbg: FuzzyOcr: Enough OCR Hits without space stripping, skipping second matching pass...
[10025] info: FuzzyOcr: Scanset "ocrad" generates enough hits (8), skipping further scansets...
[10025] info: FuzzyOcr: Message is spam, score = 15.000
[10025] info: FuzzyOcr: Adding Hash to "/etc/mail/spamassassin/FuzzyOcr.db" with score "15.000"
[10025] dbg: FuzzyOcr: Digest: 538584:327:549:7::255:255:255:255:168580::0:0:0:0:9098::0:128:0:75:1086::0:0:128:15:395::128:0:128:53:213::0:0:255:29:115
[10025] info: FuzzyOcr: Words found:
[10025] info: FuzzyOcr: "target" in 1 lines
[10025] info: FuzzyOcr: "service" in 1 lines
[10025] info: FuzzyOcr: "stock" in 2 lines
[10025] info: FuzzyOcr: "price" in 2 lines
[10025] info: FuzzyOcr: "company" in 1 lines
[10025] info: FuzzyOcr: "recommendation" in 1 lines
[10025] info: FuzzyOcr: (12 word occurrences found)
[10025] dbg: FuzzyOcr: Remove DIR: /tmp/.spamassassin10025QnPTq8tmp
[10025] dbg: FuzzyOcr: FuzzyOcr ending successfully...
[10025] dbg: FuzzyOcr: Processed in 2.191381 sec.

如您所见,/ usr/share/doc/fuzzyocr/examples/ocr-gif.eml已被归类为垃圾邮件,得分为15分,因此FuzzyOCR正在运行。

所以您的SpamAssassin现在能够识别图像垃圾邮件,这得益于FuzzyOCR的帮助。