From: spc (Sean 'Captain Napalm' Conner) Date: 19:15 on 20 Mar 2006 Subject: The sorry state of i18n This stems from spam, so right off the bat it's hateful. But other than the spam issue, I'm not sure where to place the rest of my hate since it crosses multiple programs across multiple platforms. Now, you may be asking yourself, ``Self, what does i18n have to do with spam and softare hate?'' Glad you asked (even if you didn't). I work at a small web hosting company and even though we're small, we get an insane amount of spam through our network (then again, who doesn't?). We have a dedicated platform (read: commercial, proprietary and expensive) that does nothing but filter for spam---it is, in effect, a Spam Firewall. You point the MX record to this device and it'll scrub the incoming email---blocking from known spammers, letting through the rest but marking on the subect line emails that *may* be spam (and so far, it's never been wrong when it marks an email as spam). So, a message that comes into the Spam Firewall as: Subject: Play longer! Increase your mortgate by 3 inches! if not outright blocked, will be slightly modified to read: Subject: [SPAM] Play longer! Increase your mortgate by 3 inches! I'm the system adminstrator for said small web hosting company, and as such, I have root's mail from each of our servers headed to my account. Which means I get a ton of email---log summaries, mail bounces, problem notifications, what have you. In order to keep from being inundated I've set up procmail to filter and file all my incoming email. So, it was easy enough to setup the following rule in procmail: :0: * ^Subject: .*SPAM.* in-SPAM Never mind the obscure syntax and the difficulty in actually scanning for a literal '['---this works enough to send all spam marked emails to the bit bucket. But I noticed that not all marked spam was being caught. There I am, in mutt, and what do I see in my inbox? Subject: [SPAM] Play longer! Increase your mortgate by 3 inches! That shouldn't be there. Let me test something---I sent from my personal account an email to my work account with "[SPAM]" in the subject line, and lo' it ended up in 'in-SPAM' just like I told procmail to do. Yet I still get Subject: [SPAM] Play longer! Increase your mortgate by 3 inches! What's going on? Suspecting that somehow procmail wasn't seeing the actual subject line, I checked the incoming mail spool file directly and what do I see? Subject: =?ISO-8859-1?B?W1NQQU1dIA==?= =?ISO-8859-1?B?UGxheSBsb25nZXIhICBJbmNyZWFzZSB5b3VyIG1vcnRnYXRlIGJ5IDMgaW5jaGVz?= Aha! [1] MIME crap! [1] I18n crap! [2] With varying degress of support (or non-support in the case of procmail). Okay, so where's the hate? Let's see ... the Spam Firewall? Okay, it's nice that it can decode encoded header lines, but *why* oh *why* does it encode "[SPAM]" if the subject line is encoded? Obviously you can have portions of a head encoded and not all of it. I'm guessing the Spam Firewall vendor can't (or probably won't) fix this because the actual bit that does the rewriting of the subject line is probably some third party i18n library that the Spam Firewall uses and it's not cost effective to "fix" this particular problem, since for most people it's not a "problem" at all. Stupid. Procmail? For not supporting i18n at all? Are there any regex engines out there that can deal with i18n? Does procmail need to be updated to support MIME? Hate. Mutt? Well ... it supports MIME and i18n, but it masked this particular problem for a few days. It's tempting to rip out MIME support from mutt (since I can't stand MIME but that's an issue I have to deal with) but it does make it difficult to deal with the occasional attachment. Perhaps a toggle to flip MIME support on and off ... Agravation. Spam? Well, that's pure hate incarnate. So I dutifully add: :0: ^Subject: =?ISO-8859-1?B?W1NQQU1dIA==?=.* in-SPAM to .procmailrc and get on with my life, until I start seeing Subject: [SPAM] Play longer! Increase your mortgate by 3 inches! in the inbox yet again. What now? Subject: =?UTF-8?B?W1NQQU1dIA==?= =?UTF-8?B?UGxheSBsb25nZXIhICBJbmNyZWFzZSB5b3VyIG1vcnRnYXRlIGJ5IDMgaW5jaGVz?= Sigh. -spc (Actually, I think it was originally encoded in WINDOWS-1251 which is a whole other form of hate ... ) [1] It's actually encoded in UTF-8 in this example---I don't have a full example in ISO-8859-1 but it's close enough to serve for an example. [2] Mostly hateful, but I can see a use for it. [3] Not crap at all, but I'm ranting here.
From: A. Pagaltzis Date: 06:44 on 21 Mar 2006 Subject: Re: The sorry state of i18n * Sean Conner <spc@xxxxxx.xxx> [2006-03-20 20:45]: >Procmail? For not supporting i18n at all? Are there any regex >engines out there that can deal with i18n? Does procmail need >to be updated to support MIME? That's where all of the blame rests. Neither mutt's nor the spamwall's behaviour would be a problem if procmail behaved itself, would it? Precedent: RFC(2)822 headers must be wrapped at 76 columns. Procmail sensibly undoes this wrapping before it matches the headers against the patterns you defined. It would be stupid if procmail made you account for the fact that the subject line could be broken at any point, now wouldn't it? But it makes you account for the fact that headers may be encoded per RFC2047. Stupid. Hateful. >I'm guessing the Spam Firewall vendor can't (or probably won't) >fix this because the actual bit that does the rewriting of the >subject line is probably some third party i18n library that the >Spam Firewall uses and it's not cost effective to "fix" this >particular problem, since for most people it's not a "problem" >at all. Indeed. If, say, you used Thunderbird and set up a filter rule based on the subject, it would work regardless of whether of the subject was RFC2047-encoded or not. This is just a matter of procmail being prehistoric. I have forever meant to write my own mail filter script and retire that nasty paleolithic excuse for a hack... Regards,
From: Peter da Silva Date: 12:51 on 21 Mar 2006 Subject: Re: The sorry state of i18n > I have forever meant to write my own mail filter script and > retire that nasty paleolithic excuse for a hack... Heh. I wrote one, then replaced it with a script to turn my simpler and more understandable syntax into a procmailrc, after running into one too many variant mail tentacles. But I'll support you to the fullest when you get a Tuit or two.
From: A. Pagaltzis Date: 11:49 on 03 Apr 2006 Subject: Re: The sorry state of i18n * Peter da Silva <peter@xxxxxxx.xxx> [2006-03-21 13:55]: >> I have forever meant to write my own mail filter script and >> retire that nasty paleolithic excuse for a hack... > > Heh. > > I wrote one, then replaced it with a script to turn my simpler > and more understandable syntax into a procmailrc, after running > into one too many variant mail tentacles. > > But I'll support you to the fullest when you get a Tuit or two. Even after I tell you that I planned to have the filter be written in Perl? :-) Iâm just very tired of little languages. Dealing with the vagaries of mail should be feasible enough considering that the Email::* set of modules does a pretty good job at both being easy to employ (unlike the Mail::* legacy) and having comprehensive support for email specs. So my thinking was to take a bunch of Email::* modules plus Mail::POP3Client and tie them together into some sort of DSL so that I could write filter scripts in Perl with the absolute minimal friction. Given that the script is still Perl however would mean that playing fancy tricks would be possible without pointlessly wasting effort â first to figure out the limitations of yet another arbitrarily limited crappy mini-language, then to find ways around them. I mean, it was fun when I was an eager green nerd, but Iâve been around that block enough times at this point that itâs lost its novelty. Just give me a real language and let me do what I want, thanks. * Peter da Silva <peter@xxxxxxx.xxx> [2006-03-21 16:10]: > I think a mail filter syntax really needs to be relational, > with an escape to procedural code. That's where procmail is > going in the right direction. Yeah, my thoughts exactly. > What I have is: > > [mailbox-name] > Header: glob > Header: glob > Header: /pattern/ > # the following requires both headers > Header: glob > +Header: glob > > <body>glob > > ... > > Any match in a [mailbox] section goes to that mailbox. Mmm, I was thinking about something along similar lines, only with a Real Syntax by virtue of it being a DSL. Regards,
From: Phil Pennock Date: 13:13 on 21 Mar 2006 Subject: Re: The sorry state of i18n On 2006-03-21 at 07:44 +0100, A. Pagaltzis wrote: > I have forever meant to write my own mail filter script and > retire that nasty paleolithic excuse for a hack... I'm actually not hating Sieve. It seems to work well, reasonably clean design if you remember that it's designed to be machine-editable too, so get used to stuff like: require ["fileinto", "envelope"]; if envelope :is "from" [ "owner-all1@xxxxxxx.xxx", "owner-all2@xxxxxxx.xxx", "allhands-sender-for-this-week@xxxxxxxxx.xxx.xxx" ] { fileinto "INBOX.site"; stop; } if header :matches "Subject" [ "Foo *", "Bar *" ] { fileinto "INBOX.fred"; stop; } Nice enough RFC, plus RFCs and drafts for various "require" extensions. Supported by Cyrus IMAP (now there, I have some hate) and Exim. Snippet from RFC 3028 follows. Regards, -Phil 2.7.2. Comparisons Across Character Sets All Sieve scripts are represented in UTF-8, but messages may involve a number of character sets. In order for comparisons to work across character sets, implementations SHOULD implement the following behavior: Implementations decode header charsets to UTF-8. Two strings are considered equal if their UTF-8 representations are identical. Implementations should decode charsets represented in the forms specified by [MIME] for both message headers and bodies. Implementations must be capable of decoding US-ASCII, ISO-8859-1, the ASCII subset of ISO-8859-* character sets, and UTF-8. If implementations fail to support the above behavior, they MUST conform to the following: No two strings can be considered equal if one contains octets greater than 127.
From: peter (Peter da Silva) Date: 15:06 on 21 Mar 2006 Subject: Re: The sorry state of i18n > I'm actually not hating Sieve. The problem is that it's procedural. I think a mail filter syntax really needs to be relational, with an escape to procedural code. That's where procmail is going in the right direction. What I have is: [mailbox-name] Header: glob Header: glob Header: /pattern/ # the following requires both headers Header: glob +Header: glob <body>glob ... Any match in a [mailbox] section goes to that mailbox.
Generated at 10:25 on 16 Apr 2008 by mariachi