Wednesday, May 09, 2012

SQL Challenge - Part 1 - How to Find Contact Email Addresses in Platypus That May Cause an SMTP 550 Error.

Occasionally, if I am very lucky, someone will confront me with what I call the "SQL Challenge".  Way back in the day of the early 2000's, one of my coworkers and I would try to come up with ways to stretch our SQL knowledge.  Thanks to the "SQL Challenge", I am now capable of some fun acrobatics using just a little SQL.

The rules are fairly simple, once you have the actual challenge.  The SQL must be encapsulated into a single SQL statment - be that select, delete, insert or update.  It can have all the joins, subclauses, derived tables, and whatever else SQL provides, as long as it is included in a single statement. If inserting, updating or deleting, you are allowed a separate SQL statement for each table, but that is the only exception to the one statement rule. So, no cursors or loops. You get a bonuses for using ANSI SQL syntax and for how quickly you finish.

Today, I got one of those challenges.  Let's start with the givens.

1.  We have one or more email addresses stored in customer.email.
2.  These emails can be comma delmited or semicolon delimited.
3.  There may be spaces embedded before or after commas/semicolons.
4.  The emails may be hosted internally or by a 3rd party provider, or a combination of the two.
5.  A list of domains is stored in domain_item.domain.  These domains are hosted internally.
6.  A list of email addresses is stored in email_data.emailaddr.  These emails are hosted internally.
7.  When sending email messages to the email addresses stored on customer.email, if one of the email addresses is on a hosted domain but the email address is not in the list of hosted emails, the SMTP server will return a 550 (mailbox not found) error.

Now, the challenge.  Given all of the above, we want to find a list of customers with email addresses on customer.email that will generate a 550.  This means we want to find a list of email addresses attached to a hosted domain but are not a hosted email address.  After a little less than an hour, I made this...


select
/* The unique customer id */
    customer.id,
/* The delimited list of contact email addresses */
    customer.email,
/* A hosted domain matching one of the email addresses */
    domain_item.domain,
/* The number of email addresses in customer.email */
    len(customer.email) - len(replace(customer.email, '@', '')) as email_count,
/* The number of emails that match the hosted domain */
(
    len(replace(';'+replace(customer.email,' ','')+';',',', ';'))
    - len(replace(replace(';'+replace(customer.email,' ','')+';',',', ';'), '@'+domain_item.domain+';',''))
) / len('@' + domain_item.domain + ';') as domain_count,
/* The number of hosted email addresses that match the hosted domain  */
(
    select count(*)
      from email_data
     where customer.id = email_data.d_custid
       and ';'+replace(replace(customer.email,' ',''),',',';')+';' like '%;'+email_data.emailaddr+';%'
       and email_data.emailaddr + ';' like '%@' + domain_item.domain + ';'
) as match_count
  from customer
 inner join domain_item
    on replace(';'+replace(customer.email,' ','')+';',',',';') like '%@'+domain_item.domain+';%'
/* Strip the hosted domain from the delimited list */
/* Comparing the difference divided by the length of the hosted domain name */
/* This will let us know how many emails _should_ be hosted for this domain */
 where
(
    len(replace(';'+replace(customer.email,' ','')+';',',',';'))
    - len(replace(replace(';'+replace(customer.email,' ','')+';',',',';'), '@'+domain_item.domain+';', ''))
) / len('@'+domain_item.domain+';')
/* Compare the _should_ total against the _actual_ total */
/* If they don't match exactly, include the customer in the result */
<> (
/* Find the number of hosted email addresses for this customer */
/* That match one of the email addresses on the delimited list */
/* This will let us know how many emails are _actually_ hosted for this domain */
    select count(*)
      from email_data
     where customer.id = email_data.d_custid
       and replace(';'+replace(customer.email,' ','')+';',',',';') like '%;'+email_data.emailaddr+';%'
       and email_data.emailaddr+';' like '%@'+domain_item.domain+';'
)
go