I spent roughly 6 hours Sunday trying to fix the wrong bug. Long story long:
Due to a problem with the timezone on one of our database servers, the last two time changes caused several issues. Among them, a message from our mail software claiming we were trying to schedule a mailing for a time in the past. Even though once a mail is scheduled, the mail system simply queries the database and sends all mailings that have a scheduled time in the past. Evidently the initial creator of our mail system wrote this check into the scheduler to ensure he wouldn’t encounter any issues if he ever got around to inventing a time machine. On Sunday, the day we set our clocks ahead, when the error popped up again, it was obvious that the fix we made 6 months ago didn’t work.
5:30 – phone call from C trying to send a mailing reporting the error
5:30-6:00 – spent time verifying that every single machine I could think of that possible touched this process had the proper time set (they did)
6:00-6:30 – pulled in a systems engineer to verify the same thing I just did (and he did)
6:30 – process restarted since all the times had been verified with the ‘notify_delay’ time set to 61 minutes to account for the hour difference
6:32 – C notices there is a configuration variable (mail_on_reprocess) that must be set to allow a re-processing to send out the mailing. We decide to wait for the current process to finish, then rerun with the proper configuration.
7:15 – re-processing finishes, C notices there is a mailing scheduled despite the bad config – oh well, we’ll take it. C resets the configuration to defaults to avoid future problems.
7:35 – mailing completes, C notices the mail has sent to 0 recipients instead of the 4200 it should have
8:15 – after scanning the code, I notice that a mailing is created even in the event the mail_on_reprocess variable isn’t true, it just sends it to 0 recipients. Fuck.
8:30 – C sets the re-processing variable and re-runs
9:15 – re-run complete, we see the timing error again. WTF?
9:15-9:45 – Rethink my strategy, find that the code generating the schedule date gets the time from the database server, not the local machine unlike our other mailing code that was working fine all day. Obviously, I was right in assuming the database was off even though it reports back ok. Must be a timezone translation thing that happens in the Perl DBI module.
9:45 – re-run, set notify_delay to 61 minutes, mail_on_reprocess to true
9:45 – 10:30 – drink, shower
10:30 – failure again, bad config, unrelated issue
10:30 – 11:57 – re-run twice more. C has been awake for 21 straight hours at this point.
11:57 – 11:59 – check mailing is scheduled and sent to proper number of recipients
Turns out the timing problem had nothing to do with the time change, and the fact that it generates the same error we see whenever the clocks change is a mere coincidence. It had to do with the fact the schedule date is calculated BEFORE the list of recipients is generated, which took 40 minutes. With the delay set to 30 minutes, we’ve moved past the schedule date and only Doc Brown can get us back.
Coincidences:
- this process results in the same error we see whenever the clocks change
- the clocks changed that day
- unlike all of the other code that sends mail, which was working, this method got the time from the database server and NOT the machine’s local time, fueling the idea that the database server’s time was the culprit
- this was the first time since the code’s release 6 months ago that the recipient creation time ever took longer than the notify delay