This page is updated manually with status of current and recent (30ish days) events.
(Times are US/Arizona UTC-7)
Current status is: Green: I am completely operational, and all my circuits are functioning perfectly normally.
20170326 @6:17PM – It looks like the Updraft2 plugin on a client site freaked out and started looping backups, consuming all I/O write operations and nearly filling the disk when the system went non-responsive. Our best guess is that the underlying ZFS file system started throttling disk writes, which made the system appear to have hung. Please DO NOT run backups during high traffic times. If you must do them, do them after 11PM (US) Eastern / 8PM Pacific. ~M
20170326 @4:28PM – A Phoenix webserver went completely unresponsive, and dropped off the network. We had to hard reboot it. Not sure why. (A client using an overly aggressive backup plugin might have contributed.) (Total slow/down time was 16 minutes)
20170320 @11:15PM – Trying to add a caching drive to speed disk accesses caused a Phoenix server to crash. Downtime 12 minutes. (Figure out what happened – the OS confused itself and renamed the drive, then freaked out. Hard coded the name, and should not ever happen again.)
20170319 @11:37PM – Sneaking another couple servers in for a quick reboot. Downtime was 3 minutes.
20170319 @11:15PM – We did a quick reboot of two web hosting servers to install some security updates. Downtime was 3 minutes. (Love that!)
20170317 @6:40AM – One of the Phoenix webservers got confused and rebooted. We had hoped to reboot it during our maintenance window last week, but ran out of time. It didn’t want to wait until the 23rd. Downtime was about 2 minutes.
20170316 @ 11PM – The new phone server is online.
20170314 @ 2PM – Our VoIP (phone) software was incompatible with the new servers, so our phones are temporarily offline. We get notified on failed calls, so will be watching for alerts between now and when we get the new VoIP server up. We’re going to streamline the phone tree to just have one TS voice extension/mailbox & then individual employee extensions.
20170311 @ 12:29AM – One of the Phoenix web servers crashed during backups. Trying to figure out why. (Was probably a network driver bug.)
20170310 @12:48PM – WHPHX13 was just hit by a pretty intense Denial of Service attack from a small botnet. They were trying to brute force their way in on a variety of thankfully patched WP exploits. (This is why updates are so important.)
20170310 @2:30AM – All services except for our billing portal are online – it needs a license reset, which can be done from the office. Calling this a success.
20170310 @1:15AM – Mail Server is online after the move, but not all services are working. We’re ahead of schedule… Hope our luck holds.
20170309 @9:17PM: Quebec1 is back up. It decided that it was jealous of the attention to the Phoenix servers. It just hung hard. Took a while (30 min) to do its disk checks, but all sites are up and happy. On the plus side, we won’t need to reboot it during the next reboot cycle – middle of March.
20170305 @ 11:17PM – Backups on one server are slowing it down. Given it’s Sunday night, we’re watching it, but letting it run its course. (Backup is done now – 11:31PM)
Green: I am completely operational, and all my circuits are functioning perfectly normally.
AMBER: External network issues.
RED: Zombie Apocalypse
Magenta – a service is down, but not really an emergency.