Status

This page is updated manually with status of current and recent (30ish days) events.

(Times are US/Arizona UTC-7)

Current status is: Green: I am completely operational, and all my circuits are functioning perfectly normally.

20180125 @9:46AM – Got an explanation on the Sunday networking incident – a new client of theirs was using a fiber optic transceiver which was mostly compatible but was generating subtle errors that used up all the CPU power of the switch. They swapped it out with a known compatible one, and the interface came back up

20180116 @1:50PM – PNAP’s network came back up. We’re awaiting details from them on the incident.

20180116 @1:45PM – One of our ISPs is having internal network issues (PNAP). All sites – except those using Cloudflare – have IP addresses with both of our ISP’s so are not affected significantly. Clients using Cloudflare can only have one IP address associated with their site (this is set in the Cloudflare control panel), so if set to use a PNAP IP address, that site is down.
For those clients where we have Cloudflare credentials, we will begin logging in to their Cloudflare account and change the IP address to our other provider.

20180106 @11PM – The email server (our sole legacy server) did not reboot correctly (it hadn’t been reset in 262 days), and had to be restored from backups. We took advantage of the downtime to do a full reinstall of the server on new hardware, which was a couple weeks earlier than planned. It now matches our other servers, and is both faster & more resilient for future reboots.

20180106 @3PM – We’ll be rolling out some reboot-needed updates to deal with the Intel “Meltdown” CPU bug. Downtime should be <15 minutes on average with the mail server likely taking longer. The mail server hasn’t been rebooted since March, and will likely be down for 45 minutes to an hour as it does its disk checks.
We expect to need to do at least another reboot cycle for this over the next few weeks.

20171216 @12:48 PM – Moved the disks from the failed server to our spare server, and after going through the normal disk checks, all sites came up fine. (This is why we keep a spare server on hand.)

20171216 @12:40PM – One of our core servers had a total hardware failure (blown motherboard). We’re moving disks to a replacement. (ETA 10 minutes)

20171210: One Phoenix webserver (whphx5) is on a server that just suffered a disk failure. We’re doing an emergency out-of-band backup and will then replace the disk. (Doing it in this order lessens the risk of changes since the last backup being lost if additional disks fail.)

 

 


 

Green: I am completely operational, and all my circuits are functioning perfectly normally.

AMBER: External network issues.

RED: Zombie Apocalypse

Magenta – a service is down, but not really an emergency.