DETAILS At 6:40 PM MDT, users began seeing 504 page errors when trying to access courses, and pages within Canvas. This was caused by a job server that became unresponsive to requests. Job servers are responsible for processing server requests, including course migration files, imports and exports, and gradebook downloads. These requests had started to queue when the jobs server became unresponsive, resulting in the errors users were seeing. Our DevOps team was alerted to the issue by automated monitoring we have in place, and they quickly identified an unresponsive jobs server as the culprit. They restarted the job server at 6:53 PM. After the job server finished restarting, we verified that the timeout errors were no longer occurring. The queued job requests still needed to be processed, and resumed after the server restart. This resulted in some additional slow page load times until the queued jobs had been processed. The incident was resolved and Canvas was functioning as normal, at 7:03 PM MDT. |