EU region only
Today we experienced a breakdown of our search platform as we were carrying out a routine maintenance task.
We had a complete restart of our search cluster scheduled this morning, which is normally a walk in the park, and not causing us any issues. This time, since we changed our backup location since the last full restart, our system did not come back up, as it was still expecting the old backup volume (that is not in use anymore) to exist on the system. We have tested this routine numerous times on our test environments before with success, but one of the few areas where production differs from test, is with regards to backup. This is of course something we will look into changing, so that we can test real-world even more accurately in the future. After an amazing effort from our engineers, we did manage to get the cluster back up to work for most customers by 10:36 (UTC), but manually going through all nodes in our clusters, making sure the backup configuration is updated and working as expected, is inevitably taking some time.
This does not mean that backup has not been working in the meantime, it only means that our system was not tolerant enough to allow a missing volume to still be mentioned in our cluster configuration.
We sincerely apologize of any issues this may have caused you or your users.
10:20 - 10:28 (UTC)
10:31 - 10:36 (UTC)
10:54 - 11:20 (UTC)
Silas Hansen - CTO