4 years ago
About availability issues on 2020-09
What happened
Service interruptions, which have lasted up to 10 minutes during the last week, have the following two explanations:
- Due to the exponential growth of the application during the last months, the engine that stored one of Integrates’s databases started to run out of resources. This generated an impact over the response times and general availability of the application.
- Integrates is currently allocated in our Kubernetes cluster. As we have been developing an API for it, we created a very memory consuming function. Calls to such function, made in the testing phase by ourselves, caused the cluster nodes to stop working due to lack of memory.
What we’ve done
To fix (1), we:
- Deployed new infrastructure according to the application’s present and future needs, thus guarantying optimal performance and availability.
To fix (2), we:
- Created integrates replicas in all the cluster nodes, so we can guarantee high redundancy in case of node unavailability.
- Established consumption limits for integrates, so we can ensure stability on the infrastructure that is running it.
What’s the impact
Integrates was not accessible during service interruptions, preventing anyone who uses the platform to do any work on it.
What we are doing to help
We will keep closely monitoring all our infrastructure in order to guarantee optimal up and recovery times.