"What SRE wouldn’t want to use Zebrium? It finds the root cause of our problems automatically."
Personal.ai brings the creative potential of AI to individuals, connecting the dots across a user’s various communication channels and interests to build a personalized module for each user to create, collaborate and monetize their work. Personal.ai uses a highly customized modern cloud native software stack that supports multiple interfaces, including a web interface and a desktop app. The architecture is microservices based, with more than 45 microservices. And they iterate rapidly, deploying updates anywhere from 5 to 30 times each day.
Use machine learning to quickly catch software incidents and show IT and cybersecurity teams the root cause.
Challenge
After launch, Personal.ai saw a surge in signups, and their small engineering team was challenged to keep up with scaling and bug discovery. Their existing solution — a self-hosted Elastic Stack for Observability with dashboards keeping track of various metrics, traces and logs, and PagerDuty for incident management — made it very painful to troubleshoot issues in the highly personalized pipelines that could create user-specific bugs. The team had to drill-down into approximately 40 dashboards, then look for errors in logs, and manually correlate all the pieces.
Multiple P1 issues arose daily, and troubleshooting delays meant rolling back an entire deployment rather than deploying a targeted fix. Some issues took hours or days to debug — significantly impacting the pace of new development.
“After trying this, I can say only one thing – what SRE wouldn’t want to use Zebrium? It finds the root cause of our problems automatically, and the integration with Kibana is beautiful.”
Solution
Personal.ai’s engineering team leader, Bala Sista, signed-up for a free trial of Root Cause as a Service (RCaaS), and immediately saw its promise. Zebrium’s root cause reports do all the discovery and correlation automatically. And its integration with Elastic Stack means Personal.ai can see the details of root cause found by Zebrium right in context of Kibana dashboards, making it significantly easier to line up the root cause with other symptoms and information to get a full picture of the problem.
Benefits
Impact
The number of P1 issues has dropped, and the time to root cause has been reduced by 60%, which has resulted in freeing up countless hours for engineering teams. Even better, Zebrium’s targeted root cause details enable instant and precise fixes that can be rolled out quickly vs rolling back entire deployments.
The net results of using Zebrium RCaaS is an improved customer experience, less wasted engineering time, and faster software cycles.
What’s Next
The Personal.ai engineering team plans to expand into using GPU resources, and use Zebrium to help identify and root cause any issues that arise there. They also intend to enable the Zebrium integration with PagerDuty to automate the entire root cause workflow.