The Technical Building Blocks of a Modern AIOps Platform

The promise of AIOps for Service Providers & Enterprises is to enable the collection of data, processes, and storage at scale as well as to derive actionable insights that help drive automation.

Narayan Partangel, Vice President, Engineering

Businesses are facing a very serious data challenge today—the collecting, managing, and making sense of the huge volume and variety of data coming from heterogeneous and disparate sources into their IT universe. And I believe AIOps to be the solution—with the ScienceLogic SL1 platform enabling that journey.

The paradigms have shifted from being reactive to being predictive and actionable. And problem-solving has taken the form of self-healing. ScienceLogic has responded to this shift by putting additional demands on all aspects of our platform. These new challenges have led us to reimagine the way we think about product architecture. Not only do we need to collect, process, and store at scale, we need to be able to seamlessly deploy on-cloud platforms as well as on-premises.

Building the Foundation

Our journey starts with the establishment of a microservices-based platform that involved the creation of a container orchestration layer, an automated deployment framework, and an artifact versioning and promotion process. Most modern software companies go through a cycle of evolving their platform to responding to changing market needs and increasing scale requirements, ease, and velocity of new and continuous development.

As software platforms continue to evolve at a rapid pace, there are certain core principles that remain constant. It is important to ensure that the foundation upon which newer and customer-facing features are built is durable over the longer term. This was one of our guiding tenets as we set out to address this challenge.

The Four Steps of Creating a Microservice

For ScienceLogic, our vision of AIOps provided the impetus to embark on this transformative journey. The creation of a microservice (both new or refactoring an existing feature) involves a four-step process: code isolation, containerization, refactor for service code, and horizontal scalability.

Step 1: Code isolation involves:

  • Designing each feature as a service
  • Moving code to its own repository
  • Publishing it as an Artifactory
  • Executing its own set of unit and functional tests
  • Defining the version and dependencies.

Step 2: Containerization requires each service to be migrated to a container wherein all the unit and functional tests are executed in the docker container.

Step 3: Refactor for service code involves:

  • Following design patterns created for microservices
  • Ensuring services “own” their data
  • Replacing any direct database calls with data access objects (DAOs) and metrics for measuring service performance.

Step 4: Horizontal scalability ensures that all services are reliable, can scale on-demand, and produce the desired performance levels.

For SL1, the proliferation of these services has been supported by the introduction of a highly scalable message bus infrastructure (Kafka) and a NoSQL data store (Scylla) for storing high-volume performance data.

The philosophy for this conversion promoted a two-pronged approach:

  • All new features will follow the service pattern described above.
  • For the existing code base, we made a thoughtful evaluation of all the features that will benefit the most from this conversion.

The Container Orchestration Process

Containers provide a remarkable amount of flexibility in terms of creating a lightweight packaged application that can be ported across multiple environments—leveraging underlying OS and their ability to scale on demand. However, an orchestration layer for deploying, managing, and scaling containers is needed. This is where the power of Kubernetes comes in. Since we use Kubernetes as the container orchestration solution, the team evaluated a number of tools to simplify the deployment of the k8s cluster. Starting with a basic Kubernetes deployment, we migrated to using RKE (Rancher Kubernetes Engine) as it has minimal requirements to run (ssh and docker) and can create clusters with a single command.

As SL1 evolves, some services need to be stateful, which requires the adaptation of persistent storage (Persistent Volume) into the deployment. The next step in this evolution is the adoption of K9s which provides the option of a CLI-based administration that is easier to incorporate into an automation framework. We have also introduced lighter-weight versions of Kubernetes like Minikube and k3s into our workflows for rapid prototyping and demos.

We have fully embraced Kubernetes as an organization and continue to evolve and adopt best practices that are in line with the k8s community. The adoption and progress of the microservices pattern require a very iterative approach and an open and collaborative team culture that is not afraid to experiment with new ways of software development and is willing to get trained on new technologies.

Next Step: Deployment

With the foundational pieces in place, our focus moved to creating a seamless deployment mechanism. We started by creating an end-end deployment automation workflow using Ansible, AWX, rke (Rancher Kubernetes Engine), and Helm. We leveraged AWS managed services like EKS, Aurora, RDS, and CloudFormation to maximum effect in order to drastically reduce operational overhead. The entire deployment can now be accomplished in a matter of minutes. The DevOps team was then able to leverage the same framework to deploy to other environments, such as VM, for on-premises clients and internal customers.

The Artifact Journey

The next step in this process was the creation of an artifact repository for versioning and promotion. The binary repo consists of the full history of published artifacts that includes publishing metadata. At ScienceLogic, there are two Artifactory repositories an artifact can be promoted to: staging and distribution. The staging repository contains artifacts that are in a pre-production state. The distribution repository contains artifacts used in production environments. Customers will only be able to access and download artifacts in the distribution repository. Consequently, artifacts have to first “travel” through the different repositories before they can be consumed by customers. Appropriate quality control goes hand in hand with the promotion process.

Lessons Learned

The transformative journey that we embarked on is both challenging and rewarding. There are three main takeaways from this endeavor: embracing the power of iteration, choosing the right technology for the job, and making a cultural shift.

Any technological innovation requires a rigorous process of iteration and experimentation, an exhaustive and detailed analysis of picking the right tool for the problem, and finally having the right team culture that is constantly looking at new ways to solve problems. This transformation has set ScienceLogic up to implement the next generation of services around Behavioral Correlation which leverages AI/ML-driven anomaly detection and a Publisher that is designed to send large amounts of data to third-party platforms.

Want to learn more about how ScienceLogic does AIOps? Read the EMA Report »

 


 

X