Setting Your Data Scientist Free With Clean Data

Data scientists waste a lot of time preparing and cleaning data. But when you use ScienceLogic as your AIOps platform, cleaning data is already handled, so your data scientists can focus on analysis.

By Richard Chart, Co-Founder & Chief Scientist at ScienceLogic

The explosion of available data, coupled with the market pressures to rapidly understand and respond to the needs of your customers, puts data science into the spotlight as a way to gain competitive advantage through deep analytics.

And, when data science is in the spotlight, this puts data scientists in high demand. However, your data scientists have a dirty little secret. Much of their time is spent not doing data science at all. Instead, they are performing the mundane tasks of gathering, organizing, and formatting the source data—all of which must be completed before they get into the more interesting task of doing data science. According to Gartner, data scientists spend 79% of their time collecting, cleaning, and organizing data.[1]

And ITOps data illustrates this challenge perfectly. Operational data is exposed by your IT systems using a myriad of methods. Unless your organization was recently “born-in-the-cloud”, you will have some technology that relies on traditional SNMP, some that use proprietary APIs, while others use emerging standards such as OpenMetrics.

The ScienceLogic SL1 platform ingests that diverse data with its swiss army knife of collection methods then applies structure, and enforces consistent processing—irrespective of the source collection method or technology. With this approach, you can simply perform operations such as analyzing jitter from a wide range of device manufacturers without any further data alignment by you or your busy data scientists.

One of the things that set SL1 apart in the world of ITOps is a rich variety of techniques to gather the data you need. While there are other vendors that offer a similar range of ingestion methods and target technologies, all too often their data unification is only skin deep. When products come together through acquisition (CA, MicroFocus), it can be a major challenge to unify underlying message structures, storage schemas, and APIs. As a result, the data lives on islands of storage, each with their own set of access methods and data structures.

For your data scientists tasked with gleaning insights correlated from across the islands, this is a major impediment. Before they can demonstrate their command of deep data analysis, they must grind through the mundane exercise of scripting extraction from each data source, with ETL (extract, transform, load) customized for each unique source schema. This is the grunt-work that kills your data scientist’s productivity and takes 40% or more of their time[2].

Common Data Model

ScienceLogic offers the rare benefit of working from a single code base which affords you the benefit of SL1 unifying your data for you in ways that many other platforms cannot. One way we take advantage of this is through the application of data models after data ingestion to unify data from different sources that relate to the same fundamental element.

Let’s take network utilization as a simple example. This is one of the most widely reported metrics in IT: whether a Solaris Oracle database server, a Cisco Nexus switch, or a Windows SharePoint server, all will report this information in one form or another, but the form and the method to get the data can vary greatly between methods used to get the data, such as ssh, SNMP, and Powershell. By applying a common data model, SL1 can label all of them as inbound or outbound interface data, and any further processing or analysis from that point is consistent.

Unified Access

In practice, the large variance in shape of source data for AIOps means that even a highly unified platform such as SL1 needs to reflect some of that variance in how data is stored in order to support efficient search techniques and scalability. Your voluminous and semi-structured custom application logs have very different characteristics from time-series data events though both may be describing network bandwidth. And both are very different again from change tracking in your system configurations. Some of that difference in data characteristics is necessarily reflected in how this data is stored and indexed.

Given these differences in underlying data forms, how do your data scientists get a unified view of your data lake? At ScienceLogic, we have adopted GraphQL as the most flexible technique to provide a common access tier across all underlying data. You can access the entirety of the SL1 data lake using GraphQL, along with all the methods you might need to transform or make policy changes to it.

Like other well-formed APIs, SL1’s GraphQL layer isolates your consumer applications from any changes made to the ScienceLogic internals. It also provides a single point of authentication and authorization, minimizing time spent on access management. GraphQL offers has some unique benefits over REST, such as allowing you to specify exactly the data fields you want in response to a particular request, but this blog won’t delve further into the full benefits of GraphQL (start here if GraphQL is new to you and want to learn more).

In Conclusion

There is a wide array of business-impacting insights waiting to be discovered in your operational data. When that data is fragmented into silos with their own schema, access methods, and authentication controls, your analysts and data scientists will be wasting a large portion of their valuable time bringing it all together. ScienceLogic has invested heavily in providing that unification within the SL1 platform so you and your team can concentrate on using all that clean data to bring benefit to your business.

Want to learn more about how SL1 can benefit your business? Read Forrester’s “The Total Economic Impact™ of ScienceLogic SL1 for NetDesign>

[1] Gartner, Deliver Cross-Domain Analysis and Visibility with AIOPs and Digital Experience Monitoring, 5 July 2018, ID G00352799, Charley Rich, Padraig Byrne

[2] 2018 Kaggle ML & DS Survey with 23,859 respondents indicated that Data Scientists spend 40% of their time gathering and cleaning data, other surveys have put this at 60% or higher. Big Data Borat put this best in his famous (at least amongst practitioners) tweet that Data Science is 99% preparation, 1% misinterpretation.

 

X