Like shale production, data science is challenged by extracting, refining, and controlling the input that makes it productive.
If data is the new oil, it’s a lot more like shale than fresh crude.
Here’s why: when someone compares data to oil — a conference-keynote favorite — it brings to mind the image of crude gushing out of a derrick. The reality of data science looks a lot more like the production of shale oil, which sits between layers of shale rock and impermeable mudstone and is obtainable only by fracking — fracturing the rock with pressurized liquid.
For all the transformative potential of industrial AI, most big data projects fail — just like ventures in the early days of fracking.
Estimates from Gartner place the failure rate of AI projects at 60 percent, with some sources like Pactera and Dimensional Research putting failure as high as 80-85 percent. Those estimates amount to $22- $30 billion pumped into failed AI projects in 2019.
That leaning into AI for digital transformation is not an easy or inexpensive project shouldn’t come as a surprise. What is striking is that the challenge isn’t often getting the analytics right — it’s the availability, quality, and management of data itself. Like shale production, data science is challenged by extracting, refining, and controlling the input that makes it productive.
There are three main problems data science and shale production share.
1. Extraction Challenges
For decades, shale reserves in the United States proved so difficult to tap that many were left unrecovered. Heavy research funding from the Department of Energy (DoE) and Gas Technology Institute (formerly the Gas Research Institute) into fracking and horizontal drilling resulted in the development of commercial-grade technology, equipment, and machines. Only then, after decades of funding and testing, were operators able to tap into shale plays — locations known to have large shale reserves. It took these developments in drilling technology and infrastructure (first at known plays and then to others by way of geological survey) to grow extraction from these shale plays into nearly two-thirds of oil production and three-fourths of gas production in America.
Similarly, industrial information sits in hard-to-reach systems with wildly different conditions for extraction. Like geologic surveyors, data managers and product managers must identify powerful sources of data, whether they’re in current fields (like spreadsheets), yet-undiscovered fields, or previously known fields where technology now enables access (like the Internet of Things). Additionally, these sources of information maintain differently labeled and often partially-overlapping records. Within single sources, time-series data have inconsistent timestamps and data are frequently missing or filled with errors.
Metadata and quality variation within and across sources makes extraction for value difficult to standardize. The result is that a typical data scientist spends 80 percent of her time data-wrangling — cleaning, structuring, and enriching raw data until it’s AI-ready. At many large corporations, data collection and storage are so mismanaged that data science sometimes isn’t even possible. Gartner estimates that lost productivity due to poor quality data costs businesses 30 percent of their revenue on average.
2. Difficulty of Ingesting & Refining Inputs for Value
Extracted oil is valuable as a commodity, but it is worth far more once refined into fuel, plastic, or fertilizer. Because of the lighter density and higher sulphur content of American shale oil, domestic refineries have had to adapt infrastructures traditionally suited to handling heavy crude. The global oil industry too has had to change, adopting go-to-market practices that process heavy oil to fully capture the value of available crude outside the United States. Both the existing refinery processes in the US and changing go-to-market practices worldwide have made refining shale more challenging, which is on average three-to-four times more expensive than refining crude for commercial use.
Likewise, raw data is far less useful than the insights derived from it. An enterprise must refine raw data, creating valuable intellectual property through a proprietary process, subject matter expertise, analytics, software, and the ingestion of unlike datasets. Although the soaring volume, velocity, and variety of industrial data have been a boon to AI, industrial leaders admit they’re struggling to implement predictive analytics solutions that generate sustainable value. Senior executives often lack the analytical expertise to manage strategic capabilities offered by data insights.
AI initiatives lag when different areas of the company have varying access to the data needed to make strategic decisions. Once developed, AI presents a similar problem of refinement from commodity into saleable product that lacks a mature market. In addition, this refinement for value must also be completed with an understanding of the ever-changing legal, regulatory, and contractual restrictions that bind the use of data.
3. Negative Externalities
Extracting shale has caused concern because of its release of methane gas, leaching into underground reservoirs used for drinking water, and triggering of earthquakes.
Data science raises its own issues, such as protecting data privacy, maintaining ethical algorithms, and ensuring cybersecurity. (And those are just the concerns if data science builds models that work as intended.) A survey from the open source data science platform Anaconda found that three-quarters of data scientists use open-source platforms, a third of whom say they don’t take deliberate measures to protect their work. When you’re relying on AI to help make decisions around infrastructure, a single malfunction, data leakage, or poor decision can lead to massive ill effects.
And, at bare minimum, mismanaged data quality leads to waste: about $3.1 trillion per year in the United States according to IBM.
The Promise of Data Insights
None of these challenges are insurmountable. The emergence of shale production as a viable source of energy, notwithstanding its environmental and human health effects, holds promise for a data-driven economy with true grounding in insights from AI. Too often though, these challenges remain blindspots for industrial leaders who fail to recognize that data integrity is holding back their AI initiatives.
Ingestion proves the greatest challenge to translate data into insight, from turning readily available raw material into functional inputs and then finally into commercially-viable products. Only upon a foundation of clean, standardized, high-quality data will information empower leaders to survey the complete range of facts, ask the right questions, make decisions with total clarity, and consider the consequences of their decision. And only then can AI fulfill its promise to the industrial world and its customers.
