Becoming a data-driven enterprise: From Data Warehouse to the Data Lake (Part 2 of 3)

Photo by Quang Nguyen Vinh

This is the second article of three that discusses what a data strategy is, where companies have gone wrong building data platforms, and the solution to becoming a data-driven company. In the last article, I concluded that:

The purpose of developing a data strategy for an enterprise is to make data-driven business decisions. Your data strategy should clarify the purpose of your company’s data, provide a business focus, and enable the organization to compete by driving innovation and growth in your market.

The data strategy should contain rules for confidentiality, integrity, and availability (CIA), and specify which data should be collected, analyzed, and used in the end. A mature strategy should contain processes for collecting, analyzing, and turning data into business action.

In the second article in this 3-series we move on to discussing the implementation of modern data platforms, and why they don’t deliver on their promise to take the business to the next level – being data-driven.

In a data age a long, long time ago…

In 1997, I was a research scientist and recent graduate from Penn State starting a new job. The company that hired me was Bellcore, a research and development arm of the regional phone companies (‘Baby Bells’) that had been separated out from the legendary Bell Labs in 1984. I found myself working alongside researchers and inventors that had laid the groundwork for what the Internet is as we know it in 2023.

My first project was to add a new database for private payphone operators to a system we called CMDS, or Centralized Message Distribution System. CMDS was a clearinghouse for carrier access billing for specific services. Back in the 1990s most customers didn’t own a mobile phone. Instead, people had phone cards that allowed you to use someone else’s phone, or a payphone, and bill the charges back your own number. Convenient when calling home from an airport out-of-state.

The next task was to commercialize the CMDS platform. Our Baby Bell owners wanted the local exchange carriers to pay for the system on a per-call-basis. As the project lead, I started reading the well-documented system to understand the guts of service.

The challenge was that a single phone call would generate dozens of messages between local operators and long-distance phone companies. Even a call that isn’t completed generates multiple message records. So how do you go about creating a billable service?

The good part was that we owned and controlled the raw data in the system. The challenge was the complexity of the records, understanding the meta data, and ultimately, gaining complete knowledge of the mechanics of a long-distance call so that we could bill each call (and there were millions per day) correctly and to the satisfaction of the carriers that were our stakeholders.

The result was a function that would pull the designated records, rate the call, assign cost, aggregate and bill the LEC whose customer had made the call. We then charged the LECs a per-transaction fee for handling their calls. The anatomy of the project was such that:

a. The team was empowered to build any service we needed to accomplish a given business objective.

b. The team had full control of the domain and owned the data in the CMDS platform

c. Data and service rating was running on the same infrastructure with real-time compute instructions embedded in the platform

We created a service that produced an output that lets you know who called whom, between which locations, and at what time. Bellcore controlled the data and could dedupe records to ensure that service could be billed without any further processing or cleaning. The team only used this function to bill the carriers involved, but anyone at Bellcore could have used the service to analyze traffic patterns, usage and pricing factors as they pleased.

We were working with a large data set building a product for billing, but we weren’t familiar with notions like ‘business intelligence.’ We were working with tools to ingest, store, manage, process and analyze data, but we didn’t have concepts like ‘data platform’ though it had been around for years.

We did the jobs of data engineers and data scientists, but those job titles didn’t exist yet. Our work was unstructured and intuitive, but others in the industry had long been working on systems for analysing data and making decisions based on that data.

Business Intelligence (BI): from data warehouse to data lakes

The history of BI systems goes back to the 1950s. In 1958, IBM researcher Hans Peter Kuhn wrote an article called ‘A Business Intelligence System’ where he described a system for “selective dissemination” of documents to “action points” based on the “interest profiles” of the individual action points.’

Source: ([] )

IBM dominated the market for enterprise computing in the 1960s and the company was a leading driver in the development of analytics tools. The main concepts were designed around the ideas that what the enterprise needed was a place to store large amounts of data in a centralized place. From this common repository you could present a consistent view of the company’s data.

Today’s data warehouse architecture is influenced by the thinking of that era that entailed moving data from operational to analytics systems:

“Data is:

  • Extracted from many operational databases and sources
  • Transformed into a universal schema—represented in a multidimensional and time-variant tabular format
  • Loaded into the warehouse tables
  • Accessed through SQL-like queries
  • Mainly serving data analysts for reporting and analytical visualization use cases”

Source: Dehghani, Zhamak. Data Mesh (p. 221). O’Reilly Media. Kindle Edition.

This architecture can be visualized as follows:

There are many challenges with traditional data warehouse architecture. One common mistake is to underestimate the ETL (extract, transform, load) data process. There are tools available, but with thousands of integrations, the larger enterprise will find that the data ingestion is left to a central data team.

This team owns the data but is not the user of the data. The domain knowledge of the data sits in the business (the end user), but they don’t have the expertise to collect or consume the data available to them in the data marts prepared by the central data team.

The result is that business analysts typically get static reports that fulfil a specific purpose generated at a given interval. What is missing is the dynamic view of the complete business and the ability to discover new things in your data that wouldn’t be easily discovered in your standard reports. This changed in 2010 when Pentaho (now part of Hitachi) presented what they called a new ‘big data architecture.’

The Data lake

The new concept was described as a lake with limitless opportunities:

‘If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.’


The concept of a data lake is that you have a repository of unchanged data available for analysts and machine learning tools. The lake contains large amounts of structured, unstructured and raw data in their original formats.

As the concept of data lakes gained popularity in the 2010s, ‘big data’ became the catch phrase (nearly 7 billion results on Google as of March 2023). But as we started to implement the concepts, it became clear that data lakes had a set of challenges:

a. Data governance to ensure accurate and consistent data is rigorous and expensive

b. Data security is complex and still not completely resolved making lakes a big target for hackers

c. Access control and data masking is proving difficult, which means that much of the data is never used.

d. Costs are very high as cloud storage and tool licenses are expensive and data tends to be retained for too long.

What comes next

Tracing back to the 1990s, how did we go from creating BI tools to speed data analysis to data lakes that have become unmanageable?

The motivation now is the same as it was a generation ago. The promise of the potential to gain new insights and improve decision-making is too tempting not to pursue. The amount of data we generate has exploded and both machine learning and artificial intelligence tools have evolved to take advantage.

The history of modern computing is an ever-moving pendulum from centralized to de-centralized approaches. In the third and final article I discuss the inevitable reaction to big data and data lakes. Can the concept of micro-services be adapted for data analytics?


Photo by Quang Nguyen Vinh:

Dehghani, Zhamak. Data Mesh (p. 221). O’Reilly Media. Kindle Edition.

Pentaho, Hadoop, and Data Lakes