This article is part of a series of articles co-authored by Emmanuel Manceau (Quantmetry), Olivier Denti (Quantmetry) and Magali Barreau-Combeau (OVH) on the subject “Architecture – From BI to Big data”
When we talk about Big Data, the data lake is a key concept.
Often the process of setting up a data-lake stops after simply copying the raw data in the data lake. The common thought is that everything is solved when the data is in the data-lake. However, beyond the “data lake organization” aspect, having only the raw data is not enough, in our opinion. This post proposes to explain why. First, we will detail the different types of data structures that can be found in the data-lake and why they need reprocessing, and then we will discuss the use of this same data.
The different data models
The data-lake is a reflection of the information system of the company and therefore of its history. A company established for many years will not have the same information system as a start-up. An old-fashioned V-cycle application will not store its data at all in the same way as an agile build application. An application built in the 90s, not at all the same type of storage as a recent application, etc. Depending on the history and the architecture of the company, we will find in the datalake data modeled in different ways, with different levels of maturity and consistency.
We will not talk here about unstructured data (image, voice or “free text”). The purpose therefore concerns structured data, whatever their storage medium.
Modeling in the traditional world
Structured data can be modeled in different ways depending on their source.
A transactional application requires a standardized model. The model is then optimized for massive operations:
- insert data
- delete data
- update data
Ideally, it corresponds to the 3rd normal form (3NF). The data is standardized and there is no redundancy in the database. So there are many tables and many joins between these tables. Reading the data is complex and time consuming. Not everyone keeps the database data model up-to-date. We must find the joins manually. If naming rules simplifying this task have been applied systematically, it is possible, but anyway consumer time.
There are always undocumented subtleties and no base is perfect (the “ah but that did not work so we applied a patch”). Depending on the time allocated to the teams, the patches persist and accumulate over time, making the process of valuing the data more complex.
A software package (“progiciel” french) is an application, configurable or customizable. It is found among others in the world of CRM or ERP. The principle of a software package is to be able to be customized according to the needs.
Large packages often have a very large number of tables. The base will then contain all the tables and not only those actually used. A small number of tables (compared to the total number of tables) will contain meaningful data. The Siebel CRM contained no less than 5000 tables for example, and only a few tens are used during the implementation depending on the implemented features.
Moreover the level of abstraction can be very high, and does not correspond to a “Merise” approach. This level of abstraction is necessary to allow a configuration of the software package. Sometimes, new business data can be added without changing the initial model.
Often the direct reading of the database is not supported by the editor, as much for reasons of license as performance.
The business rules are complex and not all of them are formalized by a database field. Many rules are applied in the presentation layer, ie in real time on the user’s screen.
It is impossible to guarantee the correct reading of the data without mastering the package.
If the database is a datawarehouse or reporting database, it can be denormalized.
In this case it does not follow a 3rd normal shape but it does not follow either a model star or snowflake. In this type of model, redundancies are possible. This limits the number of joins. The base is therefore easier to read.
Star schemas and snowflake schemas
tar or snowflake modeling is an evolution of denormal modeling. This type of modeling consists of representing the equivalent of a cube in a relational database.
It is a model that meets the needs of the decision-maker and has been designed to answer business questions. So natively it makes it easier to value the business data.
It consists of dimensions (these are the axis of analysis), and in fact (the data to be analyzed.).
On the other hand, depending on the importance of the project and its objectives, it will contain a greater or lesser proportion of the business data:
- if we wanted to allow ad-hoc reporting, its field will be quite wide, since the objective was to allow any type of requests
- if we built the model to respond to reporting or dashboarding only, its scope will be narrower.
We will come back in more detail on this type of modeling in the following article.
For those who want to deepen the concept, Ralph Kimball’s The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling is a classic.
Any meaningful information is not stored
It is important to distinguish meaningful information and stored information, regardless of the type of modeling.
Meaningful information is not necessarily stored:
- because it can be retrieved in real time, and we did not see the interest of storing it at the time of the design of the database
- because the business rule is calculated in the presence layer
- because of the persistence of a manual treatment (“it’s a machine that knows …”)
- due to the persistence of “hacker” repositories (excels and personal Access databases) and additional analysis.
The modeling stage is essential but long and expensive. It then makes it possible to identify business added value, and ideally to improve and industrialize the business processes for producing the data.
Modeling in the world of big data
Can the datalake do without modeling?
The limitless storage capabilities of big data made it possible to collect data from anywhere. The storage media proposed by hadoop (HDFS) based on distributed file systems, the prior work of structuring the information has become useless. These storage media (HFDS or NOSQL) are called schemaless.
Companies have been eager to build countless datalakes and populate them with no less than countless data. The young datascientists then looked at the architects and database experts with the new world’s view of the old world: No need for modeling, trash your MPD, MCD, Kimball models, UML or other SQL! We are dealing with raw data.
Yet very quickly, our datalakes became dataswamps and datascientists began to spend a lot of time preparing, cleaning and aggregating data. This load without much added value represents more than 50% of the total load of a datascience project. The part devoted to machine learning is so small that the datascientist are frustrated to do so little true datascience.
Modeling helps data understanding
Indeed, if it is no longer necessary to model to store the information, it is necessary to model to understand the information that one manipulates. Moreover, the amount of useful information contained in the data becomes more and more infinitesimal. To understand and locate useful information among terabytes of data, there is no alternative to modeling the information. To make a literary analogy, it is not by reading the persistent bitstream on its reader that we understand the deep meaning of the information is present, but without metamodel impossible to understand the meaning.
With this in mind, data architects have set up information structures per floor, rediscovered the concepts of sight, object or star model. The governance of the data has become a priority again and maintain a catalog of well-documented data a prerogative of datalab. The most advanced companies establish metamodels and ontologies to understand, interpret and navigate the data. The theory of graphs emerged from its closet and serves as a support for all data analysis social networks.
Finally, and this will be the subject of the next article, architects and data engineers have set up layered datalakes, capable of replacing decision-making warehouses, and providing the same level of readability as the old systems.
Modeling at the service of efficiency
The data lake will be used by data scientists. But the work of a data scientist breaks down as follows:
- 80%: data preparation
- 20% : Machine learning
What data do they use? Many data identical to those used by Data Analytics teams.
But how is the work done in Data Analytics?
- 80%: data preparation
- 20%: Dashboarding and analyzes
Modeling a single database, allowing a simple reading of the data of the company is therefore essential:
- For Data Science and Data Analytics teams to have the same basis for analysis
- To factorize work and improve efficiency
- It is illusory to think that one can exploit the raw data without reworking the model
- Modeling is performed a posteriori to interpret, analyze and produce an aggregate view
- Data remodeling depends on the use case; there is not only one method. We must strive for a common model but from a technological point of view, we will exploit this same common model in a different way.