This article is part of a series of articles co-authored by Emmanuel Manceau (Quantmetry), Olivier Denti (Quantmetry) and Magali Barreau-Combeau (OVH) on the subject “Architecture – From BI to Big data”
There are 4 large populations of users with data needs (except IT actors who have to process the data to make it available to users).
Who uses data from the data lake?
The Data Analyst is required to evaluate the data via analytical and logical reasoning; (data mining), data crunching, data retrieval (DataViz) to extract a data set (eg customers to target in a commercial campaign based on certain criteria) or to explain a fact (eg why are June sales weak?).
These populations are trained to use relational and decisional databases, as well as to create data mining and mining requests. They can revert back to data manipulation tools, create analysis dimensions to facilitate data mining.
The data scientist is a new trade appeared this decade with the emergence of BigData. In addition to the Data Analyst’s own activities, the Data Scientist is required to generate new data from Machine Learning models. For example: producing a customer risk score, predicting sales for the next 3 months, evaluating customer satisfaction from their mails etc.
The Data Scientist knows all the information persistence techniques (relational bases, HDFS, NoSQL) and knows perfectly the techniques of requesting and extracting the information. SQL has made a comeback since the Hadoop ecosystem adopted techniques similar to SQL like Hive or SparkSQL. He also knows programming languages (Python for Pandas, Java or Scala for manipulating Spark dataframes) to transform data (usually in the form of matrices).
The Business Analyst is an operational one of a trade considered taking over the work of a Data Analyst for the current exploitation of the data and to satisfy the regular demands of its trade.
These are users with advanced (both technical and functional) use of the data. They can adjust pre-prepared queries, even construct SQL queries (ad-hoc query), are able to build a dashboard report from the already prepared data, manipulate the data in an array and create processing (macro or via DataViz tools) and to make all this available to their own business.
End users are “consumers” of data made available to them; interactions with data are limited by the application of DataViz or by the already prepared data set provided to them.
Given the characteristics and skills expected for each of these populations, we can evaluate their use of data access functions for each of the roles.
How to access data ?
Therefore, all users are not required to query directly in the BigData database, which may have operational constraints (processing on very large volumes of data to be finalized for such time), technical constraints a relatively long request) or contains data whose unit information value is very low.
The information is thus organized in successive stages which consolidate and simplify access to information, from the grossest to the most consolidated as described in the following diagram
Construction of the data access stages
The lowest stage is big data and can accommodate any type of data (structured, image, text, sounds), then this information is synthesized, simplified and enriched as you go up the floors. Its business value increases as the detail decreases. At the lowest level, one can imagine finding all the receipts of a chain of large area, at the highest level the annual turnover.
Access to raw data
Access to the raw data is mainly reserved for the datascientist which sets up the data acquisition chain (real-time events, files etc.) and quality control (file format, number of columns / lines etc.). ).
Note that image processing techniques or language are used to extract structured information from semi-structured or unstructured information sources (image, text, sound) of the datalake. These can include identifying words, identifying shapes in images, or even detecting particular sounds. The objective here is to extract useful information from the raw support. For example, in order to interpret a level of customer satisfaction, comments are interpreted into categories (content / no content), which are then integrated into structured information.
Access to transformed data
The transformed data are derived from the raw data to which a series of specific processes (pipeline) is applied to clean, enrich or even aggregate the data. To do this, the Data Engineer works with a business referent (who knows the data produced on his business scope) and the other actors to collect their needs.
In the mining industry, four tons of earth must be treated to extract four grams of gold for the most productive mines: The Data Engineer is here a miner who must also extract from a large volume of data, information to high added value. We have gone from a dataset with billions of lines to a dataset of a few thousand.
By way of example, a raw data is a physical sensor measurement produced every millisecond by a fleet of one thousand sensors. The transformed data can be:
data enriched via repositories (sensor location, sensor type, etc.); the level of detail is then maximum
an aggregated and aggregated data on one or more axes (eg average of the aggregated value by “city”); the level of detail is more “macro” but still compatible with what is requested by the users.
The Data Engineer knows the techniques of large data processing and the characteristics of parallel and distributed architectures; the other actors do not necessarily have the computer knowledge specific to these environments.
The Data Analyst is able to better understand the data and their relationships. It will therefore use visualization tools to navigate in this layer and search in particular for weak signals, hidden links and correlations.
Access to exposed data
Compared to the transformed data, the data presented correspond to a particular problem (case of use) or is sufficiently prepared to be directly available to a wide range of actors.
Most often, the data presented are consolidated in more traditional databases (RDBMS) and can be interrogated by analysts who do not have too much computer skills. The access is therefore made with the usual instruments of the analysts (SAS, PowerBI, Qlik, Tableau …)
To take the example above, the exposed data would be compared with the maintenance of the equipment (installation, repair, monitoring). In this way, we would be able to identify potential causes of anomalies in the park and explain why some measures are out of line.
These data result from cross-references and are applied to business calculations (KPI) and are most often exposed in operational control tools to users and business analysts.
Getting a unified, shared view of data
The challenge of the data architecture is to ensure that access to these data families is based on the same data and a common technical platform.
Too often, datalakes are built next to BI warehouses and the same information is duplicated in multiple locations, creating inconsistencies and loss of information traceability. OVH’s successful challenge is precisely to have centralized all the information and to ensure that the views described above are based on the same data and the same software architecture.
Business Analyst and DataScientist are working on the same data, and the datalake has supplanted the BI warehouses to serve all needs.