This article is the introduction of a series of articles co-authored by Emmanuel Manceau (Quantmetry), Olivier Denti (Quantmetry) and Magali Barreau-Combeau (OVH) on the subject “Architecture – From BI to Big Data”
These items are the result of a breakfast meeting at Dataiku. I presented the results of our thinkin and feedback on the industrialization of data projects and Magali testifying for OVH on the organization of data at OVH.
I argued for a long time in favor of structuring data in the datalake and to model them. This speech was not well understood so far by the datascientist with whom I worked. The argument that I heard most often is: “But why get tired of modeling the data, we store the raw information in the datalake and that’s enough”.
I had experience of several projects for me or indeed, we had started from a raw information contained in a datalake and we had all the trouble in the world to give it meaning. The reality of data projects is that we use only a tiny part of the data of a datalake (not always the same from one project to another, reason why we store everything preventively) it is absolutely necessary to restore meaning to the data for use as input models. No datascientist uses raw data as input to a machine learning model.
Also, when I heard Magali explain how OVH structured the information in its datalake, organizing several layers of data processing and consolidation to move from raw data to meaningful data and that they had built views to serve the business needs, I said to myself: they have understood everything!
I say it with all the less scruples that this work was carried out by OVH without the help of any consulting firm or integrator (so not with Quantmetry).
Also, from this discussion came the desire to testify more broadly on the organization of datalakes in order to ensure that these serve the interests of data projects, facilitate industrialization and scale-up and contribute efficiently to the progressive replacement of legacy decision-making systems, thus contributing to cost control and simplification of the IS.
- To understand each other, we must first clarify the terms of the debate and the concepts used. Let us shun the suitcase words and agree together of a shared definition. Also, the first article “Data Terminology” deals with the definition of concepts.
- The second article “Users and their needs” specifies the populations accessing datalake information and their expectations in terms of structuring information.
- The 3rd article “Datawarehouse vs big data” describes how to move from a decisional architecture to a datalake architecture avoiding some pitfalls.
- 4th article: Why must we model the data?
- 5th article: Exposing datalake data
As a consultant, we were often confronted with little or badly exploited data datalakes and the need to model the data was still poorly understood. It is not enough to store to interpret. If the datalake allows us to store the raw information without modeling it (schemaless) which simplifies many things, we must model it to understand it and to interpret it (schema on read).
We decided to write a series of articles around the datalakes.