How to expose data-lake data ?

This article is part of a series of articles co-authored by Emmanuel Manceau (Quantmetry), Olivier Denti (Quantmetry) and Magali Barreau-Combeau (OVH) on the subject “Architecture – From BI to Big data”

Within the “BigData” ecosystem, once the raw data is copied into the datalake, the following points arise:

  • how to improve the knowledge of my data
  • how are my data structured
  • how to visualize and manipulate them
  • how to make them available

 

Several types of solutions exist to respond to these issues.


Tools dedicated to a user population:

Search Based Application
Used mainly by Advanced Users, End Users or DataAnalyst, it allows to search for information in large amounts of data (indexed from a denormalized model) without necessarily having to pre-aggregate the data. An analysis, via HMI, on all dimensions is possible, with excellent response times.
SQL Relational Database
Used primarily by Advanced Users or DataAnalyst, it allows them to perform ad-hoc queries and attach data sets as they did previously with a Datawarehouse. The data comes from a subset of the Lakeshore.
DataLab
Used by DataScientist or DataAnalyst, it is the place of experimentation par excellence and of POC. Users are free to perform any type of treatment (including Machine Learning).
DataViz / Data Analytics Tools
Used primarily by Advanced Users, End Users or DataAnalyst to view and manipulate data. Several tools can be made available according to the requirements of the use case (visualization of the data in real time, response time to the selection, depth of the history etc.)

Tools dedicated to the Information System (IS)

Enterprise IS is a “customer” like any other DataLake data. We can distinguish :

Data export, rather used in “batch” mode, allows feeding systems at once (once a day, for example, after night treatments).

API: systems retrieve unit data via an interface (WebService type). This technique is for example used in a call center, where the front end of the call center display a “score” client recovered in real time during the consultation of the client file.

 


Common tools for all actors

Whatever the types of actors (users, IS), it is imperative to know what data are present and to understand how they were formed. Two tools are then needed:

Data Dictionary

The data dictionary allows a user to:

      • find necessary information, such as a business object
      • have a clear and precise definition of the object
      • identify the location (table, column, file …) of this information

This research is usually done by a free text search and / or navigation through navigation facets, in order to converge quickly to the expected result.

To be relevant, the dictionary also takes into account the definitions of the business objects of the Company and their attachment to functional areas. By business object, I mean for example the “payer”, “holder”, “contract”, “billing account” …

Repository of transformations

the transformations reference system, also called Data Lineage, makes it possible to understand how data is constituted (from what and how).

For example, the calculation of turnover can result from an aggregate of all cash receipts excluding purchases made during private sales. Ideally, the tool should trace the paths and data transformations, from receiving raw data (cash receipts) until their enrichment (in what context the sale was made), filtering (excluding private sales) and the exposure (sum the amounts).

This type of tool (very popular!) does not really exist. Software suites (provided they are used end to end of the chain) are able to offer a “lineage” but the information returned are very technical and can not currently be made available to “end users” . The alternative would be to manually document each stage of the transformation, but it is an important load and must necessarily be maintained over time.

 

Leave a Reply

Your email address will not be published. Required fields are marked *