Data Terminology

This article is the first in a series of articles co-authored by Emmanuel Manceau (Quantmetry), Olivier Denti (Quantmetry) and Magali Marion (OVH) on the subject “Architecture – From BI to Big data”


Existing Concepts before Big Data

Databases and business applications

Relationnal Databases
A relational database is a type a database organized in tables and relationships.
Software package
According to Wikipedia (french), a software package (“progiciel” in french) is ”  a general purpose application software with multiple functions, composed of a set of configurable programs and intended to be used simultaneously by several people.” They are applications dedicated to a given subject and have the particularity of being parameterizable: adding data to a screen, creating new screens … CRM (Customer Relationship Management) and Enterprise Resource Planning (ERP) projects are Mainly based on software packages.
Raw Data
This is the lowest level of information. It contains a very small amount of information about the objects and processes of the company.

Relevant Data

Relevant data is data that allows a user to make an informed decision in a particular context. It therefore corresponds to a view of the data.

 Business Object

A business object corresponds to an essential business concept that is stable over time. Borrowed from the enterprise architecture, the business object term is of high level (eg Customer, Vehicle, Invoice, Order etc.) and interacts with business processes. In practice, a business object is dispersed and exploded in a large number of systems and has to be reconstructed in the repositories in order to have an overall view.

 


Business intelligence Terminology

ODS – Operational data store
Popularized by the American researcher Bill Inmon, the ODS or Operational Data Store consists in gathering the data of the company in a single base. The ODS is designed to perform data quality processing. It stores the lowest level of information only. For billing data, it will therefore store each invoice but not the weekly, monthly or other aggregates (ie data aggregates). It is not optimized for massive selections queries.
Bill Inmon proposed it as an intermediate step to creating an analytical basis (or Datamart). The ODS is not used to historicalize the data, but only to provide a real-time view of the current data On the other hand, Ralph Kimball, the other major American researcher in decision-making, recommended the creation of a single Datawarhouse, without going through the creation of ODS.
Datawarehouse
A datawarehouse is a database for use by decision-makers or data analysts and it is optimized for massive selection queries.

This is traditionally a read-only database, which does not feed the operational applications that provide the data.

The datawarehouse usually contains data at the finest level, as well as aggregates. In the case of an ODS upstream, it can sometimes simply store the aggregates.

It can be stored on a relational database (oracle, db2, sql server), or a “massively parallel” database (teradata, netezza)

Datamart
A Datamart is a database for use by decision-makers or data analysts. It is optimized for massive selection queries.

This is traditionally a read-only database, which does not feed the operational applications that provide the data.

The datawarehouse usually contains data at the finest level, as well as aggregates. In the case of an ODS upstream, it can sometimes simply store the aggregates. It can be stored on a relational database (oracle, db2, sql server), or a “massively parallel” database (teradata, netezza)


Big Data terminology

Structured data
Structured data are data that are represented in a predefined format, usually in a conventional table / column format. In this category you can find data from databases, delimited text files, flat files, etc.
Unstructured data
Unstructured data are data that can’t be saved in a classical format. This is :

  • Long written text (pdf files, text book, etc.)
  • audio files
  • video files
Datalake
The datalake is a storage location for receiving all of the company’s data, whether structured or unstructured.

The architecture of the datalake can consist of 2 different layers:

    • A layer for receiving raw data from source systems. No transformation is carried out on the data.
      Users generally do not have access to this data, which may also be retained only temporarily. Moreover, some data are derived from software packages which, in order to be exploited, require knowledge of the management rules used; it is therefore not possible to directly expose this layer to users.
    • A data exposure layer that presents the different data in a standardized format, both at the content level and at the container level.
      These at this level as the historization of data useful to the trade, such as sales results. The same data can be used by Data Scientists to predict future sales. It is also the place to host “photos” of data (customer park for example). The historicalization of the data must respect the regulatory constraints -type GDPR- applicable to the Company.

In the following articles in this series, we will specify the methodology for the definition of this datalake data presentation layer.

NoSql database
NoSQL stands for “Not Only SQL”. It refers according to Wikipedia to “a set of database management system that deviates from the classical paradigm of relational databases”. The data is no longer organized in tables and relationships.

New Concepts

Datashore or Enterprise Data Store
The Datashore exposes all the company’s significant data in a standardized format, both content and container. It is structured in business objects defined according to the organization and the objectives of the company.

These data represent the reality of the company and must be able to be understood without knowledge of the technical models from which they are derived.

Layer
A layer corresponds to a stage in the chimney of aggregation and consolidation of the information from the “raw” information of the datalake to the synthetic data presented to users or as input to a datascience model.
The structuring of the information in the datalake is organized by successive layers, from the raw level to the business object. Some layers are perennial and permanently stored: they can be used as a basis for other calculations, for which they were originally planned. Other layers are purely temporary or intended for a single calculation flow.
View
A view is an aggregated or detailed data formatting, which corresponds to a particular use. There may be multiple views of the same data.

Dans les articles suivants de cette série, nous préciserons la méthodologie de définition de cette couche de présentation des données du datalake.

This article is published in french on Quantmetry website.

Leave a Reply

Your email address will not be published. Required fields are marked *