How Criteo Manages The Traceability Of Its Data

Criteo has set up a data lineage system around its Hadoop cluster. What techniques does it rely on? What do you do with a data traceability management system ( data lineage )? For example, automate the grouping of data quality problems or the repair of datasets impacted by incidents.

Criteo is exploring these two avenues now that it has implemented its system. This works at different levels: tables (which, in practice, also includes assets not backed by Power BI tables and reports), partitions, and columns.

This traceability supports, among other use cases, impact analysis, root cause research, compliance (audit, PII tracking), and metadata enrichment. It is exposed, on the one hand, through a series of datasets. On the other, via a Datadoc web app including catalog and observability functionalities. The process is part of a more global pipeline for collecting and analyzing usage data across the Criteo data platform. Datadoc also exposes elements relating to queries, tasks, and applications that result in transformations. To support its data lineage approach, Criteo uses several techniques, including:

Manually fill in the source-destination relationships (owners or stewards generally do this)
Search for specific patterns in the asset
metadata – Use the execution logs of the data platform services ( logs-as-source )
Exploit the source code specifying the transformations ( source-as-code )
Integrate traceability capabilities into specific systems

Multi-layer monitoring within the Hadoop cluster

At Criteo, most offline processing takes place on the data lake, a Hadoop cluster of 3000 machines storing 180 PB. Spark jobs and Hive queries are mainly executed there. All are orchestrated by two in-house tools (Cuttle and BigDataflow).

Most of the raw data ingested comes from Kafka – a centralized system that transmits it in batches. Once the transformations are completed, users consume the data with Presto/Trino. Or they are exported to Vertica. Some clusters power a Tableau deployment. Most of the cluster’s inputs and outputs rely on centralized systems. This allows traceability information to be exposed on their API. In some cases, we can review the code and integrate it with the CI for the validation step.

The most complex part is tracking the transformations taking place in the cluster. Several layers of data lineage are used for this purpose. On SQL engines like Hive and Presto/Trino, the parser allows you to expose the information. Criteo has configured hooks that store the query execution context in a Kafka topic and then transmit it to the global pipeline. We also use Kafka for the transformations orchestrated by Bigdataflow. For the rest, we use Garmadon, an event collection service that tracks interactions with the underlying file system.

Enrichment And Deduplication

The data Garmadon produces only shows relationships between applications and raw paths. They, therefore, require semantic enrichment. To do this, we perform two tasks:

Merge the applications involved in the same logical transformation
This step is essentially based on pattern detection techniques. Garmadon also allows you to inject tags into applications intended to link executions to logical units declared in the orchestrators.
Associate raw paths with semantics already available in the Hive metadata store

When the traceability sources are assembled, deduplication takes place. We then keep the most qualitative source, for example, data coming from Hook rather than Garmadon. Therefore, we can carry out other transformations to expose the data in forms more suited to specific use cases.

Data lineage is used, in particular in an internal search engine, to influence the ranking of results: the more transitive dependencies an asset has, the more important it is probably.

Datadoc can also report performance alerts based on SLO information extracted from dataset definitions. And provide users with information on the root cause.

Also Read: What Is Intelligent Data Processing, Definition And Main Activities

How Criteo Manages The Traceability Of Its Data

Multi-layer monitoring within the Hadoop cluster

Enrichment And Deduplication

Recent Articles

The Importance of Single Sign-On (SSO)

The Evolution of Everyday Tech: How Smart Wallets Are Revolutionizing Personal Carry

Gramhir – Explore Instagram Stories For Free | Gramhir.pro

The Tech Revolution in Baby Mobility: Exploring Advanced Stroller Designs