Data Quality

Ensuring a high degree of quality within the enterprise’s information is key to the organisation’s overall success and stakeholder trust.

Even relatively small data quality issues and inaccuracies can have a lasting impact on the level of data consumer and decision maker confidence.

Data Quality rules are defined separately from the Data Model leading to fragmentation of business logic, causing confusion as to data correctness a the consumer level.

As the scale and complexity of the data infrastructure increases, what could previously be verified by analyst in a reasonable amount of time is no longer possible.

Data quality problems impose considerable expenses on the company. Accurately gauging the extent of this and tracking changes over time remains challenging.

The standard responses to tackling data quality problems typically involve manual script creation, and the use of an external vendor.

Manual script creation involves teams developing scripts to assess data quality at a specific moment.

This method becomes cumbersome over time, demanding extensive maintenance as scripts require manual review and updates with every model evolution.

Employing external vendor solutions for data quality necessitates exporting data to another platform.

The risks of vendor lock-in and ongoing data migrations solely for analysis creates unsustainable operational overhead, compounding challenges in managing data quality.

Move the code to the data, not the data to the code.

The TeraHelix Solution

The Abstract Syntax Tree generated by the Data Explorer solution is used as the basis from which to generate data quality scripts for the current hosting platform. The size of the data in the store is orders of magnitude larger than the scripts. Therefore, move the code to the data, not the data to the code.

Data quality rules based on Data Explorer: As the data ontology changes, so do the data quality rules. A compiler ensures that the data ontology and data quality rules remain consistent.

Auto generation of standard rules: Based on the Data Explorer, scripts for standard data quality measures (completeness, entropy, uniqueness etc.) will be system generated resulting in significant savings of analyst time.

Custom business rules as part of golden source: Any additional custom data quality rules are specified as part of the data model, and this avoids the inevitable fragmentation of business logic that would result if an external vendor solution is introduced.

Generate Different SQL Dialects for the Same Model: Different data stores have different query interfaces and implementation philosophies. The TeraHelix solution allows for maximum flexibility in terms of implementation choice. In this iteration the generation of SQL dialects for Oracle, PostgreSQL, MySQL, the Apache big data stack (Spark, Phoenix etc.) and the Cloudera Platform are supported.

Data Quality Index: Creation of a data quality score to measure relative quality between data sources and over time. This in turn can drive operational monitoring and increase the confidence level of stakeholders and data consumers.