Even relatively small data quality issues and inaccuracies can have a lasting impact on the level of data consumer and decision maker confidence.
Move the code to the data, not the data to the code.
The Abstract Syntax Tree generated by the Data Explorer solution is used as the basis from which to generate data quality scripts for the current hosting platform. The size of the data in the store is orders of magnitude larger than the scripts. Therefore, move the code to the data, not the data to the code.
Data quality rules based on Data Explorer: As the data ontology changes, so do the data quality rules. A compiler ensures that the data ontology and data quality rules remain consistent.
Auto generation of standard rules: Based on the Data Explorer, scripts for standard data quality measures (completeness, entropy, uniqueness etc.) will be system generated resulting in significant savings of analyst time.
Custom business rules as part of golden source: Any additional custom data quality rules are specified as part of the data model, and this avoids the inevitable fragmentation of business logic that would result if an external vendor solution is introduced.
Generate Different SQL Dialects for the Same Model: Different data stores have different query interfaces and implementation philosophies. The TeraHelix solution allows for maximum flexibility in terms of implementation choice. In this iteration the generation of SQL dialects for Oracle, PostgreSQL, MySQL, the Apache big data stack (Spark, Phoenix etc.) and the Cloudera Platform are supported.
Data Quality Index: Creation of a data quality score to measure relative quality between data sources and over time. This in turn can drive operational monitoring and increase the confidence level of stakeholders and data consumers.