Polyglot Data Modeling
Hackolade was originally created to perform schema design, i.e. physical data models only. But as support for more and more targets grew, customers started asking, first to be able to save a model for one target into a model for another target. Then they also started to ask that Hackolade could ensure the consistency of schemas across different technologies in data pipelines.
With RDBMS vendors, the differences between the physical implementations are fairly minor, and so are the differences between the SQL dialects. Not so with all the targets supported by Hackolade for data-at-rest and data-in-motion. The feature set, the storage model, the terminology, the data types, the indexing capabilities, the schema syntax, etc.: there is very little in common between a MongoDB and a Cassandra data model, or between a Parquet and a DynamoDB data model for example.
Nevertheless, we had to find a solution. Hackolade having been built on top of JSON Schema, it seemed natural to use it as a bridge to convert from one physical target to another. But there a couple of major limitations with JSON Schema:
- the data types are very limited and don't provide the granularity found in many targets,
- only a single entity can be represented in a schema file
- there is no way to represent Foreign Key relationships
Other schema standards presented similar issues (Avro or XSD) or other equally challenging issues (DDLs in SQL don't include nested objects or are able to represent polymorphism, Snowflake, Hive, ...).
A common physical schema from which target schemas are derived
In traditional data modeling approaches for relational databases using the conceptual, logical, and physical types, the purpose for a logical model has been defined as a database-agnostic view of structures and business rules.
In our definition, a Polyglot Data Model sits over the previous boundary between logical and physical. It is sort of a logical model in the sense that it is technology-agnostic, but it is really a common physical schema with the following features:
- allows denormalization, if desired, given access patterns;
- allows complex data types;
- generates schemas for a variety of technologies, with automatic mapping to the specific data types of the respective target technologies.
In RDBMS the different dialects of SQL will lead to fairly similar DDLs, whereas schema syntax for Avro, Parquet, OpenAPI, HiveQL, Neo4j Cypher, MongoDB, etc... are vastly different.
The generation of physical names for entities (could be called tables, collections, nodes, vertices, files, etc.. in target technology) and attributes (could be called columns, fields, etc.. in target technology) should be capable of following different transformation rules, as Cassandra for example does not allow UPPERCASE while MongoDB will prefer camelCase, etc.
Polyglot Data Modeling
With version 5.2.0 of Hackolade Studio, users can now define structures once in a technology-agnostic polyglot data model, complete with with denormalization and complex data types, then represent these structures in a variety of physical data models respecting the specific aspects of each target technology.
The polyglot functionality is there to help cope with the challenges of polyglot persistence. Polyglot persistence can be applied to different use cases, for example a large operational application:
or through a data pipeline:
For example, you want to make sure to keep in sync the structures of an operational database in MongoDB, a landing zone with Avro files on S3 object storage, a data lake on Databricks with bronze-silver-gold states, streaming with Kafka and a Confluent Schema Registry, and a data warehouse in Snowflake. It is hard enough to build this data pipeline, imagine what it takes to support to dynamic evolution of these schemas as new attributes are constantly added during the application lifecycle.
If you add/delete/modify entities or attributes in the polyglot model from which several target models are derived, all these changes can trickle down in the target models.
In a basic use case, you could have just one polyglot model, and several target models derived entirely from the common polyglot models. Each target model would just reflect the physical implementation for its respective technology. Entities can be deleted from the target model, others can be added, and the model can be enriched with entity options, indexes, etc. to generate the proper artifacts (script, DDL, ...) But this use case is a bit theoretical, and reality will quickly dictate more flexible approaches.
In a slightly more realistic use case, you could have target models that are each a different subset of a same polyglot model. Each target model may be supplemented by its own additional entities.
Elaborating further, you could derive a target model from multiple polyglot models, or subsets thereof:
More information about the philosophy behind our Polyglot Data Model concept can found on this page.
This other page details how to use polyglot data models.