Polyglot Data Modeling
Hackolade was originally created to perform schema design, i.e. physical data models only. But as support for more and more targets grew, customers started asking to be able to save a model for one target into a model for another target. Then they also started to ask that Hackolade could ensure the consistency of schemas across different technologies in data pipelines.
With RDBMS vendors, the differences between the physical implementations are fairly minor, and so are the differences between the SQL dialects. Not so with all the targets supported by Hackolade for data-at-rest and data-in-motion. The feature set, the storage model, the terminology, the data types, the indexing capabilities, the schema syntax, etc.: there is very little in common between a MongoDB and a Cassandra data model, or between a Parquet and a DynamoDB data model for example.
Nevertheless, we had to find a solution. Hackolade having been built on top of JSON Schema, it seemed natural to use it as a bridge to convert from one physical target to another. But there a couple of major limitations with JSON Schema:
- the data types are very limited and don't provide the granularity found in many targets,
- only a single entity can be represented in a schema file
- there is no way to represent Foreign Key relationships
Other schema standards presented similar issues (Avro or XSD) or other equally challenging issues (DDLs in SQL don't include nested objects or are able to represent polymorphism, Snowflake, Hive, ...).
A common physical schema from which target schemas are derived
In traditional data modeling approaches for relational databases using the conceptual, logical, and physical types of models, the purpose for a logical model has been defined as a database-agnostic view of normalized structures and business rules.
In our definition, a Polyglot Data Model sits over the previous boundary between logical and physical. It is sort of a logical model in the sense that it is technology-agnostic, but it is really a common physical model with the following features:
- allows denormalization, if desired, given access patterns;
- allows complex data types;
- generates schemas for a variety of technologies, with automatic mapping to the specific data types of the respective target technologies.
In RDBMS the different dialects of SQL will lead to fairly similar DDLs, whereas schema syntax for Avro, Parquet, OpenAPI, HiveQL, Neo4j Cypher, MongoDB, etc... are vastly different.
The generation of physical names for entities (could be called tables, collections, nodes, vertices, files, etc.. in target technology) and attributes (could be called columns, fields, etc.. in target technology) should be capable of following different transformation rules, as Cassandra for example does not allow UPPERCASE while MongoDB will prefer camelCase, etc.
Polyglot Data Modeling
Starting with version 5.2.0 of Hackolade Studio, users can now define structures once in a technology-agnostic polyglot data model, complete with denormalization and complex data types, then represent these structures in a variety of physical data models respecting the specific aspects of each target technology.
The polyglot functionality is there to help cope with the challenges of polyglot persistence. Polyglot persistence can be applied to different use cases, for example a large operational application:
or through a data pipeline:
For example, you want to make sure to keep in sync the structures of an operational database in MongoDB, a landing zone with Avro files on S3 object storage, a data lake on Databricks with bronze-silver-gold states, streaming with Kafka and a Confluent Schema Registry, and a data warehouse in Snowflake. It is hard enough to build this data pipeline, imagine what it takes to support to dynamic evolution of these schemas as new attributes are constantly added during the application lifecycle.
If you add/delete/modify entities or attributes in the polyglot model from which several target models are derived, all these changes can trickle down in the target models.
Flexibilty
In a basic use case, you could have just one polyglot model, and several target models derived entirely from the common polyglot models. Each target model would just reflect the physical implementation for its respective technology. Entities can be deleted from the target model, others can be added, and the model can be enriched with entity options, indexes, etc. to generate the proper artifacts (script, DDL, ...) You may also deactivate attributes in the target model if they're not necessary, or you may modify some their properties.
But overall, the above use case is a bit theoretical, and reality will quickly dictate more flexible approaches. In a slightly more realistic use case, you could have target models that are each a different subset of a same polyglot model. Each target model may be supplemented by its own additional entities.
Elaborating further, you could derive a target model from multiple polyglot models, or subsets thereof:
Conceptual modeling
It is generally said that there are 3 categories of data modeling: conceptual, logical, and physical. In summary, conceptual modeling captures the common business language and defines the scope, whereas the logical model captures the business rules in a technology-agnostic manner, and the physical model represents the specific implementation of the model.
With version 6.6.0 of Hackolade Studio, we facilitate the process of conceptual data modeling by providing a Graph Diagram view to capture the entities and their relationships..
Logical modeling
Important: for users coming from traditional data modeling tools for RDBMS and the traditional process of conceptual, logical, physical, it may be tempting to use Hackolade's Polyglot models purely in the same way as before, and ignore the nuances and purpose of our Polyglot approach. Please don't, as you might be disappointed by the result. It would provide results similar to an exercise of using MongoDB with a normalized data model, i.e. not the optimal use of a NoSQL document database...
Note: for decades, it has been said that logical models were "technology-agnostic. All that works well enough using the famous rules of database normalization -- as long as your target technologies are relational only. But as soon as we're having to deal with NoSQL or with APIs and modern storage formats, it turns out that logical models which respect third normal form for many-to-many relationships with junction tables aren't "technology-agnostic" after all. In a NoSQL document database, you would embed information and not use any junction table. Similarly, supertypes/subtypes can be handled in NoSQL via polymorphism instead of inheritance tables, etc.
While our Polyglot model is truly technology-agnostic, it is also a common physical model with the following features:
- allows denormalization, if desired, given access patterns;
- allows complex data types;
- generates schemas for a variety of technologies, with automatic mapping to the specific data types of the respective target technologies.
Typically in logical models, you would have normalized entities, and you may denormalize in the physical model for performance. With a Polyglot model, it is advised to do things the other way around. You should use query-driven access patterns to define Polyglot structures with complex data types, including denormalization and polymorphism. When you derive to an RDBMS target, you will want to let our process normalize the structure on-the-fly, while you will want to keep the hierarchical structure for technology targets that allow it.
Supertypes and subtypes
Hackolade Studio will let you create supertypes and subtypes in a Polyglot model, following either the traditional method:
or a denormalized method, with a complex sub-object as well as a oneOf choice representing different possible sub-types for inheritance
Either way, the application will understand, depending on the target technology and its capabilities, whether to normalize with inheritance tables, or denormalize for NoSQL and leverage polymorphism.
Many-to-many relationships
With logical models having to be normalized, they are only "technology-agnostic" if their use is limited to relational databases, as explained above. With NoSQL databases allowing for denormalization with embedded objects and arrays, no junction tables are necessary to represent many-to-many relationships.
Therefore, our Polyglot data models have a more intuitive way of documenting a many-to-many business rule:
And our automatic normalization function, when deriving the Polyglot model to an relational database, will generate the expected junction table:
Conversely, the derivation to a NoSQL target will leave the entities as declared in the Polyglot model and allow for further denormalization according to the needs of the access patterns. Ideally, the denormalization takes place directly in the Polyglot model.
More information about the philosophy behind our Polyglot Data Model concept can found on this page.
This other page details how to use polyglot data models.