Polyglot Data Modeling

Hackolade was originally created to perform schema design, i.e. physical data models only.  But as support for more and more targets grew, customers started asking to be able to save a model for one target into a model for another target.  Then they also started to ask that Hackolade could ensure the consistency of schemas across different technologies in data pipelines.


With RDBMS vendors, the differences between the physical implementations are fairly minor, and so are the differences between the SQL dialects.  Not so with all the targets supported by Hackolade for data-at-rest and data-in-motion.  The feature set, the storage model, the terminology, the data types, the indexing capabilities, the schema syntax, etc.: there is very little in common between a MongoDB and a Cassandra data model, or between a Parquet and a DynamoDB data model for example.


Nevertheless, we had to find a solution.  Hackolade having been built on top of JSON Schema, it seemed natural to use it as a bridge to convert from one physical target to another.  But there a couple of major limitations with JSON Schema: 

  • the data types are very limited and don't provide the granularity found in many targets, 
  • only a single entity can be represented in a schema file
  • there is no way to represent Foreign Key relationships


Other schema standards presented similar issues (Avro or XSD) or other equally challenging issues (DDLs in SQL don't include nested objects or are able to represent polymorphism, Snowflake, Hive, ...).  


A common physical schema from which target schemas are derived

In traditional data modeling approaches for relational databases using the conceptual, logical, and physical types of models, the purpose for a logical model has been defined as a database-agnostic view of normalized structures and business rules.  


In our definition, a Polyglot Data Model sits over the previous boundary between logical and physical. It is sort of a logical model in the sense that it is technology-agnostic, but it is really a common physical schema with the following features:

  • allows denormalization, if desired, given access patterns;
  • allows complex data types;
  • generates schemas for a variety of technologies, with automatic mapping to the specific data types of the respective target technologies.


In RDBMS the different dialects of SQL will lead to fairly similar DDLs, whereas schema syntax for Avro, Parquet, OpenAPI, HiveQL, Neo4j Cypher, MongoDB, etc... are vastly different.

The generation of physical names for entities (could be called tables, collections, nodes, vertices, files, etc.. in target technology) and attributes (could be called columns, fields, etc.. in target technology) should be capable of following different transformation rules, as Cassandra for example does not allow UPPERCASE while MongoDB will prefer camelCase, etc.


Polyglot Data Modeling

Starting with version 5.2.0 of Hackolade Studio, users can now define structures once in a technology-agnostic polyglot data model, complete with denormalization and complex data types, then represent these structures in a variety of physical data models respecting the specific aspects of each target technology.


The polyglot functionality is there to help cope with the challenges of polyglot persistence.  Polyglot persistence can be applied to different use cases, for example a large operational application:

Polyglot persistence - e-commerce


or through a data pipeline:

Polyglot persistence - data pipeline 


For example, you want to make sure to keep in sync the structures of an operational database in MongoDB, a landing zone with Avro files on S3 object storage, a data lake on Databricks with bronze-silver-gold states, streaming with Kafka and a Confluent Schema Registry, and a data warehouse in Snowflake.  It is hard enough to build this data pipeline, imagine what it takes to support to dynamic evolution of these schemas as new attributes are constantly added during the application lifecycle.  


If you add/delete/modify entities or attributes in the polyglot model from which several target models are derived, all these changes can trickle down in the target models.



In a basic use case, you could have just one polyglot model, and several target models derived entirely from the common polyglot models.  Each target model would just reflect the physical implementation for its respective technology.  Entities can be deleted from the target model, others can be added, and the model can be enriched with entity options, indexes, etc. to generate the proper artifacts (script, DDL, ...)  You may also deactivate attributes in the target model if they're not necessary, or you may modify some their properties.


But overall, the above use case is a bit theoretical, and reality will quickly dictate more flexible approaches. In a slightly more realistic use case, you could have target models that are each a different subset of a same polyglot model.  Each target model may be supplemented by its own additional entities.


Elaborating further, you could derive a target model from multiple polyglot models, or subsets thereof:

Polyglot multiple subsets




Conceptual modeling

It is generally said that there are 3 categories of data modeling: conceptual, logical, and physical.  In summary, conceptual modeling captures the common business language and defines the scope, whereas the logical model captures the business rules, and the physical model represents the specific implementation of the model.


With version 6.6.0 of Hackolade Studio, we facilitate the process of conceptual data modeling by providing a Graph Diagram view to capture the entities and their relationships..

Polyglot Cocenptual Graph View


Logical modeling

Important: for users coming from traditional data modeling tools for RDBMS and the traditional process of conceptual, logical, physical, it may be tempting to use Hackolade's Polyglot models purely in the same way as before, and ignore the nuances and purpose of our Polyglot approach.  Please don't, as you might be disappointed by the result.  It would be the same as wanting to use MongoDB with a normalized data model.


Again, while our Polyglot model is technology-agnostic like logical models, it is really a common physical schema with the following features:

  • allows denormalization, if desired, given access patterns;
  • allows complex data types;
  • generates schemas for a variety of technologies, with automatic mapping to the specific data types of the respective target technologies.


Typically in logical models, you will have normalized entities, and you may denormalize in the physical model for performance.  With a Polyglot model, it is advised to do things the other way around.  You should use query-driven access patterns to define Polyglot structures with complex data types, including denormalization and polymorphism.  When you derive to an RDBMS target, you will want to let our process normalize the structure, while you will want to keep the hierarchical structure for technology targets that allow it.


Let's take this example of a Polyglot entity with a complex sub-object as well as a choice representing different possible sub-types for inheritance:

Polyglot - complex structure and super-type


If you derive this Polyglot model in JSON, Avro, or MongoDB/CosmosBD/MarkLogic/etc., the target model will be exactly as above in the Polyglot model, since these technologies handle well complex data types and polymorphism.  But if you derive this model in an RDBMS, we automatically normalize into the structure below where the complex attribute is moved into a separate table with a foreign key, and the choice is converted to inheritance tables:

Polyglot - normalized complex and sub-types




More information about the philosophy behind our Polyglot Data Model concept can found on this page.


This other page details how to use polyglot data models.