Polyglot Data Modeling
polyglot | \ ˈpä-lē-ˌglät \ : speaking or writing several languages : multilingual
Hackolade Studio was originally designed and positioned as a physical-only data modeling tool. The reasoning was that there are plenty of excellent and mature ER tools providing conceptual and logical modeling capabilities. And Hackolade's contribution is about helping solve challenges with no previously existing solution, rather than developing "me too" functionalities.
As coverage grew for more and more databases and communication protocols however, users started to ask for the ability to define structures once, and be able to represent them in a variety of schema syntaxes. Hackolade already converts DDLs into all kinds of different schema syntaxes. It can also export data models for any target technology into JSON Schema, or generate a Swagger/OpenAPI specification for any Hackolade data model.
Why not go a step further, and allow the creation of a technology-agnostic data model?
Logical or Polyglot Data Model?
Many people may say: "Technology-agnostic model? Isn't that the definition of a logical data model?". Maybe, but we think that, with technologies of the 21st century and Agile development, the strict definition for logical data modeling makes it a bit too constraining.
Today's big data not only allows, but promotes denormalization and the use of complex data types, which are not exactly compatible with the strict definition of logical modeling. Plus, physical schema designs are application-specific and query-driven, based on access patterns.
Let's consider a traditional description of the different levels of data modeling:
|Column data types||X|
While it maybe fairly straight-forward to go from Logical to Physical in a the relational world, it is not the case with NoSQL or analytical big data.
A Polyglot Data Model for your Polyglot Data
There is a need for a data model which allows complex data types and denormalization, yet can be easily translated into vastly different syntaxes on the physical side. We call it a "Polyglot Data Model", a term inspired by the brilliant Polyglot Persistence approach promoted by Pramod Sadalage and Martin Fowler in their 2013 book NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence.
In our experience with customers, we observe two different types of polyglot persistence. The first type is the one originally described by Martin Fowler: best-of-breed persistence technology applied to the different use cases within a single application:
A second type of polyglot persistence is even more pervasive: data pipelines from operational data stores, though object storage and multi-stage data lakes, streamed or served via APIs to self-service analytics data warehouses, ML and AI.
In either case, customers are expecting from a modern data modeling tool that it helps design and manage schemas across the entire data landscape.
A Polyglot Data Model sits over the previous boundary between logical and physical. It is called by some a logical model in the sense that it is technology-agnostic, but it is really a common physical schema with the following features:
- allows denormalization, if desired, given access patterns;
- allows complex data types;
- generates schemas for a variety of technologies, with automatic mapping to the specific data types of the respective target technologies.
In RDBMS the different dialects of SQL will lead to fairly similar DDLs, whereas schema syntax for Avro, Parquet, OpenAPI, HiveQL, Neo4j Cypher, MongoDB, etc... are vastly different.
The generation of physical names for entities (could be called tables, collections, nodes, vertices, files, etc.. in target technology) and attributes (could be called columns, fields, etc.. in target technology) should be capable of following different transformation rules, as Cassandra for example does not allow UPPERCASE while MongoDB will prefer camelCase, etc.
Data Model vs Schema Design
A data model is an abstraction describing and documenting the information system of an enterprise. Data models provide value in understanding, communication, collaboration, governance, ... They help document the context and meaning of the data.
But the value of data models at a technical level is in the artifacts they help create: schemas. A schema is a “consumable” collection of objects describing the layout or structure of a file, a transaction, or a database. A schema is a scope contract between producers and consumers of data, and an authoritative source of structure and meaning of the context.
The authors of the Agile Manifesto wanted to restore a balance, and said: "We embrace modeling, but not in order to file some diagram in a dusty corporate repository." At Hackolade, we think that (data) modeling is indispensable, so schemas can be generated, consumed, and managed.
The Polyglot Data Model concept was built so you could create a library of canonical objects for your domains, and use them consistently across physical data models for different target technologies.