Delta Lake on Databricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. It was initially developed by Databricks in 2016 and open-sourced to the Linux Foundation in 2019. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is a storage layer on top of cloud object stores (Amazon S3, Azure Data Lake Storage, Google Cloud Storage and etc) or distributed filesystems like HDFS, which describes and defines the formats of data object and transaction logs, and a set of access protocols enabling DBMS-like features. It is fully compatible with Apache Spark APIs. Its schema enforcement automatically handles schema variations to prevent insertion of bad records during ingestion. The Databricks Lakehouse platform provides data engineering, SQL analytics, data science and Machine Learning on Azure Databricks, Databricks on AWS and google Cloud.
Apache Spark SQL in Databricks is designed to be compatible with Apache Hive, including metastore connectivity, SerDes, and UDFs. So every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata, and Hackolade has been supporting Hive for several years.
To perform data modeling for Delta Lake on Databricks with Hackolade, you must first download the DeltaLake plugin.
Hackolade was specially adapted to support the data modeling of Delta Lake, including the Databricks storage structure of clusters, databases, tables, views, and indexes. It leverages Hive primitive and complex data types, plus user-defined types. And combines it all with the usual capabilities of forward-engineering of HiveQL scripts, reverse-engineering, documentation generation, model comparison, command-line interface integration with CI/CD pipelines, etc...The application closely follows the Delta Lake terminology.
The data model in the picture below results from the modeling of an application described in this tutorial.
A Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.
A Databricks database is a collection of tables.
A Databricks table is a collection of structured data. Tables be can queried with Spark APIs and Spark SQL. The table is logically made up of the data being stored in cloud object stores (Amazon S3, Azure Data Lake Storage, Google Cloud Storage and etc) or distributed filesystems like HDFS. The table metadata describes the layout of the data in the table and defines the formats of data object and transaction logs, and a set of access protocols enabling DBMS-like features. It is fully compatible with Apache Spark APIs. Its schema enforcement automatically handles schema variations to prevent insertion of bad records during ingestion.
Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. Instead of using the Databricks Hive metastore, users have the option to use an existing external Hive metastore instance or the AWS Glue Catalog.
Delta Lake does not support multi-table transactions, primary or foreign keys. The only constraints supported by Delta Lake are NOT NULL and CHECK.
Delta Lake supports different data types to be used in table columns. The Hive data types supported by Delta Lake can be broadly classified in Primitive and Complex data types.
Hackolade was specially adapted to support the data types and attributes behavior of Delta Lake, including arrays, maps and structs.
A Databricks view is a searchable object in a database, which can be defined by a query. Data cannot be stored in a view, as it is a sort of virtual table. By using joins, it is possible to combine data from one or more tables. It may also hold a subset of information. More info can be found here.
Hackolade supports Delta Lake views, via a SELECT of columns of the underlying base tables. We also support the [column_list] syntax, but note that if columns are present in the SELECT statement, they take precedence over the column_list. f there is no column list, the view inherits the column names from the underlying query. All columns in a view must be uniquely named.
Hackolade dynamically generates the HiveQL script to create databases, tables, columns and their data types, as well as views for the structure created with the application.
By pressing the button "Apply to instance" the system will automatically creates the database, tables, views, columns, including constraints and indexes.
Note: As foreign keys and primary keys are not supported in Delta Lake, they can be used in Hackolade for informational purposes, but are not generated in the HQL script.
As many people store JSON within text columns, Hackolade allows for the schema design of those documents. That JSON structure is not forward-engineered, but is useful for developers, analysts and designers. It is often integrated into a CI/CD pipeline, using the Command-Line Interface.
The connection is established using a connection string including the Cloud provider and a personal access token.
- for reverse-engineering: READ_METADATA and SELECT on the tables and views
- for forward-engineering (apply to instance): rights to CREATE/UPDATE schema, table, and view: CREATE on the CATALOG, and either OWN or both USAGE and CREATE on the schema, tables and views.
The Hackolade process for reverse-engineering of Delta Lake databases includes the execution of HQL SHOW statements to discover databases, tables, columns and their data types. If JSON is detected in text columns, Hackolade performs statistical sampling of records followed by probabilistic inference of the JSON document schema.
Databricks Unity Catalog
The Unity Catalog is a unified governance solution for data and AI assets on the Databricks Lakehouse. It provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. It was introduced with Runtime 11.3
All data in Unity Catalog is referenced using a three-level namespace: catalog.schema.table.
Catalog: Catalogs are the highest level in the data hierarchy (catalog > schema > table/view/volume) managed by the Unity Catalog metastore. They are intended as the primary unit of data isolation in a typical Databricks data governance model. Catalogs represent a logical grouping of schemas, usually bounded by data access requirements. Catalogs often mirror organizational units or software development lifecycle scopes. You may choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data.
Schema (Database): Schemas, also known as databases, are logical groupings of tabular data (tables and views), non-tabular data (volumes), functions, and machine learning models. They give you a way to organize and control access to data that is more granular than catalogs. Typically they represent a single use case, project, or team sandbox.
Tables: Tables reside in the third layer of Unity Catalog’s three-level namespace. They contains rows of data. Unity Catalog lets you create managed tables and external tables.
Views: A view is a read-only object derived from one or more tables and views in a metastore.
Volumes: Volumes reside in the third layer of Unity Catalog’s three-level namespace. They manage non-tabular data. You can use volumes to store, organize, and access files in any format, including structured, semi-structured, and unstructured data. Files in volumes cannot be registered as tables.
For more information on the Unity Catalog, consult the documentation.