Documentation

No results

Delta Lake on Databricks

Delta Lake is an open-source storage layer that brings reliability to data lakes. It was initially developed by Databricks in 2016 and open-sourced to the Linux Foundation in 2019. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is a storage layer on top of cloud object stores (Amazon S3, Azure Data Lake Storage, Google Cloud Storage and etc) or distributed filesystems like HDFS, which describes and defines the formats of data object and transaction logs, and a set of access protocols enabling DBMS-like features. It is fully compatible with Apache Spark APIs. Its schema enforcement automatically handles schema variations to prevent insertion of bad records during ingestion. The Databricks Lakehouse platform provides data engineering, SQL analytics, data science and Machine Learning on Azure Databricks, Databricks on AWS and google Cloud.

Apache Spark SQL in Databricks is designed to be compatible with Apache Hive, including metastore connectivity, SerDes, and UDFs. So every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata, and Hackolade has been supporting Hive for several years.

To perform data modeling for Delta Lake on Databricks with Hackolade, you must first download the DeltaLake plugin.

Hackolade was specially adapted to support the data modeling of Delta Lake, including the Databricks storage structure of clusters, databases, tables, views, and indexes. It leverages Hive primitive and complex data types, plus user-defined types. And combines it all with the usual capabilities of forward-engineering of HiveQL scripts, reverse-engineering, documentation generation, model comparison, command-line interface integration with CI/CD pipelines, etc...The application closely follows the Delta Lake terminology.

The data model in the picture below results from the modeling of an application described in this tutorial.

Delta Lake on Databricks workspace

Clusters

A Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Databases

A Databricks database is a collection of tables.

Tables

A Databricks table is a collection of structured data. Tables be can queried with Spark APIs and Spark SQL. The table is logically made up of the data being stored in cloud object stores (Amazon S3, Azure Data Lake Storage, Google Cloud Storage and etc) or distributed filesystems like HDFS. The table metadata describes the layout of the data in the table and defines the formats of data object and transaction logs, and a set of access protocols enabling DBMS-like features. It is fully compatible with Apache Spark APIs. Its schema enforcement automatically handles schema variations to prevent insertion of bad records during ingestion.

Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. Instead of using the Databricks Hive metastore, users have the option to use an existing external Hive metastore instance or the AWS Glue Catalog.

Delta Lake does not enforce multi-table transactions, primary or foreign keys, but they still can be declared . The only constraints supported by Delta Lake are NOT NULL and CHECK.

Databricks table properties

Data types

Delta Lake supports different data types to be used in table columns. The Hive data types supported by Delta Lake can be broadly classified in Primitive and Complex data types.

Hackolade was specially adapted to support the data types and attributes behavior of Delta Lake, including arrays, maps and structs.

In addition to the Hive data types, Databricks also supports the semi-structured data types: VARIANT, OBJECT and ARRAY to represent arbitrary data structures which can be used to import and operate on semi-structured data (JSON, Avro, ORC, Parquet, or XML.) Variant provides an order of magnitude performance improvements compared with storing these data as JSON strings, while maintaining the flexibility for supporting highly nested and evolving schema.

Hive data types

Views

A Databricks view is a searchable object in a database, which can be defined by a query. Data cannot be stored in a view, as it is a sort of virtual table. By using joins, it is possible to combine data from one or more tables. It may also hold a subset of information. More info can be found here.

Hackolade supports Delta Lake views, via a SELECT of columns of the underlying base tables. We also support the [column_list] syntax, but note that if columns are present in the SELECT statement, they take precedence over the column_list. f there is no column list, the view inherits the column names from the underlying query. All columns in a view must be uniquely named.

Databricks Unity Catalog

The Unity Catalog is a unified governance solution for data and AI assets on the Databricks Lakehouse. It provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. It was introduced with Runtime 11.3

All data in Unity Catalog is referenced using a three-level namespace: catalog.schema.table.

Unity Catalog object model diagram

Catalog: Catalogs are the highest level in the data hierarchy (catalog > schema > table/view/volume) managed by the Unity Catalog metastore. They are intended as the primary unit of data isolation in a typical Databricks data governance model. Catalogs represent a logical grouping of schemas, usually bounded by data access requirements. Catalogs often mirror organizational units or software development lifecycle scopes. You may choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data.

Schema (Database): Schemas, also known as databases, are logical groupings of tabular data (tables and views), non-tabular data (volumes), functions, and machine learning models. They give you a way to organize and control access to data that is more granular than catalogs. Typically they represent a single use case, project, or team sandbox.

Tables: Tables reside in the third layer of Unity Catalog’s three-level namespace. They contains rows of data. Unity Catalog lets you create managed tables and external tables.

Views: A view is a read-only object derived from one or more tables and views in a metastore.

Volumes: Volumes reside in the third layer of Unity Catalog’s three-level namespace. They manage non-tabular data. You can use volumes to store, organize, and access files in any format, including structured, semi-structured, and unstructured data. Files in volumes cannot be registered as tables.

For more information on the Unity Catalog, consult the documentation.

Tags in Unity Catalog

This feature is only available for Runtime 13 and up.

Tags are attributes containing keys and optional values that can be applied to different objects in Unity Catalog. Tagging is useful for organizing and categorizing objects within a metastore. Using tags also simplifies search and discovery of your data assets.

In Hackolade Studio, tags are groups of key-value pairs that appear at different levels: catalog, schema, table, and attribute.

They translate into the corresponding DDL statements during forward-engineering:

ALTER <LEVEL> <levelObject_name>
SET TAGS ('key1' = 'value1', 'key2' = 'value2')

where SET TAGS ( { tag_name = tag_value } [, …] )

where tag_name is a literal STRING. The tag_name must be unique within the object, and tag_value is a literal STRING.

Tags are also picked up during reverse-engineering.

There are some constraints:

- to add tags to Unity Catalog securable objects when applying the DDL script to intance, users must have the APPLY TAG privilege on the object, as well as the USE SCHEMA privilege on the object’s parent schema and the USE CATALOG privilege on the object’s parent catalog.

- you can assign a maximum of 20 tags to a single securable object

- the maximum length of a tag is 255 characters (use regex validation property)

- special characters '.', ',', '-', '=', '/', ':', ' ' (blank space) cannot be used in tag names (use regex validation property)

Note that the tag value is optional, i.e. there can be a tag without value.

Forward-Engineering

Hackolade dynamically generates the HiveQL script to create databases, tables, columns and their data types, as well as views for the structure created with the application.

The script can also be exported to the file system via the menu Tools > Forward-Engineering, or via the Command-Line Interface.

Databricks forward-engineering

By pressing the button "Apply to instance" the system will automatically creates the database, tables, views, columns, including constraints and indexes.

Note: As foreign keys and primary keys are not supported in Delta Lake, they can be used in Hackolade for informational purposes, but are not generated in the HQL script.

As many people store JSON within text columns, Hackolade allows for the schema design of those documents. That JSON structure is not forward-engineered, but is useful for developers, analysts and designers. It is often integrated into a CI/CD pipeline, using the Command-Line Interface.

Reverse-Engineering

The connection is established using a connection string including the Cloud provider and a personal access token.

In the premium version of Databricks, it is possible to control table access. If that's the case, the user must be granted the following data object privileges:

for reverse-engineering: READ_METADATA and SELECT on the tables and views
for forward-engineering (apply to instance): rights to CREATE/UPDATE schema, table, and view: CREATE on the CATALOG, and either OWN or both USAGE and CREATE on the schema, tables and views.

The Hackolade process for reverse-engineering of Delta Lake databases includes the execution of HQL SHOW statements to discover databases, tables, columns and their data types. If JSON is detected in text columns, Hackolade performs statistical sampling of records followed by probabilistic inference of the JSON document schema.

For more information on Delta Lake in general, please consult the website. Here is the links to the Databricks website.

On this page