Apache Cassandra is designed to handle large amounts of data across many commodity servers, providing high availability through robust support for clusterspanning of multiple datacenters and asynchronous masterless replication and low latency operations. It is essentially a hybrid between a key-value and a column-oriented database. Rows are organized into tables. The first component of a primary key is a partition key, and rows clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key.
DataStax is a software vendor whose employees are key contributors to the Apache Cassandra project.
To perform data modeling for Cassandra with Hackolade, you must first download the Cassandra plugin.
Hackolade was specially adapted to support the data modeling of Cassandra, including User-Defined Types and the concepts of Partitioning and Clustering keys. It lets users define, document, and display Chebotko physical diagrams. The application closely follows the Cassandra terminology, data types, and Chebotko notation.
The data model in the picture below results from the data modeling of an application described in Chapter 5 of the book "Cassandra: the Definitive Guide" from O'Reilly.
A keyspace is a Cassandra namespace that defines data replication on nodes. A cluster contains one keyspace per node. A keyspace is logical grouping of tables analogous to a database in relation database systems.
Tables in Cassandra contain rows of columns, and a primary key identifies the location and order of stored data. Tables can also be used to store JSON. Tables are declared up front at schema definition time.
Primary, Partition, and Clustering Keys
In Cassandra, primary keys can be simple or compound, with one or more partition keys, and optionally one or more clustering keys. The partition key determines which node stores the data. It is responsible for data distribution across the nodes. The additional columns determine per-partition clustering. Clustering is a storage engine process that sorts data within the partition.
Attributes data types
Cassandra supports a variety of scalar and complex data types, including lists, maps, and sets.
Hackolade was specially adapted to support the data types and attributes behavior of Cassandra.
Some scalar types can be configured for different modes.
Hackolade also supports Cassandra User-Defined Types via its re-usable object definitions.
An index provides a means to access data in DataStax Enterprise using attributes other than the partition key. The benefit of an index is fast, efficient lookup of data that matches a given condition. Built-in indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead on average is required to query and maintain the index. It is useful to consult this section on when not to use an index.
With version 3.0, Cassandra introduced materialized views to handle automated server-side denormalization. In theory, this removes the need for client-side handling and would ensure consistency between base and view data. Materialized views work particularly well with immutable insert-only data, but should not be used in case of low-cardinality data. Materialized views are designed to alleviate the pain for developers, but are essentially a trade-off of performance for connectedness. See more info in this article.
Hackolade supports Cassandra materialized views, via a SELECT of columns of the underlying base table, to present the data of the base table with a different primary key for different access patterns.
Hackolade dynamically generates the CQL script to create keyspaces, tables, columns and their types, and indexes for the structure created with the application.
The script can also be exported to the file system via the menu Tools > Forward-Engineering, or via the Command-Line Interface.
As many people store JSON within text or blob columns, Hackolade allows for the schema design of those documents. That JSON structure is not forward-engineered in the CQL scrip, but is useful for developers, analysts and designers.
The connection is established using a connection string including (IP) address and port (typically 9042), and authentication using username/password if applicable. Details on how to connect Hackolade to a Cassandra instance can be found on this page.
The Hackolade process for reverse-engineering of Cassandra databases includes the execution of cqlsh DESCRIBE statements to discover keyspaces, tables, columns and their types, and indexes. If JSON is detected in string columns, Hackolade performs statistical sampling of records followed by probabilistic inference of the JSON document schema.