Apache Cassandra is designed to handle large amounts of data across many commodity servers, providing high availability through robust support for clusterspanning of multiple datacenters and asynchronous masterless replication and low latency operations.  It is essentially a hybrid between a key-value and a column-oriented database.  Rows are organized into tables.  The first component of primary key is a partition key, and rows clustered by the remaining columns of the key.  Other columns may be indexed separately from the primary key.


DataStax is a software vendor whose employees are key contributors to the Apache Cassandra project.  


To perform data modeling for Cassandra with Hackolade, you must first download the Cassandra plugin.  


Hackolade was specially adapted to support the data modeling of Cassandra, including User-Defined Types and the concepts of Partitioning and Clustering keys. It lets users define, document, and display Chebotko physical diagrams.  The application closely follows the Cassandra, terminology,  data types, and Chebotko notation.  


The data model in the picture below results from the modeling of an application described in Chapter 5 of the book "Cassandra: the Definitive Guide" from O'Reilly.


Keyspace

A keyspace is a Cassandra namespace that defines data replication on nodes.  A cluster contains one keyspace per node.  A keyspace is logical grouping of tables analogous to a database in relation database systems.


Table

Tables in Cassandra contain rows of columns, and a primary key identifies the location and order of stored data.  Tables can also be used to store JSON.  Tables are declared up front at schema definition time.




Primary, Partition, and Clustering Keys

In Cassandra, primary keys can be simple or compound, with one or more partition keys, and optionally one or more clustering keys.  The partition key determines which node stores the data.  It is responsible for data distribution across the nodes.  The additional columns determine per-partition clustering.  Clustering is a storage engine process that sorts data within the partition.


Attributes data types

Cassandra supports a variety of scalar and complex data types, including lists, maps, and sets.


Hackolade was specially adapted to support the data types and attributes behavior of Cassandra.

Some scalar types can be configured for different modes.


Hackolade also supports Cassandra User-Defined Types via its re-usable object definitions.


Forward-Engineering

Hackolade dynamically generates the CQL script to create keyspaces, tables, columns and their types, and indexes for the structure created with the application.


As many people store JSON within text or blob columns, Hackolade allows for the schema design of those documents.  That JSON structure is not forward-engineered, but is useful for developers, analysts and designers.


Reverse-Engineering

The connection is established using a connection string including (IP) address and port (typically 9042), and authentication using username/password if applicable.


The Hackolade process for reverse-engineering of Cassandra databases includes the execution of cqlsh DESCRIBE statements to discover keyspaces, tables, columns and their types, and indexes.  If JSON is detected in string columns, Hackolade performs statistical sampling of records followed by probabilistic inference of the JSON document schema.


For more information on Cassandra in general, please consult the Apache Cassandra website, DataStax website, and book.