Documentation

gitHub

Reverse-engineering

Reverse engineering is a useful feature to facilitate the modeling and documentation of existing instances.  After the reverse engineering has been performed, the user can enrich the model created, by filling descriptions, recording constraints that were not obvious from the stored data, documenting relationships, etc…

 

Hackolade supports the reverse engineer of different types of data sources:

  • a JSON document: conversion of a basic JSON document into a Hackolade collection schema
  • a JSON Schema: conversion of a JSON Schema file into a Hackolade collection schema
  • a Data Definition Language file (DDL): from Oracle, Microsoft SQL Server, MySQL, PostgreSQL, Hadoop Hive, Snowflake, Teradata
  • an XSD schema file from another ER tool, such as erwin, ER/Studio, PowerDesigner, or other
  • an Excel file template: first export a Hackolade model (even an empty one) to generate an Excel file for the target of your choice.  Then you may bulk edit your model or create a new one before ingesting it bak into the application.
  • target-specific
    • an Apache Avro data file or schema.Hackolade reads .avro and .avsc files from the file system or cloud storage to convert the Avro schema in a Hackolade model.
    • a Cassandra or DataStax (DSE) instance: Hackolade accesses a Cassandra instance through the REST API, reads the table and column setup, then samples columns with JSON types to probabilistically infer the Hackolade schema from the sampled JSON data.
    • a Cosmos DB instance: Hackolade can reverse-engineer a local instance of Cosmos DB or hosted at Azure, using the SQL API (previously known as DocumentDB API), the MongoDB API, or the Gremlin API.  The process performs a sampling of the selected collections and document types, and probabilistically infers the Hackolade schema from the sampled data.
    • a Couchbase instance through a connection, samples the selected bucket(s), and probabilistically infers the Hackolade schema from the sampled data.
    • a DynamoDB instance: Hackolade accesses DynamoDB either local or hosted at AWS.  It performs a sampling of the selected tables, and probabilistically infers the Hackolade schema from the sampled data.
    • a MongoDB database: Hackolade accesses a MongoDB instance through a connection, samples the selected collection(s), and derives a Hackolade schema from the sampled data
    • a Mongoose schema: conversion of the JavaScript file containing a Mongoose schema, and transformation into a Hackolade collection schema 
    • a Schema Registry of AWS EventBridge, in OpenAPI 3.0.x format
    • an Elasticsearch instance: Hackolade accesses an Elasticsearch instance through the REST API, samples the selected indices and types, and probabilistically infers the Hackolade schema from the sampled data plus the retrieved mappings.
    • an AWS Glue Data Catalog instance: Hackolade access AWS via the CLI to read the catalog databases, tables and columns definitions
    • an HBase instance: Hackolade accesses an HBase instance through the REST API, reads the table and column family setup, then samples column qualifiers with JSON types to probabilistically infer the Hackolade schema from the sampled JSON data.
    • an Apache Hive instance: Hackolade accesses Hive via the Thrift service, reads the table and column setup, then samples columns with JSON types to probabilistically infer the Hackolade schema from the sampled JSON data.
    • a MarkLogic instance: Hackolade access a MarkLogic instance and samples collections (or directories) to infer the schema of sampled documents.
    • a Neo4j graph instance: Hackolade accesses a Neo4j instance through the REST API, and reads the Cypher setup for node labels and relationship types.
    • a Parquet data file: Hackolade reads .parquet files from the file system or cloud storage to convert the Parquet schema in a Hackolade model.
    • a ScyllaDB instance: Hackolade accesses a ScyllaDB instance through the REST API, reads the table and column setup, then samples columns with JSON types to probabilistically infer the Hackolade schema from the sampled JSON data.
    • SQL Server and Azure SQL: Hackolade accesses an on-prem or cloud-based instance to fetch the DDL and derive the schema.  If a column with data type varchar(MAX) or varchar(4000) is detected to contain JSON documents, records are sampled and the schema is inferred.  
    • Snowflake: Hackolade accesses an instance to fetch the DDL and derive the schema.  If a column with data type VARIANT is detected to contain JSON documents, records are sampled and the schema is inferred.  
    • Synapse: Hackolade accesses an instance to fetch the DDL and derive the schema.  If a column with data type varchar(MAX) or varchar(4000) is detected to contain JSON documents, records are sampled and the schema is inferred.  
    • Swagger and OpenAPI documentation files, in either JSON or YAML format
    • a TinkerPop instance: Hackolade accesses a TinkerPop instance through the REST API, and reads the Gremlin setup for node labels and edge types.

 

This feature is also available via the Command-Line Interface.

Reverse-engineering process

This process includes a series of steps outlined below.  After going through the steps, the user is free to enrich the model for documentation purposes, and to generate forward-engineering scripts.

Choose the model target

Create a new model and select the target of your choice.  The reverse-engineering process is fairly similar for all targets, with local nuances when needed.  But the terminology and connection protocols can be very different.

Choose the reverse-engineering function

This screen will vary depending on the selected target.

Tools  Reverse-Engineer  MongoDB Collection

 

Establishing a connection

The options here will vary depending on the selected target.  Refer to the pages below to find the specific details.  The images below refer to MongoDB.

Reverse-Engineering Connection Settings

 

TIP:  note, if you have instances with a very large number of databases, you may wish to declare a particular database directly in the connection string by appending the address with: /<name of the database you wish to access directly>, for example: localhost/NBA

 

Reverse-Engineering Connection Selection

 

For more information on the various authentication and encryption protocols, please consult these pages.

 

Selecting one or more collections/entities

Reverse-Engineer MongoDB Collection selection

 

Sampling of one or more collections/entities

Sampling is currently limited for performance reasons to at most 10.000 documents, adjustable via a user parameter in Tools > Options > Reverse-Engineering, or up to 100k documents with the Command-Line Interface

 

Since MongoDB 3.2, collections are sampled with the $sample operator via the /core/aggregation-pipeline. This provides efficient random sampling without replacement over the entire collection, or over the subset of documents specified by a query.

 

In MongoDB 3.0 and 2.6, collections are sampled via a backwards-compatible algorithm. It takes place in three stages:

- Hackolade opens a cursor on the desired collection, sorted in descending order of the _id field.

- sampleSize documents are randomly selected from the stream, using reservoir sampling.

- Hackolade performs a query to select the chosen documents directly via _id.

 

Note that this sampling mechanism is originally provided in Open Source by MongoDB, and adapted to the specific needs of Hackolade.

 

Probabilistic inference of schema

From the sample set gathered above, measurable metrics can be applied to the schema, including schema depth and width, and class interactions, as described here:

RevEng schema inference

 

Note that this inference mechanism is originally provided in Open Source by MongoDB, and adapted to the specific needs of Hackolade.

 

In this step, when the instance being reverse-engineered is v3.2 and above, stored validator setup is leveraged as well, using the db.getCollectionInfos() method 

 

Pattern field detection

As of v2.5.0 of Hackolade, we've also introduced the ability to detect pattern fields during the reverse-engineering process.  For any sub-object at a given level, we can infer the presence of a pattern field if we detect the recurrence of similar regex patterns in field names.

Reverse-Engineering pattern field detection 

 

This pattern field recognition was further enhanced in v4.2.3 with the following logic:

a) attribute must be a complex data type (object or array)

b) in a list of successive complex attributes in a branch of the schema hierarchy, the structure of each of these complex attributes must be exactly identical 

c) the names of these successive complex attributes with identical structure must be similar as per either:

- an ID-like structure of the same length of at least 12 characters that can only include hexadecimal numbers and dashes;

- or by by using Levenshtein distance. The distance parameter may be modified in Tools > Options > Reverse-Engineering (default = 1)  An online Levenshtein distance calculator can be found here.

 

Inference of Foreign Key relationships

With v4.2.13, the option was introduced to attempt foreign key relationship inference.  This features is only available for the MongoDB target, and is for documentation purposes only, as Foreign Key relationships are not enforced by the database.  For each field in a collection with the data type ObjectID, the application finds if the sampled value can be linked to a document in the same collection (for a recursive FK) or another collection in the database.  The cardinality is set by default to "0...n" on the child side, but can be adjusted manually by the user.

 

Transformation into a Hackolade collection schema

In this final step, the derived schema needs to be converted and persisted into the Hackolade notation, so it can be visualized, adapted and enriched in the application.

 

Conflict resolution during merge into an existing model

When reverse-engineering into an existing model, it is possible that source objects already match existing objects in the model.  This could be container names and/or entity names.  When the process detects such case, the user is asked to decide on the logic to be used during processing:

 

Conflict resolution dialog

 

Keep both: leaves the objects in the mode untouched, and adds new objects in the model

Replace: overwrites the existing objects with the new ones

Merge: combines the new objects with the existing objects in the model.  New detected attributes will be added.  Existing properties of attributes are maintained.

Cancel: abort process

 

By checking the option "Apply to all conflicts", the user tells the process to apply the same choice to all subsequently detected conflicts, so the question is not asked again.

 

The feature is also available as an argument in revEng Command-Line Interface commands, but can only "apply to all conflicts".