Reverse-engineer an existing instance
Reverse engineering is a useful feature to facilitate the modeling and documentation of existing instances. After the reverse engineering has been performed, the user can enrich the model created, by filling descriptions, recording constraints that were not obvious from the stored data, documenting relationships, etc…
Hackolade supports the reverse engineer of different types of data sources:
- a JSON document: conversion of a basic JSON document into a Hackolade collection schema
- a JSON Schema: conversion of a JSON Schema file into a Hackolade collection schema
- a Data Definition Language file (DDL): from Oracle, Microsoft SQL Server, MySQL, PostgreSQL, Hadoop Hive
- an XSD schema file from another ER tool, such as erwin, ER/Studio, PowerDesigner, or other
- an Excel file template: first export a Hackolade model (even an empty one) to generate an Excel file for the target of your choice. Then you may bulk edit your model or create a new one before ingesting it bak into the application.
- a MongoDB database: Hackolade accesses a MongoDB instance through a connection, samples the selected collection(s), and derives a Hackolade schema from the sampled data
- a DynamoDB instance: Hackolade accesses DynamoDB either local or hosted at AWS. It performs a sampling of the selected tables, and probalistically infers the Hackolade schema from the sampled data.
- a Couchbase instance: Hackolade accesses a Couchbase instance through a connection, samples the selected bucket(s), and probalistically infers the Hackolade schema from the sampled data.
- a Cosmos DB instance: Hackolade can reverse-engineer a local instance of Cosmos DB or hosted at Azure, using either the SQL API (previously known as DocumentDB API) or the MongoDB API. The process performs a sampling of the selected collections and document types, and probalistically infers the Hackolade schema from the sampled data.
- an Elasticsearch instance: Hackolade accesses an Elasticsearch instance through the REST API, samples the selected indices and types, and probalistically infers the Hackolade schema from the sampled data plus the retrieved mappings.
- an HBase instance: Hackolade accesses an HBase instance through the REST API, reads the table and column family setup, then samples column qualifiers with JSON types to probabilistically infer the Hackolade schema from the sampled JSON data.
- a Neo4j graph instance: Hackolade accesses a Neo4j instance through the REST API, and reads the Cypher setup for node labels and relationship types.
- a Cassandra or DataStax (DSE) instance: Hackolade accesses a Cassandra instance through the REST API, reads the table and column setup, then samples columns with JSON types to probabilistically infer the Hackolade schema from the sampled JSON data.
- an Apache Hive instance: Hackolade accesses Hive via the Thrift service, reads the table and column setup, then samples columns with JSON types to probabilistically infer the Hackolade schema from the sampled JSON data.
- an Apache Avro data file or schema.Hackolade reads .avro and .avsc files from the file system to convert the Avro schema in a Hackolade model.
This feature is also available via the Command-Line Interface.
This process includes a series of steps outlined below. After going through the steps, the user is free to enrich the model for documentation purposes, and to generate forward-engineering scripts.
1. Choose the model target
Create a new model and select the target of your choice. The reverse-engineering process is fairly similar for all targets, with local nuances when needed. But the terminology and connection protocols can be very different.
2. Choose the reverse-engineering function
This screen will vary depending on the selected target.
3. Establishing a connection
The options here will vary depending on the selected target. Refer to the pages below to find the specific details. The images below refer to MongoDB.
TIP: note, if you have instances with a very large number of databases, you may wish to declare a particular database directly in the connection string by appending the address with: /<name of the database you wish to access directly>, for example: localhost/NBA
For more information on the various authentication and encryption protocols, please consult these pages.
4. Selecting one or more collections
5. Sampling of one or more collections
Sampling is currently limited for performance reasons to at most 10.000 documents, adjustable via a user parameter in Tools > Options > Reverse-Engineering, or up to 100k documents with the Command-Line Interface.
Since MongoDB 3.2, collections are sampled with the $sample operator via the /core/aggregation-pipeline. This provides efficient random sampling without replacement over the entire collection, or over the subset of documents specified by a query.
In MongoDB 3.0 and 2.6, collections are sampled via a backwards-compatible algorithm. It takes place in three stages:
- Hackolade opens a cursor on the desired collection, sorted in descending order of the _id field.
- sampleSize documents are randomly selected from the stream, using reservoir sampling.
- Hackolade performs a query to select the chosen documents directly via _id.
Note that this sampling mechanism is originally provided in Open Source by MongoDB, and adapted to the specific needs of Hackolade.
6. Probabilistic inference of schema
From the sample set gathered above, measurable metrics can be applied to the schema, including schema depth and width, and class interactions, as described here:
Note that this inference mechanism is originally provided in Open Source by MongoDB, and adapted to the specific needs of Hackolade.
In this step, when the instance being reverse-engineered is v3.2 and above, stored validator setup is leveraged as well, using the db.getCollectionInfos() method
As of v2.5.0 of Hackolade, we've also introduced the ability to detect pattern fields during the reverse-engineering process. For any sub-object at a given level, we can infer the presence of a pattern field if we detect the recurrence of similar regex patterns in field names.
7. Transformation into a Hackolade collection schema
In this final step, the derived schema needs to be converted and persisted into the Hackolade notation, so it can be visualized, adapted and enriched in the application.