Reverse-engineering
The reverse-engineering function is used to import information into pre-existing or empty data models. Reverse engineering is a useful feature to facilitate the modeling and documentation of existing instances. After the reverse engineering has been performed, the user can enrich the model created, by filling descriptions, recording constraints that were not obvious from the stored data, documenting relationships, etc…
Hackolade supports the reverse engineer of different types of data sources:
- a JSON or YAML document: schema inference of JSON or YAML documents into a Hackolade Studio data model for any target
- a JSON Schema or YAML Schema: conversion of JSON Schema files into a Hackolade Studio data model for any target
- a Data Definition Language file (DDL): from Oracle, Microsoft SQL Server, MySQL, PostgreSQL, Hadoop Hive, Snowflake, Teradata
- an XSD schema file from another ER tool, such as erwin, ER/Studio, PowerDesigner, or other
- an Excel file template: first export a Hackolade model (even an empty one) to generate an Excel file for the target of your choice. Then you may bulk edit your model or create a new one before ingesting it back into the application.
- target-specific instances
This feature is also available via the Command-Line Interface.
Reverse-engineering process
This process includes a series of steps outlined below. After going through the steps, the user is free to enrich the model for documentation purposes, and to generate forward-engineering scripts.
Choose the model target
Create a new model and select the target of your choice. The reverse-engineering process is fairly similar for all targets, with local nuances when needed. But the terminology and connection protocols can be very different.
Choose the reverse-engineering function
This screen will vary depending on the selected target.
Establishing a connection
The options here will vary depending on the selected target. Refer to the pages below to find the specific details. The images below refer to MongoDB.
TIP: note, if you have instances with a very large number of databases, you may wish to declare a particular database directly in the connection string by appending the address with: /<name of the database you wish to access directly>, for example: localhost/NBA
For more information on the various authentication and encryption protocols, please consult these pages.
Selecting one or more collections/entities
Sampling of one or more collections/entities
Sampling is currently limited for performance reasons to at most 10.000 documents, adjustable via a user parameter in Tools > Options > Reverse-Engineering, or up to 100k documents with the Command-Line Interface.
Since MongoDB 3.2, collections are sampled with the $sample operator via the /core/aggregation-pipeline. This provides efficient random sampling without replacement over the entire collection, or over the subset of documents specified by a query.
In MongoDB 3.0 and 2.6, collections are sampled via a backwards-compatible algorithm. It takes place in three stages:
- Hackolade opens a cursor on the desired collection, sorted in descending order of the _id field.
- sampleSize documents are randomly selected from the stream, using reservoir sampling.
- Hackolade performs a query to select the chosen documents directly via _id.
Note that this sampling mechanism is originally provided in Open Source by MongoDB, and adapted to the specific needs of Hackolade.
Probabilistic inference of schema
From the sample set gathered above, measurable metrics can be applied to the schema, including schema depth and width, and class interactions, as described here:
Note that this inference mechanism is originally provided in Open Source by MongoDB, and adapted to the specific needs of Hackolade.
In this step, when the instance being reverse-engineered is v3.2 and above, stored validator setup is leveraged as well, using the db.getCollectionInfos() method
Special cases
Pattern field detection
As of v2.5.0 of Hackolade, we've also introduced the ability to detect pattern fields during the reverse-engineering process. For any sub-object at a given level, we can infer the presence of a pattern field if we detect the recurrence of similar regex patterns in field names.
This pattern field recognition was further enhanced in v4.2.3 with the following logic:
a) attribute must be a complex data type (object or array)
b) in a list of successive complex attributes in a branch of the schema hierarchy, the structure of each of these complex attributes must be exactly identical
c) the names of these successive complex attributes with identical structure must be similar as per either:
- an ID-like structure of the same length of at least 12 characters that can only include hexadecimal numbers and dashes;
- or by by using Levenshtein distance. The distance parameter may be modified in Tools > Options > Reverse-Engineering (default = 1) An online Levenshtein distance calculator can be found here.
Inference of Foreign Key relationships
With v4.2.13, the option was introduced to attempt foreign key relationship inference. This features is only available for the MongoDB target, and is for documentation purposes only, as Foreign Key relationships are not enforced by the database. For each field in a collection with the data type ObjectID, the application finds if the sampled value can be linked to a document in the same collection (for a recursive FK) or another collection in the database. The cardinality is set by default to "0...n" on the child side, but can be adjusted manually by the user.
Transformation into a Hackolade collection schema
In this final step, the derived schema needs to be converted and persisted into the Hackolade notation, so it can be visualized, adapted and enriched in the application.
Conflict resolution during merge into an existing model
When reverse-engineering into an existing model, it is possible that source objects already match existing objects in the model. This could be container names and/or entity names. When the process detects such case, the user is asked to decide on the logic to be used during processing:
Keep both: leaves the objects in the mode untouched, and adds new objects in the model
Replace: overwrites the existing objects with the new ones
Merge: combines the new objects with the existing objects in the model. New detected attributes will be added. Existing properties of attributes are maintained.
Cancel: abort process
By checking the option "Apply to all conflicts", the user tells the process to apply the same choice to all subsequently detected conflicts, so the question is not asked again.
The feature is also available as an argument in revEng Command-Line Interface commands, but can only "apply to all conflicts".