MongoDB is a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemas. MongoDB is developed by MongoDB Inc.
Hackolade was specially built to support the data modeling of MongoDB collections, supporting multiple databases as well. The application closely follows the terminology of the database.
The data model in the picture below results from the reverse-engineering of the Yelp Challenge Dataset.
In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key. If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field. ObjectIds are small, likely unique, fast to generate, and ordered. ObjectId values consists of 12-bytes, where the first four bytes are a timestamp that reflect the ObjectId’s creation, specifically:
- a 4-byte value representing the seconds since the Unix epoch,
- a 3-byte machine identifier,
- a 2-byte process id, and
- a 3-byte counter, starting with a random value.
MongoDB represents JSON documents in binary-encoded format called BSON behind the scenes. BSON extends the JSON model to provide additional data types, ordered fields, and to be efficient for encoding and decoding within different languages. The MongoDB BSON implementation is lightweight, fast and highly traversable. Like JSON, MongoDB's BSON implementation supports embedding objects and arrays within other objects and arrays – MongoDB can even 'reach inside' BSON objects to build indexes and match objects against query expressions on both top-level and nested BSON keys.
BSON is a binary serialization format used to store documents and make remote procedure calls in MongoDB. The BSON specification is located here.
Hackolade was specially built to support the data types and attributes behavior of MongoDB, including the BSON types.
MongoDB provides a number of different index types to support specific types of data and queries.
- default _id index: MongoDB creates a unique index on the _id field during the creation of a collection. The _id index prevents clients from inserting two documents with the same value for the _id field. You cannot drop this index on the _id field.
- single field: MongoDB supports the creation of user-defined ascending/descending indexes on a single field of a document.
- compound index: MongoDB also supports user-defined indexes on multiple fields, i.e. compound indexes.
- multikey index: MongoDB uses multikey indexes to index the content stored in arrays. If you index a field that holds an array value, MongoDB creates separate index entries for every element of the array. These multikey indexes allow queries to select documents that contain arrays by matching on element or elements of the arrays. MongoDB automatically determines whether to create a multikey index if the indexed field contains an array value; you do not need to explicitly specify the multikey type.
- geospatial index: to support efficient queries of geospatial coordinate data, MongoDB provides two special indexes: 2d indexes that uses planar geometry when returning results and 2dsphere indexes that use spherical geometry to return results.
- text index: MongoDB provides a text index type that supports searching for string content in a collection. These text indexes do not store language-specific stop words (e.g. “the”, “a”, “or”) and stem the words in a collection to only store root words.
- hashed indexes: to support hash based sharding, MongoDB provides a hashed index type, which indexes the hash of the value of a field. These indexes have a more random distribution of values along their range, but only support equality matches and cannot support range-based queries.
MongoDB can have the following properties:
- unique indexes: the unique property for an index causes MongoDB to reject duplicate values for the indexed field. Other than the unique constraint, unique indexes are functionally interchangeable with other MongoDB indexes.
- partial indexes: they only index the documents in a collection that meet a specified filter expression. By indexing a subset of the documents in a collection, partial indexes have lower storage requirements and reduced performance costs for index creation and maintenance. Partial indexes offer a superset of the functionality of sparse indexes and should be preferred over sparse indexes.
- sparse indexes: the sparse property of an index ensures that the index only contain entries for documents that have the indexed field. The index skips documents that do not have the indexed field. The sparse index option can be combined with the unique index option to reject documents that have duplicate values for a field but ignore documents that do not have the indexed key.
- TTL indexes: time-to-live indexes are special indexes that MongoDB can use to automatically remove documents from a collection after a certain amount of time. This is ideal for certain types of information like machine generated event data, logs, and session information that only need to persist in a database for a finite amount of time.
Read-only views in MongoDB were introduced with version 3.4. DBAs can define non-materialized views that expose only a subset of data from an underlying collection, i.e. a view that filters out entire documents or specific fields, such as Personally Identifiable Information (PII) from sales data or health records. As a result, risks of data exposure are dramatically reduced. DBAs can define a view of a collection that's generated from an aggregation over another collection(s) or view.
MongoDB views are handled through a pipeline projection of the fields of a collection (with possible $lookup to additional collections.) In Hackolade views are represented in the Entity Relationship diagram alongside collections with a visual icon to distinguish them.
Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. Database systems with large data sets or high throughput applications can challenge the capacity of a single server. There are two methods for addressing system growth: vertical scaling (bigger, more powerful server) and horizontal scaling (more servers with divided datasets). MongoDB supports horizontal scaling through sharding.
To distribute the documents in a collection, MongoDB partitions the collection using the shard key. The shard key consists of an immutable field or fields that exist in every document in the target collection.
You choose the shard key when sharding a collection. The choice of a shard key cannot be changed after sharding. A sharded collection can have only one shard key.
MongoDB supports 3 sharding strategies for distributing data across sharded clusters: hashed sharding, ranged sharding, and tag-aware (or zoned):
- hashed sharding: it involves computing a hash of the shard key field’s value. Each chunk is then assigned a range based on the hashed shard key values. MongoDB automatically computes the hashes when resolving queries using hashed indexes. Applications do not need to compute hashes.
- ranged sharding: it involves dividing data into ranges based on the shard key values. Each chunk is then assigned a range based on the shard key values.
- zone sharding (previously known as tag-aware): in sharded clusters, you can create zones that represent a group of shards and associate one or more ranges of shard key values to that zone. MongoDB routes reads and writes that fall into a zone range only to those shards inside of the zone.
For those developing Node.js applications on top of a MongoDB database, you may want to leverage the object document mapper (ODM) Mongoose that allows you build what your object model would look like, then auto-generate all the boilerplate logic that goes with it. Hackolade dynamically generates the Mongoose script based on model attributes and constraints.
Since version 3.2, MongoDB provides the capability to validate documents during updates and insertions. Validation rules are specified on a per-collection basis using the validator option, which takes a document that specifies the validation rules or expressions. Hackolade dynamically generates the validator script based on model attributes and constraints.
Since version 3.6, this script is now in extended JSON Schema syntax. Hackolade easily generates the proper syntax without requiring any JSON Schema knowledge.
A button lets the user apply to a selected instance the script to create databases, collections with optional $jsonschema validator, indexes, and sharding configuration, as well as sample data if desired.
The connection is established using a connection string including (IP) address and port (typically 27017), and authentication using username/password if applicable. X.409 SSL encryption can be specified, and SSH tunneling to a Cloud instance is supported as well. Hackolade also supports MongoDB Enterprise security features, with LDAP and Kerberos authentication.
The Hackolade process for reverse-engineering of MongoDB databases is different depending on the MongoDB version. For versions prior to 3.2, collections are queried with a random function. Starting with version 3.2, Hackolade uses $sample syntax to perform the statistical sampling followed by the schema inference. You may define a custom sampling with a specific aggregation pipeline query and sort. You may also enable inference of implicit relationship in the data.
For more information on MongoDB in general, please consult the website.