Documentation

No results
    gitHub

    Data modeling and the AI lifecycle

    Artificial Intelligence is only as good as the data it understands. AI built on ungoverned data will yield unpredictable and unreliable outcomes. Without dedicating time and resources to ensure data reliability, you can't expect reliable AI.

     

    A strong data model keeps AI grounded, ensuring accurate outputs, real-time reasoning, and seamless system integration.  Data modeling contributes to better AI by providing a structured framework for organizing, interpreting, and leveraging data effectively. It helps define the relationships between different data elements, ensuring that AI systems can access clean, consistent, and relevant information for training and decision-making.  By establishing clear data schemas and reducing ambiguity, data modeling improves the quality and accuracy of machine learning models, reduces bias, enhances interpretability, and enables scalability.  Ultimately, it lays the foundation for more intelligent, reliable, and efficient AI systems by aligning data with the goals and logic of AI algorithms.

     

    Conversely, AI can contribute to better data modeling by enabling advanced automation and intelligence throughout the modeling lifecycle. The current state of GenAI does not replace the synthesized knowledge of a subject-matter expert.  It would take so much effort to create the prompts, that the user might as well specify the details directly in a data model.   Nevertheless, we see the significant benefits of GenAI-aided data modeling, for example by leveraging GenAI to supplement descriptions and comments.  Or by helping to identify meaning in existing structures that lack proper descriptions.  

     

    In such cases, the application can reverse-engineer Mermaid ERD code potentially generated by GenAI outside Hackolade Studio.  The application can also generate Mermaid diagrams from existing Hackolade data models (albeit with some limitations to match Mermaid's own restrictions.)   GenAI can be leveraged for metadata enrichment by generating meaningful descriptions for entities and attributes to be edited by subject-matter experts, and by recommending attributes based on industry standards. Furthermore, AI can propose dimensional models optimized from transactional schemas and suggest improvements such as better partition key choices, laying the groundwork for more efficient and standards-aligned data architectures.

     

    Data modeling contributes to better AI

    Developing an AI solution involves several iterative steps. The process begins with understanding the business problem and clearly defining objectives. Next, data is collected and explored to assess its structure, quality, and potential value, including identifying missing values, anomalies, biases, and uncovering patterns. The data is then prepared through cleaning and transformation, addressing issues such as incomplete data and bias. These early stages are supported by traditional data modeling practices: conceptual, logical, and physical data modeling that provide structure and context. 

     

     

    Data moldeing and the AI Lifecycle

     

    Diagram courtesy of Dave Wells dwells@infocentrig.org

     

    This foundational work enables the development and training of algorithms using the prepared data. Model performance is then optimized through parameter tuning and evaluated using accuracy, precision, and alignment with business goals. Finally, the AI model is deployed in a real-world environment, where its performance is continuously monitored and refined in response to new data and feedback.

     

    AI contributes to better data modeling

    AI can contribute to better data modeling by providing the ability to automate and enhance some aspects of the data modeling process.  Currently AI is not yet good enough at understanding the specifics of domains in order to create entire data models.  For now, the knowledge by subject-matter experts and data modelers of the nuances of organizations remains too complex for AI to handle.  But it can assist in many ways to increase productivity.

     

    It can analyze large and complex datasets to identify hidden patterns, relationships, and anomalies that might be missed by human analysts. AI-driven tools can recommend optimal data structures, detect inconsistencies, and suggest schema improvements based on usage patterns and historical data. Machine learning algorithms also help in predictive modeling, enabling dynamic and adaptive models that evolve with new data. Additionally, AI can streamline tasks like data cleaning, entity recognition, and metadata generation, making the data modeling process faster, more accurate, and more scalable.

     

    At Hackolade, we could have easily pursued developing our own Natural Language Processing (NLP) or Generative Pre-Trained Transformer (GPT) models.  However, we’re not driven by a “Not Invented Here” mentality!  Our core focus is on data modeling, and there are many brilliant experts out there who specialize in NLP, LLMs, and GPT technologies. Moreover, as highlighted below, we place a premium on security and confidentiality, which is why we prefer if our customers have the freedom to choose and control the technologies they use in this context.

     

    Currently on our roadmap, are the following features:

    • Available: reverse-engineer Mermaid ERD code that could have been produced by GenAI in response to some prompt executed outside of Hackolade
    • Upcoming: generate Mermaid ERD code from a Hackolade Studio model to be used in a GenAI prompt.  Note that Mermaid has some limitations, such as lack of composite PKs/FKs, Not Null constraints, etc.
    • Still to be scheduled: use GenAI to create descriptions for selected entities and attributes of an existing model
    • Use GenAI to suggest attributes for given entities, according to industry -specific standards
    • Longer term: use GenAI to suggest an optimal dimensional model, given a transactional schema.  This could be done indirectly with the first 2 points in the list above.
    • Use GenAI to suggest more optimal modeling, choice of partition keys, etc..  But this has not yet been designed.
    • and more, based on customer feedback and suggestions.

     

    Note that it is of course foreseen that any direct AI interaction from Hackolade Studio will be entirely optional for users, including that it could be disabled, if desired or if mandated by policy of the user's organization.

     

    Data security and AI

    An important factor to take into consideration is security.  At Hackolade, we are very much aware of the legitimate security concerns of our customers.  

     

    Data security in AI is a critical concern, as the rapid adoption of AI technologies introduces numerous risks to sensitive information. One of the major threats is the leakage of personal or confidential business data, where users of AI systems might inadvertently expose private details through their prompts, and AI systems their responses or models. Model inversion attacks are another significant risk, where adversaries can reverse-engineer AI outputs to uncover sensitive data used in training, potentially revealing confidential information. Data poisoning also poses a threat, where malicious actors intentionally inject misleading or harmful data into the training set, causing AI models to make erroneous decisions or behave unpredictably. Additionally, reliance on third-party AI services introduces the risk of opaque data access policies, where the service provider may have unrestricted access to the data being processed, leading to potential breaches or misuse. Furthermore, adversarial attacks, where small, imperceptible changes to input data are made to deceive AI models, could compromise the integrity of decision-making processes. 

     

    To mitigate these risks, it’s essential for organizations to implement robust security measures, such as data encryption, model transparency, and stringent access controls.  

     

    Running large language models (LLMs) on-premises is one of the most effective measures organizations can take to avoid data leakage. By hosting the models on their own infrastructure, companies can maintain full control over their data and ensure it doesn’t leave their secure environment. This approach helps mitigate risks associated with third-party AI services, where the data might be transmitted to external servers with less transparency regarding data handling and security practices.

     

    On-premises deployment also minimizes the risk of data interception during transmission and offers better control over data access policies. Organizations can enforce stricter internal security protocols, such as encryption, network segmentation, and access restrictions, to safeguard sensitive information. Additionally, having LLMs on-premises means businesses can better monitor and audit model usage, ensuring compliance with privacy regulations (e.g., GDPR, HIPAA).

     

    However, it's important to note that running LLMs on-premises also comes with its own set of challenges, like the need for specialized hardware, continuous maintenance, and expertise in managing these complex systems.

     

    In any case, part of our AI roadmap is to take all these concerns into account, and always give the users the choice of how AI is being leveraged.  Our pledge is to always ensure that Hackolade Studio users are always in control of whether or not AI can be used and, if used, how it is used.