AI Metadata: Best Practices for AI-based Data Analysis

AI metadata is critical for artificial intelligence applications. Metadata provides essential context information that improves an AI system's understanding, reasoning, and response generation capabilities. For example, metadata can describe relationship mapping in relational databases, specify data formats, and explain business terminologies, enabling AI systems to process and interpret data correctly.

This article explains the fundamentals of AI metadata, including its types, benefits, challenges, and practical applications in AI-powered search and retrieval systems. You will learn about some of the best practices for creating AI metadata. Additionally, you will see how to use AI metadata in generative AI and business intelligence applications to derive valuable insights from data.

After reading this article, you will understand how to create AI metadata and how to use it to improve your generative AI and business intelligence applications to optimize data-driven decision-making.

Summary of key AI metadata concepts

Concept	Description
What is AI metadata?	AI metadata is information that enhances a dataset's context and quality and improves an AI system’s understanding, reasoning, retrieval, and analysis capabilities.
Benefits of AI metadata	AI metadata can enhance data context and quality and improve data tracking and documentation.
Challenges of AI metadata	Common challenges associated with AI metadata include the massive volume of metadata, lack of standardization, quality control, and privacy concerns.
Types of AI metadata	AI metadata can be data associated with structured datasets like relational databases, semi-structured data like XML and JSON files, and unstructured data such as freeform text, video, images, etc.
Applications of AI metadata	AI metadata helps AI applications with search and retrieval, data analysis, model training and interpretability, and regulatory compliance.
AI metadata creation	AI metadata is created using both automated and manual approaches. Some best practices for creating it include adding context, describing the data structure, resolving ambiguities, and clearly explaining business terminologies and jargon.
How to use AI metadata	AI metadata can be used to create generative AI and business intelligence systems that allow users to query data using natural languages. You can develop in-house AI applications using coding platforms such as LangChain. You can also use AI platforms such as WisdomAI, which use AI metadata to provide insights into your data.

What is AI metadata?

Before we delve into AI metadata, let’s first see what metadata is. The National Information Standards Organization (NISO) defines metadata as:

“Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.”

In simple words, metadata refers to “data about data.” This can include context information about how multiple files or tables are related in a dataset, the format of the date column in a table, timestamps indicating when data was recorded, or any other valuable information that helps retrieve, use, or manage an information resource.

Detailed descriptions of AI metadata vary, but there is a common consensus that it enhances the context and quality of the original dataset and helps AI systems improve their training, understanding, reasoning, and response generation capabilities. For example, explaining the format of a table’s timestamp column can help AI systems generate more accurate responses to user queries involving date/time questions. Other examples of AI metadata include embedding vectors for textual data, AI model version history, etc.

Some common standards describing AI metadata include Dublin Core, Schema.org, and ML Schema.

Benefits of AI metadata

Here are some of the specific benefits of using AI metadata:

Enhanced context: One of metadata's most significant advantages is enhancing the context of data, which helps AI applications better search, retrieve, and analyze data.
Improved data quality and accuracy: Metadata improves the quality of data by providing additional information about it. Having more information about the data reduces the possibility of data redundancy and inaccuracy.
Tracking and documentation: Effective metadata management enhances the tracking and documentation of datasets, which is essential for their scalability and maintainability.

Challenges with using AI metadata

While AI metadata is a powerful tool, there are some challenges associated with using it that are worth keeping in mind:

Volume and diversity: The biggest challenges with creating AI metadata are sometimes referred to as the “three Vs of data”: volume, velocity, and variety. Due to diverse data being generated in large volumes, ensuring that the metadata enhances data quality and context is difficult.
Lack of standardization: Though a few standards exist for creating AI metadata, they still lack common standardization protocols for unstructured and heterogeneous data. Inconsistent metadata structures make it difficult for AI systems to integrate and query data using metadata information.
Data quality issues: Poor-quality metadata can have a deleterious effect on the efficiency of AI systems, leading to inaccurate analysis and deceptive responses. It is essential to have robust quality control checks in place for metadata creation.
Security and privacy concerns: Metadata often contains sensitive and private information such as user logs or file ownership details. Unauthorized access to metadata can lead to data breaches.

Types of AI metadata

AI metadata has various types and applications, which primarily reflect the nature of the data they describe.

Metadata for structured data

This type of metadata refers to well-organized datasets such as CSV files and relational databases. It can include information about tables, columns, data types, relationships between entities, etc. For example, it could describe the data type stored in a table column or provide information about the currency format in a column containing monetary values.

Metadata for semi-structured data

Unlike structured datasets, semi-structured datasets lack a fixed schema. Common examples include XML or JSON files, where data is organized in a structured manner but allows flexibility for adding objects and properties that are not predefined.

Metadata for semi-structured data involves elements such as XML tags or JSON key-value pairs, which provide essential context and organization. For instance, metadata in a social media post could include the number of likes and comments.

Metadata for unstructured data

This category encompasses data without a predefined format, such as text documents, images, or audio or video files. Metadata for unstructured data can include text transcriptions of audio and videos, image or video captions, etc.

AI-specific metadata

While AI applications can use any type of metadata, some metadata types are specific to AI applications. For example, there may be metadata associated with machine learning model training data, model history versions, and hyper-parameters used for model training. AI metadata also includes data associated with AI model inference, such as evaluation metrics (e.g., accuracy, precision, recall, and F1 scores).

Applications of AI metadata

AI metadata has applications across various domains, such as machine learning model training and generative AI. Let's look at some of them.

Data extraction and retrieval

Metadata enhances an AI system's data extraction and retrieval capabilities. With the additional information it provides, generative AI applications can more effectively locate, access, and interpret data, leading to more accurate responses. For example, defining “customer churn” within metadata enables generative AI systems to provide more precise answers to user inquiries related to customer attrition.

Enhanced data analysis

Metadata improves the quality of data analysis by adding context and explaining business terminologies and jargon. For example, in a sales dataset, metadata indicating that the “Revenue” column represents figures in US dollars including taxes ensures accurate financial analysis and reporting and prevents potential misinterpretations.

AI model training and interpretability

Metadata plays a vital role in improving the efficiency of AI model training. It provides added information about data, facilitating faster model training and improved performance. For instance, adding metadata specifying patient ages alongside MRI scans helps AI models learn age-related patterns quicker, leading to more accurate diagnoses. Documenting preprocessing steps within the metadata enables transparency and enhances the reproducibility of machine learning models.

Regulatory compliance

Metadata containing information on data history, origin, audit trials, transformation, and model-building processes can reduce the risk of noncompliance and legal issues.

AI metadata creation

A robust AI metadata creation process ensures that AI systems can effectively understand, retrieve, analyze, and process data. You can use both automatic and manual techniques to create metadata for your AI applications.

Automatic metadata extraction

Automated tools and code scripts can automatically generate metadata by analyzing data structure and statistical properties. Some examples of automatic metadata extraction include the following:

Schema extraction: Automated tools can extract details about the database schema, including table and column names, the data types of columns, and foreign key relationships. For example, an AI-based system can scan a company’s sales database and generate metadata specifying that the date in a specific column is in YYYYMMDD format and that the Products table is linked to the Orders table via the ProductID column.

Modality extraction: AI tools can extract modalities from data that did not originally exist; for example, they can extract text and speech from digital assets. Business intelligence systems can then use this information for improved data retrieval and analysis.

Data profiling: You can also use automated tools for data profiling, which involves deriving information from data already present in datasets. This includes statistical data summaries, such as mean, median, and standard deviation, and frequency analysis used to determine the occurrence of categorical variables and identify outliers and missing values. For example, an ecommerce platform can employ an automatic metadata extraction tool to analyze customer transaction data, summarize purchase patterns, detect anomalies, and identify seasonal trends. Business intelligence tools can use this information to create personalized marketing strategies.

Evaluating the quality of automatically extracted metadata is critically important for AI applications: Low-quality metadata has adverse effects on the performance of AI applications. Some of the common evaluation approaches for evaluating the quality of automatically extracted metadata include:

Completeness assessment to verify that all necessary metadata fields are extracted
Accuracy evaluation, which evaluates if the extracted metadata is correct
Consistency verification to ensure uniformity in metadata representation across records.

Automatic metadata extraction techniques are beneficial for capturing dynamically updated metadata in production environments. They can also track user interaction in real-time, follow dataset changes, and analyze customer feedback.

While automatic data extraction techniques offer scalability and efficiency, they have some limitations. For example, they struggle to capture intricate database components such as complex views and stored procedures, implicit relations, and unstructured and heterogeneous data. If data lacks a standard structure or formatting, automatic data extraction techniques can misinterpret the data, resulting in inaccurate metadata.

Manual metadata creation

Automatic metadata extraction can be cost-effective and time-efficient; however, it often requires manual refinement to ensure clarity and accuracy. Furthermore, not all metadata can be extracted automatically. This is where manual metadata creation comes into play.

Here are some important considerations and recommended practices related to manual metadata creation.

Context enhancement

Datasets often lack contextual information; you can add metadata to enhance context. This includes items such as explanations of table and column names and expected values in table columns.

For instance, specifying that the Revenue column in a table refers to gross revenue before tax ensures that business intelligence systems generate accurate responses to revenue-related queries.

Structural understanding and relational mapping

You should add metadata explaining the structure of your dataset. For example, specify column data types, constraints, relationships, and duplicates, null, distinct, and optional columns.

Sometimes, relationships between data tables or CSV files are not predefined. When this happens, you must manually specify one-to-many, many-to-many, and other relationship types as metadata. This helps AI systems retrieve correct responses when user queries require analyzing multiple files.

Ambiguity resolution

Datasets may have column names or row values with the same names. For example, multiple tables in a dataset may have Amount columns. You should add metadata information specifying what this column refers to in each table (e.g., total cost, tax amount, discount, etc.)

Explaining business terminology and jargon

Datasets can have business terms and jargon intrinsic to a specific industry or organization. For example, “customer churn” may refer to how often customers canceled their subscriptions or how many people did not purchase an item in the last month. Metadata should capture information about such jargon and business terms to ensure correct responses from AI applications.

Storing metadata

Once you create metadata, the next step is to store this metadata and create an association between the metadata and the original data.

Data dictionaries and knowledge graphs are two primary methods of storing metadata.

Data dictionaries

A data dictionary is a centralized repository that contains definitions, attributes, relationships, constraints, validation rules, and other metadata information about a system. Data dictionaries can be implemented as specialized tables containing the original datasets within database systems.

A data dictionary serves as a reference for understanding the structure and relationship between data entities. It ensures consistency among different metadata terminologies used in an organization.

It is recommended that standards such as ISO/IEC 11179 be followed to enhance the effectiveness of data dictionaries for storing metadata.

Knowledge graphs

Knowledge graphs store information using a structure of nodes and edges that connect related nodes. They can store metadata in nodes and associate the original data with metadata via the edges, allowing them to capture complex interdependencies between data elements. In addition, knowledge graphs can dynamically expand and adapt to real-time metadata updates, improving search and retrieval in production environments.

Examples of AI metadata in search and retrieval

You can develop AI applications that extract information from data sources using AI metadata or use AI-based business intelligence platforms such as WisdomAI.

Extracting information from data using code

Let’s look at a Python-based example of how you could retrieve data from a CSV file. We will use the Python LangChain framework that uses the OpenAI GPT-4o LLM to retrieve information from multiple CSV files.

First, the following script installs the required libraries.

!pip install langchain
!pip install langchain-core
!pip install langchain-experimental
!pip install langchain-openai

The script below imports required libraries into your Python application.

import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

Next, we will extract information from the bike sales dataset from Kaggle. The following script imports three CSV files from the dataset into corresponding Pandas DataFrame objects. You can see the CSV file column names in the output.

# dataset link: https://www.kaggle.com/datasets/yasinnaal/bikes-sales-sample-data

sales_orders = pd.read_csv('/content/SalesOrders.csv')
sales_orders_items = pd.read_csv('/content/SalesOrderItems.csv')
product_texts = pd.read_csv('/content/ProductTexts.csv')

print(f"Columns in sales orders: {sales_orders.columns}")
print(f"Columns in sales_orders_items: {sales_orders_items.columns}")
print(f"Columns in product_texts: {product_texts.columns}")

Here’s the output.

Next, we create an LLM object that calls the GPT-4o model using LangChain.

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    api_key = OPENAI_API_KEY 
)

Finally, we create a Pandas DataFrame agent using the create_pandas_dataframe_agent method from LangChain and ask the agent to return the average gross amount from the Sales Order table.

dfs = [sales_orders, sales_orders_items, product_texts]

agent = create_pandas_dataframe_agent(llm, dfs,
                                      verbose=True,
                                      allow_dangerous_code=True)
agent.invoke("Give me the average gross amount from the sales order items dataframe")

The output shows the average gross amounts.

Let’s ask another question that requires retrieving data from multiple CSV files. Notice that we pass the metadata (i.e., the product name information) in the query.

agent.invoke("Which is the name of the most sold product in the sales order table? The product name is in short description column.")

Here’s the resulting output. Again, the agent generated a correct response.

Next, we will ask a slightly complicated question.

agent.invoke("Give me the average monthly gross amount from the sales orders table.")

Here, the agent tried to retrieve data from the Sales Order Items table instead of the Sales Order table and generated an incorrect response. Basically, the agent hallucinated and tried to make up a response that it thought best answered the input query.

In the next query, we passed the column information to avoid this problem.

agent.invoke("Give me the average monthly gross amount from the sales orders table. The datetime data is stored in the CREATEDAT column of the sales order table.")

However, the agent could not generate a response and timed out.

This output occurred because the agent tried multiple times to retrieve the required information but could not do so before hitting the repeat request limit. This happens when an agent cannot generate a response after repeated deliberation in LangGraph. In this case, despite passing the metadata in the form of column names, the agent was not able to retrieve a response, demonstrating the inability of simple agents to provide data insights.

Business intelligence using an AI platform: WisdomAI

The example above illustrates certain drawbacks to developing in-house business intelligence and data analysis solutions:

You must hire developers to develop and maintain your applications.
You will need to update your system with dynamic data changes.
You will need multiple code iterations and prompt refinement to retrieve correct responses from your AI systems.
If you use data analysis tools like Power BI or Tableau, you need specialist data analysis and visualization experts to generate visualizations that a layperson can understand.

Another option is to use AI-based business intelligence platforms such as WisdomAI, which let users extract information from structured and semi-structured data sources using natural language queries. Then, anyone in your organization can retrieve insights from data without needing data visualization or analysis expertise. You can ask questions about your data in plain text, and WisdomAI will use its advanced AI capabilities to generate responses and create visualizations for you.

You can create a new domain or add your CSV file to an existing one. In this case, we already have a domain called Bike Sales. To this domain, we added five CSV files from our Bike Sales data set.

WisdomAI allows users to add metadata to their datasets in multiple ways. For example, you can define relationships between CSV files by clicking the `Create Relationship` button, top right.

You can also add descriptions for the tables. Descriptions are metadata information about the data tables used to generate more accurate responses to user queries. For example, in the following screenshot, we provide the information that the date is stored in the `CREATEDAT` column in `YYYYMMDD` format.

You can also add global knowledge about the dataset, which helps enhance context, resolve ambiguities, and explain business terminologies and jargon. WisdomAI uses a semantic context layer that integrates global knowledge into evolving knowledge graphs. This allows search and retrieval strategy to adapt to real-time global knowledge updates.

For example, the screenshot below specifies that the `COMPANYNAME` column stores business partner names.

Of course, you can add more metadata information if you want. WisdomAI can infer and learn metadata automatically by analyzing different data sources. Once you have added your metadata information, click the `New chat` link in the left sidebar to ask questions about your data. First, let’s ask a simple question about the average gross sales amount from the sales order items.

The response above is similar to what we got from the LangChain agent. The related questions you may want to ask are also listed at the bottom of the response.

For example, let’s click `Show the trend of gross amount over time.`

You will see the following output. You can see that WisdomAI understands that the date information is in the `CREATEDAT` column of the Sales Orders table instead of the Sales Order Items table.

You can click `AI Workstream` to view the detailed steps and the SQL query used to generate the response.

By default, WisdomAI retrieves monthly gross amounts. It also creates the most suitable plot for your data. You can change the plot style by clicking the drop-down button at the bottom right of the plot.

Let’s ask another question: We want the names of the three business partners with the highest numbers of products in Sales Orders.

This output shows that WisdomAI was intelligent enough to differentiate between the sales orders in the Sales Order table and the Sales Order Items table and asked a follow-up question to clarify. Here was the response when asked for distinct sales order items, which seems correct.

Last thoughts

AI metadata enhances AI systems' reliability, efficiency, and accuracy by improving data context and quality. It lets AI applications better search, retrieve, and perform data analysis and generate more accurate responses to user queries.

Key challenges associated with AI metadata include a lack of standardization and privacy concerns. To maximize the benefits of AI metadata, organizations must adopt best practices such as context enrichment, structural mapping, ambiguity resolution, and precise documentation.

Two key approaches exist to harness the power of AI metadata to gain insight into data: building in-house data management solutions and leveraging AI platforms such as WisdomAI.

While custom solutions offer flexibility, they require significant investment in human resources and infrastructure. In contrast, AI platforms make creating AI metadata and using it to gain improved insights into data straightforward. If you want an AI platform that offers ready-to-use solutions that simplify data visualization and retrieval and insight extraction using natural language queries, book a demo with WisdomAI.