AI Data Preparation: Best Practices for Structured, Unstructured, and Semi-Structured Data

Chapter

AI Data Preparation

As organizations transition their AI initiatives from pilot phases to production, data preparation becomes increasingly important. High-quality input data leads to more accurate and relevant outputs, directly impacting the value that AI brings to businesses.

Data comes in various forms—structured, unstructured, and semi-structured—each requiring different processing methods. Structured data fits neatly into tables and databases, unstructured data includes text and multimedia content lacking a predefined format, and semi-structured data falls somewhere in between. AI systems must adapt to handle these diverse data types effectively.

This article provides an overview of data preparation approaches for AI projects, highlighting best practices to ensure that your models perform at their best. We explore the different kinds of data that organizations possess, look into specific AI use cases, and discuss the approaches to prepare each data type.

Summary of key AI data preparation concepts

Concept	Description
Types of data in organizations	Data falls into three categories: structured, unstructured, and semi-structured. Each type requires specific preparation methods for AI applications. Structured data is organized in databases and tables, while unstructured data lacks a specific format. Semi-structured data combines elements of both.
Preparing structured data	Preparing structured data for LLM applications involves adapting the data or equipping LLMs with structured data analysis capabilities. Depending on your use case, you can convert structured data into vectors for similarity-based querying or use LLMs to generate SQL code to query the data. Code generation is apt for most use cases, but understanding user intent and generating accurate queries is challenging. High-quality metadata is crucial for accurate code generation.
Preparing unstructured data	Preparing unstructured data for AI involves several steps, including loading, parsing, chunking, embedding, and storing the data. Key considerations include data formats, chunking strategies, embedding models, and storage methods. Retrieval-augmented generation (RAG) and prompting techniques further enhance the accuracy of AI model responses.
Preparing semi-structured data	Preparing semi-structured data involves processing the structured and unstructured components and combining them strategically for inference.
AI data preparation best practices	Data preparation for AI/ML requires a strategic approach. Understanding business goals and use cases, developing appropriate data pipelines, and choosing the right techniques and tools are all important steps in optimizing data for AI and ML projects.

Types of data in organizations

Understanding the various types of data within an organization primes the way to identifying the appropriate data preparation steps needed for AI and ML projects. Each type has unique characteristics and requires different processing methods. The differences among data types also inform the data preparation steps outlined below.

Structured data

Structured data refers to information that is highly organized and easily searchable, typically residing in databases and tables with predefined schemas and attributes. Examples include relational databases, spreadsheets, and CSV files. This data adheres to strict formats, allowing for consistent storage, retrieval, and analysis using query languages like SQL.

The concept of structured data dates back to ancient times. Around 3200 BCE, the Sumerians developed organized record-keeping systems on clay tablets to track taxes, agricultural yields, and trade activities. In modern enterprises, structured data encompasses everything from customer records to detailed sales figures and demographic information.

Structured data is optimized for computer processing, enabling efficient analysis and manipulation. Universal standards, such as specific date formats and measurement units, ensure consistency and compatibility across different systems. The specific components of structured data vary according to industry and organizational needs. For example, a pharmaceutical company’s datasets will differ significantly from those of a retail business, each tailored to capture attributes essential to their sector and business processes.

Within organizations, structured data serves as the backbone of data-driven insights. Customer relationship management (CRM), enterprise resource planning (ERP), and point of sale (POS) systems generate structured data that supports various business functions:

In the financial sector, transaction records and account information enable real-time fraud detection by analyzing patterns and anomalies. Credit scoring models rely on indicators like credit history and income levels to assess risk. You can read more about how Mastercard uses generative-AI to protect transactions against emerging threats.

Managing extensive inventories and forecasting demand depend on structured data in retail. Product information and pricing are maintained in databases for real-time updates, while POS data facilitates demand forecasting. Sales transactions help predict revenue, identify trends, and optimize pricing strategies. A real-world example is IKEA's GPT-powered AI assistant which enables customers to explore the catalog using natural language providing product information as well as real-time availability across stores.
Within manufacturing, efficiency and quality control rely on structured data. Production metrics, machine performance data, and defect rates are recorded as structured data that is analyzed to identify bottlenecks and improve processes. Structured data also supports predictive maintenance, minimizing downtime and extending machinery lifespan. For example, Siemens has enhanced its Senseye Predictive Maintenance solution by incorporating generative AI functionalities. "In the app, generative AI can scan and group cases, even in multiple languages, and seek similar past cases and their solutions to provide context for current issues. It’s also capable of processing data from different maintenance software."

These use cases highlight that structured data is essential for quantitative analysis, requiring precision and accuracy. Despite its value to an organization, structured data can be hard to access even by data analysts. For example, a large data warehouse could contain thousands of tables and have poorly documented schemas.

Semi-structured data

Semi-structured data combines elements of both structured and unstructured data. It includes tags or markers that define data hierarchies and relationships without enforcing a rigid structure. Examples are XML files, JSON documents, log files, PDF reports with tables, and email messages. Formats like Avro and Parquet are standard categorizations of semi-structured data, enabling easy conversion and integration with various data processing tools.

Here are some of the sources of semi-structured data:

Log files generated by web servers that capture user interactions, page views, and system events
Mobile applications producing JSON data for tracking feature usage, user behavior, and app performance metrics
Email systems containing structured headers (e.g., sender, recipient, and timestamp) alongside unstructured body content
Product reviews that include metadata—like product IDs, store locations, timestamps, and customer information—along with unstructured review text.

A lot of data that typically gets classified as unstructured can be made semi-structured, which improves the quality of insights it can provide. Here are some industry use cases:

Automating contract renewal notifications based on expiration dates or specific legal clauses enhances efficiency. Extracting key terms and conditions from contracts streamlines legal workflows and reduces manual intervention.
Within procurement, tracking and automating the purchase-to-pay process using semi-structured data like invoices and purchase orders speeds up operations.
Processing sensor data from smart devices for predictive maintenance and system optimization leverages the semi-structured nature of IoT data, facilitating real-time decision-making and operational efficiency.
Analyzing XML or JSON files containing patient data (like vital signs) alongside clinician notes supports personalized health recommendations, enhancing patient care.

Unstructured data

Projections indicate that by 2025, the growth rate of data will be equivalent to creating a new Google every four days, with at least 80% of this data being unstructured. Unstructured data lacks a specific format or organization, making it challenging to parse and analyze. It includes content such as webpage text, PDF documents, social media posts, audio files, videos, and images. This data is inherently human-centric, generated through natural interactions like writing, speaking, meeting transcripts, and creating multimedia content.

Despite its abundance, only 18% of businesses have a strategy to manage their unstructured data, highlighting the competitive advantage of having such a strategy. Generative AI has revolutionized the handling of unstructured data, enabling organizations to extract insights from previously untapped information sources.

Unstructured data analysis often finds applications in search, summarization, and generative tasks:

In the human resources field, searching for specific company policies or onboarding processes becomes more efficient with generative AI applications, enabling the retrieval of relevant information.
AI applications can analyze unstructured legal documents to identify the most relevant documents, improving the accuracy and efficiency of legal research. Identifying patterns and generating summaries can also aid in case law analysis and decision-making processes.
Within the automotive industry, companies are leveraging terabytes of video data to develop training scenarios for autonomous driving and connected car technologies, enhancing vehicle intelligence and safety features.
Animation studios utilize their existing media assets to create and enhance new creative projects, streamlining content production and innovation.

Preparing structured data

Traditional machine learning models have long relied on structured data—typically stored in tables or relational databases—for tasks such as classification, regression, and prediction. This structured format allows for specific features to be extracted and fed into algorithms optimized for tabular data.

In contrast, large language models (LLMs) are primarily trained on unstructured data and are optimized for sequential processing. Their training focuses on natural language, resulting in a lack of native support for inferring relationships within structured datasets mostly containing numerical data. Simply feeding structured data into LLMs without adaptation does not yield meaningful insights. These models struggle with numerical calculations, interpreting column names, table relationships, and unique identifiers unless provided with additional context or fine-tuning.

To utilize structured data in generative AI applications, the data must be adapted for use with LLMs, or the LLMs need to be equipped with structured data analysis capabilities. Two main approaches have emerged for building AI applications on top of structured data: embedding structured data and generating query code.

Embedding structured data for vector search

The first approach involves converting structured data into a vector representation that preserves the relationships and semantics of the original data. Once in vector form, structured data can be used for similarity-based querying, clustering, and basic analysis.

Note that this approach cannot be applied to all structured data use cases. It works well for prompts aiming to find and filter specific records based on the similarity of records but is less ideal for obtaining statistical summaries or aggregate insights from entire structured datasets. For example, in a database containing customer information, vector embeddings can help identify customers with similar purchasing behaviors or preferences, enabling targeted recommendations and opening new possibilities for data analysis and AI applications. But it won't produce accurate results for a query like "What was the total sales across North America in 2024?". This limitation highlights the need for a complementary approach, such as generating code to query structured data directly.

Generating code to query structured data

The second approach focuses on generating code to query structured data. In this method, LLMs convert user inputs in natural language into code, such as SQL queries that can be executed against databases. For example, a user might ask: “What was the average price of electronics sold last month?” To generate the correct SQL, the model must understand which tables contain sales and product information, identify the “price” column, interpret “last month” in terms of date ranges, and apply appropriate aggregations like a GROUP BY operation.

Generating accurate code requires the model to have a deep understanding of the database schema, including table relationships, primary keys, and joins. Challenges include understanding user intent and context, handling ambiguity in column names, and producing syntactically correct and efficient queries. Models often struggle with vague or ambiguous queries, but context-aware LLMs and metadata tagging can help resolve these ambiguities by clarifying user intent. The importance of metadata is explored in the following section.

There are two main approaches to training LLMs for domain-specific code generation tasks:

Fine-tuning the model on domain data.
Using off-the-shelf LLMs supplemented with relevant context, knowledge, or semantics to improve accuracy.

The first approach, involves using documentation, sampling data definition language (DDL) statements, and labeled pairs of natural language prompts and corresponding SQL queries to train the model. One pitfall of this approach is that it overfits task-specific knowledge. Depending on your use case, periodic retraining of the model might be required to keep the business knowledge and data up-to-date. This can easily become cost-prohibitive.

The second approach avoids these limitations by using off-the-shelf text-to-SQL models and enhancing their performance with external context. Advanced models like Claude 3.5 Sonnet and GPT-4o already have impressive accuracy in text-to-SQL generation. A semantic layer can enhance the accuracy of SQL generation of such off-the-shelf LLMs by using business context to supplement the prompt. A semantic layer unifies data from various sources into a cohesive model tailored to your business, creating a central “source of truth.” Its main tasks are translating, standardizing, resolving identities, ensuring data consistency, and establishing a foundation that standardizes key terms and concepts. For example, it enforces a unified definition of “customer” or “client” across platforms like email marketing, website analytics, and sales. This approach is useful for vague prompts where business context knowledge is required to generate an accurate query.

WisdomAI builds upon this second approach by introducing a Context Layer. The Context Layer combines the advantages of a semantic layer with additional context on when and how to use the information captured in the semantic layer. Within the WisdomAI platform, the Context Layer is an automatically created knowledge graph that captures enterprise language, common SQL patterns, and business and user awareness. We have covered the concept of Context Layers in detail in the text-to-SQL article.

	Embedding structured data for vector search	Generating code to query structured data
Approach	Converts structured data into vector representations	Translates natural language inputs into executable code (e.g., SQL) to query databases directly.
Use Cases	Finding similar records or patterns within the data. Identifying customers with similar behaviors. Understanding relationships and patterns in data.	Generating precise responses for specific queries. Performing statistical analyses and generating reports. Handling joins, nested subqueries, and conditional logic.
Limitations	Doesn't perform well with total calculations or summaries. Sensitive to inconsistencies and requires clean data. Integrating multiple data sources can be challenging.	It may struggle without clear user intent or context. Requires updates to remain accurate when changing data schemas. Retraining and resource needs can be prohibitive.
Best Practices	Entity Embeddings can help map related entities across data sources. Information theory techniques can help quantify relationships between data fields (still maturing).	Semantic layers add business context to improve query accuracy. Retrieval-augmented generation techniques can enhance prompts with additional information like schema. Context Layers (e.g., WisdomAI) can provide additional context on when and how to use information.

Table summarizing the two approaches to preparing structured data

Importance of metadata

High-quality metadata is essential for accurate code generation in AI applications working with structured data. Well-defined metadata ensures that AI models can map natural language inputs to the correct schema fields and tables. This is especially important when working with schemas where the same field name, such as “date,” might exist in multiple contexts. By utilizing metadata, the AI model can disambiguate the correct field based on context. Our text-to-SQL article explores the role of metadata and building a semantic layer in more detail.

Keeping the above context in mind, to achieve accurate code generation, data preparation steps should include:

Standard data cleaning techniques like handling missing values, eliminating duplicates, and normalizing data formats to ensure high-quality inputs
Metadata management, involving well-defined metadata such as table definitions and relationships, to guide LLMs in understanding and querying structured data

Preparing semi-structured data

Semi-structured data contains both structured elements (e.g., numeric values or categorical attributes) and unstructured components (e.g., free-form text). The flexibility of semi-structured data makes it useful for various applications, but it also complicates data parsing and standardization.

A common mistake is to process semi-structured data as if it were purely unstructured. This leads to missing key insights and delivering suboptimal results. A more effective approach is to decompose semi-structured data into structured and unstructured components, process them separately, and then recombine them strategically for downstream applications. The challenge lies in recombining these two modalities through an approach that captures the relationships and context inherent in the data. In this section, we explore some potential approaches you can take.

Concatenation with metadata

One method is to store the structured part of the data as metadata along with embeddings of unstructured data in a vector database. When relevant embeddings are retrieved, the associated structured metadata is also retrieved. Concatenate or combine the embedding results with the structured data before passing them to the model.

This approach maintains context by combining metadata (e.g., timestamp, user ID, or location) with embeddings, enhancing both semantic search and code generation.

For example, if a user queries “critical server errors in the last week,” structured fields like timestamp and severity help filter relevant logs, while semantic embeddings of the free-text descriptions improve relevance.

Hybrid search

Hybrid search integrates keyword-based search on structured data with semantic search on unstructured data. For instance, consider building a hybrid search for customer support logs where each entry has structured fields (e.g., ID, issue type, and customer ID) and an unstructured description of the issue. By combining structured and unstructured indices, this approach can handle complex queries like “show all high-priority network issues reported in the last week.”

In one hybrid search approach, the keyword search filters the relevant subset of logs, while the unstructured embeddings rank the results based on the semantic similarity of the issue descriptions. Systems like Elasticsearch, Pinecone, and Weaviate support multiple approaches to hybrid search—depending on your use case, there are many parameters available to customize how you implement it.

A high-level view of a simple hybrid search pipeline (source)

Unified embedding space

Another strategy is to create separate vector representations for structured and unstructured components and integrate them into a unified embedding space. For example, consider a customer support log with structured fields (like issue type and timestamp) and an unstructured detailed description. Here, the structured part can be encoded using techniques like one-hot encoding, while the unstructured part is embedded using transformer models. These vectors can then be concatenated or combined using techniques like attention mechanisms to be represented in the same embedding space.

Building data prep pipelines for semi-structured data extraction

Regardless of the approach, the first step is to create a pipeline to extract and standardize the information. A typical semi-structured data pipeline involves the following:

Schema extraction: Parsing the semi-structured format to identify structured elements like tags, attributes, and relationships. For example, a JSON file representing a product catalog might have fields like product ID, category, and description.
Data standardization: Ensuring that all structured fields conform to a consistent schema. This may involve converting date formats, handling missing values, and deduplication. For unstructured fields, standardization might include lowercasing, punctuation removal, and lemmatization.
Feature engineering: Creating embeddings for text fields and numerical features for structured fields.

While these pipelines effectively extract information, they can become complex and resource-intensive as data size grows.

Preparing unstructured data

Generative AI excels at processing unstructured data. The key to leveraging unstructured data in generative AI applications lies in transforming it into a format that AI models can efficiently process and understand. This typically involves parsing documents, converting them into vector embeddings, and making these embeddings independently searchable.

The figure below provides a high-level overview.

‍

High-level flow showing the steps to prepare unstructured data for use with AI

The first step is loading and parsing data from diverse sources. This data may come in various formats, including PDFs, Word documents, HTML pages, and multimedia files, each requiring custom handling. Challenges in this phase include dealing with multiple file formats, managing different encodings and languages, and efficiently processing large volumes of data. For example, extracting text from a PDF differs from parsing HTML content. Ensuring consistent encoding and properly handling multimodal content prevent data corruption and loss of information.

After loading, the data often needs to be divided into smaller, more manageable pieces, typically called “chunks.” This step is important for maintaining context and improving searchability. There are many strategies for chunking, and this article provides an excellent summary if you are taking a DIY approach. Alternatively, APIs like Unstructured provide ready-made ETL workflows for getting unstructured data ready for your LLM-based applications.

Embedding involves transforming these text chunks into numerical vectors that capture their semantic meaning. These vectors enable AI models to perform tasks like similarity searches and semantic understanding.

Key considerations for embedding

Essential issues include selecting an appropriate embedding model, balancing the dimensionality of embeddings with computational resources, and optimizing the embedding process through batch processing. The choice of the embedding model significantly impacts the quality of the vectors—options range from general-purpose models like OpenAI’s text embedding models to specialized models tailored for specific domains or languages, such as jina-embeddings-v3.

Higher-dimensional embeddings can capture more detailed semantic information but require more computational resources for storage and processing. Processing text chunks in batches can accelerate the embedding process, and optimizing batch size based on hardware capabilities can lead to significant time savings. The Massive Text Embedding Benchmark (MTEB) Leaderboard allows you to choose based on your needs and available resources.

Finally, the original content and its corresponding embeddings must be stored and indexed for retrieval later. Vector databases like Pinecone and Weaviate are optimized for storing embeddings and performing similarity searches. Properly indexing embeddings enhances retrieval speed and accuracy while storing additional metadata—such as source information, timestamps, or categories—allows for more refined searches and filtering options.

Retrieval-Augmented Generation

In addition to transforming unstructured data into a format suitable for AI models, data preparation plays a key role in the retrieval stage by enhancing the accuracy of the results obtained from the model. The RAG method combines retrieval-based systems with generative models to produce more accurate and contextually relevant outputs. In the RAG approach, the prepared and embedded data chunks are retrieved based on their relevance to a given prompt. These retrieved chunks are then used to augment the original prompt, providing additional context for the generative AI model.

Additional considerations

Applying metadata filters during retrieval can further narrow the search space, ensuring that only the most pertinent chunks are considered. For example, you can extract metadata features to represent document hierarchy, which could be useful for context-aware chunking in RAG architecture. This is an additional step in data preparation that could improve performance metrics.

To further enhance the accuracy of the AI model’s responses, various prompting techniques can be employed alongside RAG. These techniques involve providing the model with examples or encouraging it to articulate its reasoning process, which can lead to more accurate and insightful results. However, while these methods are valuable, the foundation of AI model performance relies on good input data preparation.

Recommendations for AI data preparation

Traditionally, a lot of time and effort goes into the data preparation stage in AI and machine learning projects. Data engineers have to implement strategic processes to optimize data for analysis. This involves defining clear use cases, identifying and categorizing data required for those use cases, and establishing data pipelines and workflows to transform the data for downstream processes.

With advancements in generative AI and the emergence of tools like WisdomAI, you can now minimize the need for such extensive data preparation. Modern AI platforms utilize the reasoning and natural language capabilities of LLMs to work with data in its existing form.

Instead of building extensive data pipelines and workflows for your generative AI applications, WisdomAI establishes semantic layers or context layers to supplement AI functionalities. This abstraction enables AI to handle different data formats with context about your business while maximizing the use of your existing data assets.

Continue to define clear goals and objectives to guide your AI efforts. Adopting a collaborative approach with business stakeholders makes sure your projects are aligned with business needs. Even with minimal data preparation, setting business metrics and KPIs ensures that the AI models generate relevant and actionable insights. You can also use user feedback to train your AI application to be more aligned with the business use case.

Adopt more data orchestration to streamline your processes. Automation simplifies the data preparation process and makes complex data analysis accessible to a broader audience.

Last thoughts on AI data preparation

Effective data utilization preparation is the cornerstone of successful AI and ML projects. With generative AI applications, the emphasis is shifting from intensive data preparation to intelligent data handling.

Building an end-to-end AI application can be resource-intensive and tedious at the enterprise level. This is where platforms like WisdomAI help streamline the process. With a user-friendly no-code interface, WisdomAI exemplifies how advanced AI models and data engineering can transform data analytics. The platform uses customized LLMs and a context layer to translate user prompts into SQL queries, integrates metadata to resolve ambiguities, and generates tailored insights.

For a typical business analyst, a question like “What were the top-performing products last quarter?” can be converted into a precise SQL query that joins sales, inventory, and customer feedback tables from multiple applications to generate actionable insights. This level of automation powered by agentic workflows simplifies the data analysis process, reduces manual intervention, and makes complex data analysis accessible to a broader audience. All common data storage systems can be integrated into the platform, and the models continuously learn from queries to make inferences more accurate and relevant.
The future of AI in your organization relies on adopting technologies that minimize preparation efforts and maximize analytical outputs. By utilizing tools that handle your organizational data in its existing form, you can accelerate generating value from your AI initiatives.