From RAG To Fine-Tuning, Data Management Enriches AI Results

By Timothy Prickett Morgan

From RAG To Fine-Tuning, Data Management Enriches AI Results

COMMISSIONED: As an IT leader, you know that data management is one of the critical ingredients for cultivating successful AI initiatives. One does not simply walk into an AI implementation without properly curating and cleaning data and making it available to employees who need it.

High quality data, of course, is critical for getting better business outcomes out of GenAI models and, really, any AI workload which are set to grow 25 percent to 35 percent per year through 2027, Bain & Company estimates.

Organizations must navigate some learning curves around managing data as AI workloads become increasingly distributed, subject to the whims of data gravity. Data must be prepared to ensure accuracy and the utmost mobility; this requires IT to build new data management muscles.

Good data management can be a competitive differentiator, just as the lack of it can be an albatross weighing down your AI fortunes. Let's lean into the trends that will help you get your data house in order as you navigate this new AI frontier.

Few techniques for manipulating GenAI models in meaningful ways garnered more attention in 2024 than retrieval-augmented generation (RAG).

RAG is a technique that allows organizations to funnel corporate data and other external data sources through GenAI models to enrich prompt results. A company might apply RAG on the back end of a chatbot employees can query for information about corporate travel and expense policies, or about paid time off and family leave.

Yet for as many data riches as organizations enjoyed from RAG, it isn't the answer for every AI need. As such, some companies may lean on fine-tuning to refine models. In fine-tuning, organizations train a pre-trained model on additional data sets such as customer service transcripts to increase its knowledge of specific languages and customer support issues.

Fine-tuning isn't perfect; too much of it can lead to overfitting, in which the model becomes too specialized to accommodate new data. Fine-tuning also requires more computing power and AI expertise than RAG, but more organizations will take the leap as they use it to extract maximum value.

Some organizations may even blend RAG and fine-tuning. For example, a legal firm might fine-tune an LLM on its internal legal documents to comprehend specific legal jargon and procedures, while using RAG to ensure the model accesses the latest case law and statutes.

AI techniques that get the most out of models are nice, but they're only so good as the data processed and made available to them. This is where novel data architectures enter the fold.

Data warehouses and lakes have long served organizations well, but data fabrics, data meshes and data lakehouses are on the rise. Each approach seeks to abstract the complexity of managing data storage as unstructured data such as text, images and video is soaring, thanks to GenAI.

Each approach also has its nuances.

A data fabric stitches together structured and unstructured data, various formats and systems. In data meshes, storage assets are often managed discreetly and available via self-service to engineers. Data lakehouses store data in an open format while structuring it when queried.

While a lakehouse stores and processes vast amounts of data in one repository, fabrics and meshes operate like tapestries, comprising multiple storage systems woven together. Data lakehouses also typically leverage centralized governance for managing data, while fabrics and meshes emphasize decentralized governance.

Some organizations may take a hybrid approach in which they combine elements of two or more approaches; which tack organizations take will vary per technical needs and governance requirements.

As organizations scale up proofs-of-concept to production, they are solidifying strategies that help secure the organization.

This includes education tiers - say, from 101 levels to 201 and 301 levels of instruction on how to work with GenAI models - along with processes and techniques that help organizations protect their corporate IP.

IT infrastructure chiefs are running models on-premises to better control model outputs. Data engineers meanwhile explore using diverse data sets, collaborative filtering and automated auditing tools to reduce bias and improve accuracy, as well as model monitoring to detect anomalies and unauthorized use.

Model explainability is gaining traction. For instance, organizations are embracing interpretability techniques such as LIME or SHAP to generate insights into the decision-making process of LLMs.

Governance policies are extending beyond technical approaches to human oversight beyond the casual - these outputs look okay - to codified human-in-the-loop processes.

Ensuring that LLMs are aligned with the organization's goals and values is table stakes for good governance.

Your AI success depends on the quality of the corporate data you feed it.

Unfortunately, the data houses at most IT shops may be messier than they should be, as poor data quality remains one of the biggest challenges confronting organizations' generative AI strategies.

In 2024, 55 percent of organizations avoided certain GenAI use cases due to data-related issues, Deloitte research found. Such organizations must shore up their data estates to properly take advantage of GenAI initiatives.

How will you augment your data management strategy to get the most out of your AI projects?

Learn more about the Dell AI Factory.

Previous articleNext article

POPULAR CATEGORY

corporate

10655

tech

11464

entertainment

13087

research

5971

misc

13894

wellness

10584

athletics

13932