The Data Challenge - Karine Boucher

Today´s discussion

The Data Challenge:

Navigating the Complex World of AI Training Data

What fuels the AI revolution that's transforming our world ? At the heart of every AI breakthrough lies an often-overlooked hero: data. Organizations face an unprecedented challenge in managing the lifeblood of their AI systems.

Let's explore this and uncover the strategies that can lead to success.

Understanding the Foundation: Why Data Matters

Imagine building a skyscraper without first ensuring a solid foundation. Similarly, attempting to develop AI systems without proper data management is a recipe for failure. Today's AI systems require massive amounts of high-quality data, particularly during their crucial training phase. This isn't just about quantity; it's about ensuring every piece of data contributes meaningfully to the system's performance and accuracy.

The Three Pillars of AI Training Data

- First-Party Data: Your Organization's Most Valuable Asset

First-party data are your organization's personal diary, the data collected directly from individuals.
It tells the authentic story of how your customers interact with your products and services.

Unlike other data sources, it provides unfiltered insights into customer behavior. For example: when an e-commerce platform tracks how customers navigate their site, they're gathering precious information about shopping patterns, preferences, and potential pain points. This data becomes invaluable when training AI systems to enhance customer experience.

As these data types (text, numbers, images, and audio) go directly to the company, it is easier to register its source. Mapping these sources serves 2 crucial purposes: it creates a clear inventory of available data types and alerts legal teams to specific compliance requirements.
This systematic documentation ensures organizations can both maximize data value and maintain legal compliance across all collection points. For example, voice data might trigger different privacy regulations than website analytics, requiring tailored compliance approaches. This structured approach to source documentation ultimately enables organizations to build responsible AI systems while managing legal risks effectively.

- Public Data: Navigating the Open Seas

Public data is like a vast ocean of information, available to the wider public. From government records to academic publications or web data, this data source offers tremendous opportunities but comes with its own set of challenges.

One of the most significant challenges organizations face with public data is maintaining its lineage as there is a lack of understanding of where data comes from. This lack of understanding may result in missing clues on whether the training dataset and model performs this way. Without proper tracking, organizations risk training their AI systems on biased, outdated or inappropriate data. This can lead to what's known as the "black-box problem," where we lose transparency in understanding how our AI systems make decisions.

Additionally, there is a risk of data leakage when without proper knowledge of public-data source, the company might train the AI system on personal, sensitive or priorietary data and this data get exposed in any way.

The last and not the least concern is about AI security and safety as the data may be from unsafe public source that can include malicious bugs into the AI system that affect it in its biais and accuracy.

- Third-Party Data: Bridging the Knowledge Gap

Third-party data are obtained or licenced by the company to an external 3rd party entity that collect and sell this data. When data brokers weave together information from various sources, they create a tapestry of insights that can complement your first-party data. However, the data needs careful inspection before you trust it.

Think about a retail company that wants to understand broader market trends. While their first-party data might tell them about their own customers' behaviors, third-party data can reveal how those patterns compare to the wider market. This broader perspective is invaluable, but it comes with a caveat: the data might not always be as precise or relevant as information collected directly from your customers.

This data can include open-source data. Sometimes, this data does not follow the same collection and distribution practices as your company.

The Quality Imperative: Building Trust Through Excellence

As we mentioned before, it is highly important to know where the data comes from, how it is collected, in which context it is meant to be used and what right has the company to use it ?
Data quality is a fundamental area of trust in your AI systems. Imagine serving a gourmet meal made with subpar ingredients; no amount of cooking expertise can overcome the basic quality issue. > The same principle applies to AI systems.

Accuracy, Completeness, Validity and Consistency : The Truth in Numbers

>> Accuracy: in data goes beyond simple correctness. It's about ensuring that your data reflects real-world insights accurately. For instance, when collecting customer feedback, it's not enough to just gather responses. You need to verify that the feedback comes from genuine customers and accurately represents their experiences.

Apart from improving model performance, the data quality is documented to support transparency, explainability, data fairness, auditability, understanding of the data phase of the life cycle and system performance.
How can we improve it ?

>> Completeness: to checking for missing values, determining the usability of the data and looking for any over or underrepresentation in the data sample.

>> Validity ensures data is in a format that is compatible with intended use. This may include valid data types, metadata, ranges and patterns.

>> Consistency refers to the relationships between data from multiple sources and includes checking if the data shows consistent trends and values it represents.

Data quality is documented to support transparency, explainability, data fairness, auditability, understanding of the data phase of the life cycle and system performance.

The Appropriate Use of Data

One of the most overlooked aspects of data management is ensuring data fits its intended purpose. It's like trying to use a road map to navigate the ocean – the data might be perfectly accurate for its original purpose but completely unsuitable for your needs. In this case, it can skew your AI system outcomes.

Consider a company that collects customer data in New York trying to use that same data to understand customer behavior in Tokyo. While the data collection methods might be sound, the cultural and demographic differences could make the insights irrelevant or even misleading.

Therefore not all data needs to be used all the time. It´s important to understand it it is even necessary to collect and use certain data in your AI model or leave it apart. Managing unnecessary data can increase your company´s risk of harm on your AI system.

Legal and Ethical Considerations: Navigating the Regulatory Landscape

To protect data, different regulations were created across various jurisdictions :

First-Party Data Collection : Understanding GDPR in Data Protection

When collecting first-party data, the European Union's General Data Protection Regulation (GDPR) has set a gold standard for data protection: a comprehensive rulebook for handling personal data with 6 key principles guiding lawful processing : consent, contractual performance, vital interest, legal obligations, public tasks and legitimate interest pursued by controller or 3rd party. You can learn more about all the RGDP with this full information.

Public Data : Copywrite, Licences Issues with Public Data

With public data collection, web scraping requires careful compliance with websites' terms of service and privacy policies to maintain legal data collection practices. When working with public datasets containing personal information, organizations must obtain proper consent and follow data protection regulations.
Web scrapers can inadvertently collect copyrighted material, requiring additional legal consideration before using such data for AI training.

Similarly, open-source data, while publicly available, comes with specific licensing requirements that organizations must follow. Beyond licensing compliance, organizations should conduct thorough due diligence on open-source datasets. This includes verifying the data's lawful acquisition, assessing its safety for use and evaluating potential biases that could affect AI system performance.

Establishing clear protocols for data validation helps organizations maintain compliance while maximizing the value of public and open-source data. Regular audits of data collection methods and sources ensure ongoing compliance with legal requirements and licensing agreements.

For example, when scraping e-commerce websites, organizations must respect robots.txt files, rate limits, and terms of service while ensuring they don't capture protected customer information. Similarly, when using open-source datasets, they should document license compliance and conduct bias assessments before incorporation into AI training.

Third-Party Data Regulations

Organizations are advised to carefully check on legal due diligence when working with third-party data brokers. This process starts with verifying that the broker collected personal data lawfully and extends to reviewing all contractual obligations and licenses. Organizations must identify any protected intellectual property within the datasets to avoid infringement risks.

Before using licensed data, organizations need to secure proper usage rights through formal licensing agreements. These agreements establish clear data ownership and provenance tracking, preventing disputes over data usage rights. The terms of these licenses will continue to govern how the organization can use this data throughout the AI system's lifecycle.

For instance, if an organization licenses customer demographic data, they must verify the broker's collection methods, ensure compliance with privacy laws, and maintain detailed records of permitted uses. The license terms may restrict certain AI applications or require specific data handling protocols, making thorough understanding of these agreements crucial for ongoing compliance.

This structured approach to legal due diligence helps organizations maintain compliance while maximizing the value of third-party data. Regular audits of license terms and usage patterns ensure continued alignment with legal requirements throughout the AI development process.

Proper documentation of these legal reviews and licensing agreements creates a clear audit trail, protecting organizations from potential liability while ensuring efficient use of licensed data resources and data ownership.

Implementing Effective Data Management: A Strategic Approach

The Data Management Planning

Effective data management is crucial when developing and deploying AI systems, particularly with the rise of generative AI that combines data from multiple sources. Organizations should develop comprehensive data management plans that track data lineage, usage rights and compliance requirements throughout the AI lifecycle.

A robust data management plan addresses key elements: data source tracking, collection methods, retention policies, disposal procedures, consent management and clear oversight responsibilities. For example, when training a language model, organizations should document the origin of each training dataset, verify usage rights, and maintain records of data processing steps.

While many organizations already have data management practices, AI systems require additional considerations. These practices should integrate with existing workflows while addressing AI-specific challenges like model training data documentation and bias monitoring. The plan must cover each lifecycle stage, from data collection through model deployment and monitoring.

Companies new to data management can leverage established frameworks like Harvard Biomedical Data Management guidelines. Industry standards such as ISO 8000 for data quality provide concrete benchmarks and controls. Current initiatives by NIST, ISO/IEC and other standards bodies are developing AI-specific data management guidelines.

Most importantly, the data management plan needs to adapt to evolving AI technologies and regulatory requirements. Regular reviews and updates ensure the plan remains effective as organizational needs and industry standards change.

The Role of Data Labels: Bringing Transparency to AI

Data labels are like the ingredients list on a food package : they tell you exactly what's inside and how it was prepared. Data labels provide detailed tracking of data collection methods and usage in model training. These transparency artifacts document the rationale behind data selection and explain how data influences training, design and development processes.

These labels verify data fitness for purpose by documenting demographic representation and quality standards compliance. They form an essential part of robust data management, working alongside quality assessments and impact evaluations to ensure comprehensive oversight.

Beyond documentation, data labels support ongoing assessment and review processes. For example, when evaluating model performance, labels help trace issues back to specific training data sources or identify potential bias in demographic representation.

Companies should integrate data labeling into their broader data management framework to avoid duplicate efforts. This integration helps maintain consistent documentation standards and streamlines the review process across different teams and projects.

Effective data-source maintenance through labels and inventories enables organizations to track data origins and conduct necessary legal due diligence, whether dealing with first-party or third-party data sources. This systematic approach ensures both compliance and efficient data management throughout the AI system lifecycle.

Dedicated Processes and Functions for Third-Party Providers

When organizations incorporate third-party data into their AI systems, maintaining proper data governance becomes crucial. Think of it like borrowing a book from a library : you need to follow specific rules about how you use it and acknowledge where it came from. Organizations must carefully review and adhere to the terms of service provided by data suppliers, ensuring they're using the information in permitted ways.

Transparency plays a vital role here too. By providing clear attribution for third-party data, organizations help users understand the origins of the information that influences AI system decisions. This transparency builds trust and allows users to better evaluate the system's outputs based on the credibility of its data sources.

However, the gold standard for third-party data usage goes beyond just following terms of service. Organizations should strive to establish formal data sharing agreements with their data providers. These agreements act like detailed contracts, spelling out exactly how each party can use the data, what their responsibilities are, and what happens if something goes wrong. Imagine creating a clear roadmap that both parties can follow, helping to prevent misunderstandings and protect everyone involved.

These agreements become particularly valuable when addressing liability issues. If problems arise from the AI system's use of third-party data, having a clear agreement in place helps quickly determine who's responsible and how to resolve the situation. This proactive approach to data governance helps organizations manage risks while maintaining productive partnerships with their data providers.

Best Practices for Long-term Success

Documentation: The Foundation of Accountability

As you have understood, having proper documentation from the start is fundamental. Eventhough you might not see its value every day, but when you need it. Organizations should maintain detailed records of:

Where their data comes from (with clear provenance tracking)
How it's been processed and transformed
Who has accessed it and for what purpose
What quality checks have been performed

This documentation is used when auditing AI systems or addressing concerns about bias or accuracy.

Building Robust Data Governance

Effective data governance is key for your data assets. It requires clear policies, defined roles, and established procedures for handling data throughout its lifecycle. For example, when implementing data governance, organizations should establish:

Clear decision-making hierarchies for data usage
Regular audit procedures
Quality control checkpoints
Incident response protocols

Looking to the Future: Emerging Trends and Considerations

As AI technology continues to evolve, the landscape of data management is changing rapidly. Organizations must stay agile and forward-thinking in their approach.

The Rise of Federated Learning

One emerging trend is federated learning, where AI models are trained across decentralized devices or servers holding local data samples. This approach addresses privacy concerns while still allowing organizations to benefit from diverse data sources.

Automated Data Quality Management

As data volumes grow, manual quality control becomes increasingly impractical. Forward-thinking organizations are implementing automated systems that can:

Detect anomalies in real-time
Flag potential bias in training data
Monitor data quality metrics continuously
Alert relevant stakeholders to potential issues

Starting Your Data Management Journey

The path to effective AI data management might seem daunting, but it's a journey that every organization must undertake to remain competitive in the AI-driven future. Start by:

Assessing your current data management practices
Identifying gaps in your data governance framework
Developing a comprehensive data strategy aligned with your AI goals
Building a culture of data responsibility throughout your organization

Remember, excellence in AI data management isn't achieved overnight – it's built through consistent, thoughtful effort and continuous improvement.

Looking for more Data Protection information ? or maybe Transparency and Explainability ?
Need some personalized AI Governance assessment, just contact us.

Source : IAPP AI Governance in practice report 2024.