It Starts With Data

How AI can bring forth a world of responsible opportunities

By Daniela Braga, PhD

Artificial intelligence (AI) may be the trend of the moment, but it is far from being a new innovation.

AI has been evolving for decades, with its spike now transforming industries and reshaping the way we live. From revolutionizing healthcare with earlier detection of diseases to enhancing productivity and reducing the burden of repetitive tasks, AI offers opportunities to improve quality of life across the globe.

Yet, as AI develops at light speed, the risks it poses cannot be ignored. Governments around the world are racing to draft national AI strategies, almost all of which emphasize the need for ethical and responsible AI. But what does responsible AI truly mean, and where should this responsibility begin?

The answer lies in the foundation of AI itself: data. Data is the lifeblood of every AI model, and the choices made about which data to include, how it is sourced, and how it is used have far-reaching implications. At present, the largest AI models are trained predominantly on “publicly available data.” While this might sound like a fair and neutral approach, it is swarming with challenges that demand urgent attention.

First, publicly available data on the internet is often unfiltered, not always fact-checked, reflecting the biases and demographics of its contributors. These contributors are disproportionately Western, leading to models that do not fully capture the richness of diverse cultural perspectives. The consequences are models that perpetuate existing inequalities and fail to understand or serve other contexts adequately.

Second, in a world affected by disinformation and misinformation, it is crucial to have complete traceability of the data used to train AI models. Without transparency, it becomes nearly impossible to verify the authenticity and credibility of AI-generated content. Traceability is not just a technical necessity; it is a safeguard against the misuse of AI in spreading false narratives and deepfakes. If we do not know where the data comes from, we cannot fully rely on the models built upon it. Furthermore, data traceability is essential in regulatory compliance, as governments increasingly demand accountability from AI developers to mitigate the spread of harmful content.

Third, the industry’s reliance on public data as “fair use” is being increasingly scrutinized. As AI matures, legislation and litigation are catching up, challenging these practices. Brands that built their early AI models on such data are now facing lawsuits and reputational damage. In response, we are seeing a shift toward closed-source data, where companies negotiate data licenses with publishers, aggregators, and creators. This approach ensures higher-quality data and fair compensation, signaling a more sustainable and ethical future for AI development.

Localization is another critical aspect of responsible AI. Take Arabic, for example, one of the most spoken languages in the world, with over 400 million speakers spread across the globe. Despite its prominence, Arabic is often treated as a “tail language” in the AI landscape, with dialects and linguistic nuances poorly represented in major models. This is not simply a technical oversight but a reflection of deeper systemic issues in AI development. By overlooking the complexity and richness of languages and cultures, AI models fail to serve a significant portion of the global population effectively, leaving them misrepresented.

The Arabic language is composed of a rich mosaic of dialects, which are not accurately represented on the available models for modern standard Arabic. This lack of representation is not just a technical gap – it poses a challenge to preserving cultural heritage. These dialects carry the histories, traditions, and identities of their speakers, and without intentional effort, these linguistic treasures risk not being conserved in the digital age.

To address these gaps, local companies and governments must prioritize AI strategies that reflect their unique linguistic and cultural contexts. This means creating and investing in localized datasets that capture the diversity of the dialects and contexts. It also requires collaboration with local linguists, technologists, and communities to ensure accuracy and cultural sensitivity. Supporting local creators and fostering partnerships that respect the rights and identities of diverse communities are crucial steps toward building AI systems that are truly inclusive.

Ownership of cultural representation and language in AI is crucial. Ensuring that AI systems understand and serve Arabic-speaking communities effectively is also a pathway to unlocking economic and social opportunities. For instance, accurate natural language processing in Arabic could revolutionize access to education and healthcare for millions. As we advance AI technologies, we must ensure they reflect the full spectrum of human diversity, respecting and preserving the unique identities that define us.

As we chart the future of AI, we must remember that its power lies not just in what it can do, but in how it is built. Responsible AI begins with responsible data practices – practices that honor the diversity of our world, uphold fairness, and prioritize collaboration over exploitation. Only then can AI truly fulfill its promise of creating a better, more inclusive future for all.

About The Author

Daniela Braga, PhD, is the founder and CEO of Defined.ai, the leading provider of ethical AI data, offering the world’s biggest ethical AI data marketplace alongside subscriptions for flexible data access and custom services. defined.ai