MIT, Cohere for AI, others launch platform to track and filter audited AI datasets
3 min read
VentureBeat presents: AI Unleashed – An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More
Researchers from MIT, Cohere for AI and 11 other institutions launched the Data Provenance Platform today in order to “tackle the data transparency crisis in the AI space.”
They audited and traced nearly 2,000 of the most widely used fine-tuning datasets, which collectively have been downloaded tens of millions of times, and are the “backbone of many published NLP breakthroughs,” according to a message from authors Shayne Longpre, a Ph.D candidate at MIT Media Lab, and Sara Hooker, head of Cohere for AI.
“The result of this multidisciplinary initiative is the single largest audit to date of AI dataset,” they said. “For the first time, these datasets include tags to the original data sources, numerous re-licensings, creators, and other data properties.”
To make this information practical and accessible, an interactive platform, the Data Provenance Explorer, allows developers to track and filter thousands of datasets for legal and ethical considerations, and enables scholars and journalists to explore the composition and data lineage of popular AI datasets.
Dataset collections do not acknowledge lineage
The group released a paper, The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI, which says:
“Increasingly, widely used dataset collections are treated as monolithic, instead of a lineage of data sources, scraped (or model generated), curated, and annotated, often with multiple rounds of re-packaging (and re-licensing) by successive practitioners. The disincentives to acknowledge this lineage stem both from the scale of modern data collection (the effort to properly attribute it), and the increased copyright scrutiny. Together, these factors have seen fewer Datasheets, non-disclosure of training sources and ultimately a decline in understanding training data.
This lack of understanding can lead to data leakages between training and test data; expose personally identifiable information (PII), present unintended biases or behaviours; and generally result in lower
quality models than anticipated. Beyond these practical challenges, information gaps and documentation
debt incur substantial ethical and legal risks. For instance, model releases appear to contradict data terms of use. As training models on data is both expensive and largely irreversible, these risks and challenges are not easily remedied.”
Training datasets have been under scrutiny in 2023
VentureBeat has deeply covered issues related to data provenance and transparency of training datasets: Back in March, Lightning AI CEO William Falcon slammed OpenAI’s GPT-4 paper as ‘masquerading as research.”
Many said the report was notable mostly for what it did not include. In a section called Scope and Limitations of this Technical Report, it says: “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”
And in September, we published a deep dive into the copyright issues looming in generative AI training data.
The explosion of generative AI over the past year has become an “‘oh, shit!” moment when it comes to dealing with the data that trained large language and diffusion models, including mass amounts of copyrighted content gathered without consent, Dr. Alex Hanna, director of research at the Distributed AI Research Institute (DAIR), told VentureBeat.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.