MIT Connection Science

LLMs like ChatGPT have enormous potential, but we need to think pragmatically about how to minimize their risks.
- Alex 'Sandy' Pentland

At MIT Connection Science, we're addressing the challenges of AI that matter to organizations through a holistic approach across the AI supply chain. Our work spans three interconnected areas: enhancing data transparency and provenance, securing AI deployment and data sharing, and navigating the complex landscape of AI policy and law. These initiatives collectively aim to create a more responsible, secure, and ethically sound AI ecosystem.

Data Provenance Initiative

This work conducts large-scale audits of the massive datasets that power state-of-the-art AI models, auditing over 4,000 popular text, speech, and video datasets, tracing them from origin to creation, cataloging data sources, licenses, creators, and other metadata. In addition, this work analyzed 14,000 web domains, to understand the evolving provenance and consent signals behind AI data. The purpose of this work is to map the landscape of AI data, improving transparency, documentation, and informed use of data.

New York Times Coverage

Read about our work in the New York Times.

Data Provenance Homepage

Decline of the AI Data Commons

Explore our research on the changing landscape of AI data.

Audits of Dataset Licenses

Learn about our comprehensive audits of AI dataset licenses.

Data Authenticity, Consent, and Provenance for AI Are All Broken

Discover the challenges in AI data management.

Comment to U.S Copyright Office

Read our submission on Data Provenance and Copyright.

Scroll for more ➜

Securing AI Deployment and Data Sharing

Securing AI Deployment and Data Sharing involves developing and implementing robust methods to ensure that AI technologies are deployed securely and that data shared across systems remains protected. This initiative addresses the growing need for end-to-end security in AI systems, via leveraging a patchwork of technologies across the AI supply chain, particularly in the face of increasingly sophisticated threats. The focus is on innovative techniques such as private RAG, verifiable model evaluations, and secure credentials to safeguard digital spaces.

Private Retrieval Augmented Generation

Learn about our innovative RAG techniques.

Verifiable Model Evaluations

Discover our methods for ensuring AI model integrity.

Open Problems in Technical AI Governance

Explore key challenges in AI governance.

Personhood Credentials

Learn about our work on protecting digital spaces.

Zero-knowledge tax disclosures

Discover our research on privacy-preserving financial disclosures.

Two party private data sharing.

Explore our work on making health feeds private decentralised data sharing.

Scroll for more ➜

Policy, AI, and the Law

Policy, AI, and the Law explores the intersection of artificial intelligence, policy-making, and legal frameworks. This initiative focuses on developing guidelines and legal mechanisms that ensure AI systems are designed, deployed, and evaluated within a framework that prioritizes safety, fairness, and accountability. The work emphasizes the concept of "Regulation by Design," advocating for AI systems to be built with compliance to legal standards from the ground up. It also addresses the need for safe harbors in AI evaluation and red teaming, allowing for robust testing and assessment of AI models without legal repercussions. Through these efforts, the initiative aims to contribute to a legal environment that fosters innovation while safeguarding public interest and ethical standards in AI.

Data Provenance Initiative

New York Times Coverage

Data Provenance Homepage

Decline of the AI Data Commons

Audits of Dataset Licenses

Data Authenticity, Consent, and Provenance for AI Are All Broken

Comment to U.S Copyright Office

Securing AI Deployment and Data Sharing

Private Retrieval Augmented Generation

Verifiable Model Evaluations

Open Problems in Technical AI Governance

Personhood Credentials

Zero-knowledge tax disclosures

Two party private data sharing.

Policy, AI, and the Law

Regulation by Design for AI Systems

A Safe Harbor for AI Evaluation and Red Teaming

Discit ergo est: Training Data Provenance And Fair Use

Generative AI for Pro-Democracy Platforms

Competition Between AI Foundation Models

Art and the science of generative AI

The Team

Alex 'Sandy' Pentland

Tobin South

Robert Mahari

Shayne Longpre

Gabriele Mazzini

Guy Zyskind

Thomas Hardjono