Ensuring LLMs can be augmented with relevant contextual or business data safely and securely is critical to capitalizing on the diverse data created by individuals and organizations, and making sure that the benefits of LLMs are shared equitably.
This body of work focuses on both technical and legal / policy solutions, addressing problems from winner-take-all model dynamics to cryptographically secure private retrieval augmented generation.
Secure Technologies
Private Retrieval Augmented Generation (PRAG): A Technology for Securely Using Private Data

Introducing Private Retrieval Augmented Generation (PRAG), a novel method that uses multi-party computation to securely fetch information from distributed databases for use in large language models without compromising privacy. It also presents a new MPC-friendly protocol for inverted file search, enabling fast and private document retrieval in a decentralized manner.
Guy Zyskind*, Tobin South*, Alex 'Sandy' Pentland
Secure Community Transformers: Private Pooled Data for LLMs
This paper discusses how large language models (LLMs) can effectively utilize small-scale data for insights across various sectors like consulting, education, healthcare, and law. It presents a framework for navigating trade-offs related to privacy, performance, and transparency, and argues that retrieval augmented generation (RAG) provides a more flexible, efficient, and auditable approach than traditional fine-tuning methods.
Tobin South, Guy Zyskind, Robert Mahari, Alex 'Sandy' Pentland
Data and Law
Data Provenance Initiative

The Data Provenance Initiative is a large-scale audit of AI datasets used to train large language models. As a first step, we've traced 1800+ popular, text-to-text finetuning datasets from origin to creation, cataloging their data sources, licenses, creators, and other metadata, for researchers to explore using this tool. The purpose of this work is to improve transparency, documentation, and informed use of datasets in AI.
Shayne Longpre, Robert Mahari, et al.
Legal Perspective on Auditability and Updateability

A key design element of the Secure Community Transformers project is the ability for models to be auditable and updatable, key requirements in the wake of GDPR and CCPA. This builds into a framework of transparency by design, a topic that is explored from a legal perspective in the piece published in Network Law Review below.
Robert Mahari*, Tobin South*, Alex 'Sandy' Pentland
Policy Commentary
Competition Between AI Foundation Models: Dynamics and Policy Recommendations

Generative AI is set to become a critical technology for our modern economies. If we are currently experiencing a strong, dynamic competition between the underlying foundation models, legal institutions have an important role to play in ensuring that the spring of foundation models does not turn into a winter with an ecosystem frozen by a handful of players.
Thibault Schrepel, Alex 'Sandy' Pentland
Unlocking the Power of Digital Commons: Data Cooperatives as a Pathway for Data Sovereign, Innovative and Equitable Digital Communities

Management of Digital Ecosystems
This paper argues that data cooperatives can democratize digital resources and promote entrepreneurship, particularly for SMEs in small communities. It presents case studies to illustrate the transformative potential of such cooperatives and proposes a policy framework to support their practical implementation globally.
Bühler et al.