Ensuring LLMs can be augmented with relevant contextual or business data safely and securely is critical to capitalizing on the diverse data created by individuals and organizations, and making sure that the benefits of LLMs are shared equitably.

This body of work focuses on both technical and legal / policy solutions, addressing problems from winner-take-all model dynamics to cryptographically secure private retrieval augmented generation.


Secure Technologies

Private Retrieval Augmented Generation (PRAG): A Technology for Securely Using Private Data

Technical Diagram
Article in ICLR Submission

Introducing Private Retrieval Augmented Generation (PRAG), a novel method that uses multi-party computation to securely fetch information from distributed databases for use in large language models without compromising privacy. It also presents a new MPC-friendly protocol for inverted file search, enabling fast and private document retrieval in a decentralized manner.


Guy Zyskind*, Tobin South*, Alex 'Sandy' Pentland

Secure Community Transformers: Private Pooled Data for LLMs

Technical Diagram
Read the Whitepaper

This paper discusses how large language models (LLMs) can effectively utilize small-scale data for insights across various sectors like consulting, education, healthcare, and law. It presents a framework for navigating trade-offs related to privacy, performance, and transparency, and argues that retrieval augmented generation (RAG) provides a more flexible, efficient, and auditable approach than traditional fine-tuning methods.


Tobin South, Guy Zyskind, Robert Mahari, Alex 'Sandy' Pentland


Data and Law

Data Provenance Initiative

Technical Diagram
Explore the Data

The Data Provenance Initiative is a large-scale audit of AI datasets used to train large language models. As a first step, we've traced 1800+ popular, text-to-text finetuning datasets from origin to creation, cataloging their data sources, licenses, creators, and other metadata, for researchers to explore using this tool. The purpose of this work is to improve transparency, documentation, and informed use of datasets in AI.


Shayne Longpre, Robert Mahari, et al.

Legal Perspective on Auditability and Updateability

Technical Diagram
Article in Network Law Review

A key design element of the Secure Community Transformers project is the ability for models to be auditable and updatable, key requirements in the wake of GDPR and CCPA. This builds into a framework of transparency by design, a topic that is explored from a legal perspective in the piece published in Network Law Review below.


Robert Mahari*, Tobin South*, Alex 'Sandy' Pentland


Policy Commentary

Competition Between AI Foundation Models: Dynamics and Policy Recommendations

Technical Diagram
Working Paper on SSRN

Generative AI is set to become a critical technology for our modern economies. If we are currently experiencing a strong, dynamic competition between the underlying foundation models, legal institutions have an important role to play in ensuring that the spring of foundation models does not turn into a winter with an ecosystem frozen by a handful of players.


Thibault Schrepel, Alex 'Sandy' Pentland

Unlocking the Power of Digital Commons: Data Cooperatives as a Pathway for Data Sovereign, Innovative and Equitable Digital Communities

Management of Digital Ecosystems

This paper argues that data cooperatives can democratize digital resources and promote entrepreneurship, particularly for SMEs in small communities. It presents case studies to illustrate the transformative potential of such cooperatives and proposes a policy framework to support their practical implementation globally.


Bühler et al.

Want to see how this all fits together?

Prof. Pentland gave a Special Lecture on Engineering and Society at the National Academy of Engineering (NAE) Annual Meeting.


Read and Watch: How AI Can Help Predict Human Behavior and Accelerate Solutions to Societal Challenges