Abstract

Large Language Models (LLMs) such as ChatGPT and LLaMA have demonstrated immense potential in various applications. However, these models primarily reflect the broader internet and do not inherently account for the nuances of specific communities or private data. We introduce the "Secure Community Transformers: Private Pooled Data for LLMs" project, a novel approach to augmenting LLMs with private community and personal data in a secure, privacy-preserving manner. By leveraging a combination of traditional privacy transformations, LLM-enabled privacy transformations, trusted execution environments, custodial control of data, and consent-based privacy choices, we enable the continuous updating of community data within a privately hosted LLM, resulting in a tailored Q&A tool that embodies community values and individual circumstances.


Our solution addresses the limitations of LLMs that stem from their reliance on historical public data and lack of secure contextualization. The Community Transformers project empowers communities and organisations to securely and privately amalgamate local data, enabling LLMs to provide contextually relevant answers tailored to the specific needs of the community. This approach not only enhances the utility of LLMs but also ensures the protection of sensitive community and personal information.

Tobin South, Guy Zyskind, Robert Mahari, Thomas Hardjono, Alex 'Sandy' Pentland

Technical Architecture

Technical Diagram

To ensure that data from individuals and community stakeholders remains private, the entire pipeline of information question answering must be kept secure. Features such as hosting solutions that are self-custodial to the community play an essential role in keeping data local, encrypted, and secure.
In addition to ensuring data security, we draw on a long history of secure computation and community data to extract local insights using tools such as secure MPC, private set intersections, and trusted ML inference. However, certain elements of this design pose computational or speed challenges, and we welcome public feedback on the model's architecture.

Safety & Privacy

Addressing safety challenges is crucial when integrating community data with LLMs, ensuring privacy and data security while providing tailored benefits.

  • Preventing Data Leakage:

    Stop sensitive community or organizational data from leaking to the public internet. Community data should be community owned.

  • Local Hosting:

    Ensure private or sensitive data is not uploaded to commercial APIs for potential use in model training.

  • Within Community Privacy:

    Accessing sensitive data in a privacy-preserving form, such as aggregation, differential privacy, or private set intersections.

  • Private Queries:

    Create methods that enable individuals to perform private queries without compromising personal or community privacy.

By addressing these challenges, we can create a robust framework for the secure and responsible use of community-enhanced Large Language Models, prioritizing privacy and data security.

Next Steps

The proposed system holds great promise for addressing key challenges faced by communities, and we are eager to test its effectiveness in real-world scenarios. Contexts include building tools for unique mental health contexts such as the LGBTQ+ community, where shared experiences and challenges can greatly benefit a support chatbot's usefulness, or local physical communities where insights could be extracted by pooling local data. We believe that the system can address key challenges for many types of communities and we would love to partner with communities that would be interested in participating. If you know of a community that could benefit from this project, contact us.

Bibtex Citation

@misc{South23CommunityTransformers,
	title={Secure Community Transformers: Private Pooled Data for LLMs},
	author={South, Tobin and Zyskind, Guy and Mahari, Robert and Hardjono, Thomas and Pentland, Alex 'Sandy'},
	year={2023},
	url={transformers.mit.edu}
}