We present a novel privacy-preserving synthetic data generation algorithm that enables automatic topic-wise distribution matching, making it accessible even for resource-constrained AI applications.
Generating large-scale differentially private (DP) synthetic data is challenging due to the fundamental privacy–computation–utility trade-off, where strong privacy guarantees can either hurt the quality of the synthetic data, or require large amounts of computation. Privately fine-tuning a billion-size large language model (LLM) on the “private data”—the dataset on which one plans to provide privacy guarantees—and then sampling from the fine-tuned model to generate synthetic data is a well-liked approach. Because of its high computational cost, this method cannot be used in applications with limited resources. So, recently proposed Aug-PE and Pre-Text algorithms have explored generating synthetic data that only requires LLM API access.
However, they are unable to effectively utilize private information in their iterative data selection process because they typically rely heavily on manual prompts to generate the initial dataset. We propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without fine-tuning billion-scale LLMs or domain-specific prompt engineering, in “Synthesizing Privacy-Preserving Text Data via Fine-Tuning Without Fine-Tuning Billion-Scale LLMs,” which was presented at ICML 2025. CTCL’s 140 million-parameter lightweight model makes it useful for applications with limited resources. By conditioning on the topic information, the generated synthetic data can match the distribution of topics from the private domain. Finally, unlike the Aug-PE algorithm, CTCL allows generating unlimited synthetic data samples without paying additional privacy costs. We tested CTCL on a variety of datasets and found that, especially when strong privacy guarantees are in place, it consistently outperforms baselines.
Experiments also demonstrated CTCL’s improved scalability in comparison to the Aug-PE algorithm, and ablation studies confirmed the significance of its pre-training and keyword-based conditioning. Developing the framework for data synthesis Using private datasets, the CTCL Framework creates high-quality synthetic data while maintaining privacy. This is done by dividing the procedure into three main steps. Understanding CTCL-Topic and CTCL-Generator, the two essential components that make this framework work, is essential before we get into the specifics. CTCL-Topic is a universal topic model that captures the high-level themes of a dataset, while CTCL-Generator is a powerful language model that can create documents based on specific keywords. The foundation for learning various private domains and producing synthetic data from them is made up of these two components, developed with large public corpora.
Step 1: Developing CTCL-Topic and CTCL-Generator
Large-scale public corpora are used to develop both components, which can later be used to learn about various private domains. CTCL-Topic is a topic model derived from the diverse Wikipedia corpus, which contains approximately 6 million documents. We embed each document using BERTopic, group them into roughly 1,000 clusters (or topics), and assign 10 keywords to each cluster. CTCL-Generator is a 140M-parameter conditional language model that generates documents that meet the input conditions from free-form document descriptions (such as document type, keywords, etc.). We instruct Gemma-2-2B to “Describe the document in multiple aspects” for each SlimPajama document in order to construct the pre-training data. A dataset with 430 million description–document pairs is the outcome. We then use this dataset to perform continual pre-training on top of BART-base (a 140M-parameter language model), yielding the CTCL-Generator.
Step 2: gaining knowledge of the private domain
The high-level distributional data from the entire private corpus is then gathered using CTCL-Topic. To accomplish this, a private histogram, or percentage of each topic in the private data, is collected to show the topic-wise distribution of the data. This topic histogram will be used later in Step 3 for sampling.
Every document in the private dataset has been linked to a topic during the process of collecting the topic histogram. We then transform the private dataset into a dataset of keywords and document pairs, the 10 keywords for each document are obtained from their corresponding topic in CTCL-Topic. After that, we use DP to fine-tune the CTCL-Generator for this dataset.
Step 3: Creating Artificial Data
The DP topic histogram is used to proportionally sample the CTCL-Generator that has been fine-tuned for each topic. In particular, we are aware of the number of target samples for each topic (i.e., x%*N for Topic 1, y%*N for Topic 2, etc.) in relation to the desired size of the synthetic dataset (say, N) and the DP topic histogram (say, x% for Topic 1, y% for Topic 2, etc.). For each topic, we use the corresponding 10 keywords as input to the DP fine-tuned CTCL-Generator to generate data. Using the post-processing property of DP, CTCL-Generator can generate any amount of synthetic data without incurring any additional privacy costs.
Experiments
We conducted experiments on four datasets, where three datasets correspond with downstream generative tasks and one dataset with a classification task. Tasks involving generation are typically more difficult than tasks involving classification. This is because the accuracy of the next token prediction is used to evaluate the generative tasks, which necessitates that the private data be separated from the synthetic data in order to preserve fine-grained textual information. In contrast, maintaining the co-occurrence patterns between labels and words in the private data is all that is required for the classification tasks. The three generative tasks are chosen to cover a diverse set of practical scenarios: PubMed (medical paper abstracts), Chatbot Arena (human-to-machine interactions), and Multi-Session Chat (human-to-human daily dialogues). We followed the Aug-PE setup to train a small downstream language model on the generated synthetic data and then compute the next-token prediction accuracy on the actual test data to assess the quality of the generated synthetic data. The classification task is performed on the OpenReview (academic paper reviews) dataset. To evaluate the quality of the generated synthetic data, we train a downstream classifier on the synthetic data, and compute the classification accuracy on the real test data.
To mitigate concerns regarding data contamination, we carefully analyzed our selected datasets. Our analysis revealed that our downstream datasets and our pre-training data did not overlap. Results
In the strong privacy guarantee regime, CTCL consistently outperforms the other baselines. Downstream DPFT (directly DP fine-tuning downstream model on the private data without using synthetic data), Aug-PE (an augmented version of the Private Evolution algorithm), and DP fine-tuning an LLM of similar size to CTCL to generate synthetic data with post-generation resampling are all examples of baseline algorithms that are compared to CTCL in the plot below. The plot below illustrates CTCL’s improved performance, particularly for the more challenging setting that satisfies a stronger privacy guarantee (i.e., smaller ε value). This demonstrates CTCL’s strong ability to effectively capture useful information from the private data while maintaining privacy.
Additionally, CTCL is more scalable in terms of synthetic data size and privacy budget when compared to Aug-PE. As shown by the left plot below, CTCL improves with an increased privacy budget while Aug-PE does not. This limitation may stem from Aug-PE’s constrained capacity (i.e., only via the nearest neighbors) to effectively capture information in the private data. The right plot shows that as more CTCL-generated samples are given to the downstream model, accuracy increases, while Aug-PE performance saturates around 10K examples. These results align with the intuition that fine-tuning–based methods (e.g., CTCL) can better capture fine-grained statistics than prompting-based methods (e.g., Aug-PE).
And finally, ablation studies validate the importance of two key components in our framework: 1) pre-training the CTCL-Generator on public corpus, and 2) incorporating keyword-based conditions during DP fine-tuning. In particular, we measure the test loss of the downstream model and introduce these parts sequentially beginning with the standard DP fine-tuning. Our findings indicate that for a fixed privacy budget, adding pre-training and keywords to DP fine-tuning both reduce test loss by 50% and 50%, respectively. This demonstrates that both components are crucial in our framework design.
upcoming work Our experiments synthesizing data with ConTrollability and CLustering (CTCL) uses a generator of only 140M parameters. But the key idea of CTCL, i.e., using clustering information or LLM extracted metadata as input instructions, can be easily extended to larger size models. We are actively investigating this concept to assist in improving applications in the real world.
Acknowledgements
This work was primarily done by Bowen Tan during his internship at Google Research, under the guidance of Shanshan Wu and Zheng Xu. We are grateful to Daniel Ramage and Brendan McMahan for their leadership support, as well as to external academic partners Eric Xing and Zhiting Hu for their helpful feedback on the ICML paper, Zachary Garrett and Michael Riley for reviewing an early draft, Taylor Montgomery for reviewing the use of datasets, and Mark Simborg and Kimberly Schwede for assistance editing the blogpost and graphics. We are grateful to the ICML reviewers for their valuable time and insightful comments on our paper.