As we’ve seen the rapidly rising impact of LLMs, we’ve also seen the growing importance of “synthetic data,” generated instructional raw text used to train LLMs for specific tasks without the need to mine from real human conversations.
Synthetic data has been successfully used many times to improve the performance of neural networks. Some examples include self-driving cars, which are trained on synthetic data alongside real-world data, and object detection models, which improve performance by training on synthetic datasets generated by GANs (generative adversarial networks).
The names for these techniques continue to evolve and change, but, regardless, all of them are a form of “data augmentation.” Data augmentation is not a new concept. It has been applied and battle-tested thoroughly in every field of machine learning. We can leverage data augmentation for any modality in various forms. The same is true for LLMs. So, why the sudden buzz around synthetic datasets?
Large language models are typically trained on web-scale datasets. Given that these models scale well with the size of the dataset, there is no reason to believe that we have achieved their peak potential performance.
The general perception is that LLMs have essentially consumed all of the data on the internet. This is far from indicating that LLMs have reached their peak performance, however. We must consider the potential means of acquisition of quality data. It has been shown that quality data leads to better models with far smaller parameter sizes.
One way to overcome this blocker—that is, the inherent finite nature of data available on the web—is to leverage synthetic datasets for further training. People have leveraged synthetic datasets to train smaller models that perform on par with bigger models at certain tasks. The Phi series is a good example.
Consistently, it’s true that the higher the quality of a dataset, the better the performance of the model trained on it. The same concept applies to synthetic data. We see the benefits of using a synthetic dataset only if it is meaningful and of good quality. Phi-3 models are a testimony of this.
Noticeable in today’s research papers is the common mention of leveraging synthetic datasets to improve model performance, but hardly any descriptions of the components of the related synthetic datasets’ generation pipelines.
Generating a reliable synthetic dataset is a complex process. One needs to be aware of the pitfalls of this process. Using a noisy dataset to generate synthetic data can lead to a few silent failures.
Let us take an example to understand the above point a bit better. Let us say we want to build a closed-domain Question-Answering system. We have a dataset (context, question -> answer) that we can use to fine-tune a language model. Each sample consists of multiple paragraphs and is noisy (contains missing figures, unnecessary Unicode characters, and other formatting issues). Our dataset consists of exactly one pair of question-answer for a given context. A synthetic dataset here means an additional pair of generated question-answers for the same context.
LLMs today are incredibly powerful, and leveraging them to generate synthetic data is a natural choice. We can utilize GPT-4 or Llama2-70B for the task defined above. We will pass a context as an input prompt to a large language model and ask it to generate a pair of question-answers relevant to the given context. So, what are the possible pitfalls one can run into in this process? Let us take a look at a few of them:
The image shown below depicts a non-trivial type of hallucination. Although the generated text is from the input context, it contains a lot of garbage content. For example, the generated answer consists of a chunk of gibberish present in the input context.
After a certain point, easy (low-complexity generated) examples provide diminishing returns in terms of performance. Hence, we want to incorporate more complex examples for better performance gains. The definition of complex or hard examples is use-case dependent; an example of a complex question-answer is one where the answer to a question consists of more than a few words. For example, look at the figure below and think about choosing one from the two generated samples based on the complexity of the generated text. Though both the generated samples are good, the second is more complex than the first one.
If we are using a hosted model (like GPT-4) via an API, then running into issues like unstructured outputs or outputs with disorderly fields is not uncommon. Look at the figure below and notice the missing generated question. Debugging this behavior is not straightforward, as factors like bad inputs, broken tokenization, lack of constraints, hallucination, etc., could be inducing this behavior.
There are many other nuances to this process that are use-case dependent. That is why generating synthetic datasets in an automated fashion is not a solved problem. An automated synthetic dataset generation pipeline is helpful only if it's positively known to be reliable.
There are several ways to address these problems. Either you can use tools like Instructor, DSPy, Langchain, etc., to solve these issues, or depending on the use case, you can write a few checks that are a part of your data generation pipeline. Which should one choose? We will talk about this in detail in the FAQ section later. First, let us see how to address the issues listed above for the current use case.
Before jumping on the solution, let us contemplate the possible fixes to address these issues. Here are a few of them:
These are some good experiments for figuring out the optimized data generation pipeline. A combination of such considerations is powerful enough to generate synthetic datasets reliably.
Semi-supervised dataset annotation is common across computer vision tasks where the collected dataset is first (partially) annotated by a computer vision model, then human annotators refine those annotations in the final pass. It is a kind of external intervention in the data annotation pipeline.
Inspired by the above, we can have feedback loops within the data generation mechanism that help to generate fine-grained reliable synthetic datasets. We can either have a human in the loop here or put a bigger LLM (bigger than the one currently used for generating the dataset) as a verifier in this loop. In many cases, a human in the loop is necessary for verification and should not be replaced by an automated system. The task we have on hand is a much simpler one, and an LLM can provide the feedback instead. This is similar in principle to the Chain-of-Verification and other similar techniques.
So, what kind of feedback mechanisms can we implement for the question-answer generation task? Ensuring correctness is equivalent to passing unit tests. Here are a few examples:
These are a few checks we can implement in the feedback loop. We can implement the feedback loop as a standalone py file and integrate it in the dataset generation pipeline, but for the sake of demonstration, we will use DSPy here to implement the same.
1class GenerateQA(dspy.Signature):
2 """Generate question-answer pair from a given context, and stay faithful
3 to the context.
4 """
5
6 context = dspy.InputField(desc="context")
7 question = dspy.OutputField(desc="relevant question")
8 answer = dspy.OutputField(desc="relevant answer of one or more lines")
Here we are inheriting the Signature
class to tell the LM what it needs to do. In our case, the LM is tasked to generate a pair of question-answer from a given context. Hence, the context
is defined as the input field, while question
and answer
are defined as the output fields.
1class Assess(dspy.Signature):
2 """Assess the quality of generated question-answer pair along the specified dimension."""
3 context = dspy.InputField(desc='ignore if N/A')
4 assessed_text = dspy.InputField()
5 assessment_question = dspy.InputField()
6 assessment_answer = dspy.OutputField(desc="Yes or No")
Why do we need assessments? Remember the explicit feedback loop we talked about in the previous section? The Assess
class provides a common interface for doing assessments on the fly. Assessments are presented as a series of questions passed to a particular LLM that responds with a boolean value of yes or no as the answer to a specific question.
We have defined our main question-answer generator and the evaluator (assessment checks). Now, we need to integrate these pieces into a single module. Let's do that. Here is a simple way to achieve this in DSPy:
1class GenerateQAWithAssertions(dspy.Module):
2 def __init__(self, data):
3 super().__init__()
4 self.data = data
5 self.generate_qa = dspy.ChainOfThought(GenerateQA)
6 self.assessment = dspy.Predict(Assess)
7
8 def forward(self, idx):
9 # 1. Sample a context from the data
10 context = sample_context_from_dataset(self.data)
11
12 # 2. Generate corresponding QA using some LLM
13 with dspy.context(lm=config["llm_for_generation"]):
14 pred = self.generate_qa(context=context)
15
16 generated_question = pred.question
17 generated_answer = pred.answer
18 generated_qa = generated_question + "\n" + generated_answer
19
20 pred = dspy.Prediction(
21 context=context,
22 generated_question=generated_question,
23 generated_answer=generated_answer
24 )
25
26 # 3. Null check for the generated question
27 dspy.Suggest(
28 is_null(generated_question),
29 "The generated question is an empty string. Please revise accordingly.",
30 target_module=GenerateQA
31 )
32
33 # 4. Null check for the generated answer
34 dspy.Suggest(
35 is_null(generated_answer),
36 "The generated answer is an empty string. Please revise accordingly.",
37 target_module=GenerateQA
38 )
39
40 # 5. Other assessments
41 with dspy.context(lm=config["llm_for_verification"]):
42
43 # 5.1 Can the question be answered from the given context?
44 is_answerable = "Is the question answerable from the given context? Say no if it's not."
45 is_answerable_assessment = self.assessment(context=context, assessed_text=generated_question, assessment_question=is_answerable)
46 dspy.Suggest(
47 is_assessment_yes(is_answerable_assessment.assessment_answer),
48 "The generated answer is not found in the context. Please revise accordingly.",
49 target_module=GenerateQA
50 )
51
52 # 5.2 Is the content well-structured?
53 is_structured = "Is the assessed text well structured? Say no if the quality is bad."
54 is_structured_assessment = self.assessment(context='N/A', assessed_text=generated_question, assessment_question=is_structured)
55 dspy.Suggest(
56 is_assessment_yes(is_structured_assessment.assessment_answer),
57 "The generated question is not well structured. Please revise to make it more structured.",
58 target_module=GenerateQA
59 )
60
61
62 # 5.3 Does the generated question-answer pair conatin hallucinated context?
63 is_faithful = "Is the assessed text grounded in the context? Say no if it includes significant facts not in the context."
64 is_faithful_assessment = self.assessment(context=context, assessed_text=generated_qa, assessment_question=is_faithful)
65 dspy.Suggest(
66 is_assessment_yes(is_faithful_assessment.assessment_answer),
67 "The text contains unfaithful elements or significant facts not in the context. Please revise for accuracy.",
68 target_module=GenerateQA
69 )
70
71 # 4. Return predictions
72 return pred
Let us break down what is happening in the forward pass of the GenerateQAWithAssertions
module.
GenerateQA
module to generate the question-answer pair. We use GPT-3.5 as the generator LLM.dspy.Suggest
module for evaluation. If an assessment fails, it retries to generate a response until the assessment has been passed or a maximum number of retries have been attempted.dspy.Predict
class.This is how we generate synthetic data reliably. There are chances that we will not be able to generate data corresponding to each point. But we can surely expect to increase the size of our dataset by almost double using this approach. We can then fine-tune any model suited for this task on top of this dataset (a mixture of original plus synthetic datasets). Let us take a look at some examples and compare the approach of generating data with assertions and without assertions:
Here the second option is much cheaper. With the method described above, we can extend the size of our dataset and try boosting the performance of the model before running a randomized grid-search for automatic hyper-parameters tuning.