Session Outline
This session at the NDSML Summit 2023 describes how Gavagai has improved sentiment analysis for a specific domain by using generative models (GPT-3.5/4) to generate synthetic examples that were then used for fine-tuning an existing transformer-based model. In initial experiments, the method they devised allowed them to scale up 450 domain-specific, severely skewed texts to a corpus of 500.000+ balanced and labeled texts, allowing them to circumvent privacy issues in the original data, and improve predictive power in the final model. The final fine-tuned model showed an improved F1 score of between 8 and 10 points when evaluated on a held-out, non-synthetic dataset. The talk also addresses the two major challenges when generating labeled synthetic training data: label noise, and making sure that the data generated is “similar enough” to the original data to be of use as training data.
Key Takeaways
- Generative LLMs can be used to generate synthetic, labeled training data for NLP tasks.
- Two challenges of the synthetic data: Can we trust the labels that the LLM outputs? Can we control the distribution of the generated data to ensure that it is similar enough to the original data?