Hacker News new | past | comments | ask | show | jobs | submit login
Generate Better Synthetic Image Datasets with Prompt Engineering and Evaluation (cleanlab.ai)
1 point by cmauck10 on Oct 5, 2023 | hide | past | favorite | 1 comment



When generating synthetic data with LLMs (GPT4, Claude, …) or diffusion models (DALLE 3, Stable Diffusion, Midjourney, …), how do you evaluate how good it is?

Introducing: Quality scores to systematically evaluate a synthetic dataset with just one line of code! Use Cleanlab’s synthetic dataset scores to rigorously guide your prompt engineering (much better signal than just manually inspecting samples). These scores also help you tune settings of any synthetic data generator (eg. GAN or probabilistic model hyperparameters) and compare different synthetic data providers.

Cleanlab scores comprehensively evaluate a synthetic dataset for different shortcomings including: unrealistic examples, low diversity, overfitting/memorization of real data, and underrepresentation of certain real scenarios. These scores are universally applicable to image, text, and structured/tabular data!

Check out the blog for more details or the tutorial notebook: https://help.cleanlab.ai/tutorials/synthetic_data/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: