Daniel van Strien PRO

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Organizations

Posts 14

view post
Post
621
In my ongoing quest to learn more about building synthetic datasets, I've created an "Awesome Synthetic Datasets" list.

The aim is to lightly curate a collection of resources, tutorials, and tools for generating synthetic datasets using large language models.

I plan to add some "key techniques" to the repo, but for now, it focuses on important datasets, papers, and tools.

🔗 https://github.com/davanstrien/awesome-synthetic-datasets
view post
Post
2300
Introducing CosmoChat, a multiturn chat dataset based on Cosmopedia that I'm working on in the open on the Hub.

🎯 Goals:
💬 Create multi-turn chats seeded from Cosmopedia
🎓 Customize questions for different audience levels
🔍 Evaluate the model's ability to elaborate and clarify
🤓 (I want to learn more about creating valuable synthetic datasets, and I learn best by doing stuff rather than reading stuff).

Cosmochat is created using the excellent distilabel library.

🔗 Explore the current version of the dataset: davanstrien/cosmochat
📝 Read more: https://huggingface.co/blog/davanstrien/cosmochat