475 160 514

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

vanstriendaniel

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Synthetic dataset generation techniques: Self-Instruct

1 day ago

• 3

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

10 days ago

• 6

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 17

Extracting Insights from Model Cards Using Open Large Language Models

Nov 27, 2023

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Oct 30, 2023

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 9

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

Organizations

Posts 14

Post

621

In my ongoing quest to learn more about building synthetic datasets, I've created an "Awesome Synthetic Datasets" list.

The aim is to lightly curate a collection of resources, tutorials, and tools for generating synthetic datasets using large language models.

I plan to add some "key techniques" to the repo, but for now, it focuses on important datasets, papers, and tools.

🔗 https://github.com/davanstrien/awesome-synthetic-datasets

Post

2300

Introducing CosmoChat, a multiturn chat dataset based on Cosmopedia that I'm working on in the open on the Hub.

🎯 Goals:
💬 Create multi-turn chats seeded from Cosmopedia
🎓 Customize questions for different audience levels
🔍 Evaluate the model's ability to elaborate and clarify
🤓 (I want to learn more about creating valuable synthetic datasets, and I learn best by doing stuff rather than reading stuff).

Cosmochat is created using the excellent distilabel library.

🔗 Explore the current version of the dataset: davanstrien/cosmochat
📝 Read more: https://huggingface.co/blog/davanstrien/cosmochat

View all posts