Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

Join Posts waitlist
Severian 
posted an update 11 minutes ago
view post
Post
New model and dataset! The Llama-3-IMPACTS-2x8B-64k-MLX (and upcoming GGUF) model is a cutting-edge large language model trained on the I.M.P.A.C.T.S dataset, which encompasses scenarios from biomimicry, climate change, and theoretical astrobiology.

SivilTaram 
posted an update about 2 hours ago
view post
Post
158
✨ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. 🚀

💻Code: https://github.com/sail-sg/sailcraft
🤗Model: sail/sailor-language-models-65e19a749f978976f1959825
📜Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🌐Homepage: https://sailorllm.github.io

# Overview 🔍

The pipeline consists of 4 stages🧹:
1️⃣ Initial data cleaning
2️⃣ Near deduplication
3️⃣ Exact deduplication
4️⃣ Second round of data cleaning

A special focus was given to the data cleaning part of South-East Asian (SEA) languages🌍

# Use Case ✨

With this codebase, you can clean your own dataset with:

✅ Get filtered data counts after each processing stage
✅ Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay)
✅ Investigate what data was removed at each processing stage

# Acknowledgement 🙏

The main credit goes to @dreamerdeo , the first author of our Sailor paper ❤️! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. 🎉

Sharing the recipe openly aligns with our commitment to open language model development. 💪 And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. 🧠

# What's Next 🚀

Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. 🚄
prithivMLmods 
posted an update about 2 hours ago
akhaliq 
posted an update about 2 hours ago
view post
Post
224
A Careful Examination of Large Language Model Performance on Grade School Arithmetic

A Careful Examination of Large Language Model Performance on Grade School Arithmetic (2405.00332)

Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show minimal signs of overfitting. Further analysis suggests a positive relationship (Spearman's r^2=0.32) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.
Wauplin 
posted an update about 4 hours ago
view post
Post
405
🚀 Just released version 0.23.0 of the huggingface_hub Python library!

Exciting updates include:
📁 Seamless download to local dir!
💡 Grammar and Tools in InferenceClient!
🌐 Documentation full translated to Korean!
👥 User API: get likes, upvotes, nb of repos, etc.!
🧩 Better model cards and encoding for ModelHubMixin!

Check out the full release notes for more details:
Wauplin/huggingface_hub#6
👀
Twilight02 
posted an update about 5 hours ago
yanghaojin 
posted an update about 6 hours ago
wxDai 
posted an update about 6 hours ago
view post
Post
342
🔥Motion Latent Consistency Model🔥

Introducing MotionLCM💃, controlling and generating a motion in milliseconds!

Huggingface Space:
wxDai/MotionLCM
Huggingface Paper:
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model (2404.19759)

Project page: https://dai-wenxun.github.io/MotionLCM-page/
Paper: https://arxiv.org/pdf/2404.19759.pdf
Code: https://github.com/Dai-Wenxun/MotionLCM
video: https://www.youtube.com/watch?v=BhrGmJYaRE4

MotionLCM supports inference pipelines of 1-4 steps, with almost no difference in effectiveness between 1 and 4 steps. Generating approximately 200 frames of motion only takes about 30ms, which averages to approximately 6k fps per frame.

Our MotionLCM can achieve high-quality text-to-motion and precise motion control results (both sparse and dense conditions) in ∼30 ms.

We integrated a control module into the diffusion of the latent space, named Motion ControlNet, to achieve controllable motion generation. Our control algorithm is approximately 1,000 times faster than the best-performing baseline, with comparable quality.
  • 1 reply
·
sayakpaul 
posted an update about 8 hours ago
view post
Post
398
Custom pipelines and components in Diffusers 🎸

Wanted to use customized pipelines and other components (schedulers, unets, text encoders, etc.) in Diffusers?

Found it inflexible?

Since the first dawn on earth, we have supported loading custom pipelines via a custom_pipeline argument 🌄

These pipelines are inference-only, i.e., the assumption is that we're leveraging an existing checkpoint (e.g., runwayml/stable-diffusion-v1-5) and ONLY modifying the pipeline implementation.

We have many cool pipelines, implemented that way. They all share the same benefits available to a DiffusionPipeline, no compromise there 🤗

Check them here:
https://github.com/huggingface/diffusers/tree/main/examples/community

Then we might have a requirement of everything customized i.e., custom components along with a custom pipeline. Sure, that's all possible.

All you have to do is keep the implementations of those custom components on the Hub repository you're loading your pipeline checkpoint from.

SDXL Japanese was implemented like this 🔥
stabilityai/japanese-stable-diffusion-xl

Full guide is available here ⬇️
https://huggingface.co/docs/diffusers/main/en/using-diffusers/custom_pipeline_overview

And, of course, these share all the benefits that come with DiffusionPipeline.
fantaxy 
posted an update about 9 hours ago
view post
Post
510
Introducing the New Language Model "Fantasy 1r39B" and the Multilingual Erotic Story Generator Demo
The development of artificial intelligence language models is progressing rapidly, particularly in the creation of content that takes into account various languages and cultural backgrounds. As part of this progress, inspired by OpenLLM's Mixtral 8x7B and LLAMA3, we are introducing a new language model, "Fantasy 1r39B." This model is specifically designed for multilingual story generation, incorporating sophisticated fine-tuning, quantization, and Retrieval-Augmented Generation (RAG) technologies.

A New Horizon in Erotic Story Generation
The first challenge for "Fantasy 1r39B" is the 'Erotic Story Generator' in multiple languages. This special demo is designed based on a deep understanding of language, culture, emotions, and customs, and it can generate a variety of stories to increase user engagement and interest. The model is specifically designed to unlock filtering and apply incremental learning, allowing for the free exploration of content without restrictions on what can be generated. As a result, it may sometimes include crude expressions or violent words, so users should be mindful of this.

Service Opening and Accessibility
The 'Erotic Story Generator' is currently available as a web test version, accessible to anyone through the link ("https://fantaxy-erotica.hf.space"). In this test version, support is provided for up to 128K tokens, with new token generation set at 1K per instance. This allows users to continuously input commands and generate stories in succession.

Plans are in place to soon publish this model on ArXiv.org and GitHub, and to open API access and the model itself to the public. This will make it easier for developers and researchers to access and use the model.