15.5 C
London
Friday, September 20, 2024

Announce: Introducing Enhanced Dataset Search Features for Easier Discovery

Introduction

The Hugging Face Dataset Hub has made available over 180,000 public datasets for the AI and ML community to use. These datasets are used for various tasks, including training LLMs, chatbots, and evaluating automatic speech recognition and computer vision systems. However, dataset discoverability and visualization are key challenges in finding, exploring, and transforming datasets to fit specific use cases.

Search by Modality

The modality of a dataset refers to the type of data it contains, such as text, image, audio, or tabular data. We have released a set of filters that allow users to filter datasets based on their modality. For example, users can search for datasets that contain both text and image data.

  • Text
  • Image
  • Audio
  • Tabular
  • Time-Series
  • 3D
  • Video
  • Geospatial

For instance, users can search for datasets that contain both text and image data:

search by modality example

Search by Size

We have recently added a feature to show the number of rows in each dataset. Users can now search for datasets by a specific number of rows, making it easier to find datasets of a certain size.

number of rows of each dataset

For example, users can search for datasets with more than 10 billion rows:

biggest datasets

Search by Format

The same dataset can be stored in multiple formats, such as Parquet, JSON Lines, or text files. Each format has its pros and cons, and users can search for datasets in specific formats.

For example, users can search for datasets in the WebDataset format:

webdatasets

Search by Library

There are many libraries and tools available for loading and preparing datasets, such as Pandas, Dask, or the 🤗 Datasets library. Users can search for datasets compatible with their favorite libraries.

For example, users can search for datasets compatible with Pandas:

pandas compatible datasets

Combine Filters

The new filters can be used together with other existing filters, such as Language, Tasks, and Licenses. Users can combine these filters with the text search bar to find specific datasets.

search for a webdataset of images of pdf

Conclusion

The new filters in the Dataset Search tool make it easier for users to find and explore datasets on the Hugging Face Dataset Hub. With these filters, users can search for datasets based on their modality, size, format, and library compatibility, making it easier to find the right dataset for their specific use case.

Frequently Asked Questions

Q1: What is the Hugging Face Dataset Hub?

The Hugging Face Dataset Hub is a platform that provides access to over 180,000 public datasets for the AI and ML community. These datasets are used for various tasks, including training LLMs, chatbots, and evaluating automatic speech recognition and computer vision systems.

Q2: What are the new filters in the Dataset Search tool?

The new filters in the Dataset Search tool allow users to search for datasets based on their modality, size, format, and library compatibility. These filters make it easier for users to find and explore datasets on the Hugging Face Dataset Hub.

Q3: How do I use the modality filter?

Users can use the modality filter to search for datasets that contain specific types of data, such as text, image, audio, or tabular data. For example, users can search for datasets that contain both text and image data.

Q4: Can I combine the new filters with other existing filters?

Yes, the new filters can be used together with other existing filters, such as Language, Tasks, and Licenses. Users can combine these filters with the text search bar to find specific datasets.

Q5: How do I get started with the Hugging Face Dataset Hub?

Users can get started with the Hugging Face Dataset Hub by visiting the website and exploring the available datasets. Users can also use the Dataset Search tool to find specific datasets based on their modality, size, format, and library compatibility.

Latest news
Related news
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x