Introducing Goodbooks-10k: A Comprehensive Dataset for Personalized Book Recommendations

Introduction

In the world of data science, having access to a comprehensive and reliable dataset is crucial for building accurate models and making informed decisions. While there are numerous datasets available for movies and music, there has been a lack of a similar resource for books. That is, until now. A new dataset, known as Goodbooks-10k, has been made available, providing a vast amount of information on books and their ratings from a popular book-sharing platform. In this article, we will delve into the details of this dataset and explore its potential applications.

There have been a few recommendations datasets for movies (Netflix, Movielens) and music (Million Songs), but not for books. That is, until now.

The dataset contains six million ratings for ten thousand most popular books (with most ratings). There are also:

books marked to read by the users
book metadata (author, year, etc.)
tags/shelves/genres

As to the source, let’s say that these ratings come from a site similar to goodreads.com, but with more permissive terms of use.

There are a few types of data here:

explicit ratings
implicit feedback indicators (books marked to read)
tabular data (book info)
tags

For a quick exploratory analysis of the data, see the notebook.

Data Files

ratings.csv contains ratings sorted by time. It is 69MB and looks like that:

user_id,book_id,rating
1,258,5
2,4081,4
2,260,5
2,9296,5
2,2318,3

Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424.

to_read.csv provides IDs of the books marked “to read” by each user, as user_id,book_id pairs, sorted by time. There are close to a million pairs.

books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The metadata have been extracted from goodreads XML files, available in books_xml.

Goodreads Book and Work IDs

Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.

You can use the goodreads book and work IDs to create URLs to see the difference:

Note that book_id in ratings.csv and to_read.csv maps to work_id, not to goodreads_book_id. It means that ratings for different editions are aggregated.

Download

All files are available on GitHub. Some of them are quite large, so GitHub won’t show their contents online. See samples for smaller CSV snippets. You can download individual zipped files from releases.

Thanks to Maciej Kula, the dataset is accessible from Spotlight, recommender software based on PyTorch.

Citing

This is a preferred citation style:

@article{goodbooks2017,
    author = {Zajac, Zygmunt},
    title = {Goodbooks-10k: a new dataset for book recommendations},
    year = {2017},
    publisher = {FastML},
    journal = {FastML},
    howpublished = {url{

Conclusion

In conclusion, the Goodbooks-10k dataset provides a valuable resource for researchers and developers working in the field of book recommendations. With its vast amount of data and metadata, it offers a unique opportunity to explore the world of books and develop innovative applications. Whether you’re a data scientist, a researcher, or a developer, this dataset is an exciting new frontier to explore.

Frequently Asked Questions

Question 1: What is the Goodbooks-10k dataset?

The Goodbooks-10k dataset is a collection of ratings and metadata for 10,000 popular books, along with information on user behavior and book characteristics.

Question 2: Where does the data come from?

The data comes from a popular book-sharing platform with more permissive terms of use.

Question 3: What types of data are included in the dataset?

The dataset includes explicit ratings, implicit feedback indicators, tabular data, and tags.

Question 4: How can I access the dataset?

The dataset is available on GitHub, where you can download individual zipped files or access smaller CSV snippets.

Question 5: How can I cite the dataset?

You can cite the dataset using the following citation style: @article{goodbooks2017,…}.

Introducing Goodbooks-10k: A Comprehensive Dataset for Personalized Book Recommendations

Introduction

Contents

Data Files

Tags

Goodreads Book and Work IDs

Download

Citing

Conclusion

Frequently Asked Questions

Question 1: What is the Goodbooks-10k dataset?

Question 2: Where does the data come from?

Question 3: What types of data are included in the dataset?

Question 4: How can I access the dataset?

Question 5: How can I cite the dataset?

Wheeled Humanoid Robotics: The Future of Agile and Efficient Human-Centred Automation

PUDU D7 1.65m Semi-Humanoid Robot Unveiled: Redefining Human-Robot Interaction

Breakthrough in Robotics: Wurzburg Researchers Successfully Pilot Swarm of Robots to Top Google Rankings

Revolutionizing Cancer Diagnosis: Medical Centers Leverage AI-Federated Learning for Enhanced Detection and Improved Patient Outcomes

Wheeled Humanoid Robotics: The Future of Agile and Efficient Human-Centred Automation

PUDU D7 1.65m Semi-Humanoid Robot Unveiled: Redefining Human-Robot Interaction

Breakthrough in Robotics: Wurzburg Researchers Successfully Pilot Swarm of Robots to Top Google Rankings

Revolutionizing Cancer Diagnosis: Medical Centers Leverage AI-Federated Learning for Enhanced Detection and Improved Patient Outcomes

Editor Picks

PUDU D7 1.65m Semi-Humanoid Robot Unveiled: Redefining Human-Robot Interaction

Breakthrough in Robotics: Wurzburg Researchers Successfully Pilot Swarm of Robots to Top Google Rankings

Revolutionizing Cancer Diagnosis: Medical Centers Leverage AI-Federated Learning for Enhanced Detection and Improved Patient Outcomes

Must read

PUDU D7 1.65m Semi-Humanoid Robot Unveiled: Redefining Human-Robot Interaction

Breakthrough in Robotics: Wurzburg Researchers Successfully Pilot Swarm of Robots to Top Google Rankings

Revolutionizing Cancer Diagnosis: Medical Centers Leverage AI-Federated Learning for Enhanced Detection and Improved Patient Outcomes

Popular categories

Wheeled Humanoid Robotics: The Future of Agile and Efficient Human-Centred Automation

PUDU D7 1.65m Semi-Humanoid Robot Unveiled: Redefining Human-Robot Interaction

Breakthrough in Robotics: Wurzburg Researchers Successfully Pilot Swarm of Robots to...

Revolutionizing Cancer Diagnosis: Medical Centers Leverage AI-Federated Learning for Enhanced Detection...

Planzer and Loxo Collaborate to Launch Autonomous Commercial Vehicle in Switzerland,...

Unlocking YouTube Success: How Generative AI Can Elevate Your Video Content...