“Fine-Tuning Multimodal Models to Unlock Humor: Unparalleled Cartoon-Joke Pairing with CLIP” (This title incorporates the topic “Fine Tuning Multimodal Models” while specifically highlighting the application in CLIP to match cartoon images with joke captions, which is likely to grab the attention of users searching for this topic on Google)

Introduction

Introduction

Multimodal models like CLIP have opened up new AI use cases by connecting complex objects like images to text descriptions that are easy to understand, generate, and parse. However, off-the-shelf models like CLIP may not be representative of the data typically seen in specific domains, in which case fine-tuning may be needed to adapt the model to that domain.

Fine-Tuning CLIP for Cartoon Caption Prediction

Fine-Tuning CLIP for Cartoon Caption Prediction

This post shows how to fine-tune the CLIP model on cartoon images from The New Yorker Magazine and joke captions for those cartoons. It is based on [1], a dataset for various tasks associated with The New Yorker’s cartoon contest. One of the tasks is to take a cartoon image and predict the appropriate caption from a list of possible captions. Let’s see how we can fine-tune CLIP for this task.

Data Preparation

Data Preparation

The data is hosted and publicly available at gs://datachain-demo/newyorker_caption_contest, and it has two parts:

images: A folder of JPEG files, each representing a cartoon image.
new_yorker_meta.parquet: A parquet file with metadata about the images, including multiple choices of captions for the image and the correct caption choice.

To work with this data, we will use the open-source library [2], which helps to wrangle unstructured data like this into a more structured format (disclaimer: I helped develop [2]). All of the code used in this post is available in a [3], or you can run it in [4].

Data Processing

Data Processing

The code first creates a dataset img_dc from images in a directory, storing the essential information about each file, which we will use later to read the images. Then, it creates a dataset meta_dc from the parquet file of metadata. Finally, it merges these two based on the image filename. img_dc contains a column file.path with the full path to the file, and img_dc.mutate(filename=path.name(C("file.path"))) extracts only the last part of that path, which matches the contents of the filename column in meta_dc. The merged dc dataset has both the file info and metadata for each image.

Training

Training

We can view a sample of the data by filtering and collecting the data like this:

sample = dc.filter(C("file.path").endswith("/371.jpeg")).limit(1)
sample_results = list(sample.collect("file", "caption_choices", "label"))

This limits the data to the image ending in /371.jpeg and collects only the columns "file", "caption_choices", "label". The resulting output includes an ImageFile (see below), a list of possible captions, and a label for the letter choice of the correct caption. You may end up with slightly different results since there are multiple rows per image with different caption choices.

Training Loop

Training Loop

The code above selects the columns needed for training ("file", "caption_choices", "label_ind"), and then calls to_pytorch() with the CLIP preprocessor and tokenizer, which returns a PyTorch IterableDataset with the preprocessed image tensors, tokenized text, and label indices. Next, the code creates a PyTorch DataLoader and optimizer and passes them to train() to start training.

Model Evaluation

Model Evaluation

Since we are using a tiny dataset, we can quickly see the model fit to the sample and the loss decreases dramatically:

loss for epoch 0: 5.243085099384018
loss for epoch 1: 6.937912189641793e-05
loss for epoch 2: 0.0006402461804100312
loss for epoch 3: 0.0009484810252615716
loss for epoch 4: 0.00019728825191123178

This should set off alarm bells about overfitting, but for this exercise, it’s useful to see that train() is doing what we expect: learning the correct captions from the training dataset. We can confirm by calculating the predicted probability of the correct caption for each image in the training data using the fine-tuned model:

train_dc = train_dc.map(label_prob, params=["scores_fine_tune", "label_ind"], output={"label_prob_fine_tune": float})

Running train_dc.avg("label_prob_fine_tune") outputs an average predicted probability >0.99, so it looks like the fine-tuning worked as expected.

Conclusion

Conclusion

This is an artificial example but hopefully gives you an idea of how to fine-tune CLIP. To solve the task of predicting the correct caption in a more robust way, you would want to take a much larger sample and evaluate against a held-out sample of images and texts that hadn’t been seen during training. When trying that, you may find that CLIP does not perform so well at generalizing to the caption prediction problem, which should not be too surprising since CLIP was built to understand the contents of images rather than understanding jokes.

Frequently Asked Questions

Frequently Asked Questions

Question 1: What is CLIP?

What is CLIP?

CLIP is a multimodal model that connects complex objects like images to text descriptions that are easy to understand, generate, and parse.

Question 2: What is the dataset used in this example?

What is the dataset used in this example?

The dataset used in this example is [1], a dataset for various tasks associated with The New Yorker’s cartoon contest.

Question 3: How do you fine-tune CLIP?

How do you fine-tune CLIP?

To fine-tune CLIP, you need to prepare the data, process the data, and then train the model using a PyTorch DataLoader and optimizer.

Question 4: What is the goal of fine-tuning CLIP?

What is the goal of fine-tuning CLIP?

The goal of fine-tuning CLIP is to adapt the model to a specific domain or task, such as predicting the correct caption for a cartoon image.

Question 5: How do you evaluate the performance of the fine-tuned model?

How do you evaluate the performance of the fine-tuned model?

You can evaluate the performance of the fine-tuned model by calculating the predicted probability of the correct caption for each image in the training data and comparing it to the actual label.