22.6 C
London
Friday, September 20, 2024

Boosting Hugging Face Model Training Speed: Leveraging Flash Attention 2 for Enhanced Efficiency

Here is the article in HTML format:

Introduction

Packing instruction tuning examples (without padding) is now compatible with Flash Attention 2 in Hugging Face, thanks to a recent PR and the new DataCollatorWithFlattening. This new feature can provide up to 2x improvement in training throughput while maintaining convergence quality. By leveraging the power of Flash Attention 2, users can train their models faster and more efficiently, resulting in better results and reduced training times.

Packing without Padding

Padding input sequences in mini-batches is a common method to collate inputs during training. However, this introduces inefficiencies because of the irrelevant padding tokens. Packing examples without padding, and using the token position information, is a more efficient alternative. However, previous implementations of packing did not consider example boundaries when using Flash Attention 2, resulting in undesired cross-example attention that reduces quality and convergence.

Before the introduction of the new feature, packing without padding required manual modifications to the training code and relied on the model’s ability to provide position information. This made it less accessible to users who wanted to try this approach.

The new DataCollatorWithFlattening feature in Hugging Face eliminates the need for manual modifications and provides an easy-to-use solution for packing instruction tuning examples without padding.

Up to 2x Throughput Increase

We see significant improvement in training throughput using this feature with the new DataCollatorWithFlattening. The figure below shows the throughput measured in tokens/second during training. In this example, the throughput is the per-GPU average over 8 A100-80 GPU over one epoch of a 20K randomly selected sample from two different instruct tuning datasets, FLAN and OrcaMath.

FLAN has short sequences on average but a large variance in sequence length, so example lengths in each batch may vary widely. This means that padded FLAN batches may incur a significant overhead in unused padding tokens. Training on the FLAN dataset shows a significant benefit using the new DataCollatorWithFlattening in terms of increased throughput. We see a 2x throughput increase on the models shown here: llama2-7B, mistral-7B, and granite-8B-code.

OrcaMath has longer examples and a lower variance in example length. As such, the improvement from packing is lower. Our experiments show a 1.4x increase in throughput when training using this form of packing on the OrcaMath dataset across these three models.

memory

Memory usage also improves through packing with the new DataCollatorWithFlattening. The following figure shows the peak memory usage of the same three models training on the same two datasets. Peak memory is reduced by 20% on the FLAN dataset, which benefits considerably from packing.

Peak memory reduction is 6% on the OrcaMath dataset with its more homogeneous example lengths.

Packing examples, when it reduces the number of optimization steps, may harm training convergence. However, the new feature, DataCollatorWithFlattening, retains the minibatches and, hence, the same number of optimization steps as would be used with padded examples. Thus, there is no impact on train convergence, as we see in the next figure, which shows identical validation loss of the same three models training on the same two datasets, whether the models are trained with packing using the new DataCollatorWithFlattening or with padding.

ValLoss

How it Works

Consider a batch of data with a batchsize = 4 where the four sequences are as follows:

batch

After concatenating the examples, the padding-free collator returns the input_ids, labels, and position_ids of each example. Hence, the collator provides, for this batch of data,

example

The modifications required are lightweight and are limited to providing the position_ids to Flash Attention 2.

This relies, however, on the model exposing position_ids. As of the time of writing, 14 models expose them and are supported by the solution. Specifically, Llama 2 and 3, Mistral, Mixtral, Granite, DBRX, Falcon, Gemma, OLMo, Phi 1, 2, and 3, phi3, Qwen 2 and 2 MoE, StableLM, and StarCoder 2 are all supported by the solution.

Getting Started

Reaping the benefits of packing with position_ids is easy. To use packing with position_ids, only two steps are required:

  1. Instantiate the model with Flash Attention 2
  2. Use the new DataCollatorWithFlattening

How to Use it

image1
image2

Conclusions

Packing instruction tuning examples, instead of padding, is now fully compatible with Flash Attention 2, thanks to a recent PR and the new DataCollatorWithFlattening. The method is compatible with models that use position_ids. Benefits can be seen in throughput and peak memory usage during training, with no degradation in training convergence. Actual throughput and memory improvement depends on the model and the distribution of example lengths in the training data. Training with data that has a wide variation of example lengths will see the greatest benefit, with respect to padding, by using the DataCollatorWithFlattening.

For a more detailed analysis, have a look at the paper at

Conclusion

In conclusion, packing instruction tuning examples (without padding) is now compatible with Flash Attention 2, thanks to the recent PR and the new DataCollatorWithFlattening. This feature provides a seamless way to pack examples without padding, which can significantly improve the performance of your models during training.

Frequently Asked Questions

Q1: What is DataCollatorWithFlattening?

DataCollatorWithFlattening is a new feature in Hugging Face that allows users to pack instruction tuning examples without padding. This feature eliminates the need for manual modifications to the training code and provides an easy-to-use solution for packing examples without padding.

Q2: What are the benefits of using DataCollatorWithFlattening?

Using DataCollatorWithFlattening provides several benefits, including improved throughput, reduced peak memory usage, and no degradation in training convergence. Additionally, this feature supports models that use position_ids.

Q3: Are there any limitations to using DataCollatorWithFlattening?

Yes, there are some limitations to using DataCollatorWithFlattening. For example, this feature only supports models that use position_ids. Additionally, the actual throughput and memory improvement depends on the model and the distribution of example lengths in the training data.

Q4: How do I get started with DataCollatorWithFlattening?

Getting started with DataCollatorWithFlattening is easy. First, instantiate the model with Flash Attention 2. Then, use the new DataCollatorWithFlattening feature. For more detailed instructions, refer to the How to Use it section above.

Q5: What is Flash Attention 2?

Flash Attention 2 is a new version of the Flash Attention mechanism that provides better performance and faster training times. This feature is supported by the new DataCollatorWithFlattening feature.

Latest news
Related news
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x