T5 Text generation colab notebook:
@gwern @l4rz @theshawwn
It is more or less same as official notebook to fine-tune on QA datasets. But I have modified it to work for generating text. Here are a few key differences:
- Changed datasets to news article generation and poem generation
- Changed preprocessors which work best with text gen.
Key things to note here:
1. T5 is sequence to sequence so it has both encoder and decoder unlike GPT-2 which has only decoder.
2. Max input sequence length is 512 and max output sequence length is 512.
3. It has many sizes but 11B version can only be trained on TPUv3
4. Just like most of TF models, to train them on TPU, you have to use google cloud storage to store your data. So I have written scripts to load and convert data in trainable format.
5. TF as not much options when it comes to load dataset from files as tf dataset. Only supported options I found were tf-record and TextLineDataset. TF-record is a mess. So I chose TextLineDataset which works but has some caveats.
6. In order to use TextLineDataset you have to newlines in your text into something useful. I tried using \n <NL> \\n and many more but only |NL| works. Plus you cannot feed your text to T5 with newline character in it. T5's tokenizer converts it to weird unicodes.
6. T5 lets you build custom dataloader like functionality which is helpful as you can provide all the variations of input/output from a single example without storing all the variations. So it lets you build dataloader function.
7. T5 uses something call task registry - a fancy way to describing dataset/preprocessors/postprocessors/train-test split and much more. A single task is entire pipeline from cloud storage bucket to what has to be fed to model.
8. I have created two task registries: one for bbc news articles and one for poems dataset. You can create as many tasks as you like. Since T5 is a one-model-to-rule-them-all and designed keeping in mind transfer learning, it ll train on all the tasks. but it can slow training.
9. In dataset loading step, you can perform all sorts of things. One thing that I tried was to replace new line token back to newline but model went crazy because tokenizer converts different characters to unicodes. < character was also converted to unicode. thats why i used |NL|
10. T5 has many preprocessors built-in. I tried many of them which can be could be helpful like gpt-2 style combining texts with <|eot|>, splitting and offsetting text as well as pure gpt-2 style lm but nothing worked except prefix_lm
11. On bright-side, it just works like you dont waste time looking for TF bugs. As long as you have built correct data pipeline, the entire training step is handler internally by mesh tensorflow which just WORKS
12. On dark side, you can't customize beyond what authors has provided. Not that bad considering TF would take your hours if you were working with it directly.
13. Inference is a mess due to mesh tensorflow. Does not work on GPU.
14. Generated samples are not bad at all. but you have to wait a few minute because mesh tensorflow inference pipeline is very slow.
15. It is possible to export model to savedmodel but that is WIP. so stay tuned!
You can follow @NaxAlpha.
Tip: mention @threader on a Twitter thread with the keyword “compile” to get a link to it.