The following code allowed me to successfully download the IMDB dataset with fastai to a Modal volume:
import os
os.environ["FASTAI_HOME"] = "/data/fastai"
from fastai.text.all import *
app = modal.App("imdb-dataset-train")vol = modal.Volume.from_name("modal-llm-data", create_if_missing=True)
@app.function( gpu="any", image=modal.Image.debian_slim().pip_install("fastai"), volumes={"/data": vol},)def download(): path = untar_data(URLs.IMDB) print(f"Data downloaded to: {path}") return path
run with
modal run train.py::download
Next, I tried to run one epoch of training of the language model
@app.function( gpu="h100", image=modal.Image.debian_slim().pip_install("fastai"), volumes={"/data": vol}, timeout=20 * 60,)def train(): path = untar_data(URLs.IMDB) print(f"Training with data from: {path}") get_imdb = partial(get_text_files, folders=["train", "test", "unsup"])
dls_lm = DataBlock( blocks=TextBlock.from_folder(path, is_lm=True), get_items=get_imdb, splitter=RandomSplitter(0.1), ).dataloaders(path, path=path, bs=128, seq_len=80)
print("Sample from datablock:") print(dls_lm.show_batch(max_n=2))
learn = language_model_learner( dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()] ).to_fp16()
learn.fit_one_cycle(1, 2e-2) learn.save("1epoch")
I was waiting around for a long time, having never seen “Sample from datablock:” print.
Looking into the volume with the Modal UI, I noticed the /fastai/data/imdb_tok/unsup
folder has been modified recently.
It seemed like the tokenization of the dataset was talking a long time.
I was able to do this tokenization quite fast locally, so I am going to chalk this up to the Modal volume not being as performant as a local file system.
While I’m not 100%, I think the need to train with so many little files may undermine my ability to train this model on Modal.