Practical Deep Learning, Lesson 4, Language Model Blog Post Imitator

In this notebook/post, we’re going to be using the markdown content from my blog to try a language model. From this, we’ll attempt to prompt the model to generate a post for a topic I might write about.

Let’s import fastai and disable warnings since these pollute the notebook a lot when I’m trying to convert these notebooks into posts (I am writing this as a notebook and converting it to a markdown file with this script).

from fastai.text.all import *
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

Loading the data

The written content in my blog is markdown (.md) files. You can see the raw contents of any of these posts by appending /index.md to the end of the URL on any post on this site. They look something like

---
<frontmatter key values>
---
<the rest of the post with code, links, images, shortcodes, etc.>

To get started, I modeled my approach after the one used in chapter 10, which looks something like

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

There were a few modifications I needed to make with the approach. For starters, we were loading .md files rather than text files, so initially I tried to do this with

path = Path("./data/content")
files = get_files(path, extensions='.md', recurse=True)
for f in files[:3]:
print(f)
data/content/posts/2013/2013-07-05-qc.md
data/content/posts/2024/models-writing-about-coding-with-models.md
data/content/posts/2024/vlms-hallucinate.md

However, these seemed to cause opaque and confusing issues with DataBlock or DataLoaders that manifested something like this

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[209], line 10
3 # First, create a tokenizer
4 text_processor = TextBlock.from_folder(path, is_lm=True)
6 dls_lm = DataBlock(
7 blocks=text_processor,
8 get_items=get,
9 splitter=RandomSplitter(0.1)
---> 10 ).dataloaders(
11 path,
12 path=path,
13 bs=128,
14 seq_len=80,
15 )
File ~/dev/lab/fastbook_projects/blog_post_generator/.venv/lib/python3.12/site-packages/fastai/data/block.py:157, in DataBlock.dataloaders(self, source, path, verbose, **kwargs)
151 def dataloaders(self,
152 source, # The data source
153 path:str='.', # Data source and default `Learner` path
154 verbose:bool=False, # Show verbose messages
155 **kwargs
156 ) -> DataLoaders:
--> 157 dsets = self.datasets(source, verbose=verbose)
158 kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
...
387 self.types.append(type(x))
--> 388 types = L(t if is_listy(t) else [t] for t in self.types).concat().unique()
389 self.pretty_types = '\n'.join([f' - {t}' for t in types])
TypeError: 'NoneType' object is not iterable

To workaround this challenge, I changed all the file extensions to .txt. This allowed the model to load and tokenize the dataset.

Next, I had an issue with encoding

return f.read()
^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

I solved this in the copy_and_rename_files function with

src_file.read_text(encoding=encoding)

There is also a function called clean_content which I will address later.

import re
def clean_content(text):
# Handle code blocks with language specifiers
text = re.sub(r'```(\w+)?\n(.*?)```', lambda m: f'<CODE>{m.group(2)}</CODE>', text, flags=re.DOTALL)
# Replace single backticks
text = re.sub(r'`([^`]+)`', r'<INLINE_CODE>\1</INLINE_CODE>', text)
return text
def copy_and_rename_files():
src_dir = Path("./data/content")
dst_dir = Path("./data/cleaned_content")
if not dst_dir.exists():
dst_dir.mkdir(parents=True)
for src_file in src_dir.rglob("*.md"):
try:
content = None
for encoding in ['utf-8', 'latin-1', 'cp1252']:
try:
content = src_file.read_text(encoding=encoding)
if content.startswith('---'):
# Remove markdown frontmatter
parts = content.split('---', 2)
if len(parts) >= 3:
content = parts[2]
break
except UnicodeDecodeError:
continue
if content is None:
print(f"Skipping {src_file}: Unable to decode with supported encodings")
continue
content = clean_content(content)
rel_path = src_file.relative_to(src_dir)
dst_file = dst_dir / rel_path.with_suffix('.txt')
dst_file.parent.mkdir(parents=True, exist_ok=True)
dst_file.write_text(content, encoding='utf-8')
except Exception as e:
print(f"Error processing {src_file}: {str(e)}")
copy_and_rename_files()

Training the model

With the needed adjustments to the dataset made, and the model-ready content now in the data/cleaned_content folder, we can load and tokenize that data with fastai.

We validate that we can read the files paths with our code

path = Path("./data/cleaned_content")
files = get_text_files(path)
for f in files[:3]:
print(f)
data/cleaned_content/posts/2013/2013-07-05-qc.txt
data/cleaned_content/posts/2024/language-model-based-aggregators.txt
data/cleaned_content/posts/2024/making-your-vision-real.txt

Then we create a DataBlock and view the loaded, tokenized content

get = partial(get_text_files, folders=['posts', 'til', 'logs', 'projects'])
text_processor = TextBlock.from_folder(path, is_lm=True)
dls_lm = DataBlock(
blocks=text_processor,
get_items=get,
splitter=RandomSplitter(0.1)
).dataloaders(
path,
path=path,
bs=128,
seq_len=512,
)
dls_lm.show_batch(max_n=2)
text text_
0 xxbos xxmaj i 've been wanting to create a chat component for this site for a while , because i really do n't like quoting conversations and manually formatting them each time . \n xxmaj when using a model playground , usually there is a code snippet option that generates xxmaj python code you can copy out intro a script . \n xxmaj using that feature , i can now copy the message list and paste it as xxup json into a xxmaj hugo shortcode and get results like this : \n\n\n { { < chat model="gpt-4o - mini " > } } \n [ \n▁ { \n▁ " role " : " system " , \n▁ " content " : [ \n▁ { \n▁ " type " : " text " , \n▁ " text " : " you should respond with the understanding they are an experienced software xxmaj i 've been wanting to create a chat component for this site for a while , because i really do n't like quoting conversations and manually formatting them each time . \n xxmaj when using a model playground , usually there is a code snippet option that generates xxmaj python code you can copy out intro a script . \n xxmaj using that feature , i can now copy the message list and paste it as xxup json into a xxmaj hugo shortcode and get results like this : \n\n\n { { < chat model="gpt-4o - mini " > } } \n [ \n▁ { \n▁ " role " : " system " , \n▁ " content " : [ \n▁ { \n▁ " type " : " text " , \n▁ " text " : " you should respond with the understanding they are an experienced software engineer
1 xxunk / aka \n [ sublime xxup cli ] : https : / / xxrep 3 w xxunk / docs / 2 / xxunk xxbos i built a [ site](https : / / xxunk / ) to host a language model generated kid ’s book i built using chatgpt and xxmaj midjourney . xxmaj the plot was sourced from my book xxunk to see how a language model would perform writing a xxunk with an unusual plot . \n xxmaj it prompted some interesting conversations about the role of xxunk and art in culture on the [ xxunk xxunk : / / news.ycombinator.com / xxunk ) . \n\n xxmaj tech : xxmaj react , xxmaj next.js , openai , xxmaj midjourney , xxmaj vercel \n\n [ source code](https : / / github.com / danielcorin / adventure - of - xxunk / ) xxbos i wrote a tiny site to use / aka \n [ sublime xxup cli ] : https : / / xxrep 3 w xxunk / docs / 2 / xxunk xxbos i built a [ site](https : / / xxunk / ) to host a language model generated kid ’s book i built using chatgpt and xxmaj midjourney . xxmaj the plot was sourced from my book xxunk to see how a language model would perform writing a xxunk with an unusual plot . \n xxmaj it prompted some interesting conversations about the role of xxunk and art in culture on the [ xxunk xxunk : / / news.ycombinator.com / xxunk ) . \n\n xxmaj tech : xxmaj react , xxmaj next.js , openai , xxmaj midjourney , xxmaj vercel \n\n [ source code](https : / / github.com / danielcorin / adventure - of - xxunk / ) xxbos i wrote a tiny site to use to

That looks good, so we create a learner then run the same approach as is done in Chapter 10, checkpointing as we go. We don’t have to do it this way — we could just call fit_one_cycle the number of times we want — but it was helpful for me to validate the process end-to-end once more.

learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]
).to_fp16()
learn.fit_one_cycle(1, 2e-2)
epoch train_loss valid_loss accuracy perplexity time
0 4.923746 4.835233 0.222222 125.867935 00:12
learn.save('1epoch')
Path('data/cleaned_content/models/1epoch.pth')
learn = learn.load('1epoch')

Now, we do the bulk of the fine-tuning.

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
epoch train_loss valid_loss accuracy perplexity time
0 3.721324 4.359152 0.278559 78.190788 00:12
1 3.582399 3.972712 0.338368 53.128395 00:13
2 3.452389 3.608671 0.379557 36.916973 00:13
3 3.307291 3.372669 0.413889 29.156248 00:13
4 3.221604 3.291163 0.422917 26.874105 00:13
5 3.131132 3.213343 0.436892 24.862070 00:13
6 3.047300 3.147552 0.447179 23.278996 00:13
7 2.978492 3.120242 0.455946 22.651857 00:14
8 2.921093 3.109544 0.458030 22.410818 00:14
9 2.876113 3.107162 0.458464 22.357492 00:12
learn.save_encoder('finetuned')

Experimenting with the Result

With the fine-tuned model, we can run inference! We define the beginning of the content we want the model to output in TEXT, then we call learn.predict for the number of tokens we want the model to output and set temperature to determine the randomness/creativity of the output.

TEXT = "I am interacting with a language model as a thought partner to"
N_WORDS = 256
TEMP = 0.75
pred = learn.predict(TEXT, N_WORDS, temperature=TEMP)
print(pred)
i am interacting with a language model as a thought partner to train .
I think it should be more difficult to use Sonnet but it is often difficult to get to because i am a very familiar language model .
In this way , i wanted to use the phrase Sonnet > rather than just see the phrase as a word .
He 's also using the word " prompt " , which is a word word that is used in art , then as a title to describe a language model that extract structured data from an individual 's experience .
I 've used this idea to describe how a language model has a structure with a single structure and i can try and solve this with the following following Sonnet code : i n't have such a thead for a language model .
Being like this is an easy way to design an [ initial Sonnet ] .
In another example , where you would have written a single word to explicitly describe a language model , i had to use the following word usage : < inline_code >
< / INLINE_CODE > : The word i used to read the script in a language model and then describe the word as an JSON object .
This working was quite different from my JSON code and Python CODE >
the most recent model to use this PYTHON code

The output is a little wild but it kinda sorta makes sense and isn’t too bad for a model I trained in a couple minutes on my MacBook.

I tweaked several parts of the approach in effort to improve the model’s output quality. Jumping back to the clean_content function, I found that removing the markdown frontmatter and replacing triple backticks with single tokens seemed to make the output make a bit more sense. When this fine-tuned model tries to generate code, it makes little sense and does strange things like emit tokens with triple words like importimportimport. I have a feeling this deficiency may be because the base model wasn’t trained on much source code.

So there we have it. A simple language model fine-tuned on my blog posts. This was a helpful experience for getting a feel for some feature engineering.

If you liked this post, be sure to check out some of my other notebooks I’ve built while working through the FastAI Course linked below.