Intro
The scale of language models these days goes well beyond a hobbyists capacity to reproduce (GPT-4 ~ 1.7T) - requiring thousands of GPUs to train. This model won't come anywhere close to that - it's purely an experiment for fun! To satisfy ambition, I went for 1 Billion parameters. From interacting with various GPT-2 model sizes, 1B seemed to work for fluency, capability, and could still run locally without difficulty.
Phase 2 - Design an Architecture
Pytorch is a very familiar framework to me at this point. Countless hours have been spent building everything from Chess Engines to RL Snake-Playing agents. The transformer architecture was a new one to me, though. I remember distinctly one morning on vacation sitting in the Aqua Aloha Surf hotel in Honolulu, casually parsing over Attention is All You Need - as one does. Thus began my deep dive into transformer-based langauge models.
To understand the transformer mechanism, field advancements (RoPE, data/computation/training optimizations, etc...), and exactly what I could do with it took more time. Given my hardware limitations (RTX 4060Ti 16GB), I took to ≤1B parameter models. Months of toying around and optimizing compute, memory, and data requirments led me to the following model training as we speak:
Here's some example code from the model.py file where I implemented the architecture. I use Pytorch for the model which makes for super neat code - I take pride in my work!
#Implementation of multi-head attention
class MultiHeadAttention(torch.nn.Module):
def __init__(self, embed_dim, num_heads,n_positions, device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),droput=.2):
super(MultiHeadAttention, self).__init__()
# Initialize parameters
self.n_positions = n_positions
self.embed_dim = embed_dim
self.num_heads = num_heads
self.d_k = embed_dim // num_heads
self.device = device
# Linear layers for transforming inputs
self.layer_1 = torch.nn.Linear(embed_dim,embed_dim*3,bias=True)
self.W_o = torch.nn.Linear(embed_dim, embed_dim,device=device,bias=True)
self.scale = 1 / math.sqrt(self.d_k)
self.dropout = droput
self.is_training = True
def forward(self, x:torch.Tensor,attn_mask:torch.Tensor=None):
B, N, C = x.size()
# Apply linear transformations and split heads
Q,K,V = self.layer_1(x).split(self.embed_dim,dim=2)
Q:torch.Tensor = Q.view(B, N, self.num_heads, self.d_k).transpose(1,2)
K:torch.Tensor = K.view(B, N, self.num_heads, self.d_k).transpose(1,2)
V:torch.Tensor = V.view(B, N, self.num_heads, self.d_k).transpose(1,2)
# Apply RoPE
Q = apply_rope(Q,N,self.device)
K = apply_rope(K,N,self.device)
with sdpa_kernel([SDPBackend.EFFICIENT_ATTENTION,SDPBackend.FLASH_ATTENTION,SDPBackend.MATH,SDPBackend.CUDNN_ATTENTION]):
attn_mask = attn_mask.unsqueeze(1).unsqueeze(2)
attn_out = scaled_dot_product_attention(Q,K,V,dropout_p=self.dropout if self.is_training else 0,is_causal=True,scale=self.scale,attn_mask=attn_mask)
attn_out = attn_out.transpose(1,2).contiguous().view(B,N,C)
return self.W_o(attn_out)
#One stack of a decode-transformer layer
class DecoderLayer(torch.nn.Module):
def __init__(self,n_embed,n_head,n_positions,n_ff,dropout=.1):
super(DecoderLayer,self).__init__()
#Self attention layer
self.mh_attn = MultiHeadAttention(n_embed,n_head,n_positions)
self.mha_dropout = torch.nn.Dropout(p=dropout)
self.mha_layer_norm = torch.nn.LayerNorm(n_embed)
#Feed Forward layer
self.ff_layers = torch.nn.Sequential(
torch.nn.Linear(n_embed,n_ff),
torch.nn.GELU(),
torch.nn.Linear(n_ff,n_embed))
self.ff_dropout = torch.nn.Dropout(p=dropout)
self.ff_layer_norm = torch.nn.LayerNorm(n_embed)
def forward(self,x:torch.Tensor,attn_mask:torch.Tensor=None)->torch.Tensor:
#Apply layer_norm, MHA, and residual connection
attn_output = self.mh_attn(self.mha_layer_norm(x),attn_mask=attn_mask)
attn_output = self.mha_dropout(attn_output)
x = x + attn_output
#Apply layer_norm, ff_layer, and residual connection
ff_output = self.ff_layers(self.ff_layer_norm(x))
ff_output = self.ff_dropout(ff_output)
x = x + ff_output
return x
With this model we're off to the races! Using LM head weight-sharing with the embeddings has us just above parameters. RoPE embeddings are nice to have (no position embeddings needed), and a head dimension of makes for quick but effective training. Not big, but definitely capable of something!
If you're curious about seeing the code you can check it out on my Github!
Architecture Stats -
Training!
Training took by far the most time. At first I attempted training locally - at 5k tokens/sec throughput. Quick maths puts us at 60 days per epoch of training. I was not going to be waiting on that... I soon gave up on my 4060 and used online compute from LambdaAI.com (not sponsored, highly recommend). For 10 days, the model crunched 50 billion tokens. And now that brings us to finetuning - the real hard part.
You see, pretraining is easy! Point and go. Finetuning is the real art of the deal - finding what combination of training steps, techniques, and data will improve the model. Just finding quality datasets was hard enough. I opted for a crowd sourced approach. Iterative refinement of the model based on rejecting / accepting prompt pairs. Please take a minute to contribute - and see the model in action while youre at it!
Check Back in Tomorrow for More!
Updates daily on the project! I code as much as I can after work (well into the night, its unhealthy...) to get updates to you. Thanks for reading so far!