Steinshark's Projects
SLM Project

Welcome to the SLM Project

Part 2 - Build a Bigger Model

Intro

The scale of language models these days goes well beyond a hobbyists capacity to reproduce (GPT-4 ~ 1.7T) - requiring thousands of GPUs to train. This model won't come anywhere close to that - it's purely an experiment for fun! To satisfy ambition, I went for 1 Billion parameters. From interacting with various GPT-2 model sizes, 1B seemed to work for fluency, capability, and could still run locally without difficulty.

Phase 2 - Design an Architecture

Pytorch is a very familiar framework to me at this point. Countless hours have been spent building everything from Chess Engines to RL Snake-Playing agents. The transformer architecture was a new one to me, though. I remember distinctly one morning on vacation sitting in the Aqua Aloha Surf hotel in Honolulu, casually parsing over Attention is All You Need - as one does. Thus began my deep dive into transformer-based langauge models.

To understand the transformer mechanism, field advancements (RoPE, data/computation/training optimizations, etc...), and exactly what I could do with it took more time. Given my hardware limitations (RTX 4060Ti 16GB), I took to ≤1B parameter models. Months of toying around and optimizing compute, memory, and data requirments led me to the following model training as we speak:

1B Parameter Model Architecture
Embedding Layer ( x )
Stacked Transformer Decoder ()
Decoder Layer
LayerNorm
Multi-Head Attention
Dropout
+ Residual
LayerNorm
Feedforward
Dropout
+ Residual
LM Head

Here's some example code from the model.py file where I implemented the architecture. I use Pytorch for the model which makes for super neat code - I take pride in my work!

Multi-Head Attention Code Example
#Implementation of multi-head attention class MultiHeadAttention(torch.nn.Module): def __init__(self, embed_dim, num_heads,n_positions, device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),droput=.2): super(MultiHeadAttention, self).__init__() # Initialize parameters self.n_positions = n_positions self.embed_dim = embed_dim self.num_heads = num_heads self.d_k = embed_dim // num_heads self.device = device # Linear layers for transforming inputs self.layer_1 = torch.nn.Linear(embed_dim,embed_dim*3,bias=True) self.W_o = torch.nn.Linear(embed_dim, embed_dim,device=device,bias=True) self.scale = 1 / math.sqrt(self.d_k) self.dropout = droput self.is_training = True def forward(self, x:torch.Tensor,attn_mask:torch.Tensor=None): B, N, C = x.size() # Apply linear transformations and split heads Q,K,V = self.layer_1(x).split(self.embed_dim,dim=2) Q:torch.Tensor = Q.view(B, N, self.num_heads, self.d_k).transpose(1,2) K:torch.Tensor = K.view(B, N, self.num_heads, self.d_k).transpose(1,2) V:torch.Tensor = V.view(B, N, self.num_heads, self.d_k).transpose(1,2) # Apply RoPE Q = apply_rope(Q,N,self.device) K = apply_rope(K,N,self.device) with sdpa_kernel([SDPBackend.EFFICIENT_ATTENTION,SDPBackend.FLASH_ATTENTION,SDPBackend.MATH,SDPBackend.CUDNN_ATTENTION]): attn_mask = attn_mask.unsqueeze(1).unsqueeze(2) attn_out = scaled_dot_product_attention(Q,K,V,dropout_p=self.dropout if self.is_training else 0,is_causal=True,scale=self.scale,attn_mask=attn_mask) attn_out = attn_out.transpose(1,2).contiguous().view(B,N,C) return self.W_o(attn_out)
Decoder Block Code Example
#One stack of a decode-transformer layer class DecoderLayer(torch.nn.Module): def __init__(self,n_embed,n_head,n_positions,n_ff,dropout=.1): super(DecoderLayer,self).__init__() #Self attention layer self.mh_attn = MultiHeadAttention(n_embed,n_head,n_positions) self.mha_dropout = torch.nn.Dropout(p=dropout) self.mha_layer_norm = torch.nn.LayerNorm(n_embed) #Feed Forward layer self.ff_layers = torch.nn.Sequential( torch.nn.Linear(n_embed,n_ff), torch.nn.GELU(), torch.nn.Linear(n_ff,n_embed)) self.ff_dropout = torch.nn.Dropout(p=dropout) self.ff_layer_norm = torch.nn.LayerNorm(n_embed) def forward(self,x:torch.Tensor,attn_mask:torch.Tensor=None)->torch.Tensor: #Apply layer_norm, MHA, and residual connection attn_output = self.mh_attn(self.mha_layer_norm(x),attn_mask=attn_mask) attn_output = self.mha_dropout(attn_output) x = x + attn_output #Apply layer_norm, ff_layer, and residual connection ff_output = self.ff_layers(self.ff_layer_norm(x)) ff_output = self.ff_dropout(ff_output) x = x + ff_output return x

With this model we're off to the races! Using LM head weight-sharing with the embeddings has us just above parameters. RoPE embeddings are nice to have (no position embeddings needed), and a head dimension of makes for quick but effective training. Not big, but definitely capable of something!

If you're curious about seeing the code you can check it out on my Github!

Architecture Stats -

Vocab Size:
Number heads:
Embed Dims:
Num Layers:
FF Size:
Context Len:

Training!

Training took by far the most time. At first I attempted training locally - at 5k tokens/sec throughput. Quick maths puts us at 60 days per epoch of training. I was not going to be waiting on that... I soon gave up on my 4060 and used online compute from LambdaAI.com (not sponsored, highly recommend). For 10 days, the model crunched 50 billion tokens. And now that brings us to finetuning - the real hard part.

You see, pretraining is easy! Point and go. Finetuning is the real art of the deal - finding what combination of training steps, techniques, and data will improve the model. Just finding quality datasets was hard enough. I opted for a crowd sourced approach. Iterative refinement of the model based on rejecting / accepting prompt pairs. Please take a minute to contribute - and see the model in action while youre at it!

Check Back in Tomorrow for More!

Updates daily on the project! I code as much as I can after work (well into the night, its unhealthy...) to get updates to you. Thanks for reading so far!