LLM Project from Scratch - A Python Project Part 2

Intro

The scale of language models these days goes well beyond a hobbyists capacity to reproduce (GPT-4 ~ 1.7T) - requiring thousands of GPUs to train. This model won't come anywhere close to that - it's purely an experiment for fun! To satisfy ambition, I went for 1 Billion parameters. From interacting with various GPT-2 model sizes, 1B seemed to work for fluency, capability, and could still run locally without difficulty.

Phase 2 - Design an Architecture

Pytorch is a very familiar framework to me at this point. Countless hours have been spent building everything from Chess Engines to RL Snake-Playing agents. The transformer architecture was a new one to me, though. I remember distinctly one morning on vacation sitting in the Aqua Aloha Surf hotel in Honolulu, casually parsing over Attention is All You Need - as one does. Thus began my deep dive into transformer-based langauge models.

To understand the transformer mechanism, field advancements (RoPE, data/computation/training optimizations, etc...), and exactly what I could do with it took more time. Given my hardware limitations (RTX 4060Ti 16GB), I took to ≤1B parameter models. Months of toying around and optimizing compute, memory, and data requirments led me to the following model training as we speak:

1B Parameter Model Architecture

Embedding Layer ( x )

→

Stacked Transformer Decoder (x)

Decoder Layer

LayerNorm

Multi-Head Attention

Dropout

+ Residual

LayerNorm

Feedforward

Dropout

+ Residual

→

LM Head

Here's some example code from the model.py file where I implemented the architecture. I use Pytorch for the model which makes for super neat code - I take pride in my work!

Multi-Head Attention Python Code Example

                #Implementation of multi-head attention
                class MultiHeadAttention(torch.nn.Module):
                    
                    
                    def __init__(self, embed_dim, num_heads,n_positions,dropout=.1):
                
                        super(MultiHeadAttention, self).__init__()
                        
                        # Initialize parameters
                        self.n_positions        = n_positions
                        self.embed_dim          = embed_dim
                        self.num_heads          = num_heads
                        self.d_k                = embed_dim // num_heads
                
                        # Linear layers for transforming inputs
                        self.layer_1            = torch.nn.Linear(embed_dim,embed_dim*3,bias=True)
                        self.W_o                = torch.nn.Linear(embed_dim, embed_dim,bias=True)
                        self.scale              = 1 / math.sqrt(self.d_k)
                        self.dropout_p          = dropout
                        self.training           = True        
                
                
                    def forward(self, x:torch.Tensor,attn_mask:torch.Tensor=None):
                        B, N, C         = x.size()
                
                        # Apply linear transformations and split heads
                        Q,K,V           = self.layer_1(x).split(self.embed_dim,dim=2)
                        Q:torch.Tensor  = Q.view(B, N, self.num_heads, self.d_k).transpose(1,2)
                        K:torch.Tensor  = K.view(B, N, self.num_heads, self.d_k).transpose(1,2)
                        V:torch.Tensor  = V.view(B, N, self.num_heads, self.d_k).transpose(1,2)
                
                        # Apply RoPE
                        Q               = apply_rope(Q,N,x.device)
                        K               = apply_rope(K,N,x.device)
                
                        #Build user mask
                        user_mask       = attn_mask.unsqueeze(1).unsqueeze(1)
                       
                        #Calculate attn_scores on Q,V
                        attn_scores     = torch.matmul(Q,K.transpose(-2,-1)) * self.scale 
                
                        #apply both masks 
                        causal_mask     = torch.tril(torch.ones(size=(N,N),dtype=torch.bool,device=x.device))
                        attn_scores     = attn_scores.masked_fill(causal_mask==0,float('-inf'))    
                        attn_scores     = attn_scores.masked_fill(user_mask==0,float('-inf'))
                
                        #Reduce by max value (numerical stability reasons)
                        attn_scores     = attn_scores - attn_scores.max(dim=-1,keepdim=True).values
                
                        #Softmax 
                        attn_probs      = torch.softmax(attn_scores,dim=-1)
                
                        #Dropout if training
                        attn_probs      = torch.nn.functional.dropout(attn_probs,self.dropout_p,training=self.training)
                
                        #Multiply into Values
                        attn_out        = torch.matmul(attn_probs,V)
                
                        #Reshape heads
                        attn_out        = attn_out.transpose(1,2).reshape(B,N,C)
                
                        #Pass into last weight layer and return
                        return self.W_o(attn_out)

Decoder Block Python Code Example

                #One stack of a decode-transformer layer
                class DecoderLayer(torch.nn.Module):


                    def __init__(self,n_embed,n_head,n_positions,n_ff,dropout=.1):
                        super(DecoderLayer,self).__init__()
                        
                        #Self attention layer
                        self.mh_attn                = MultiHeadAttention(n_embed,n_head,n_positions)
                        self.mha_dropout            = torch.nn.Dropout(p=dropout)
                        self.mha_layer_norm         = torch.nn.LayerNorm(n_embed)
                        
                        #Feed Forward layer
                        self.ff_layers              = torch.nn.Sequential(
                            torch.nn.Linear(n_embed,n_ff), 
                            torch.nn.GELU(), 
                            torch.nn.Linear(n_ff,n_embed)) 
                        self.ff_dropout             = torch.nn.Dropout(p=dropout)
                        self.ff_layer_norm          = torch.nn.LayerNorm(n_embed)
                        
                
                    def forward(self,x:torch.Tensor,attn_mask:torch.Tensor=None)->torch.Tensor:
                        
                        #Apply layer_norm, MHA, and residual connection
                        attn_output                 = self.mh_attn(self.mha_layer_norm(x),attn_mask=attn_mask)
                        attn_output                 = self.mha_dropout(attn_output)
                        x                           = x + attn_output

                        #Apply layer_norm, ff_layer, and residual connection
                        ff_output                   = self.ff_layers(self.ff_layer_norm(x))
                        ff_output                   = self.ff_dropout(ff_output)
                        x                           = x + ff_output

                        return x

With this model we're off to the races! Using LM head weight-sharing with the embeddings has us just above parameters. RoPE embeddings are nice to have (no position embeddings needed), and a head dimension of makes for quick but effective training. Not big, but definitely capable of something!

If you're curious about seeing the code you can check it out on my Github!

Architecture Stats -

Vocab Size:

Number heads:

Embed Dims:

Num Layers:

FF Size:

Context Len:

Training!

Training took by far the most time. At first I attempted training locally - at 5k tokens/sec throughput. Quick maths puts us at 60 days per epoch of training. I was not going to be waiting on that... I soon gave up on my 4060 and used online compute from LambdaAI.com (not sponsored, highly recommend). For 10 days, the model crunched 50 billion tokens. And now that brings us to finetuning - the real hard part.

You see, pretraining is easy! Point and go. Finetuning is the real art of the deal - finding what combination of training steps, techniques, and data will improve the model. Just finding quality datasets was hard enough. I opted for a crowd sourced approach. Iterative refinement of the model based on rejecting / accepting prompt pairs. Please take a minute to contribute - and see the model in action while youre at it!

Check Back in Tomorrow for More!

Updates daily on the project! I code as much as I can after work (well into the night, its unhealthy...) to get updates to you. Thanks for reading so far!

Welcome to the SLM Project

Intro

Phase 2 - Design an Architecture

Architecture Stats -

Training!

Check Back in Tomorrow for More!