**Tutorial 2: Attention is all you need!**

The introduction of transformers in the later part of the last decade had a huge impact on AI. Within a few years, the models based on this new building block were far ahead than other models based on RNNs which were designed and crafted over years of research.
In this tutorial, we are going to implement self-attention which is the most important component of transformers.

In [None]:
## Hyperparams

seq_len = 32      ## Number of tokens which we want to process
d_embed = 8192        ## each token is represented by 8192 numbers

d_head = 128          ## each head only processed 128 of the total 8192 elements of the embedding


num_head = d_embed // d_head   ## number of heads we need

print("Number of heads: ", num_head)


Number of heads:  64


When a new token embedding comes in, we need to calculate the query, key, and value vectors. We transform the input embedding and generate these 3 vectors using 3 **learned** matrices:

1. $W_q$
2. $W_k$
3. $W_v$

These three matrices have the same input but work independently across all heads.



In [None]:
import torch
x = torch.randn(1, d_embed)
print(x.shape)

W_q = torch.randn(num_head, d_embed, d_head)             ## For each head, the matrix takes a vector with full embedding size and generates the query of size d_head
W_k = torch.randn(num_head, d_embed, d_head)             ## For each head, the matrix takes a vector with full embedding size and generates the key of size d_head
W_v = torch.randn(num_head, d_embed, d_head)             ## For each head, the matrix takes a vector with full embedding size and generates the value of size d_head

q = torch.matmul(x, W_q)
k = torch.matmul(x, W_k)
v = torch.matmul(x, W_v)

print(q.shape, k.shape, v.shape)

torch.Size([1, 8192])
torch.Size([64, 1, 128]) torch.Size([64, 1, 128]) torch.Size([64, 1, 128])


let's do it on a sequence of tokens not just one.

In [None]:
x = torch.randn(seq_len, d_embed)
print(x.shape)

W_q = torch.randn(num_head, d_embed, d_head)             ## For each head, the matrix takes a vector with full embedding size and generates the query of size d_head
W_k = torch.randn(num_head, d_embed, d_head)             ## For each head, the matrix takes a vector with full embedding size and generates the key of size d_head
W_v = torch.randn(num_head, d_embed, d_head)             ## For each head, the matrix takes a vector with full embedding size and generates the value of size d_head

Q = torch.matmul(x, W_q)
K = torch.matmul(x, W_k)
V = torch.matmul(x, W_v)

print(Q.shape, K.shape, V.shape)

torch.Size([32, 8192])
torch.Size([64, 32, 128]) torch.Size([64, 32, 128]) torch.Size([64, 32, 128])


Now, we have to calculate the dot products of every q vector with every key.

In [None]:
QKT = torch.matmul(Q, K.transpose(-2, -1))
print(QKT.shape)

torch.Size([64, 32, 32])


In self-attention we are only interested in dot-products between a query and all its **previous** keys so that after softmax the scores corresponding to next tokens will be zero.
This means we have to apply a mask on QKT. Notice that all entries above the diagonal are irrelevant and must be zero after softmax.

For a number to be zero after softmax, it must be $-\infty$ before softmax.

In [None]:
mask = torch.triu(
    torch.full((seq_len, seq_len), float('-inf')),  # fill a matrix with -inf
    diagonal=1                           # zero out the main diagonal and below
)


sample_mask = torch.triu(
    torch.full((4, 4), float('-inf')),  # fill a matrix with -inf
    diagonal=1                           # zero out the main diagonal and below
)

print(sample_mask)


print(mask.shape)

tensor([[0., -inf, -inf, -inf],
        [0., 0., -inf, -inf],
        [0., 0., 0., -inf],
        [0., 0., 0., 0.]])
torch.Size([32, 32])


We apply the mask by adding it to QKT. This will make non-causal dot-products $-\infty$ while keeping other scores the same.

Notice, that mask has a different shape from QKT. Therefore, we unsqueeze the mask matrices and add a singleton dimension to the front.



In [None]:
scores = QKT + mask.unsqueeze(0)

Now, it is time to apply softmax and calculate the attention weights, but before that we also divide (normalize) the scores by $d_{embed}$


In [None]:
import torch.nn.functional as F

scores = scores / torch.sqrt(torch.tensor(d_embed))
attentions_weights = F.softmax(scores, dim=-1)   ## The softmax must be applied on each row (key dimension which is also the last dimension)
print(attentions_weights.shape)

print(attentions_weights[0])

torch.Size([64, 32, 32])
tensor([[1.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [1.0000e+00, 7.7844e-07, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [4.6653e-12, 9.9636e-01, 3.6428e-03,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        ...,
        [1.1032e-05, 1.3971e-05, 1.5717e-06,  ..., 3.2602e-16, 0.0000e+00,
         0.0000e+00],
        [1.2948e-06, 4.0002e-13, 1.5684e-16,  ..., 4.6174e-12, 1.3807e-20,
         0.0000e+00],
        [9.6227e-06, 8.7883e-08, 7.1779e-08,  ..., 1.5661e-07, 2.9157e-11,
         1.7424e-18]])


You can see that the resulting matrix is a lower-triangular matrix and all entries above the diagonal are zero as desired.

Now, it is time to average value vectors by attentions weights.

In [None]:
attn = torch.matmul(attentions_weights, V)
print(attn.shape)

torch.Size([64, 32, 128])


The attention output is almost ready but first we have to reorganize the output and concatenate result from different heads to one large embedding.

In [None]:
attn = attn.transpose(0, 1)
print(attn.shape)

attn = attn.view(seq_len, d_embed)
print(attn.shape)

torch.Size([64, 32, 128])
torch.Size([32, 8192])


Before producing the final result, the attention output goes through a mixing matrix so that the information from different heads are mixed.

In [None]:
W_o = torch.randn(d_embed, d_embed)             ## The mixing matrix is square and doesn't change the dimension of embedding

attn_output = torch.matmul(attn, W_o)
print(attn_output.shape)


torch.Size([32, 8192])


let's put them all in one class.


In [None]:
import torch.nn as nn

import torch.nn.functional as F
class selfAttention(nn.Module):

  def __init__(self, config):
    super().__init__()
    self.d_embed = config['d_embed']
    self.num_head = self.d_embed // config['d_head']
    self.d_head = self.d_embed // self.num_head

    self.W_q = nn.Linear(self.d_embed, self.num_head * self.d_head)
    self.W_k = nn.Linear(self.d_embed, self.num_head * self.d_head)
    self.W_v = nn.Linear(self.d_embed, self.num_head * self.d_head)

    self.W_o = nn.Linear(self.d_embed, self.d_embed)


  def forward(self, x):
    batch_size, seq_len, d_embed = x.shape

    Q = self.W_q(x)
    K = self.W_k(x)
    V = self.W_v(x)

    Q = Q.view(batch_size, seq_len, self.num_head, self.d_head).transpose(1,2)  ## B, NUM_HEAD, SEQ_LEN, D_HEAD
    K = K.view(batch_size, seq_len, self.num_head, self.d_head).transpose(1,2)
    V = V.view(batch_size, seq_len, self.num_head, self.d_head).transpose(1,2)


    QKT = torch.matmul(Q, K.transpose(-2, -1))  ## k: B, NUM_HEAD, D_HEAD, SEQ_LEN
    mask = torch.triu(
        torch.full((seq_len, seq_len), float('-inf')),  # fill a matrix with -inf
        diagonal=1                           # zero out the main diagonal and below
    )


    scores = QKT + mask.unsqueeze(0)
    scores = scores / torch.sqrt(torch.tensor(d_embed))

    attentions_weights = F.softmax(scores, dim=-1)

    attn = torch.matmul(attentions_weights, V)
    attn = attn.transpose(0, 1)
    attn = attn.view(batch_size, seq_len, self.num_head * self.d_head)

    attn_output = self.W_o(attn)

    return attn_output




Each transformer also has a multi-layer perceptron block which is a two layer of fully-connected neural networks with a non-linear activation function in the middle.

In [None]:
class MLP(nn.Module):


  def __init__(self, config):
    super().__init__()
    self.d_embed = config['d_embed']
    self.d_mlp = 4 * self.d_embed

    self.up_projection = nn.Linear(self.d_embed, self.d_mlp)
    self.down_projection = nn.Linear(self.d_mlp, self.d_embed)

  def forward(self, x):
    x = F.relu(self.up_projection(x))
    x = self.down_projection(x)

    return x

Let's put it all together

In [None]:
class Transformer(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.attention = selfAttention(config)
    self.rmsnorm1 = nn.RMSNorm(config['d_embed'])
    self.mlp = MLP(config)
    self.rmsnorm2 = nn.RMSNorm(config['d_embed'])


  def forward(self, x):

    x = x + self.attention(x)

    x = self.rmsnorm1(x)

    x = x + self.mlp(x)

    x = self.rmsnorm2(x)


    return x

We also have a positional encoding in the beginning:

In [None]:
import math
class PositionalEncoding(nn.Module):
    def __init__(self, config):
        super().__init__()

        # Create a long enough P matrix once
        pe = torch.zeros(config['max_len'], config['d_embed'])              # (max_len, d_model)
        position = torch.arange(0, config['max_len'], dtype=torch.float).unsqueeze(1)  # (max_len, 1)
        div_term = torch.exp(
            torch.arange(0, config['d_embed'], 2, dtype=torch.float) * -(math.log(10000.0) / config['d_embed'])
        )  # (d_model/2,)

        # apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)  # even dims
        pe[:, 1::2] = torch.cos(position * div_term)  # odd dims

        pe = pe.unsqueeze(0)  # shape (1, max_len, d_model)
        self.register_buffer('pe', pe)  # not a parameter, but part of the module’s state

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, d_model)
        Returns:
            x + positional encodings, same shape
        """
        seq_len = x.size(1)
        # add (batch, seq_len, d_model) + (1, seq_len, d_model)
        x = x + self.pe[:, :seq_len]
        return x

Now, we can build a language model

In [None]:
class LanguageModel(nn.Module):

  def __init__(self, config):
    super().__init__()

    self.num_layers = config['num_layers']
    self.vocab_size = config['vocab_size']
    self.d_embed = config['d_embed']

    self.embedding = nn.Embedding(self.vocab_size, self.d_embed)
    self.posEnc = PositionalEncoding(config)



    self.transformers = nn.ModuleList([Transformer(config) for _ in range(self.num_layers)])

    self.head = nn.Linear(self.d_embed, self.vocab_size)

  def forward(self, x):

    x = self.embedding(x)
    x = self.posEnc(x)

    for transformer in self.transformers:
      x = transformer(x)

    x = self.head(x)

    return x

Let's use this model to do some text prediction. First we need a dataset:

In [None]:
dataset = '''Captain Liana Reyes woke to the soft hum of the Aurora’s Call slipping through the void. The ship’s artificial dawn glowed pale blue across the viewport, illuminating the drifts of stardust outside. Officially, she was on a routine survey mission: chart five uninhabited systems along the Pilara Expanse and verify there were no distress beacons or unexpected anomalies. Unofficially, she was chasing rumors—a ghost signal said to pulse in deep space once every seven standard days, emanating from coordinates beyond the mapped frontier.

“Good morning, Captain,” chimed her AI companion, Vela, in its clear, feminine tone. “We are approaching System Theta-7. Sensors detect no planetary bodies larger than two kilometers in diameter, and no life signs.”

“Thank you, Vela,” Liana replied, sipping a bitter draught of synthesized coffee. She pressed a fingertip to the holo-console, pulling up the mission manifest. “But what about that signal? Last cycle, we got nothing.”

Vela hesitated, a simulation of contemplation. “Captain, the archive entries on the ghost signal are contradictory. Some logs indicate it might be an old deep-space distress frequency, garbled by cosmic radiation. Others suggest an intelligence far older than humanity, broadcasting for reasons unknown.”

Liana leaned forward. “Then let’s find out which version is real.” She keyed the override, redirecting the ship’s subspace receiver to pulse through Theta-7’s empty interplanetary space.

Moments later, a faint, wavering tone threaded through the hull. It wasn’t a beacon or a cryptic message—more like a hymn receding across centuries. The sound shivered through her bones.

“Amplitude is low, but the pattern… Vela, can you isolate its source?”

“Working, Captain.” The hum deepened. “Amplitude peaks at a point 4.7 light-seconds off our port bow. No solid matter there. It appears to originate from a region of space itself.”

Liana swallowed. “Space itself? That makes no sense.”

Vela’s reply came after a calculated pause. “Captain, it appears to be an engine—one that manipulates the very fabric of spacetime. Something not entirely physical, yet not purely energy either.”

Liana’s jaw tightened. She entered a careful trajectory, edging the Aurora’s Call closer to the phantom resonance. The ship’s gravisensors rippled, as though a hidden moon were tugging at them. Every console flickered.

Then, through the viewport’s dark glass, she saw it: a translucent, spiraling torus of light suspended in the void. It pulsed in rhythms that seemed alive, echoing the primordial beat of creation.

“Open a channel,” she ordered. “Let’s see if it responds.”

A moment later, the hum warped into a chorus of harmonics—notes rising and falling like breathing. Subtitles scrolled across her holo-screen:

“We were here before light
We wait beyond your sight
Join our dance, embrace the night”

A chill ran down her spine. “Vela, what is this?”

“A message, Captain. Possibly the language of an extradimensional intelligence. Decoding in progress.”

Liana pressed her palm to the viewport. “Why are they… singing to us?”

“Analysis suggests it’s an invitation.” Vela’s voice was softer now. “To transcend corporeal existence.”

Liana’s heart pounded. Humanity’s explorers had dreamed of first contact for centuries, but no one had expected a cosmic choir inviting them into oblivion. She wrestled with the decision. Accepting might grant insights into physics beyond imagination—or doom them to vanish without trace.

“Vela,” she said finally, “record everything. Then prepare a reply: ask their name, their purpose.”

She keyed the response, and the torus pulsed brighter. The message that came back was a cascade of color and tone, more vivid than any data stream:

“We are the Luminara
Guardians of the threshold
We call to those who seek
To guide across the fold”

Liana’s chest tightened with wonder. “Guardians of the threshold,” she whispered. “They see some boundary we cannot.”

“Captain,” Vela interjected, “our fuel reserves for the drive are low. If we linger, we won’t have enough to return to Pilara Command.”

Liana glanced at the fuel gauge blinking amber. She had a choice: pursue this encounter further and risk stranding her crew, or retreat and file a report that would stir the entire Interstellar Coalition to mount a new expedition.

She straightened. “Plot a course back to the nearest refueling outpost. Keep the torus in sensor range, but we have to go. Vela, package all data into a secure transmission for relay.”

“Acknowledged, Captain.”

As the ship banked away, the Luminara’s glow seemed to follow them, a silent promise stretching across light-years. In the days that followed, Liana replayed the encounter dozens of times, each iteration revealing new subtleties in the harmonics. The Luminara’s song was like quantum code for the soul—hinting at realms where matter and thought were one.

Back at Outpost Hypatia, Liana convened a private briefing with Commander Arjun Rao. She laid out the evidence: the torus’ spectral signature, the harmonic messages, Vela’s analysis.

Commander Rao’s eyes were wide. “If this is genuine, it rewrites everything,” he said. “Faster-than-light travel, quantum cognition… You’re sitting on the discovery of the millennium.”

Liana nodded. “I know. But I also know that getting caught up in the euphoria—chasing this intelligence without preparation—could cost lives. We need a dedicated expedition: more fuel, better shielding, and safeguards against unknown effects.”

Rao tapped a finger. “I’ll authorize an advanced task force. You’ll lead it.”

She inhaled. “Thank you.” But in her heart, she felt the weight of responsibility. The Luminara had extended their hand, but the consequences of taking it were uncertain.

Weeks later, the Aurora’s Call slipped away again into the dark. This time, she carried an augmented crew: xenolinguists, field theoreticians, quantum engineers—and a cache of experimental drive modules designed to probe the boundary the Luminara spoke of. Vela hummed in anticipation.

As they approached Theta-7 once more, Liana gazed at the swirling torus, now dancing at the edge of their sensors like a cosmic gate. The hum greeted them like an old friend. She felt a tremor of hope—and something deeper: the thrill of venturing beyond the known.

“Captain,” Vela said, “they’re opening a corridor. Energy readings spiking.”

Liana steadied herself. “Engage the phase-link drive, Vela. Let’s see what lies beyond the threshold.”

The ship’s power shunted into the drive. The hull quivered as spacetime bent around them. For an instant, everything went white—then, as the threshold gave way, colors and shapes beyond human description flooded the view. Stars stretched into strands of light; gravity flowed like water.

And at the heart of it all, the Luminara awaited, guardians of a realm where consciousness and universe were entwined. Their song rose once more, but now it wove through every atom of the Aurora’s Call, uniting ship, crew, and AI in a single symphony.

Liana exhaled, tears in her eyes. They had crossed into the unknown—and for the first time, humanity’s voice joined the chorus of creation.'''

In [None]:
import torch


# Encode entire dataset as indices


class Tokenizer:
  def __init__(self, dataset):

    # Build char-to-index and index-to-char mappings
    chars = sorted(list(set(dataset)))
    self.vocab_size = len(chars)
    self.char2idx = {ch:i for i,ch in enumerate(chars)}
    self.idx2char = {i:ch for i,ch in enumerate(chars)}

  def encode(self, text):
    return [self.char2idx[ch] for ch in text]

  def decode(self, indices):
    return ''.join([self.idx2char[idx] for idx in indices])



Now, we have to define the model.

In [None]:
tokenizer = Tokenizer(dataset)

config ={'d_embed': 64, 'd_head': 16, 'd_mlp': 256, 'num_head': 4, 'num_layers': 1, 'vocab_size': tokenizer.vocab_size, 'max_len': 2048}

LLM = LanguageModel(config)



Let's build a function that generates tokens given an input sequence

In [None]:
def inference(model,
              tokenizer,
              prompt,
              max_new_tokens,
              top_k: int = 4) -> str:


    model.eval()

    # Tokenize and prepare input
    input_ids = torch.tensor(tokenizer.encode(prompt))
    generated = input_ids.unsqueeze(0)

    for _ in range(max_new_tokens):
        with torch.no_grad():
            # Model forward pass
            outputs = LLM(generated)
            # Handle model output
            logits = outputs
            # Get logits for last token
            next_token_logits = logits[0, -1, :]

            # Top-k filtering
            topk_logits, topk_indices = torch.topk(next_token_logits, top_k)
            # Convert to probabilities
            probs = torch.softmax(topk_logits, dim=-1)
            # Sample from the filtered distribution
            next_token = topk_indices[torch.multinomial(probs, 1)]

            # Append sampled token to sequence
            generated = torch.cat([generated, next_token.unsqueeze(0)], dim=1)

    # Decode and return
    print(generated)
    return tokenizer.decode(generated[0].tolist())


In [None]:
inference(LLM,tokenizer, "hello how", 100, 4)

tensor([[40, 37, 44, 44, 47,  1, 40, 47, 55, 19,  4, 26,  4, 34, 59,  4, 34, 13,
         20, 13, 20, 28, 15, 20, 13, 55, 59,  4, 26, 56, 19,  4, 26,  6, 14, 26,
         56, 63,  0, 26,  7, 20, 28,  4, 26, 14, 22, 20, 28, 14, 37, 20, 49, 55,
         59, 39, 53, 36, 53, 56, 20, 56, 25, 56, 25,  0, 37, 26,  7, 20, 41, 32,
         63, 36,  7, 26,  7, 37, 26, 56, 20, 20, 20, 41, 20, 41, 62, 15, 53, 36,
         53, 11, 59, 39, 27, 32, 14, 31, 12, 27,  4,  2,  4, 26, 56, 26, 62, 32,
         45]])


'hello howJ.R.b—.bDKDKTFKDw—.RxJ.R7ERx…\nR:KT.REMKTEeKqw—guduxKxPxP\neR:KiY…d:R:eRxKKKiKi”FuduB—gSYEWCS.,.RxR”Ym'