GPT from scratch

AutoGenerated notes for the Youtube video on building GPT from scratch

Video link:

What it is?

Probabilistic System and Transformer Architecture
Probabilistic System: A system (like ChatGPT) that generates multiple potential answers in response to a single prompt, offering slightly different outputs each time.
Transformer Architecture: A neural network architecture introduced in the 2017 paper "Attention is All You Need." The architecture is the underlying neural network used in ChatGPT and other modern AI applications.

What we are building?

Character-Level Language Model (bigram) on Tiny Shakespeare Dataset

Character-Level Language Model: A language model that works on predicting the sequence of characters in a text instead of words or tokens. It learns patterns in the sequence of characters.
Tiny Shakespeare: A concatenated dataset of all the works of Shakespeare. It's used as a toy dataset for training a character-level language model.

Token-by-Token Level and Nano GPT( github link above)

Token-by-Token Level: The generation approach used by ChatGPT, where it generates text based on tokens (sub-word pieces) instead of full words or individual characters.
Nano GPT: The implementation is simple, with only two files of approximately 300 lines of code each. One file defines the GPT model (the Transformer), and the other file trains it on a specified text data set.

Tokenization and Training

Tokenize: The process of converting raw text (a string) into sequences of integers, according to a predefined vocabulary.
Character-Level Language Model: A language model that translates individual characters into integers, creating sequences of integers based on the characters in the text.
Encode and Decode: The process of translating text into integers (tokens) and back to text using a character-level tokenizer or other schemes such as sub-word tokenization.

Tools and Techniques

Tokenizer: A tool that breaks down text into smaller components (tokens). Examples include character-level tokenizers, sub-word tokenizers like SentencePiece, and word-level tokenizers.
SentencePiece: A sub-word tokenizer developed by Google that encodes text into integers using a different schema and vocabulary from character-level tokenization.
Tick Token Library: A library by OpenAI that uses a byte pair encoding tokenizer (GPT tokenizer) and encodes text into integers.
Codebook: A mapping between text and integers in a tokenizer.
Training and Validation Split: The process of splitting a dataset into training data (90% in this case) for training the model and validation data (10% in this case) to evaluate its performance and understand the extent of overfitting.
Block Size: The maximum length (or context length) of text (or sequences) when training a Transformer model.
Batch Dimension: Represents the number of independent sequences processed during each forward and backward pass of the Transformer model to ensure efficiency in GPU processing.
Bigram Language Model: A simple, probabilistic language model that predicts the next word in a sequence based on the previous word. It is the simplest form of language models and is covered in-depth in the "Neural Networks from Scratch" series.
PyTorch: An open-source machine learning library for Python, primarily used in applications such as natural language processing and computer vision, to build deep learning models.
NN Module: The base class for all neural network modules in PyTorch.

GPU and Cuda

GPU: A type of processor that enables faster calculations, often used in machine learning and data processing tasks.
Cuda: NVIDIA's parallel computing platform and programming model that allows developers to use NVIDIA GPUs for faster computations.
Device: A reference to the hardware being used for calculations (GPU or CPU).

Attention Mechanisms

Self-attention: A mechanism in Transformer architecture that allows tokens to communicate and aggregate data-dependent information based on keys, queries, and values.
Query: A vector representation of what a token is looking for in terms of information from other tokens.
Key: A vector representation of what information a token contains that it can share with others.
Value: A separate vector representation of what a token will communicate to other tokens when they find it interesting.
Head size: A hyperparameter defining the size of each self-attention head.

Training Techniques

Estimate loss: A method of calculating a smoother loss function by taking average losses over multiple batches of data.
Evaluation mode: A mode in which the neural network is set for evaluating its performance on data without updating the weights.
Training mode: A mode in which the neural network is set for updating its weights based on gradient calculations.
torch.nograd: A context manager in PyTorch that tells the system that gradient calculations won't be required within its block, allowing for more efficient memory use.

Decoder-only Transformer and Pre-training

Decoder-only Transformer: A type of Transformer architecture used for text generation tasks, such as language modelling, where there's no need for an encoder.
Pre-training stage: The initial stage of training a Transformer model, where it learns the general language structure by training on a large text dataset.
Fine-tuning stage: The subsequent stage of training a Transformer model, where it gets refined for specific tasks based on task-specific data.

Additional Techniques

Layer Norm: A normalization technique that normalizes along a sequence rather than a batch, useful for improving training stability and performance.
Dropout: A regularization technique that randomly drops a percentage of neurons during each training iteration, preventing overfitting and encouraging generalization.
Causal Self-Attention: The attention mechanism in Transformers that accounts for the autoregressive property of the model, using a triangular mask to prevent predictions from depending on future tokens.
Cross Attention: Attention mechanism that allows Transformer models to condition the generation of decoder output on an encoded input sequence, like in machine translation.
Policy Gradient: A reinforcement learning optimization technique used to fine-tune models according to a reward function.
PPO (Proximal Policy Optimization): A type of policy gradient algorithm that helps stabilize and improve the learning process during fine-tuning.

Multi-Head Attention and Positional Encoding

Multi-Head Attention: An extension of the attention mechanism where multiple attention operations are performed in parallel, followed by a concatenation of their outputs. This improves the communication capabilities of the network.
Positional Encoding: An encoding method used in Transformer models to provide nodes with information about their position in a sequence, as attention mechanisms by default don't have a notion of space.

Encoder and Decoder Blocks

Encoder Block: A block in Transformer models where all nodes communicate with each other, having no constraint on future tokens communicating with past tokens. Used for tasks like sentiment analysis.
Decoder Block: A block in Transformer models with a constraint preventing future tokens from communicating with past tokens to avoid revealing the answer. Used for tasks like language modelling.

Feed Forward Networks and Skip Connections

Feed Forward Network: A type of neural network where information flows in one direction only, from the input layer through hidden layers to the output layer.
Skip Connections: Also known as residual connections, these connections in a neural network allow the information from earlier layers to bypass some layers and be directly added to the output of later layers. This helps in optimizing deep neural networks.

PreviousFoundations of LLMs Lesson 1 NextIntroduction to CUDA Python with Numba

Last updated 1 year ago