About CharlesGPT

CharlesGPT is a small project to understand what is behind a context-aware chatbot: how a language model works, how it is trained, and how generation is controlled. The goal is to explore some models, interactions, techniques and parametrizations in a controlled setting rather than using a black-box API.

Possible next steps include RAG (retrieval-augmented generation: grounding answers in a specific dataset), BERT-style pretraining or encoders for intent/classification alongside the generative model, and other architectures and/or fine-tuning experiments. For now, the interface runs on a single model: GPT-2 (small).

Tunable parameters details:

  • Temperature: rescales the logits before applying softmax — higher values flatten the distribution (more random output), lower values sharpen it (more deterministic), and zero gives greedy decoding (always pick the most likely token).
  • Top-k: only the k tokens with highest probability are considered; the distribution is renormalized over them and one token is sampled from that subset, which cuts off long-tail options.
  • Top-p (nucleus): the smallest set of tokens whose cumulative probability reaches p is selected, then one token is sampled from that set, so the “tail” of unlikely tokens is discarded.
  • Max new tokens: the maximum number of tokens the model is allowed to generate in a single reply, after which generation stops (or stops at end-of-sequence if earlier).
  • Context length: number of tokens from the conversation used to predict the next token. The backend keeps the last N tokens (N = this parameter, 1–1024). If the conversation (all your messages + GPT-2 replies so far) is shorter than N tokens, the whole conversation is used. So short chats use the full history; long chats use only the most recent part. GPT-2’s hard limit is 1024 tokens.

1. Overview of charlesgpt (currently only usingGPT-2)

  • Model type: Causal (decoder-only) language model, trained for next-token prediction.
  • Parameter count: about 124 M (some sources cite ~117 M depending on what is included in the count).
  • OpenAI gpt2 config: 12 layers, hidden size 768, 12 attention heads, FFN inner dim 3072, vocabulary 50 257 (BPE), maximum sequence length 1024 tokens.

2. Architecture

2.1 Global structure

Decoder stack: Transformer [6] decoder blocks only (no encoder). Each block has (i) a masked self-attention layer (causal: each token sees only tokens to its left); (ii) a FFN (Feed-Forward Network, sometimes called MLP — Multilayer Perceptron): two linear layers with GELU in between; (iii) residual connections and LayerNorm (Layer Normalization) before each sub-layer (pre-norm style).

Why context is capped at 1024: GPT-2 uses learned position embeddings of shape (n_positions, hidden_size) with n_positions = 1024. The model was only trained on sequences up to 1024 tokens, so positions beyond that have no learned embedding and would be out of distribution. Hence the context length slider is limited to 1–1024 tokens.

Interactive architecture diagram (GPT-2 small) — scroll to zoom, drag to pan. Hover over a box to see formulas, parameter counts, and training details:

2.2 GPT-2 small (gpt2) parameters — model used by CharlesGPT

ParameterValue
Number of layers12
Hidden size768
Attention heads12
Dimension per head64
FFN size (Feed-Forward hidden dim)3072 (4 × 768)
Vocabulary size50,257 (BPE — Byte-Pair Encoding)
Max context length1024 tokens

2.3 Causality

A causal mask in attention ensures that the prediction at position i depends only on positions 1, …, i−1. This allows autoregressive generation (one token at a time).

2.4 Inputs, outputs, and inference

Input: sequence of token IDs (BPE). Output: logits of size vocab_size at each position; the last position corresponds to predicting the next token. Next-token probabilities at inference:

Sampling: either greedy (argmax) or sample from this distribution; temperature and top-k/top-p reshape the distribution before sampling.

2.5 Tokenizer and tokens (GPT-2)

Vocabulary size: 50,257 tokens. They are chosen as follows:

  • 256 byte tokens: base vocabulary (all byte values 0–255). Byte-level BPE ensures any UTF-8 text can be tokenized without a true “unknown” token.
  • 50,000 merge tokens: learned by the BPE algorithm on the model’s training data. The algorithm repeatedly merges the most frequent adjacent token pair and adds the new pair as a single token until 50,000 merges are obtained. Result: common subwords, words, and character combinations (including symbols like ↓, ⇓, punctuation, spaces, newlines) become single tokens or short token sequences.
  • 1 reserved special token: <|endoftext|>, added by the tokenizer for end-of-sequence (and used as BOS/UNK internally). It is the only token with a reserved role; the model is not trained to “spell” it from bytes.

What appears in the Top-k chart: the chart shows the decoded string for each of the top-k predicted token IDs. So you may see:

Token / roleDisplay in CharlesGPT
<|endoftext|> (EOS)<EOS> — generation stops when sampled; not appended to the message.
Newline \n, \r, \r\nShown as ↵ in the chart. In the message, \r and \r\n are normalized to a single line break; the bubble uses white-space: pre-wrap so the text actually wraps and newlines create visible line breaks.
Tab \t
Space or empty token
Arrows (↓, ⇓, etc.), quotes, other symbolsShown as the decoded token (e.g. , , "). These are normal BPE tokens, not reserved specials.

There is no separate “start of text” token; the model can run without adding one.

3. Training

3.1 Data

Corpus: WebText (OpenAI internal dataset [1]). About 40 GB of raw text, extracted from web pages linked from Reddit (at least 3 karma). Deduplication, content filtering, and removal of boilerplate (nav, footer, etc.).

3.2 Objective and training loss

Next-token prediction (causal language modeling) [1]: given x1, …, xt−1, predict xt. The training objective is cross-entropy over the vocabulary, averaged over token positions in each training sequence:

Minimizing L corresponds to maximizing the likelihood of the next token given the previous context.

3.3 Model behaviour and quirks

Because GPT-2 was trained on WebText (web pages from Reddit-outbound links), it often reproduces patterns that were very frequent in that corpus. Two consequences:

  • Repetition: the model can loop on the same phrase or structure if temperature is low or context is generic.
  • “Random” proper nouns: certain names and places (e.g. Kolkata) appear often in the training data (news, geography, lists). The model learned strong co-occurrences, so in generic or uncertain contexts it may output such tokens with high probability even when they seem unrelated to the prompt. This is a known quirk of GPT-2 and similar models trained on broad web text.

Tuning temperature, top-k, and top-p helps reduce repetition and slightly dampens these effects, but they remain inherent to the pretrained distribution.

4. Metrics & perplexity

The CharlesGPT chat page shows top‑k probabilities for the next token under your sampling settings (temperature, top‑p, top‑k). The top‑k panel is filled as soon as the page loads from a single forward pass on the welcome text (same as during generation: the chart is not tied to the sampled token, which is discarded for that probe). Conversation perplexity uses the full transcript (same truncation rules as generation): it is also computed on the welcome line at load, then on the whole thread after you chat. While the assistant is generating, perplexity refreshes after each token, so the headline value and sparkline update in real time and usually stabilize as the reply grows (see below).

Perplexity of a generated sequence (full conversation)

Let the chat transcript be tokenized into GPT‑2 subword tokens x1, …, xN (after the same character and length limits as generation: very long text is truncated to the last chunk, then to at most 1024 tokens — the same context window as the slider).

A causal LM defines a conditional probability P(xt | x1,…,xt−1) for each position t = 2,…,N using the raw logits and a single softmax over the vocabulary (no temperature rescaling, no top‑k / top‑p truncation). This is the model’s standard evaluation distribution.

The mean negative log-likelihood over those N − 1 conditional predictions is:

Perplexity is the exponential of that average (geometric mean of inverse probabilities):

Equivalently, PPL is the geometric mean of 1/P(observed token | context) across positions. Lower is better (the model assigned more mass to the tokens that actually appear). It mixes user and assistant text in one string, so it measures how “predictable” the whole transcript is under GPT‑2 — not a human quality score.

Note: During chat generation, sampling uses temperature / top‑k / top‑p; the reported perplexity uses the unmodified softmax, so the number is not tied to the sampled path’s per-step probabilities shown in the top‑k chart.

5. References

← Back to CharlesGPT