About CharlesGPT
CharlesGPT is a small project to understand what is behind a context-aware chatbot: how a language model works, how it is trained, and how generation is controlled. The goal is to explore some models, interactions, techniques and parametrizations in a controlled setting rather than using a black-box API.
Possible next steps include RAG (retrieval-augmented generation: grounding answers in a specific dataset), BERT-style pretraining or encoders for intent/classification alongside the generative model, and other architectures and/or fine-tuning experiments. For now, the interface runs on a single model: GPT-2 (small).
Tunable parameters details:
- Temperature: rescales the logits before applying softmax — higher values flatten the distribution (more random output), lower values sharpen it (more deterministic), and zero gives greedy decoding (always pick the most likely token).
- Top-k: only the k tokens with highest probability are considered; the distribution is renormalized over them and one token is sampled from that subset, which cuts off long-tail options.
- Top-p (nucleus): the smallest set of tokens whose cumulative probability reaches p is selected, then one token is sampled from that set, so the “tail” of unlikely tokens is discarded.
- Max new tokens: the maximum number of tokens the model is allowed to generate in a single reply, after which generation stops (or stops at end-of-sequence if earlier).
- Context length: number of tokens from the conversation used to predict the next token. The backend keeps the last N tokens (N = this parameter, 1–1024). If the conversation (all your messages + GPT-2 replies so far) is shorter than N tokens, the whole conversation is used. So short chats use the full history; long chats use only the most recent part. GPT-2’s hard limit is 1024 tokens.
1. Overview of charlesgpt (currently only usingGPT-2)
- Model type: Causal (decoder-only) language model, trained for next-token prediction.
- Parameter count: about 124 M (some sources cite ~117 M depending on what is included in the count).
- OpenAI
gpt2config: 12 layers, hidden size 768, 12 attention heads, FFN inner dim 3072, vocabulary 50 257 (BPE), maximum sequence length 1024 tokens.
2. Architecture
2.1 Global structure
Decoder stack: Transformer [6] decoder blocks only (no encoder). Each block has (i) a masked self-attention layer (causal: each token sees only tokens to its left); (ii) a FFN (Feed-Forward Network, sometimes called MLP — Multilayer Perceptron): two linear layers with GELU in between; (iii) residual connections and LayerNorm (Layer Normalization) before each sub-layer (pre-norm style).
Why context is capped at 1024: GPT-2 uses learned position embeddings of shape (n_positions, hidden_size) with n_positions = 1024. The model was only trained on sequences up to 1024 tokens, so positions beyond that have no learned embedding and would be out of distribution. Hence the context length slider is limited to 1–1024 tokens.
Interactive architecture diagram (GPT-2 small) — scroll to zoom, drag to pan. Hover over a box to see formulas, parameter counts, and training details:
2.2 GPT-2 small (gpt2) parameters — model used by CharlesGPT
| Parameter | Value |
|---|---|
| Number of layers | 12 |
| Hidden size | 768 |
| Attention heads | 12 |
| Dimension per head | 64 |
| FFN size (Feed-Forward hidden dim) | 3072 (4 × 768) |
| Vocabulary size | 50,257 (BPE — Byte-Pair Encoding) |
| Max context length | 1024 tokens |
2.3 Causality
A causal mask in attention ensures that the prediction at position i depends only on positions 1, …, i−1. This allows autoregressive generation (one token at a time).
2.4 Inputs, outputs, and inference
Input: sequence of token IDs (BPE). Output: logits of size vocab_size at each position; the last position corresponds to predicting the next token. Next-token probabilities at inference:
Sampling: either greedy (argmax) or sample from this distribution; temperature and top-k/top-p reshape the distribution before sampling.
2.5 Tokenizer and tokens (GPT-2)
Vocabulary size: 50,257 tokens. They are chosen as follows:
- 256 byte tokens: base vocabulary (all byte values 0–255). Byte-level BPE ensures any UTF-8 text can be tokenized without a true “unknown” token.
- 50,000 merge tokens: learned by the BPE algorithm on the model’s training data. The algorithm repeatedly merges the most frequent adjacent token pair and adds the new pair as a single token until 50,000 merges are obtained. Result: common subwords, words, and character combinations (including symbols like ↓, ⇓, punctuation, spaces, newlines) become single tokens or short token sequences.
- 1 reserved special token:
<|endoftext|>, added by the tokenizer for end-of-sequence (and used as BOS/UNK internally). It is the only token with a reserved role; the model is not trained to “spell” it from bytes.
What appears in the Top-k chart: the chart shows the decoded string for each of the top-k predicted token IDs. So you may see:
| Token / role | Display in CharlesGPT |
|---|---|
<|endoftext|> (EOS) | <EOS> — generation stops when sampled; not appended to the message. |
Newline \n, \r, \r\n | Shown as ↵ in the chart. In the message, \r and \r\n are normalized to a single line break; the bubble uses white-space: pre-wrap so the text actually wraps and newlines create visible line breaks. |
Tab \t | → |
| Space or empty token | ␣ |
| Arrows (↓, ⇓, etc.), quotes, other symbols | Shown as the decoded token (e.g. ↓, ⇓, "). These are normal BPE tokens, not reserved specials. |
There is no separate “start of text” token; the model can run without adding one.
3. Training
3.1 Data
Corpus: WebText (OpenAI internal dataset [1]). About 40 GB of raw text, extracted from web pages linked from Reddit (at least 3 karma). Deduplication, content filtering, and removal of boilerplate (nav, footer, etc.).
3.2 Objective and training loss
Next-token prediction (causal language modeling) [1]: given x1, …, xt−1, predict xt. The training objective is cross-entropy over the vocabulary, averaged over token positions in each training sequence:
Minimizing L corresponds to maximizing the likelihood of the next token given the previous context.
3.3 Model behaviour and quirks
Because GPT-2 was trained on WebText (web pages from Reddit-outbound links), it often reproduces patterns that were very frequent in that corpus. Two consequences:
- Repetition: the model can loop on the same phrase or structure if temperature is low or context is generic.
- “Random” proper nouns: certain names and places (e.g. Kolkata) appear often in the training data (news, geography, lists). The model learned strong co-occurrences, so in generic or uncertain contexts it may output such tokens with high probability even when they seem unrelated to the prompt. This is a known quirk of GPT-2 and similar models trained on broad web text.
Tuning temperature, top-k, and top-p helps reduce repetition and slightly dampens these effects, but they remain inherent to the pretrained distribution.
4. Metrics & perplexity
The CharlesGPT chat page shows top‑k probabilities for the next token under your sampling settings (temperature, top‑p, top‑k). The top‑k panel is filled as soon as the page loads from a single forward pass on the welcome text (same as during generation: the chart is not tied to the sampled token, which is discarded for that probe). Conversation perplexity uses the full transcript (same truncation rules as generation): it is also computed on the welcome line at load, then on the whole thread after you chat. While the assistant is generating, perplexity refreshes after each token, so the headline value and sparkline update in real time and usually stabilize as the reply grows (see below).
Perplexity of a generated sequence (full conversation)
Let the chat transcript be tokenized into GPT‑2 subword tokens x1, …, xN (after the same character and length limits as generation: very long text is truncated to the last chunk, then to at most 1024 tokens — the same context window as the slider).
A causal LM defines a conditional probability P(xt | x1,…,xt−1) for each position t = 2,…,N using the raw logits and a single softmax over the vocabulary (no temperature rescaling, no top‑k / top‑p truncation). This is the model’s standard evaluation distribution.
The mean negative log-likelihood over those N − 1 conditional predictions is:
Perplexity is the exponential of that average (geometric mean of inverse probabilities):
Equivalently, PPL is the geometric mean of 1/P(observed token | context) across positions. Lower is better (the model assigned more mass to the tokens that actually appear). It mixes user and assistant text in one string, so it measures how “predictable” the whole transcript is under GPT‑2 — not a human quality score.
Note: During chat generation, sampling uses temperature / top‑k / top‑p; the reported perplexity uses the unmodified softmax, so the number is not tied to the sampled path’s per-step probabilities shown in the top‑k chart.
5. References
- [1] GPT-2 report — Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI. PDF.
- [2] OpenAI blog — Better Language Models and Their Implications (Feb 2019).
- [3] Wikipedia GPT-2 — History, release timeline, and context: GPT-2.
- [4] GPT-1 → GPT-2 and beyond — Timeline of OpenAI releases (GPT-1, GPT-2, GPT-3, etc.) with changelog-style context: Timeline of OpenAI releases (external).
- [5] Hugging Face — Model card and usage: gpt2, Transformers GPT-2 docs.
- [6] Original Transformer — Vaswani et al., Attention Is All You Need (2017): arXiv:1706.03762.