About CharlesGPT
CharlesGPT is a small project (available for testing here) to understand what is behind a context-aware chatbot: how a language model works, how it is trained, and how generation is controlled. The goal is to explore some models, interactions, techniques and parametrizations in a controlled setting rather than using a black-box API.
Possible next steps include RAG (retrieval-augmented generation: grounding answers in a specific dataset), BERT-style pretraining or encoders for intent/classification alongside the generative model, and other architectures and/or fine-tuning experiments. For now, the interface runs on a single model: GPT-2 (small).
Tunable generation parameters (temperature, top-k, top-p, max new tokens, context length) control how GPT-2 samples each token. Current UI defaults and allowed ranges are in §3.4; the sidebar sliders apply the same values to generation and to the top-k preview.
1. Overview (currently GPT-2 only)
- Model type: Causal (decoder-only) language model, trained for next-token prediction.
- Parameter count: about 124 M (some sources cite ~117 M depending on what is included in the count).
- OpenAI
gpt2config: 12 layers, hidden size 768, 12 attention heads, FFN inner dim 3072, vocabulary 50 257 (BPE), maximum sequence length 1024 tokens.
2. Architecture
2.1 Global structure
Decoder stack: Transformer [6] decoder blocks only (no encoder). Each block has (i) a masked self-attention layer (causal: each token sees only tokens to its left); (ii) a FFN (Feed-Forward Network, sometimes called MLP — Multilayer Perceptron): two linear layers with GELU in between; (iii) residual connections and LayerNorm (Layer Normalization) before each sub-layer (pre-norm style). GPT-2 uses learned position embeddings of shape (n_positions, hidden_size) with n_positions = 1024. The model was only trained on sequences up to 1024 tokens, so positions beyond that have no learned embedding and would be out of distribution. Hence the context length slider is limited to 1–1024 tokens.
2.2 GPT-2 small (gpt2) parameters — model used by CharlesGPT
| Parameter | Value |
|---|---|
| Number of layers | 12 |
| Hidden size | 768 |
| Attention heads | 12 |
| Dimension per head | 64 |
| FFN size (Feed-Forward hidden dim) | 3072 (4 × 768) |
| Vocabulary size | 50,257 (BPE — Byte-Pair Encoding) |
| Max context length | 1024 tokens |
2.3 Causality
A causal mask in attention ensures that the prediction at position i depends only on positions 1, …, i−1. This allows autoregressive generation (one token at a time).
2.4 Inputs, outputs, and inference
Input: sequence of token IDs (BPE). Output: logits of size vocab_size at each position; the last position corresponds to predicting the next token. Next-token probabilities at inference:
Sampling: either greedy (argmax) or sample from this distribution; temperature and top-k/top-p reshape the distribution before sampling.
2.5 Tokenizer and tokens (GPT-2)
What BPE means in practice. BPE (Byte-Pair Encoding) starts from a base vocabulary and repeatedly merges the most frequent adjacent token pair found in training text. After many merges, common chunks become single tokens. Example idea: if "t" + "h" appears very often, a merge creates "th"; then "th" + "e" can become "the". GPT-2 uses a byte-level BPE variant, so any UTF-8 text can always be represented, while still learning compact tokens for frequent words/subwords.
Vocabulary size: 50,257 tokens. They are chosen as follows:
- 256 byte tokens: base vocabulary (all byte values 0–255). Byte-level BPE ensures any UTF-8 text can be tokenized without a true “unknown” token.
- 50,000 merge tokens: learned by the BPE algorithm on the model’s training data. The algorithm repeatedly merges the most frequent adjacent token pair and adds the new pair as a single token until 50,000 merges are obtained. Result: common subwords, words, and character combinations (including symbols like ↓, ⇓, punctuation, spaces, newlines) become single tokens or short token sequences.
- 1 reserved special token:
<|endoftext|>, added by the tokenizer for end-of-sequence (and used as BOS/UNK internally). It is the only token with a reserved role; the model is not trained to “spell” it from bytes.
What appears in the Top-k chart: the chart shows the decoded string for each of the top-k predicted token IDs. So you may see:
| Token / role | Display in CharlesGPT |
|---|---|
<|endoftext|> (EOS) | <EOS> — generation stops when sampled; not appended to the message. |
Newline \n, \r, \r\n | Shown as ↵ in the chart. In the message, \r and \r\n are normalized to a single line break; the bubble uses white-space: pre-wrap so the text actually wraps and newlines create visible line breaks. |
Tab \t | → |
| Space or empty token | ␣ |
| Arrows (↓, ⇓, etc.), quotes, other symbols | Shown as the decoded token (e.g. ↓, ⇓, "). These are normal BPE tokens, not reserved specials. |
There is no separate “start of text” token; the model can run without adding one.
2.6 CharlesGPT tokenization pipeline
CharlesGPT uses the Hugging Face gpt2 tokenizer (byte-level BPE, same vocabulary as §2.5). The chat stack applies it in three places:
- Chat display (
/api/chat/tokenize): the full transcript fromCharlesGptPrompt.buildTranscript()is encoded withadd_special_tokens=False; each token ID is decoded back for colored spans in the chat area. The client refreshes after each update (debounced ~120 ms), including while the assistant reply is streaming. - Generation (
/api/chat/next-token→generate_one_step): the prompt string is truncated to the last 6 000 characters (MAX_PROMPT_CHARS), tokenized without Hugging Face truncation, then—if longer than the context-length slider—only the last N token IDs are kept (N =max_context_tokens, capped at 1024). Each step returns the sampled token, top-k for the chart,token_prob(sampling probability of the chosen token), andcontext_tokens. - Perplexity (
/api/chat/perplexity): the transcript string gets the same 6 000-character tail cut, thentruncation=Truewithmax_length = max_context_tokens(capped at 1024).
Important: no <|endoftext|> token is prepended to chat prompts. EOS can still be sampled during generation, which stops the reply.
Context highlighting: tokens outside the last N positions (max_context_tokens) get a lighter background only (text stays at full contrast). Moving the context slider updates highlighting immediately. During generation, context_tokens from /api/chat/next-token keeps the display aligned with the truncated window fed to the model.
2.7 Static embeddings before attention (embeddings panel)
Before any attention layer runs, GPT-2 builds an input vector per token position by adding two separate learned tables:
- Token embedding
W_te[token_id](lexical/static information). - Position embedding
W_pe[position](order information up to 1024).
How the W_te table is obtained. There is no standalone preprocessing algorithm that computes final token vectors once and for all. In GPT-2, W_te is a trainable matrix (shape vocab_size × hidden_size, i.e. 50257 × 768). Training starts from random initialization, then updates all rows with gradient descent while optimizing next-token cross-entropy on large text corpora. At each step: (1) tokens are converted to IDs, (2) rows W_te[token_id] are looked up, (3) the model predicts next-token probabilities, (4) the loss gradient is backpropagated, and (5) the optimizer updates W_te (together with all other parameters). Frequent/useful tokens receive many updates and organize in geometry that reflects predictive co-occurrence patterns.
The first hidden state is therefore x0(position p, token t) = W_te[t] + W_pe[p]. The embeddings panel (collapsible band below the chat) projects only W_te (static token space) so lexical structure is easy to inspect without mixing positional effects. The background is a precomputed UMAP map over 5,000 tokens; each refresh overlays top-k candidates (blue), the prompt trajectory (orange, numbered by token order), and in both sets (green) when a top-k candidate is also already in the prompt. Top-k point opacity reflects each candidate’s sampling probability. This view is pre-attention: it does not show contextualized vectors after self-attention blocks.
3. Training
3.1 Data
Corpus: WebText (OpenAI internal dataset [1]). About 40 GB of raw text, extracted from web pages linked from Reddit (at least 3 karma). Deduplication, content filtering, and removal of boilerplate (nav, footer, etc.).
3.2 Objective and training loss
Next-token prediction (causal language modeling) [1]: given x1, …, xt−1, predict xt. The training objective is cross-entropy over the vocabulary, averaged over token positions in each training sequence:
Minimizing L corresponds to maximizing the likelihood of the next token given the previous context.
3.3 Model behaviour and quirks
Because GPT-2 was trained on WebText (web pages from Reddit-outbound links), it often reproduces patterns that were very frequent in that corpus. Two consequences:
- Repetition: the model can loop on the same phrase or structure if temperature is low or context is generic.
- “Random” proper nouns: certain names and places (e.g. Kolkata) appear often in the training data (news, geography, lists). The model learned strong co-occurrences, so in generic or uncertain contexts it may output such tokens with high probability even when they seem unrelated to the prompt. This is a known quirk of GPT-2 and similar models trained on broad web text.
Tuning temperature, top-k, and top-p helps reduce repetition and slightly dampens these effects, but they remain inherent to the pretrained distribution.
3.4 Current generation stack (April 2026)
The chat stack is configured to reduce degenerate outputs (symbol bursts, unstable continuations, and repetition loops):
- Role-formatted prompting: structured transcript (
User: …,CharlesGPT: …); trailing cue alternates while waiting for input vs. while the assistant should reply (see §3.5). - Repetition penalty: decoding applies a light repetition penalty (
1.12) on tokens already present in the active context. - Bounded sampling domain: UI and backend clamp sampling values (notably
top-p ≥ 0.05).
Parameter semantics: temperature rescales logits before softmax (0 = greedy argmax). Top-k keeps the k highest-probability tokens, renormalizes, then samples. Top-p keeps the smallest nucleus whose cumulative mass reaches p. Max new tokens caps reply length. Context length keeps the last N tokens of the prompt (or the full history if shorter; GPT-2 hard limit 1024).
Default UI values: temperature 0.3, top-p 0.95, top-k 30, max new tokens 60, context length 300, generation speed 0.2s/token.
Allowed ranges: temperature [0, 2], top-p [0.05, 1], top-k [1, 100], max new tokens [1, 500], context length [1, 1024].
3.5 Prompting system (exact template)
The chat area is a 1:1 mirror of the string sent to GPT-2: every token in the model prompt appears exactly once (role labels above bubbles, bodies inside). Built by CharlesGptPrompt.buildTranscript() in charlesgpt-prompt.js, rendered by charlesgpt-transcript.js, orchestrated from charlesgpt.js with HTTP in charlesgpt-api.js.
- Fixed seed:
User: Introduce yourselfplus the primer assistant reply (CHAT_PRIMERin the prompt module). - Each turn you send is appended as
User: …orCharlesGPT: …plus message text. - Trailing cue: draft input is included as
User: …(top-k matches what you see); empty input shows label-onlyUser:. While the model should reply, a label-onlyCharlesGPT:cue appears before the first token only—not duplicated while text streams.
Primer (prepended on every request):
CharlesGPT is a useful assistant that answers questions of the user by giving information.
CharlesGPT answers with short, clear, natural sentences, stays on topic, and avoids random symbols.
If a question is broad, CharlesGPT gives a concise definition then one concrete example.
User: What is Python?
CharlesGPT: Python is a high-level programming language used for scripting, web development, and data science. For example, you can write a short Python script to read a CSV file and compute statistics.
Example runtime flow (input pre-filled with What is Python? until you send): seed + draft → after Send your message → CharlesGPT: cue → streaming reply on one line → trailing User:. Top-k, embeddings, and perplexity all use this same string.
4. Metrics & perplexity
Three collapsible panels sit below the chat frame (grey bands). Each metric runs only when its panel is open—no extra forward passes on page load. Opening a panel may extend the page; content scrolls with the page rather than inside the band. During generation, top-k and (if open) perplexity refresh after each token; opening the embeddings panel loads the static map once, then overlays trajectory and top-k on each step.
Top-k shows the distribution for the next token under your sampling settings (chart capped at 10 entries). Bar width is relative share within the displayed top-k; bar opacity reflects absolute probability. When the panel is opened manually, one probe /api/chat/next-token call refreshes the chart without appending a token.
Token coloring in the chat uses the same sampling probabilities for assistant tokens the model actually generated (token_prob per step): lower confidence → stronger background (saturated chip colors when confidence falls below ~68%); higher confidence → lighter pastel. User messages, the primer, and draft input use a fixed low background strength; text color stays at full contrast throughout.
Conversation perplexity uses the same transcript string as generation with the same truncation rules. While perplexity is open during a reply, the headline value and sparkline update after each token and usually stabilize as the reply grows.
Perplexity of a generated sequence (full conversation)
Let the chat transcript be tokenized into GPT‑2 subword tokens x1, …, xN (after the same character and length limits as generation: very long text is truncated to the last chunk, then to at most 1024 tokens — the same context window as the slider).
A causal LM defines a conditional probability P(xt | x1,…,xt−1) for each position t = 2,…,N using the raw logits and a single softmax over the vocabulary (no temperature rescaling, no top‑k / top‑p truncation). This is the model’s standard evaluation distribution.
The mean negative log-likelihood over those N − 1 conditional predictions is:
Perplexity is the exponential of that average (geometric mean of inverse probabilities):
Equivalently, PPL is the geometric mean of 1/P(observed token | context) across positions. Lower is better (the model assigned more mass to the tokens that actually appear). It mixes user and assistant text in one string, so it measures how “predictable” the whole transcript is under GPT‑2 — not a human quality score.
Note: During chat generation, sampling uses temperature / top‑k / top‑p; the reported perplexity uses the unmodified softmax, so the number is not tied to the sampled path’s per-step probabilities shown in the top‑k chart.
5. References
- [1] GPT-2 report — Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI. PDF.
- [2] OpenAI blog — Better Language Models and Their Implications (Feb 2019).
- [3] Wikipedia GPT-2 — History, release timeline, and context: GPT-2.
- [4] GPT-1 → GPT-2 and beyond — Timeline of OpenAI releases (GPT-1, GPT-2, GPT-3, etc.) with changelog-style context: Timeline of OpenAI releases (external).
- [5] Hugging Face — Model card and usage: gpt2, Transformers GPT-2 docs.
- [6] Original Transformer — Vaswani et al., Attention Is All You Need (2017): arXiv:1706.03762.