It's a Reflection

LLMs can not be trained on data that cannot be accessed.

What is NOT in the training set: not exhaustive

  • Internal, authoritative documents
  • Non-public / closed source, high-impact software source code

(And many other things, like physical books...)


Long tail of rubbish


Good Habits

illustrative

  1. Pre-training: a lot of data ("pretraining corpora") like Common Crawl to "wire up" the LLM to "understand"
  2. Post-training: get the LLM respond well, follow instructions etc.

→ General availability

  1. Fine-tuning: refine LLM with non-public data (documents, code etc.)

→ Available for the organisation


LLMs Change — For Better or Worse!

example

2025

  • Aug, GPT-5: huge jump from 4.x to a very effective platform
  • Nov, GPT-5.1: "personalities" (?!) and coding quality drops → need to force 5.0
  • Dec, GPT-5.2: back to high coding quality

Not picking on OpenAI, it's just how I saw it.


System Prompt

illustrative

[SYSTEM PROMPT]
[SYSTEM PROMPT OF SUB-PROVIDER] → optional! Could be inserted by your IDE or your own system
[User said:  ...]
[AI said:    ...]
[User said:  ...]
[Attachment]
[AI thought: ...]
[AI said:    ...]
... and so on

Play to the Strengths

Claude or Codex better at coding?
→ Use them for coding

Gemini great at code review?
→ Use it for code review

Simple? Yes!
And still, one can easily get tangled up trying to get one do well where another is better.


Predictable?!?!

Computers used to be predictable.
That was their strength: same request → same result, every time

AI is not predictable.
We have to now handle different results for the same request.
And the variance itself varies!


Context Window

A little bit of memory.

illustrative

Context window     "The Brain" (fixed)
--------------     -------------------
[Prompt      ]     [ Pretrained         ]
[Response    ]     [ Transformers       ]
[Prompt      ] --> [ (and other things) ] --+
[Response    ]     [                    ]   |  
[Prompt      ]     [ fixed in           ]   |
[Response    ]     [ model versions     ]   |
     ^                                      |
     |                                      |
     +-- LLM output appended to response ---+

The amount of "memory" in the context window is minuscule
compared to the data encoded in the LLM itself.


Context Window Poisoning

illustrative

Context window      "The Brain" (fixed)
---------------     ----------------------
[Old stuff    ]     [ Pretrained         ]
[Old stuff    ]     [ Transformers       ]
[New objective] --> [ (and other things) ]
[New objective]     [                    ]  
[Prompt       ]     [ fixed in           ]
[Response     ]     [ model versions     ]

What to pay attention to? The old or the new??