Big data means big impact – on algorithm sizes, authors’ rights, and climate. This post is a short, rough explainer of the size of GPT training datasets, and the ethical implications of that size.

We often talk about the increasing scale of the training datasets for new generations of GPT models, essentially to give a sense of their increased performance capacity, but it’s… a little abstract. So I thought I’d try to find a way to unpack it simply.
Why does this matter?
A few reasons, both functional and ethical. A background recap: A GPT’s “training data” is a bit like its memories of life. When I talk about what I “know”, I’m really referring to the bank of things I’ve seen, heard, felt, reasoned, read, and agreed with throughout my life as I remember it. If you ask me to spell a word, I’ll use my memories to try to find or figure out the spelling. A GPT doesn’t have a “life” to reflect on, so it uses the training dataset for this. This is the stock of data fed to it by its developers to enable it to “reflect” and produce a response to a prompt.
So, some of the reasons that training data size matters:
Relative “life experience”. A GPT’s training data is a bit like how much “life” it has experienced. So if GPT-2 is a freshman, GPT-3 is an internationally-distinguished professor nearing retirement.
Ethical sourcing. A GPT’s training data is scraped from a vast bank of sources. Some of it is legitimately acquired; most of it is definitely, definitely not. And every new GPT that is developed is built off the back of previous models, meaning this unethical sourcing cascades from model to model.
Carbon emissions. The larger a GPT’s dataset, the bigger its carbon footprint. This is both in terms of the algorithm’s actual size, which has to be stored on massive, energy-hungry data servers, and in terms of the carbon emissions generated by the GPT every time it is prompted to generate something. A common example is that a ChatGPT search produces about five times the carbon emissions as a traditional search engine, despite being far less reliable (given hallucinations and dataset limits).
How big are these training datasets?
I’ll limit this to OpenAI’s GPT models since 2018 to provide a basic benchmark. A common way of describing text dataset size is in “tokens”; that is, semantic features that matter within text. (Here’s a neat explainer if you’re curious. That previous sentence was 10 tokens, if you can’t be bothered following the link.)
Here’s a comparison of training datasets for OpenAI’s milestone models:
2018: GPT-1 - 600 million tokens
2019: GPT-2 - 28,000 million tokens (~47 times larger than GPT-1)
2020: GPT-3 - 300,000 million tokens (~11 times larger than GPT-2)
2023: GPT-4 - 13,000,000 million tokens (~43 times larger than GPT-3)The exact file sizes have not been made transparent by OpenAI, so we have to rely on heuristics like tokens, but it’s commonly understood that GPT-3 was trained on about 45 terabytes of data, while GPT-4 was trained on 1 petabyte.
If this is abstract, let’s break it down again. Remember that “bits” are the zeroes and ones that make up digital code.
1 byte (B) = 8 bits
1 kilobyte (KB) = 1024 B
1 megabyte (MB) = 1024 KB
1 gigabyte (GB) = 1024 MB
1 terabyte (TB) = 1024 GB
1 petabyte (PB) = 1024 TBEvery line of that is over a thousand times the size of the line before. I wanted to visualise it, but my computer froze while I was trying, so here’s just one layer. If you can try to imagine multiplying it by a thousand four more times, please do:

It’s a bit stressful, but there you go. And each time a new GPT is developed, the scale increases again. This means more data needs to be scraped. It means more “experience” for the algorithm. It also means far more data has to be got from somewhere. And let’s be honest – even when they ask permission, that permission is often deeply embedded in fine print. And, of course, it means exponentially more carbon is emitted into our struggling atmosphere, while we try in vain to remember to use our keep cups and convert our home energy systems away from gas.
This is a moral plea, now. I hope it’s not too much to ask. Just, please, pause a moment before you act on that thought: “maybe ChatGPT can save me a few minutes here”.

Leave a comment