Video Codecs: More Than You Wanted To Know

How do video codecs work? How do they compress video into such tiny filesizes? A modern video codec is pretty complex, but you can get a pretty good idea of how they work at a high level.

What is a codec anyway?

Codec is short for COder/DECoder. Codecs aren't video files by themselves - codecs are just a part of one. Video files like MP4 are called containers. Each container can hold multiple audio and video streams. (Think different audio tracks for the same video.)

Today I'll just be talking about video codecs. The codec is what compresses the audio or video data, and decompresses it for playback. Each stream is separately encoded when the file is created, and decoded during playback.

Encode/compress and decode/decompress mean the same thing; I use them interchangeably here.

Just how good are they?

High-resolution video has a problem. It takes up so much space.

If you were to store the raw pixel data for HD video, you're going to run out of hard drives pretty quickly. Standard color video stores 8 bits per RGB channel, 3 channels per pixel, per frame.

Resolution Framerate Filesize
1080p 30fps 6 MB / frame 186 MB / second 11 GB / minute
4k 30fps 25 MB / frame 746 MB / second 44 GB / minute

Double those numbers for 60fps. So how can we avoid filling our phones after a minute of footage?

Take this stock video for example. It is in an mp4 container, encoded with the H.264 codec, 4k, 30fps, 11 seconds long. Raw bitrate from the calculations above would give 8.2 GB. The actual file is 30.6 MB, including audio. That’s 0.4% of the raw data.

How do they do it?

How can we get that crazy level of compression? Your first thought might be to compress each frame, a la JPG. Taking this frame as an example:

video frame as jpg

At high quality, it weighs 1 megabyte. At 30fps, this would be 30 MB/second, or 330 MB for the whole video. Still too high, 10x what the actual video file is.

So how do they actually work?

Codecs get such great compression by exploiting similarities between frames. Video frames are 1/30th (or 1/60th) of a second apart - they usually have a lot of overlap.

Video codecs store frames in 3 ways:

  • I frames, or intra-coded frames, also called keyframes. These contain full pixel data for the entire image.
  • P frames, or predicted frames. These don't carry full data for every pixel. They depend on the frame before, which may be a keyframe or another P frame.
  • B frames, or bidirectional frame. These are like P frames, but depend on the frame both before and after.

Video files contain framerate data, but that is more of a suggestion than a hard rule. What actually determines how fast frames are shown is their timestamp. Timestamps are calculated from data in the video file called timebase and PTS, short for Presentation TimeStamp.

Calculating the timestamp for a given frame is a bit complicated. Containers have a piece of data called the timebase. This is often a number like 1/90,000, for 1/90,000th of a second.

Each frame also has a PTS, which indicates it should be Presented at that timebase. So your first few frames in a video might look something like

Frame PTS Timebase Time to display
1 0 1/90,000 0 * 1/90,000 = 0 seconds
2 3,000 1/90,000 3,000 * 1/90,000 = 0.033 seconds (1/30th!)
3 6,000 1/90,000 6,000 * 1/90,000 = 0.066 seconds

and so on.

What was all that about frame predictions?

How do you capture similarities between frames? From our earlier example, there is still huge savings to be had there - around 10X over just compressing each frame by itself!

Codecs handle this by splitting each frame into regions, and comparing differences between regions in adjacent frames. Two popular modern codecs are H.264 and H.265. H.264 uses sections called macroblocks, and H.265 uses something called coding tree units. These work like macroblocks but with more sizes available.

So now that we calculated block differences for 2 consecutive frames, we can store the full first frame, and only the delta for the second.

That second frame is a P frame. Once the first frame is decoded, the second frame can be decoded by applying those changes to the first frame.

B frames work the same way, except using differences between the frame before and after. Hence bidirectional.

Remember when we said you need a powerful processor to decode frames fast enough? This is why. In addition to PTS, frames also have something called DTS - decode timestamp. This indicates to the decoder when frames should be decoded. Now your timestamps might look something like this:

Frame Type DTS PTS
1 I frame 0 (first decoded) 3,000
2 P frame 1,000 (second decoded) 6,000
3 B frame 7,000 (fourth decoded) 9,000
4 I frame 5,000 (third decoded) 12,000

The third frame, a B frame, must wait until frames 2 and 4 are decoded.

Keyframes are the fastest to decode - they don't rely on any other frames. P frames rely on the frame before them, but that's not really an issue. It had to be displayed earlier, so it has necessarily already been decoded. B frames are another animal. They rely on later frames being decoded, before they can be decoded themselves.

The window between DTS and PTS is usually short. Decoded frames take up a lot of memory, so you don't want to have more sitting around than absolutely necessary. Doubling+ the work you need to display a frame on time can make for tight deadlines.

B frames have the highest decoding performance requirements, but also generally the most compression. You can always compress less, and get faster decoding. But then you have much bigger filesizes.

So now we can store keyframes occasionally, when the whole screen needs an update. We can store P frames, storing only the differences between a frame and the last. And we can store B frames, storing differences between the frames both before and after, for maximum space savings. And that’s how you get a video file 0.4% as big as the original.