Sub Menu

How are movies compressed?

Motion Compensated Prediction

Motion Estimation

Motion Compensation

Transform Coding

Group of Pictures

The Team:
Scott Chin
Anup Misra

Dr. Wu Sheng Lu

August 2003

Video Coding Basics

How are Movies Compressed?

There are three main parts found in a standard video compression algorithm.

  • Motion Compensated Prediction (MCP)
  • Transform Coding (DWT/ DCT)
  • Group of pictures (GOP)

The most fundamental procedure in video compression is the Motion-Compensated Prediction. What does this mean? Video sequences show a high degree of correlation from frame to frame. One compression strategy is to take the difference between adjacent frames and store this value. If the two frames are similar, the difference frame will contain only a little amount of information. This technique is known as Predictive Coding and is commonly used in signal compression. In video signals there is a catch, imagine an action sequence with fast moving cars, and quick camera movements. In these sequences, adjacent frames may be wildly different. If neighboring frames are very different, the difference frame may be large and contain more information than the original frames themselves! To overcome this we use ‘Motion-Compensated’ Prediction.

Back to Top

Motion Compensated Prediction

MCP is a refinement of predictive coding. Rather than finding the difference frame directly, we can use the motion of objects in the scene to produce a better predictive coding algorithm.

How do we estimate the motion in the scene?

Motion Estimation

In general, a scene has multiple moving objects. Therefore, the motion of each object can be characterized from frame to frame. For example, if there is a movie of a car driving across the screen, then each frame shows the same car but it is shifted with respect to the previous frame. This shift can be calculated and characterized by a Motion Vector. Now we can extend this idea to all objects in the scene.

However, determining where all the objects are in a scene is extremely complex. A simple, but non-ideal, solution is to partition each frame into non-overlapping uniform square blocks. This type of motion estimation is called Block Matching.

The blocks in the first frame, called the anchor frame, are compared to the blocks in the second frame, called the target frame. Motion Vectors can then be calculated for each block to see where each block from the anchor frame ends up in the target frame.

The following is an example of two frames from a video sequence. Note that the train is moving from right to left, the ball is rolling, the gimbals are spinning, and the camera is slowly panning upwards.

The following figure shows the motion vectors generated between these two frames. Note the areas of motion. Also note that there is perceived motion in areas of constant color. This is because all the blocks in this area consitute a perfect match. Therefore there is perceived motion. However, this is not a problem when reconstructing the frame.

The following figure shows the target frame predicted from the anchor frame and the motion vectors.

Back to Top

Motion Compensation

When a target frame is reconstructed using the anchor frame and motion vectors, the reconstruction is not perfect. In order to compensate for these errors, an error frame is generated at the encoder. The error frame is the difference between the actual target frame, and the reconstructed target frame. The error frame is added to the reconstructed frame at the decoder.

So why are these error frames and motion vectors desireable? The reason is that the motion vectors produce error frames that compress extremely well.

So how are these error frames compressed? The answer is transform coding.

Back to Top

Transfom coding - The Discrete Wavelet Transform

All mainstream encoders use Discrete Cosine Transform (DCT) to perform transform coding. The DCT maps a time domain signals to a frequency domain representation. We can compress the frequency domain spectrum by truncating low intensity regions.

However, the DCT has several drawbacks. Computation of the DCT takes an extremely long time and grows exponentially with signal size. To calculate the DCT of an entire video frame takes an unacceptable amount of time. The only solution is to partition the frame into small blocks and then apply the DCT to each block. However, this leads to a degradation in picture quality.

The Discrete Wavelet Transform, DWT, offers a better solution. The DWT is another transform that maps time domain signals to frequency domain representations. But the DWT has a distinct advantage; The DWT, in essence, can be computed by performing a set of digital filters which can be done quickly. This allows us to apply the DWT on entire signals without taking a significant performance hit. By analyzing the entire signal the DWT captures more information than the DCT and can produce better results.

So what does the DWT look like when applied to an image?

The figure below shows an image of a seagull. The DWT separates the image’s high frequency components from the rest of the image, resizes the remaining parts and rearranges them to form a new ‘transformed’ image.

This image shows one step of the 2-Dimensional DWT. The image is separated into four subimages. The bottom left, bottom-right and top-right show the high-frequency detail of the image. The top left quadrant contains the low frequency or lower detail portion of the image, we can see that most of the information is in this portion. We can achieve compression by removing data in the high detail areas. As you can see, if we retain only the top left image we are dropping information that does not distort the image in a noticeable fashion.

To perform lossy compression all small components in the transformed image can be set to zero and run-length encode.

Back to Top

Group of Pictures

What is the general structure of compressed and uncompressed video sequences?

In general, there are three different types of frames. They are called I, P, and B frames.

I frames are essentially the main anchor frames. No motion estimation is performed to generate these frames. They are transform coded directly to ensure a high quality reconstruction. This is because all following frames are predicted from the I frame. And any error in the I frame will propagate through the rest of the group.

P frames are predicted using MCP from the preceeding I or P frame. The error frame generated is then transformed and compressed. Both the error frame and the set of motion vectors are stored to file.

B frames are encoded much like P frames except that the prediction is done from a combination of a previous P or I frame, and a future frame P or I frame. The results are then averaged to represent the current frame. This is called bi-directional prediction. The prediction relative to future frames is needed to capture new object that may appear in the video in the middle of the group of pictures.

Why is the ordering important?

When predicting frames from previously encoded frames any errors in the previously encoded frames will degrade the reconstruction of the current frame. This error propagation can be controlled by using a specific frame ordering known as the group of pictures. A general group of pictures structure is shown below.

Back to Top