© 2017 by Doran Bae 

You are reading Coding Otter, stuffed with articles about big data, data science development, ML in production, managing data science projects, and many more. ​
About the host
I'm Doran Bae, Data Scientist @TVB turning data into products and stories. More about me.

Unboxing: YouTube-8M Starter Code

I am starting a series called Unboxing, which is a post on replicating others work (+ and takeaways) or trying out new library/solution/tools (+ and review). The first one in the series is on YouTube-8M dataset and the video classification model starter code provided by Google.

I first came to know about this dataset when it first was published in 2016. At that time, I was working on my Masters degree in Data Science at UC Berkeley and used this dataset as one of our class projects. My team and I focused on visualizing and analyzing the dataset for the project, so I thought it might be a nice time to give it a fresh look. As of now, this dataset was used for two Kaggle competitions one in 2017 and the other in 2018.

About the Dataset

Video level vs. frame level?

The dataset is provided in frame-level and video-level features. The most accurate way to understand the difference of these two features is to reference the official site or the white paper released by Google AI in 2016. In summary, here's the gist of it.

  • From the beginning, YouTube's videos are already labeled using what is called a YouTube video annotation system, which labels videos with the main topics in them ☞ Video-level features

  • Engineers decoded each video at one-frame-per-second up to the first 360 seconds (6 minutes)

  • And used a Deep CNN (the Inception network pre-trained on ImageNet) to extract the hidden representation immediately prior to the classification layer

  • Engineers compressed the frame features (by a factor of 8) ☞ Frame-level features

The total size of the video-level features is 31GB and are broken into 3,844 shards which can be subsampled to reduce the dataset size. 31GB is equivalent to watching Netflix for 30 hours. It is not that much and manageable.

In contrast, the total size of the frame-level features is 1.53TB and they are broken into 3,844 shards. Now this is equal to watching Netflix for really really long time (30 X 1024 X 1.53 = about 47,000 hours)

Can I use one over the other?

Sure it seems like you can choose to use either the video-level (for those who are concerned about storage cost) or both for those who have the means to burn money in storage and in computing.

Please be minded that one of the challenges with this dataset is that we only have video-level ground-truth labels. The dataset does not provide any additional information that specifies how the labels are localized within the video, nor their relative prominence in the video.

Downloading the dataset

If you refer to this github, you will find every bit of detail to download the dataset (in partial) and running the starter code.


I am using the starter code and it can be trained with just a single line of command. As far as what is inside the starter code, we will get to it later.


Evaluating step is easily completed by running this command. As far as what is inside the starter code, we will get to it later.

The result of the evaluation is here. Because we trained on a partial dataset (1/100), the GAP is at 0.323.

Now you can download the entire dataset and train a model to classify videos!