Top YouTube Cubers
Introduction
I'm someone who likes rankings and competition. I used to play chess competitively and played DoTA2 on UC Berkeley's eSports team as well. In both games, there is rating system that estimates your skill level compared to other players. Commonly, this is referred to as Elo, though DoTA's implementation is a slight modification.
Speedcubing is a niche hobby where people solve Rubik's cubes as fast as possible. To make it even more niche, there are a handful of us who dedicate time making videos about cubing. These videos include tutorials, product reviews, and competition footage. As is the case with many niche communities on YouTube, individuals have gained notoriety by consistently uploading videos and forming a fanbase.
It's sometimes contested who the best speedcubing YouTube content creators are, but it's difficult to find a metric that represents "goodness" of a content creator. There are many factors at play, including subscriber count, engagement in videos, number of views, etc.
This project is attempts to remove some of the subjectivity by using metadata from a collection of uploads for various users.
Speedcubing is a niche hobby where people solve Rubik's cubes as fast as possible. To make it even more niche, there are a handful of us who dedicate time making videos about cubing. These videos include tutorials, product reviews, and competition footage. As is the case with many niche communities on YouTube, individuals have gained notoriety by consistently uploading videos and forming a fanbase.
It's sometimes contested who the best speedcubing YouTube content creators are, but it's difficult to find a metric that represents "goodness" of a content creator. There are many factors at play, including subscriber count, engagement in videos, number of views, etc.
This project is attempts to remove some of the subjectivity by using metadata from a collection of uploads for various users.
Scoring Method
The parameters I'm most interested in are viewership and engagement (i.e. how many people click like/dislike on a video). We want people who have more views to typically score higher because it means they have a wider reach to the audience. However, a penalty needs to be put in place for videos that attract negative engagement (i.e. a lot of dislikes).
Thus, we can calculate the amount of "good" views as a proportion of likes to total engagements and then make that proportional to the log of # of views.
\[ score_{video} = (\%likes \times \log{views})^2 \]
Note that this means the minimum score is 0 (if all engagement is negative). That's great, but what if someone has created multiple videos? That should improve their channel score. We simply take the sum of the scores for the last 50 uploads, scaling each score by its recency. That is, the oldest video would have a weight of 1/50 while the newest video would have a weight of 1. It's not a perfect representation of time, but it's a good start.
Thus, we can calculate the amount of "good" views as a proportion of likes to total engagements and then make that proportional to the log of # of views.
\[ score_{video} = (\%likes \times \log{views})^2 \]
Note that this means the minimum score is 0 (if all engagement is negative). That's great, but what if someone has created multiple videos? That should improve their channel score. We simply take the sum of the scores for the last 50 uploads, scaling each score by its recency. That is, the oldest video would have a weight of 1/50 while the newest video would have a weight of 1. It's not a perfect representation of time, but it's a good start.
Implementation
The biggest challenge for this project (at least to where I wanted to get with it) was figuring out how to pull the relevant data for each user. Deciding which coding stack to use was also challenging because I'm limited by how much code I can inject into this site (since it's hosted on Weebly's servers) and I was reluctant to deploy everything on Heroku and AWS.
Sooo, I started with a simple schematic.
Sooo, I started with a simple schematic.
A lot of the things were nice-to-haves such as AWS and using Flask as a web framework. What was most important to me was first getting the data. I used YouTube's Data v3 API which is well-documented and has a large user base for this project. However, it quickly became a limiting factor and ultimately why this project ended where it did.
YouTube's (and really, Google's) API allows for 10,000 queries/day. Given the amount of data I'm trying to retrieve, it means that I can pull about 20 users' data in a 24 hour period. So, you can imagine if I'm trying to gather rankings for hundreds of users, I cannot update frequently.
There were additional issues such as how YouTube handles channel IDs. Legacy users got to choose their usernames before Google bought YouTube. Newer users have their Google and YouTube accounts linked, and their channel IDs are thus random strings that don't correspond with their username.
I first created a script that would get the channel IDs if the user had a legacy account. From there, I only worked with channel IDs.
After that, the implementation of getting the video metadata and calculating the scores was straightforward, but expensive. Each channel's data cost over 400 search queries, which is why the results at the bottom of this page are so short.
YouTube's (and really, Google's) API allows for 10,000 queries/day. Given the amount of data I'm trying to retrieve, it means that I can pull about 20 users' data in a 24 hour period. So, you can imagine if I'm trying to gather rankings for hundreds of users, I cannot update frequently.
There were additional issues such as how YouTube handles channel IDs. Legacy users got to choose their usernames before Google bought YouTube. Newer users have their Google and YouTube accounts linked, and their channel IDs are thus random strings that don't correspond with their username.
I first created a script that would get the channel IDs if the user had a legacy account. From there, I only worked with channel IDs.
After that, the implementation of getting the video metadata and calculating the scores was straightforward, but expensive. Each channel's data cost over 400 search queries, which is why the results at the bottom of this page are so short.
Results
Channel Name
|
Score
|
Future Directions
There's lots that can still be done to make this better! Going back to the schematic, right now the data in the table updates when I make a new git commit with an updated database csv. It would be nice to have the csv file hosted on an AWS for scalability. Furthermore, I currently need to webscrape to find cubing channels. Having a front-end form entry would be a great way for people to enter their own channels and have the site populated that way. However, that's very difficult to do with a static page like this one. One implementation that would be to include a form on this page that sends emails to an account that I can then webscrape. However, that's far too much work for very little value, so I've put that off for now.
On the back end, optimizations can be made to limit quota usage. For example, I can store when the last video was uploaded in my database, and make checks ever few days for each user to see if things have changed. Then I can selectively update the score for that single user.
As far as the scoring goes, the methodology isn't perfect. One improvement I'd like to make is to account for when the video was uploaded. Being uploaded 10 years ago means that it likely has more views, and that should be slightly penalized.
Overall, this was a fantastic project that brought together multiple aspects of development, especially in the back end. I used Python to pull metadata using YouTube's API and automated multiple procedures to calculate scores and store information in a database.
On the back end, optimizations can be made to limit quota usage. For example, I can store when the last video was uploaded in my database, and make checks ever few days for each user to see if things have changed. Then I can selectively update the score for that single user.
As far as the scoring goes, the methodology isn't perfect. One improvement I'd like to make is to account for when the video was uploaded. Being uploaded 10 years ago means that it likely has more views, and that should be slightly penalized.
Overall, this was a fantastic project that brought together multiple aspects of development, especially in the back end. I used Python to pull metadata using YouTube's API and automated multiple procedures to calculate scores and store information in a database.