Last time out, I went deep into some data on the Netflix film database. This time I’ll I’m compare what I found to some findings on Disney + and Amazon Prime. There are some pretty clear findings, that could be repurposed for other data sets and business purposes, but also some classic Data Analysis problems. Namely, the data is stored different across the different data sets and this poses some issues, we also get to talk through that classic Null Value issue.
Let’s start by doing the same data cleaning that I did last time, I replicate this for the different data sets. I’ll be examining the data in Disney and Prime.
Next time I revisit this topic, I’ll combine all three to directly compare the three.
Let’s start by what data cleaning is necessary.
data = disney_data
movies = data[data["type"] == "Movie"].copy()
movies["duration"] = movies["duration"].str.replace(" min", "")
movies["duration"] = movies["duration"].astype("Int64")
movies.info()
I’ve just made a copy of the dataset with only Movies in it. I turned the duration from a string or object to an integer. This means that I can do some actual analysis on it.
print(movies.duration.isna().sum())
0
Great, so no null values. Now, I’ll look at what countries Disney’s movies are from.
Pretty egregious visualisation. I’ll come back to this a little bit later. What about when the films were released ?
Interesting, compared to the Netflix data, there isn’t such a massive 2021 dip. However, the scale is massively different, with peak Netflix films around the 750 mark, a factor of about 10x the volume.
Let’s get a little more data on the duration of these films.
Something really clear to me after looking at this is the far lower average film length to under an hour at least in recent data. The second graphic also shows a lot of films with a low duration of around 10 or so minutes.
But let’s go back to that country data.
Still terrible. It appears that multiple countries are being listed in the one value for the ‘country’ column of the Data Frame. There is a solution however, by running through a loop and splitting on each comma it’s possible to count the number of times a particular instance shows up. The above graph could lead you to guess the number one country featured in Disney’s movie database.
Yep. United States is an extreme outlier, very different from Netflix which had a huge number of films from India, rivalling even the number of films with United States as the country.
These numbers can be normalised to show the split a bit more clearly.
United States 87.74
United Kingdom 7.22
Canada 5.80
Australia 1.90
France 1.24
Germany 0.76
China 0.57
Japan 0.38
Spain 0.38
India 0.38
Let’s move on to the Amazon Prime data. After completing similar data validation tasks as mentioned at the beginning, the null value count shows something interesting.
movies.country.isna().sum()
7245
That’s a lot of null values, that out of a total of 7814 values.
Yep. About what would I would expect when 90% + values are missing. In this case, without further querying potential, we simply have to accept that the data we have for countries on Amazon Prime films is just poor.
Moving along, let’s have a look at the number of films released over time.
Wow, here the volume is really astounding, especially with the volume of recent films. It’s just really huge. This requires further querying of how Amazon works to really understand what is going on, but we can say for sure that they have a large volume of films, especially regarding new films.
Next is the duration of the films.
Something I noticed immediately is the different scale the average duration is on. Comparatively to the Disney data it appears to stay healthy around the 90 minute mark. There is still a decline slowly to shorter films but it is not an incredible decline in this case.
The second graph is a bit alarming. The first thing to mention here is that of outliers. This might be a topic for another time, however, it’s quite clear to see that a couple of values in the last 5 years around the 500 minute mark drastically distort this graph and it begins to lose meaning. One way to deal with this would be to remove them, but I thought it told an interesting story so I decided to keep it in this case.
Lastly, I decide to show the countries regardless of the massive number of null values. Strangely enough, India comes out on top, winning especially in terms of duration, meaning that these films are generally longer.