In this notebook, I'll take a look at my streaming viewing history data from Hulu, Netflix, and Prime Video accounts. I will merge these datasets with an additional dataset that contains more information about the streaming titles like run time, genres, and Imdb score. This will enable me to gain more insight into my streaming history.
I turned in my Project Plan when I only had my Netflix data. Since then, I've been able to obtain my Hulu and Prime Video data.
My original idea was to compare my streaming data to the top streaming titles but I wasn't able to find a dataset that worked with my data.
I realized (too late for this project) that Netflix and Prime Video create a new "watch event" every time a stream is paused and played again. This means that my data shows individual titles with artificially high watch counts. I made the decision not to remove the duplicates because doing so would have removed genuine new watch events and I would rather go through the datasets myself to determine which ones to delete.
I removed more data from my streaming viewing history than I would have liked in order to combine them. While Hulu had the best data because it did not count each pause as a new "watch event", the downside was the data came split over multiple PDFs and would have required far more time to clean. I made the decision to stick with the basic data from each streaming service on the basis of time.
I struggled finding an additional dataset that worked with my combined streaming viewing history data. In the end, I stuck with the one that gave me the largest merged dataset. I know I could have used just the three streaming datasets but I wanted the challenge of merging additional data.
- Hulu
- Netflix
- Prime Video
Kaggle - Netflix, Movies, and Popularity
Using a Virtual Environment
- Clone this repo git clone https://github.com/istarlet/streaming_analysis.git
- Create a new folder in the cloned repo called
datasets
- Download the datasets here and add the downloaded datasets to the
datasets
folder See note - CD into cloned project folder
- Install virtual venv if you don't already have it installed
pip install virtualenv
- Activate the virtual environment (see intructions here)
- Install the requirements.txt file
pip install -r requirements.txt
- Then run these Juptyer Notebook files: Hulu, Imdb, Netflix, Prime Video BEFORE running the main project notebook Streaming Data
(Note: If you downloaded the dataset as a .zip file, make sure to add the individual datasets to the new datasets
folder and not the folder they were zipped in.)
On Mac/Linux
Open the Terminal and create a virtual environment with the command python3 -m venv virtual-env
Activate the virtual environment with the command source virtual-env/bin/activate
On Windows
Open the Command Prompt and create a virtual environment with the command python -m venv virtual-env
Activate the virtual environment with the command virtual-env\Scripts\activate.bat
Type deactivate
- datetime
- matplotlib
- pandas
- seaborn
Read TWO data files (JSON, CSV, Excel, etc.).
I read in four CSV files.
- AshleyViewingActivity.csv
- DigitalPrimeVideoViewingHistory.csv
- HuluViewingHistoryUpdated.csv
- titles.csv
Clean your data and perform a pandas merge with your two data sets, then calculate some new values based on the new data set.
I cleaned the each dataset in their own Jupyter Notebook.
I then concatonated the Hulu, Netflix, and Prime Video datasets together.
With my new combined streaming dataset, I merged it with the Imdb dataset.
I added new columns to the merged dataset by extracting the day, month, and hour from the "Date Watched" column.
Make 3 matplotlib or seaborn (or another plotting library) visualizations to display your data.
Utilize a virtual environment and include instructions in your README on how the user should set one up
I created a virtual environment with instructions in the How to Run this Project section.
Annotate your code with markdown cells in Jupyter Notebook, write clear code comments, and have a well-written README.md.
In my Jupyter Notebooks, I annotated my code with markdown cells and wrote clear code comments. I have included a README.md.