- COMP30760 Data Science in Python
- COMP30770 Programming for Big Data
- COMP30850 Network Analysis
- COMP30390 Optimisation
- COMP47490 Machine Learning
- Acknowledgements
Autumn Trimester, 2020
- Create and activate the
ds-env
environment.conda env create -f environment.yml conda activate ds-env
- Change to the
python-comp30760
directory, then runjuypter notebook
. - The project notebooks can now be run.
The objective of this assignment is to collect a dataset from one or more open web APIs, and use Python to prepare, analyse, and derive insights from the collected data.
- API chosen: Spotify Web API
- Data: python-comp30760/data/a1/ (pre-collected to avoid calling the API with secret tokens which are not included).
- Notebook: a1-spotify-analysis.ipynb
- Data Identification and Collection:
- Choose one or more public web APIs.
- Collect data from your API(s) using Python.
- Save the collected dataset in JSON format for subsequent analysis.
- Data Preparation and Analysis:
- Load the stored JSON dataset, and represent it using an appropriate structure.
- Apply any pre-processing steps that might be required to clean, filter or engineer the dataset before analysis.
- Analyse, characterise, and summarise the cleaned dataset, using tables and visualisations where appropriate.
- Summarise any insights which you gained from your analysis of the dataset, and suggest ideas for further analysis.
Increasingly, large-scale mobility datasets are being made publicly available for research purposes. This type of data describes the aggregated movement of people across a region or an entire country over time. Mobility data can naturally be represented using a time series, where each day is a different observation. Recently, Google made mobility data available to help researchers to understand the effects of COVID-19 and associated government policies on public behaviour. This data charts movement patterns across different location categories (e.g. work, retail etc). The objective of this assignment is construct different time series representations for a number of countries based on the supplied mobility data, and analyse and compare the resulting series.
- Data: python-comp30760/data/a2/ (three countries selected: Ireland, New Zealand, USA)
- Notebook: a2-covid-19-mobility.ipynb
-
Within-country analysis (for each of the three selected countries separately)
- Construct a set of time series that represent the mobility patterns for the different location categories for the country (e.g. workplaces, residential, transit stations etc).
- Characterise and visualise each of these time series. You may choose to apply re-sampling and/or smoothing in order to provide a clearer picture of the trends in the series.
- Compare and contrast how the series for the different location categories have changed over time for the country. To what extent are these series correlated with one another?
- Suggest explanations for any differences that you have observed between the time series for the location categories.
-
Between-country analysis (taking the three selected countries together)
- Construct a set of time series that represent the overall mobility patterns for the three countries.
- Characterise and visualise each of these time series. You may choose to apply re-sampling and/or smoothing in order to provide a clearer picture of the trends in the series.
- Compare and contrast how the overall time series for the three countries have changed over time. To what extent are these series correlated with one another?
- Suggest explanations for any differences that you have observed between the time series for the countries.
Spring Trimester, 2021
For a detailed report, please see proj1-report.pdf.
The scripts below can be run on any Linux shell, without any command line arguments.
Bash on WSL2 for Windows 10 has been used for development.
- Raw dataset: data/reddit_2021.csv
- Execute permissions for scripts:
$ chmod +x ./*.sh
- Performing data cleaning operations:
$ ./00-replace-protected-commas.sh $ ./01-drop-index-and-nsfw.sh $ ./02-drop-empty-cols.sh $ ./03-drop-single-val-cols.sh $ ./04-sec-to-month.sh $ ./05-count-posts-per-month.sh $ ./06ab-title-lower-no-punc.sh $ ./06c-remove-stop-words.sh $ ./06d-reduce-to-stem.sh $ ./06e-place-clean-titles.sh
- All other files in the data/ directory are regenerated by the scripts above.
- Clean dataset obtained: data/reddit_2021_clean.csv
A Docker container with support for MySQL and MongoDB is recommended.
Image used: registry.gitlab.com/roddhjav/ucd-bigdata/db
.
- Stop and remove preexisting containers named
comp30770-db
, then create a new container using the above image.$ ./docker-create.sh
- Copy scripts for MySQL and MongoDB, and the cleaned dataset data/reddit_2021_clean.csv to the container.
$ ./docker-cp-files.sh
- Start a Bash prompt in the container's
/root
directory.$ ./docker-start.sh
- Create and populate the 'reddit' database in MySQL and MongoDB (in Docker)
# ./07-mysql-create-db.sh # ./08-mysql-populate-db.sh # ./09-mongo-populate-db.sh
- Queries are run in the
mysql
andmongo
prompts as described in the report.
- CLI for Big Data
- Relational (SQL) vs. non-relational (NoSQL) database systems
- Review on 'Dynamo: Amazon's highly available key-value store'
For a detailed report, please see proj2-report.pdf.
-
Execute permissions for scripts:
$ chmod +x ./*.sh
-
Set up a Docker cluster for Spark:
$ ./docker-setup-spark.sh
-
Clean and download data/ files, copy files to Docker container, and start a Bash prompt.
Both files in the data/ directory are regenerated by this script.$ ./docker-start.sh
-
In the container, run Spark SQL queries on the GitHub starred projects dataset.
bash-5.0# spark-shell -i 01-github.scala
-
Graph processing on the DBLP co-authorship dataset.
bash-5.0# spark-shell -i 02-dblp.scala
-
Reflection: Review on 'Spark: Cluster Computing with Working Sets'.
Spring Trimester, 2021
The goal of this assignment is to construct and characterise network representations of two movie-related datasets. The networks should model the co-starring relations between actors in these two dataset - i.e. the collaboration network of actors who appear together in the same movies.
-
Notebook: a1-co-stardom-network.ipynb
Set up
conda
environment and start Jupyter Notebook.$ conda create -n a1-comp30850 python=3.8 jupyterlab networkx seaborn $ conda activate a1-comp30850 $ cd network-analysis-comp30850/a1-co-stardom-network/ $ jupyter notebook
-
GEXF and PNG Files: net1.gexf, net2.gexf, net1.png, net2.png
-
Gephi Project: costardom.gephi
For each dataset:
- Network Construction
- Parse the JSON data and create an appropriate co-starring network using NetworkX, where nodes represent individual actors.
- Identify and remove any isolated nodes from the network.
- Network Characterisation
- Apply a range of different methods to characterise the structure and connectivity of the network.
- Apply different centrality measures to identify important nodes in the network.
- Ego-centric Analysis
- Select one of the important nodes in the network and generate an ego network for this node.
- Network Visualisation
- Export the network as a GEXF file and use Gephi to produce a useful visualisation.
Sample visualisations (see notebook for details):
The goal of this assignment is to construct and characterise a range of network representations, created from pre-collected Twitter data for a specific Twitter List of user accounts which relate to a particular topic (e.g. technology, sports news etc).
-
Notebook: a2-twitter-networks.ipynb
Set up
conda
environment and start Jupyter Notebook.$ conda create -n a2-comp30850 python=3.8 jupyterlab networkx seaborn $ conda activate a2-comp30850 $ cd network-analysis-comp30850/a2-twitter-networks/ $ jupyter notebook
-
GEXF and PNG Files: follow_net.gexf, follow_net.png, mention_net.gexf, mention_net.png, hashtag_co_net.gexf, hashtag_co_net.png
-
Gephi Project: twitter-networks.gephi
For the selected data, construct and characterise five different Twitter network representations.
- Follower network
- Reply network
- Mention network
- User-hashtag network
- Hashtag co-occurrence network
Sample visualisations (see notebook for details):
Autumn Trimester, 2021
cd optimisation-comp30390
conda env create -f env-comp30390.yml
conda activate comp30390
jupyter notebook
A number of classic linear programming problems solved in Julia.
- Notebook: a1
- Notebook: a2
Autumn Trimester, 2021
cd machine-learning-comp47490
conda env create -f env-com47490-m1.yml
conda activate comp47490-m1
jupyter notebook
- Notebook: a1
- Task:
- Given a sample of the Austin Animal Shelter Outcomes dataset, the objective is to build a data analytics solution for death risk prediction to help the shelter in their planning toward improving animal welfare.
- The goal is to work with the sample to build and evaluate prediction models that capture the relationship between attributes and the target feature: outcome.
- Notebook: a2
- Task:
- Given a sample of the Adult Census Income dataset, the objective is to use the ensemble learning functionality to identify the extent to which classification performance can be improved through the combination of multiple models.
- The data contains 14 attributes including age, race, sex, marital status etc, and the goal is to predict whether the individual earns over $50K per year.