ENN543, Data Analytics and Optimisation

{`
  Supplementary Assignment
  ENN543, Data Analytics and Optimisation, Semester 2, 2019
  Queensland University of Technology
  `}

Problem 1. Clustering. Bike share systems are becoming increasingly common in cities across the world, but their usage is highly variable and depends on factors such as local weather.

You have been provided with two months data from the New York Bike Share system covering one month in summer (Q1/JC-201707-citibike-tripdata.csv) and one month in winter (Q1/JC-201801-citibike-tripdata.csv). From the size of the files alone it is clearly evident that there are substantially fewer trips in winter than there are in summer, however it it unclear if the actual pattern of use (i.e. the typical types of trips) is different.

Using this data and the clustering method of your choice, you are to attempt to answer the question: ‘aside from the overall number of trips, do usage patterns change from from summer to winter?’. In doing this you should cluster the data using the following five dimensions:

start station latitude;
start station longitude;
end station latitude;4. end station longitude;
tripduration.

Note that this means that clusters will contain 5 dimensions, and visualisation of clusters in a single 2D plot will not be possible.

Your answer should demonstrate and discuss how usage patterns are similar or dissimilar (depending on what you find), and should also consider different time periods (morning, afternoon, etc) to better explore how the service is used.

Your answer should explain all decisions made when conducting the analysis, including details such as:

the clustering method selected;
any parameters that are required for the clustering;
any outlier removal that is conducted; and
any data normalisation or scaling that is performed.

Problem 2. Classification. Software systems are complex, and errors in deployed software can be very costly and difficult to correct. In an effort to help detect faulty software, a number of metrics have been proposed that measure software complexity.

You have been provided with data (Q2/pc1.csv) which contains various code metrics for a number of software examples, as well as a flag to indicate if the software contains a fault or not. For clarity:

The first 21 columns contain predictors that measure some aspect of the software complexity, and may be used to determine if software is faulty or not;
The last column contains a value of true or false, indicating if the software has a defect or not.

Using this data, you are to train a support vector machine (SVM) to separate defective software from error free software. You are to report on the accuracy of the developed model, and on any problems or challenges that you encounter in developing the model. In doing this you should:

Divide the data into appropriate training, validation and testing datasets;
Consider what SVM parameters (box constraint, kernel type, etc.) you should use;
Consider the class distribution of the data, and make allowances within the model asneeded.

Please note that allowing MATLAB to optimise hyper-parameters in place of properly investigating parameter settings is not acceptable as a justification for hyper-parameter selection, though a grid search (which is a more systematic approach) will be accepted.

Your answer should explain the choice of parameters in the final model, and discuss it’s performance.

Problem 3. Dimension Reduction and Classification. Recognising content in images can be a challenging problem due to the high dimensional nature of the input data. As such, dimension reduction methods can be used to reduce a problem space and make tasks more computationally feasible.

You have been provided with data (Q3/shvn test.mat) that shows images of single digits (0, 1, 2, 3, 4, 5, 6, 7, 8 and 9) of house numbers, extracted from Google street view data. Using this data you are to train classifiers (the type of classifier is up to you) to classify the observed digit in the image. Prior to classification, you are to reduce the data using:

PCA;
LDA;

i.e. you should train two classifiers: one using data reduced using PCA, one using data reduced using LDA. You are then to evaluate the two classifiers and compare their performance.

In completing this question you should:

Divide the data into appropriate training, validation and testing datasets;
Consider what type of classifier to use;
Determine what an appropriate amount of dimensions to retain is.

Also note that due to memory constraints, it may not be possible to train the PCA or LDA space on all samples, and you may need to use only a subset of the data to compute the PCA and LDA transforms.

Your answer should explain the choice of any parameters and choices made (type of classifier, number of dimensions retained, etc) in arriving at your solution, and discuss the performance of the two methods, relating this what the two transforms (PCA and LDA) are seeking to achieve.

Australia Universities

ACT

Australian Catholic University

Australian National University

Bond University

Central Queensland University

Charles Darwin University

Charles Sturt University

Curtin University of Technology

Deakin University

Edith Cowan University

Flinders University

Griffith University

Holmes Institute

James Cook University

La Trobe University

Macquarie University

Monash University

Murdoch University

Queensland University of Technology

RMIT University

Southern Cross University

Swinburne University of Technology

University of Adelaide

University of Ballarat

University of Canberra

University of Melbourne

University of Newcastle

University of New England

University of New South Wales

University of Notre Dame Australia

University of Queensland

University of South Australia

University of Southern Queensland

University of Sydney

University of Tasmania

University of Technology Sydney

University of the Sunshine Coast

University of Western Australia

University of Wollongong

Victoria University

Western Sydney University

Year 11 - 12 Certification Assignment

Australian Capital Territory Year 12 Certificate

HSC - Higher School Certificate

NTCE - Northern Territory Certificate of Education

QCE - Queensland Certificate of Education

SACE - South Australian Certificate of Education

TCE - Tasmanian Certificate of Education

VCE - Victorian Certificate of Education

WACE - Western Australia Certificate of Education

ENN543, Data Analytics and Optimisation

Diploma Universities Assignments