Data Mining

Group Project

This page will updated as the project progresses throughout the semester.
The next two milestones are the presentation and final report:

Presentation: be ready to present by Tuesday, November 28th in case you are on the first day. See the link in the Timeline below for more details.

Final report: the final report is due Monday, December 4th. See the link in the Timeline below for more details. The instructions for what to submit are on Brightspace, in the form of an assignment.

Goal

The purpose of the group project is to get hands-on experience in applying maching learning algorithms to interesting/important problems.

Timeline

Milestone Due date

Initial proposal Friday, September 22nd

Group formation Friday, September 29th

Formal proposal Monday, October 16th

Progress report Monday, November 13th

Presentation November 28th and December 1st

Final report Monday, December 4th

Milestone	Due date
Initial proposal	Friday, September 22nd
Group formation	Friday, September 29th
Formal proposal	Monday, October 16th
Progress report	Monday, November 13th
Presentation	November 28th and December 1st
Final report	Monday, December 4th

How Groups Work

The course project will be in groups of 4-6 students. Undergrads team up with undergrads. Graduate students team up with graduate students. This is to ensure that graduate students select more ambitious projects (though undergraduate projects should still be moderately ambitious and highly ambitious ones (if reasonably feasible) are certainly welcomed).

Each student is expected to be involved with some machine learning component of the project. If, for instance, a project is to get movie ratings data and then apply some fancy machine learning method for it, it is not OK if one person only collects the data. Collecting data is not machine learning (unless, of course, you're using machine learning itself to collect the data!).

Some general advice about the project

You ultimately should be running (and analyzing the results from) machine learning methods on some real data. A slight exception is if there is a significant theory component, but even then you should still be at least trying synthetic data that is designed to provide understanding
Graduate students are expected to propose more sophisticated projects. This has purposely been left vague. Roughly speaking, a sophisticated project should involve at least a little bit of theory, should be a bit ambitious, and should involve more creativity.
For any students that are totally lost with regards to proposing a project, you might want to check out any recent proceedings of NeurIPS (formerly called NIPS) or ICML. Consider implementing a recent paper. See if you can replicate their results. Maybe this leads to interesting, new questions which you also can explore.

Here are some project ideas to help get you started thinking

Predicting stock prices using news articles
Sentiment analysis: classifying text (like tweets, or reviews) as expressing positive or negative emotions
Predicting ratings (for a restaurant, product, movie, etc.) using collaborative filtering, content filtering, or a combination of the two
Exploring a prediction problems where ethics and social implications is a concern (including fairness and privacy) - this is a great talk to get started thinking about this
Design a reinforcement learning agent for some game
Compare EM-based algorithms to recent, provably efficient and consistent methods for a problem like learning mixture models
Empirically evaluate recent advanced algorithms for the k-means problem, comparing to the EM approach
Empirically evaluate some recent state-of-the-art algorithms for learning deep neural networks
An in-depth study of some regularization method (like dropout or batch normalization) for learning deep neural networks which tries to provide an understanding for why the method works and ideally experiments with novel variations of the method
Explore various unsupervised dimension reduction techniques (e.g. PCA, ICA, sparse coding) and ideally also develop a new technique. Compare the results on some problem(s)
Predict whether a NeurIPS paper is a poster or a spotlight/oral based on the public reviews (reviews have been public for several years)
A thorough empirical evaluation of some recent ICA algorithms on some problems, or develop your own ICA algorithm and evaluate it
Investigate efficient algorithms for approximate PCA for the purpose of large scale machine learning
Investigate unsupervised dimension reduction methods vs supervised dimension reduction methods for some application problem(s)
Compare frequentist and Bayesian methods for clustering