Data Mining
Group Project
This page will updated as the project progresses throughout the semester.
The next two milestones are the presentation and final report:
- Presentation: be ready to present by Tuesday, November 28th in case you are on the first day. See the link in the Timeline below for more details.
- Final report: the final report is due Monday, December 4th. See the link in the Timeline below for more details. The instructions for what to submit are on Brightspace, in the form of an assignment.
Goal
The purpose of the group project is to get hands-on experience in applying maching learning algorithms to interesting/important problems.
Timeline
How Groups Work
The course project will be in groups of 4-6 students.
Undergrads team up with undergrads. Graduate students team up with graduate students. This is to ensure that graduate students select more ambitious projects (though undergraduate projects should still be moderately ambitious and highly ambitious ones (if reasonably feasible) are certainly welcomed).
Each student is expected to be involved with some machine learning component of the project. If, for instance, a project is to get movie ratings data and then apply some fancy machine learning method for it, it is not OK if one person only collects the data. Collecting data is not machine learning (unless, of course, you're using machine learning itself to collect the data!).
Some general advice about the project
- You ultimately should be running (and analyzing the results from) machine learning methods on some real data. A slight exception is if there is a significant theory component, but even then you should still be at least trying synthetic data that is designed to provide understanding
- Graduate students are expected to propose more sophisticated projects. This has purposely been left vague. Roughly speaking, a sophisticated project should involve at least a little bit of theory, should be a bit ambitious, and should involve more creativity.
- For any students that are totally lost with regards to proposing a project, you might want to check out any recent proceedings of NeurIPS (formerly called NIPS) or ICML. Consider implementing a recent paper. See if you can replicate their results. Maybe this leads to interesting, new questions which you also can explore.
Here are some project ideas to help get you started thinking
- Predicting stock prices using news articles
- Sentiment analysis: classifying text (like tweets, or reviews) as expressing positive or negative emotions
- Predicting ratings (for a restaurant, product, movie, etc.) using collaborative filtering, content filtering, or a combination of the two
- Exploring a prediction problems where ethics and social implications is a concern (including fairness and privacy) - this is a great talk to get started thinking about this
- Design a reinforcement learning agent for some game
- Compare EM-based algorithms to recent, provably efficient and consistent methods for a problem like learning mixture models
- Empirically evaluate recent advanced algorithms for the k-means problem, comparing to the EM approach
- Empirically evaluate some recent state-of-the-art algorithms for learning deep neural networks
- An in-depth study of some regularization method (like dropout or batch normalization) for learning deep neural networks which tries to provide an understanding for why the method works and ideally experiments with novel variations of the method
- Explore various unsupervised dimension reduction techniques (e.g. PCA, ICA, sparse coding) and ideally also develop a new technique. Compare the results on some problem(s)
- Predict whether a NeurIPS paper is a poster or a spotlight/oral based on the public reviews (reviews have been public for several years)
- A thorough empirical evaluation of some recent ICA algorithms on some problems, or develop your own ICA algorithm and evaluate it
- Investigate efficient algorithms for approximate PCA for the purpose of large scale machine learning
- Investigate unsupervised dimension reduction methods vs supervised dimension reduction methods for some application problem(s)
- Compare frequentist and Bayesian methods for clustering