Big data assignment question

FINAL PROJECT – PHASE I

The final project is an essential part of this class. It will allow you to demonstrate your Big Data skills and create something that you are proud of. It can also be a valuable addition to your projects portfolio that you can demonstrate to prospective employers.

In the first phase of the project you will perform the following:

  • Choose a topic that is related to Big Data and involves substantial design, analysis, programming, and validation.
  • Be sure data is available. Mention source and features, such as size, of data.
  • Choose your team members.
  • Do preliminary research on your chosen topic and come up with an analysis document and list the future steps.
  • Note that the projects are only briefly described below. It is up to you to define the specific design and limit or expand the scope. Remember, you will be graded on your effort and complexity of the project.

Project Ideas

Below are some of the project ideas. You have to choose one of the projects listed below

Below are some suggested topics.

  1. Real-time news sentiment analysis using live data from sources, such as:
  • Google News, NYT API
  • Social media sources → Twitter API, StockTwits

Ideas:

  • Find sentiment of a stock price over time and correlate it with stock price.
  • Find sentiment of a company that suffered a cyber security breach/attack. Create a vocabulary of terms associated with cyber breach/attack and correlate with stock price.
  1. Build a recommender system on DBLP’s conference/publications dataset:

http://dblp.uni-trier.de/xml/

Ideas:

- DBLP stores details of publications of authors in journals and conferences. - You will build a recommender system on author-conference, author-journal, author-title keyword, author-author. That is, recommend how likely is an author to publish at a conference, etc.

  1. Find clusters in the DBLP conference/publications dataset

http://dblp.uni-trier.de/xml/

Clusters can represent authors with similar interests or authors that publish at the same conference/journal. You can use a text mining strategy for that.

  1. [Bioinformatics] Create a Big Data approach for next generation sequence comparison and analysis.

References: http://bioinformatics.oxfordjournals.org/content/early/2013/10/01/bioinformati cs.btt528.full

http://www.osti.gov/scitech/servlets/purl/1050659

  1. Take part in one of the active Kaggle competition that involves significant amount of Big Data technologies

https://www.kaggle.com/competitions

* You can propose which competition you would like to take part in, but it will need to be approved by the instructor *

  1. Take part in one of the KDD cup challenges

http://www.kdd.org/kdd-cup

* You can propose which cup you would like to take part in, but it will need to be approved by the instructor *

  1. Take part in Driven Data competitions https://www.drivendata.org/competitions/

* You can propose which competition you would like to take part in, but it will need to be approved by the instructor *

  1. [Hadoop cluster creation and performance analysis] Create your own high performance Hadoop cluster using community hardware (such as old laptops, desktops, etc). After creation of cluster, you should compare your performance against UTD cluster, Amazon Web Services, Microsoft Azure, and Databricks clusters.

This should involve significant effort in setting up the cluster and performance evaluation.

  1. [Page Rank and TF-IDF implementation for webpages from CS department at UTD] In this project, you will first create a web crawler that downloads all the webpages from the CS department at UTD. After that, you will run PageRank algorithm on them and find the top-k webpages in terms on PageRank. Secondly, you will create TF-IDF inverted index (not just TF, but entire TF-IDF values) for nontrivial words. This inverted index will be used for running complex queries involving logical operators, such as AND, OR, and NOT.
  2. Building a recommendation or clustering system using either one of following:
  • IMDB movie dataset http://www.imdb.com/interfaces
  • Large Movie Review Dataset

http://ai.stanford.edu/~amaas/data/sentiment/

Phase I requirements

For this first phase, you are to do the following:

  • Choose a topic.

The project should involve solving a medium or large sized problem using Big Data technologies, such as Spark, MapReduce, Pig, Hive, etc.

  • Find data that you will use.

Find the data and find its characteristics, such as number of instances, attributes, distribution of data.

Evaluate the size of data and pre-processing requirements

Submit a snapshot of the data.

  • Propose a preliminary solution and workflow.

This would involve sketching out some ideas on how you will solve the problem. You could also indicate the workflow for the entire project.

  • Indicate your hypothesis and what you wish to accomplish.

For example, you could say "We wish to prove that the stock price is strongly correlated with its sentiment on Twitter" or "We wish to solve this competition using Big Data technologies using xyz algorithm and come up with a more accurate solution"

Include all of the above in your report for phase I.