ECE656 assignment 1

  1. If you have your own computer, install a copy of MySQL on it. If you do not have your own computer, let me know and I will provide access to necessary equipment.
  2. Acquire the Yelp Challenge Dataset: https://www.yelp.com/dataset_challenge/dataset
  3. The Yelp dataset consists of 5 entities: business, review, user, check-in, and tip. The data is in JSON format.
    1. Define a relational schema for these entities, including table names, attributes and appropriate types, define primary and foreign keys, and indexes as necessary.
      1. You may wish to use more than five tables for your scheme; g., the user entity has an attribute “friends” which appears to be other Yelp users.
      2. You will need to justify your relational schema.
    2. Create a database, and all associated tables necessary per your relational schema.
    3. Write a script to convert the data from JSON format to a format suitable for loading into database (g., convert it to CSV).
    4. Load the data, timing your load operations, and verifying that it loads without errors or warnings. Note how many rows are loaded in each of the tables you define.
    5. Confirm the size of each table with various “SELECT count()” queries.
  4. Basic data analysis: write and execute SQL queries that will determine:
    1. The user(s) who has written the greatest number of reviews.
    2. The business(s) that has received the greatest number of reviews.
    3. The average number of reviews written by users.
      1. Note that the user entity maintains an “average stars” rating. However, that value should also be able to be computed from the actual reviews performed by each user. Determine if the recorded “average stars” rating is consistent with the computed value.
    4. The average number of reviews received by businesses.
    5. What fraction of businesses have received more than 10 reviews?
    6. What fraction of users have written more than 10 reviews?
    7. What is the average length of their reviews?
  5. Submit a report that describes your relational schema, all commands you used to create it, load the data, and query it to get the results for part (4). Provide all results for part 4.