COGS108 Assignment 3 Data Privacy
A3_DataPrivacy
1 COGS 108 - Assignment 3: Data Privacy
1.1 Important Reminders
• Do not change / update / delete any existing cells with ‘assert’ in them. These are the tests used to check your Assignmentwork.
– Changing these will be flagged for attempted cheating.
• Do not rename this file.
• This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted file.
– This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!
1.2 Overview
We have briefly discussed in lecture the importance and the mechanics of protecting individuals privacy when they are included in datasets.
One method to do so is the Safe Harbor Method. The Safe Harbour method specifies how to protect individual’s identities by telling us which tells us which information to remove from a dataset in order to avoid accidently disclosing personal information.
In this assignment, we will explore web scraping, which can often include personally identifiable information, how identity can be decoded from badly anonymized datasets, and also explore using Safe Harbour to anonymize datasets properly.
The topics covered in this assignment are mainly covered in the ‘DataGathering’ and ‘DataPrivacy&Anonymization’ Tutorial notebooks.
1.3 Part 1: Web Scraping
1.3.1 Scraping Rules
1) If you are using another organizations website for scraping, make sure to check the website’s terms & conditions.
2) Do not request data from the website too aggressively (quickly) with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3) The layout of a website may change from time to time. Because of this, if you’re scraping website, make sure to revisit the site and rewrite your code as needed.
1.3.2 1a) Web Scrape
We will first retrieve the contents on a page and examine them a bit.
Make a variable called wiki, that stores the following URL (as a string):
https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population
Now, to open the URL, use requests.get() and provide wiki as its input. Store this in a variable called page.
After that, make a variable called soup to parse the HTML using BeautifulSoup. Consider that there will be a method from BeautifulSoup that you’ll need to call on page to get the content from the page.
1.3.3 1b) Checking Scrape Contents
Extract the title from the page and save it in a variable called title_page.
Make sure you extract it as a string.
To do so, you have to use the soup object created in the above cell. Hint: from your soup variable, you can access this with .title.string.
Make sure you print out and check the contents of title_page.
Note that it should not have any tags (such as <title> included in it).
List of states and territories of the United States by population - Wikipedia
1.3.4 1c) Extracting TablesIn order to extract the data we want, we’ll start with extracting a data table of interest.
Note that you can see this table by going to look at the link we scraped.
Use the soup object and call a method called find, which will and extract the first table in scraped webpage. Store this in the variable right_table.
Note: you need to search for the name table, and set the class_ argument as wikitable
Now, Extract the data from the table into lists.
Note: This code provided for you. Do read through it and try to see how it works.
1.3.5 1d) Collecting into a dataframe
Create a dataframe my_df and add the data from the lists above to it. - lst_a is the state or territory name. Set the column name as State, and make this the index - lst_b is the population estimate. Add it to the dataframe, and set the column name as Population Estimate - lst_c is the census population. Add it to the dataframe, and set the column name as
Census Population
{`my_df = pd.DataFrame({'State': lst_a, 'Population Estimate': lst_b, 'Census␣`}
[9]:
State Population Estimate Census Population
California 39,557,045\n 37,254,523\n
Texas 28,701,845\n 25,145,561\n
Florida 21,299,325\n 18,801,310\n
New York 19,542,209\n 19,378,102\n
Pennsylvania 12,807,060\n 12,702,379\n
Illinois 12,741,080\n 12,830,632\n
Ohio 11,689,442\n 11,536,504\n
Georgia 10,519,475\n 9,687,653\n
North Carolina 10,383,620\n 9,535,483\n
Michigan 9,995,915\n 9,883,640\n
New Jersey 8,908,520\n 8,791,894\n
Virginia 8,517,685\n 8,001,024\n
Washington 7,535,591\n 6,724,540\n
Arizona 7,171,646\n 6,392,017\n
Massachusetts 6,902,149\n 6,547,629\n
Tennessee 6,770,010\n 6,346,105\n
Indiana 6,691,878\n 6,483,802\n
Missouri 6,126,452\n 5,988,927\n
Maryland 6,042,718\n 5,773,552\n
Wisconsin 5,813,568\n 5,686,986\n
Colorado 5,695,564\n 5,029,196\n
Minnesota 5,611,179\n 5,303,925\n
South Carolina 5,084,127\n 4,625,364\n
Alabama 4,887,871\n 4,779,736\n
Louisiana 4,659,978\n 4,533,372\n
Kentucky 4,468,402\n 4,339,367\n
Oregon 4,190,713\n 3,831,074\n
Oklahoma 3,943,079\n 3,751,351\n
Connecticut 3,572,665\n 3,574,097\n
Puerto Rico 3,195,153\n 3,725,789\n
Utah 3,161,105\n 2,763,885\n
Iowa 3,156,145\n 3,046,355\n
Nevada 3,034,392\n 2,700,551\n
Arkansas 3,013,825\n 2,915,918\n
Mississippi 2,986,530\n 2,967,297\n
Kansas 2,911,505\n 2,853,118\n
New Mexico 2,095,428\n 2,059,179\n
Nebraska 1,929,268\n 1,826,341\n
West Virginia 1,805,832\n 1,852,994\n
Idaho 1,754,208\n 1,567,582\n
Hawaii 1,420,491\n 1,360,301\n
New Hampshire 1,356,458\n 1,316,470\n
Maine 1,338,404\n 1,328,361\n
Montana 1,062,305\n 989,415\n
Rhode Island 1,057,315\n 1,052,567\n
Delaware 967,171\n 897,934\n
South Dakota 882,235\n 814,180\n
North Dakota 760,077\n 672,591\n
Alaska 737,438\n 710,231\n
District of Columbia 702,455\n 601,723\n
Vermont 626,299\n 625,741\n
Wyoming 577,737\n 563,626\n
Guam 165,718\n 159,358
U.S. Virgin Islands 104,914\n 106,405
American Samoa 55,641\n 55,519
Northern Mariana Islands 55,194\n 53,883
Contiguous United States 325,009,505\n 306,675,006\n
[10]: assert isinstance (my_df, pd.DataFrame) assert my_df.index.name == 'State'
assert list(my_df.columns) == ['Population Estimate', 'Census Population']
1.3.6 1e) Using the data
What is the Population Estimate of Texas? Save this answer to a variable called texas_pop Notes:
- Extract this value programmatically from your dataframe (as in, don’t set it explicitly, as cf = 123) - You can use .loc to extract a particular value from a dataframe. - The data in your dataframe will be strings - that’s fine, leave them as strings (don’t typecast).
[11]: '28,701,845\n'
1.4 Part 2: Identifying Data
Data Files: - anon_user_dat.json - employee_info.json
You will first be working with a file called ‘anon_user_dat.json’. This file that contains information about some (fake) Tinder users. When creating an account, each Tinder user was asked to provide their first name, last name, work email (to verify the disclosed workplace), age, gender, phone # and zip code. Before releasing this data, a data scientist cleaned the data to protect the privacy of Tinder’s users by removing the obvious personal identifiers: phone #, zip code, and IP address. However, the data scientist chose to keep each users’ email addresses because when they visually skimmed a couple of the email addresses none of them seemed to have any of the user’s actual names in them. This is where the data scientist made a huge mistake!
We will take advantage of having the work email addresses by finding the employee information of different companies and matching that employee information with the information we have, in order to identify the names of the secret Tinder users!
1.4.1 2a) Load in the ‘cleaned’ data
Load the anon_user_dat.json json file into a pandas dataframe. Call it df_personal.
[13]: age email gender
0 60 gshoreson0@seattletimes.com Male
1 47 eweaben1@salon.com Female
2 27 akillerby2@gravatar.com Male
3 46 gsainz3@zdnet.com Male
4 72 bdanilewicz4@4shared.com Male
.. … … …
995 3 pstroulgerrn@time.com Female
996 49 kbasnettro@seattletimes.com Female
997 75 pmortlockrp@liveinternet.ru Male
998 81 sphetterq@toplist.cz Male
999 70 jtyresrr@slashdot.org Male
[1000 rows x 3 columns]
1.4.2 2b) Check the first 10 emails
Save the first 10 emails to a Series, and call it sample_emails. You should then print out this Series.
The purpose of this is to get a sense of how these work emails are structured and how we could possibly extract where each anonymous user seems to work.
[15]: 0 gshoreson0@seattletimes.com
4 bdanilewicz4@4shared.com 5 sdeerness5@wikispaces.com 6 jstillwell6@ustream.tv
8 nerickssen8@hatena.ne.jp 9 hparsell9@xing.com
Name: email, dtype: object
1.4.3 2c) Extract the Company Name From the Email
Create a function with the following specifications: - Function Name: extract_company - Purpose: to extract the company of the email (i.e., everything after the @ sign but before the .) - Parameter(s): email (string) - Returns: The extracted part of the email (string) - Hint: This should take 1 line of code. Look into the find(”) method.
You can start with this outline:
def extract_company(email):
return
Example Usage: - extract_company(“larhe@uber.com”) should return “uber” - extract_company(“ds@cogs.edu”) should return “cogs”
[17]: 'gdsjkasns
[18]: assert extract_company("gshoreson0@seattletimes.com") == "seattletimes"
With a little bit of basic sleuthing (aka googling) and web-scraping (aka selectively reading in html code) it turns out that you’ve been able to collect information about all the present employees/interns of the companies you are interested in. Specifically, on each company website, you have found the name, gender, and age of its employees. You have saved that info in employee_info.json and plan to see if, using this new information, you can match the Tinder accounts to actual names.
1.4.4 2d) Load in employee data
Load the json file into a pandas dataframe. Call it df_employee.
[19]: company first_name last_name gender age
0 123-reg Inglebert Falconer Male 42
1 163 Rafael Bedenham Male 14
2 163 Lemuel Lind Male 31
3 163 Penny Pennone Female 45
4 163 Elva Crighton Female 52
.. … … … … …
995 zdnet Guido Comfort Male 46
996 zdnet Biron Malkinson Male 48
997 zimbio Becka Waryk Female 27
998 zimbio Andreana Ladewig Female 34
999 zimbio Jobyna Busek Female 75
[1000 rows x 5 columns]
1.4.5 2e) Match the employee name with company, age, gender
Create a function with the following specifications: - Function name: employee_matcher - Purpose: to match the employee name with the provided company, age, and gender -
Parameter(s): company (string), age (int), gender (string) - Returns: The employee first_name and last_name like this: return first_name, last_name - Note: If there are multiple employees that fit the same description, first_name and last_name should return a list of all possible first names and last name i.e., [‘Desmund’, ‘Kelby’], [‘Shepley’, ‘Tichner’]. Note that the names of the individuals that would produce this output are ‘Desmund Shepley’ and ‘Kelby Tichner’.
Hint: There are many different ways to code this. An unelegant solution is to loop through df_employee and for each data item see if the company, age, and gender match i.e., python for i in range(0, len(df_employee)): if (company
== df_employee.ix[i,'company']):
However! The solution above is very inefficient and long, so you should try to look into this: Google the df.loc method: It extracts pieces of the dataframe if it fulfills a certain condition.
i.e., df_employee.loc[df_employee['company'] == company]
If you need to convert your pandas data series into a list, you can do list(result) where result is a pandas “series”
You can start with this outline:
[21]: (['Maxwell'], ['Jorio'])
[22]: assert employee_matcher("google", 41, "Male") == (['Maxwell'], ['Jorio']) assert employee_matcher("salon", 47, "Female") == (['Elenore'], ['Gravett'])
1.4.6 2f) Extract all the private data
• Create 2 empty lists called first_names and last_names
• Loop through all the people we are trying to identify in df_personal
• Call the extract_company function (i.e., extract_company(df_personal.ix[i, 'email']) )
• Call the employee_matcher function
• Append the results of employee_matcher to the appropriate lists (first_names and last_names)
1.4.7 2g) Add the names to the original ‘secure’ dataset!
We have done this last step for you below, all you need to do is run this cell.
For your own personal enjoyment, you should also print out the new df_personal with the identified people.
0 60 gshoreson0@seattletimes.com Male [Gordon]
1 47 eweaben1@salon.com Female [Elenore] 2 27 akillerby2@gravatar.com Male [Abbe] 3 46 gsainz3@zdnet.com Male [Guido]
4 72 bdanilewicz4@4shared.com Male [Brody]
.. … … … …
995 3 pstroulgerrn@time.com Female [Penelopa]
996 49 kbasnettro@seattletimes.com Female [Anthiathia, Kandy]
997 75 pmortlockrp@liveinternet.ru Male [Paco] 998 81 sphetterq@toplist.cz Male [Sammy]
999 70 jtyresrr@slashdot.org Male [Josiah]
last_name
0 [DelaField]
1 [Gravett]
2 [Stockdale] 3 [Comfort]
4 [Pinckard]
.. …
995 [Roman]
996 [Baldwin, Cossam]
997 [Weatherburn]
998 [Dymick]
999 [Ayshford]
1000 rows x 5 columns]
We have now just discovered the ‘anonymous’ identities of all the registered Tinder users…awkward.
1.5 Part 3: Anonymize Data
You are hopefully now convinced that with some seemingly harmless data a hacker can pretty easily discover the identities of certain users. Thus, we will now clean the original Tinder
data ourselves according to the Safe Harbor Method in order to make sure that it has been properly cleaned…
1.5.1 3a) Load in personal data
Load the user_dat.csv file into a pandas dataframe. Call it df_users.
[27]: age email first_name gender last_name \ 0 34 clilleymanlm@irs.gov Carly Female Duckels 1 87 parnecke9a@furl.net Prisca NaN Le Friec 2 60 ldankersley7j@mysql.com Lauree
Female Meineking 3 47 kcattrollma@msn.com Karoly NaN Hoyles
4 85 rchestney60@dailymotion.com Rona Female St. Quentin
ip_address phone zip
0 229.46.197.198 (445)515-0719 70397
1 60.255.20.98 (962)747-5149 71965
2 65.148.56.18 (221)690-1264 47946
3 207.40.101.214 (203)282-1167 29063
4 177.12.128.156 (703)482-9159 68872
1.5.2 3b) Drop personal attributes
Remove any personal information, following the Safe Harbour method. Based on the Safe Harbour method, remove any columns from df_users that contain personal information.
Note that details on the Safe Harbour method are covered in the Tutorials.
0 34 Female 70397
1 87 NaN 71965
2 60 Female 47946
3 47 NaN 29063
4 85 Female 68872
.. … … …
945 57 Male 22812
946 23 Male 31522
947 33 Female 34219
948 47 Male 75153
949 57 Male 95666
[950 rows x 3 columns]
1.5.3 3c) Drop ages that are above 90
Safe Harbour rule C: Drop all the rows which have age greater than 90 from df_users.
0 34 Female 70397
1 87 NaN 71965
2 60 Female 47946
3 47 NaN 29063
4 85 Female 68872
.. … … …
945 57 Male 22812
946 23 Male 31522
947 33 Female 34219
948 47 Male 75153
949 57 Male 95666
[943 rows x 3 columns]
1.5.4 3d) Load in zip code data
Load the zip_pop.csv file into a (different) pandas dataframe. Call it df_zip.
Note that the zip data should be read in as strings, not ints, as would be the default.
In read_csv, use the parameter dtype to specify to read zip as str, and population as int.
[33]: zip population
0 01001 16769
1 01002 29049
2 01003 10372
3 01005 5079
4 01007 14649
1.5.5 3e) Sort zipcodes into “Geographic Subdivision”
The Safe Harbour Method applies to “Geographic Subdivisions”as opposed to each zipcode itself.
Geographic Subdivision: All areas which share the first 3 digits of a zip code
Count the total population for each geographic subdivision
Warning: you have to be savy with a dictionary here
To understand how a dictionary works, check the section materials, use google and go to discussion sections!
Instructions: - Create an empty dictionary: zip_dict = {} - Loop through all the zip_codes in df_zip - Create a dictionary key for the first 3 digits of a zip_code in zip_dict - Continually add population counts to the key that contains the same first 3 digits of the zip code To extract the population you will find this code useful:
population = list(df_zip.loc[df_zip['zip'] == zip_code]['population']) To extract the first 3 digits of a zip_code you will find this code useful: int(str(zip_code)[:3])
Note: this code may take some time (many seconds, up to a minute or two) to run
1.5.6 3f) Explain this code excerpt
# In the cell below, explain in words what what the following line of code is doing: population = list(df_zip.loc[df_zip['zip'] == zip_code]['population']) Note: you do not have to use this line of code at this point in the assignment.
It is one of the lines provided to you in 3e. Here, just write a quick comment on what it does. This question will not be graded, but it’s important to be able to read other people’s code. gets a population where zip is zipcode
1.5.7 3g) Masking the Zip CodesIn this part, you should write a for loop, updating the df_users dataframe.
Go through each user, and update their zip-code, to Safe Harbour specifications:
• If the user is from a zip code for the which the “Geographic Subdivision” is less than equal to 20000, change the zip code to 0
• Otherwise, change the zip code to be only the first 3 numbers of the full zip cide
• Do all this re-writting the zip_code columns of the df_users DataFrame
Hints: This will be several lines of code, looping through the DataFrame, getting each zip code, checking the geographic subdivision with the population in zip_dict, and setting the zip_code accordingly.
0 34 Female 703
1 87 NaN 719
2 60 Female 479
3 47 NaN 290
4 85 Female 688
.. … … …
945 57 Male 228
946 23 Male 315
947 33 Female 342
948 47 Male 751
949 57 Male 956
1.5.8 3h) Save out the properly anonymized data to json file
Save out df_users as a json file, called real_anon_user_dat.json
Congrats, you’re done! The users identities are much more protected now.
1.6 Re-start & run all cells to be sure that everything passes, validate, and submit on DataHub!
[ ]: