Assignment 6

In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table Description
crime Crime reports dating back to 2010.
mobile_food_locations List of all locations where mobile food vendors sell.
mobile_food_permits List of all mobile food vendor permits. More details here.
mobile_food_schedule Schedules for mobile food vendors.
noise Noise complaints dating back to August 2015.
parking List of all parking lots.
parks List of all parks.
schools List of all schools.
zillow Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

In [83]:
import sqlite3 as sql
import sqlalchemy as sqla
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = (12, 12)
db = sqla.create_engine('sqlite:///sf_data.sqlite')

cursor = db.execute("SELECT * FROM sqlite_master")
rows = cursor.fetchall()

Exercise 1.1. Which mobile food vendor(s) sells at the most locations?

In [2]:
pd.read_sql("SELECT * FROM sqlite_master", db)
Out[2]:
type name tbl_name rootpage sql
0 table crime crime 2 CREATE TABLE "crime" (\n"IncidntNum" INTEGER,\...
1 table noise noise 35775 CREATE TABLE "noise" (\n"CaseID" INTEGER,\n "...
2 table parking parking 35921 CREATE TABLE "parking" (\n"Owner" TEXT,\n "Ad...
3 table schools schools 35944 CREATE TABLE "schools" (\n"Name" TEXT,\n "Ent...
4 table parks parks 35961 CREATE TABLE "parks" (\n"Name" TEXT,\n "Type"...
5 table zillow zillow 35967 CREATE TABLE "zillow" (\n"RegionName" INTEGER,...
6 table mobile_food_permits mobile_food_permits 36050 CREATE TABLE "mobile_food_permits" (\n"permit"...
7 table mobile_food_locations mobile_food_locations 36060 CREATE TABLE "mobile_food_locations" (\n"locat...
8 table mobile_food_schedule mobile_food_schedule 36079 CREATE TABLE "mobile_food_schedule" (\n"locati...
9 table location location 36105 CREATE TABLE location(\n locationid INT,\n p...
10 table ls ls 36522 CREATE TABLE ls(\n locationid INT,\n permit ...
11 table permitSchedule permitSchedule 36210 CREATE TABLE permitSchedule(\n permit TEXT,\n...
In [3]:
pd.read_sql("SELECT * FROM mobile_food_schedule LIMIT 1 ",db)
#nothing for 0040, but 0070 is off the grid. requesting permit still
Out[3]:
locationid permit DayOfWeek EndHour StartHour
0 305727 11MFF-0040 Mo 15 10
In [4]:
pd.read_sql("SELECT * FROM mobile_food_locations LIMIT 1",db)
Out[4]:
locationid LocationDescription Address Latitude Longitude
0 762182 TOWNSEND ST: 05TH ST to 06TH ST (400 - 499) 444 TOWNSEND ST 37.774871 -122.398532
In [5]:
approved=pd.read_sql("SELECT * FROM mobile_food_permits WHERE Status = 'APPROVED' LIMIT 2",db)
approved
Out[5]:
permit Status Applicant FacilityType FoodItems PriorPermit Approved Expiration
0 16MFF-0027 APPROVED F & C Catering Truck Cold Truck: Hot/Cold Sandwiches: Water: Soda: ... 1 2016-03-09 12:00:00 2017-03-15 12:00:00
1 16MFF-0069 APPROVED Munch A Bunch Truck Cold Truck: packaged sandwiches: pitas: breakf... 1 2016-03-15 12:00:00 2017-03-15 12:00:00
In [79]:
db.execute("CREATE TABLE ls AS SELECT * FROM mobile_food_schedule INNER JOIN mobile_food_locations ON mobile_food_locations.locationid = mobile_food_schedule.locationid ")

#db.execute("CREATE TABLE NewTable AS SELECT * FROM Parts WHERE Weight > 14")
#INNER JOIN SupplierParts ON Parts.PartID = SupplierParts.PartID", db
In [7]:
pd.read_sql("SELECT * FROM ls LIMIT 1",db)
Out[7]:
locationid permit DayOfWeek EndHour StartHour locationid:1 LocationDescription Address Latitude Longitude
0 305727 11MFF-0040 Mo 15 10 305727 MISSION ST: SHAW ALY to ANTHONY ST (543 - 586) 561 MISSION ST 0.0 0.0
In [80]:
db.execute("CREATE TABLE permitSchedule AS SELECT * FROM mobile_food_permits INNER JOIN ls ON ls.permit = mobile_food_permits.permit")
In [9]:
PS=pd.read_sql("SELECT LocationDescription,Applicant FROM permitSchedule WHERE LocationDescription IS NOT NULL ",db)
PS=PS.drop_duplicates(keep='last')
PS['Applicant'].value_counts()
Out[9]:
May Catering                                        52
Anas Goodies Catering                               34
Natan's Catering                                    25
Liang Bai Ping                                      24
Singh Brothers Ice Cream                            21
Sun Rise Catering                                   18
Mang Hang Catering                                  18
John's Catering #5                                  18
Park's Catering                                     18
Two G's Catering                                    16
D & T Catering                                      16
Mini Mobile Food Catering                           15
Steve's Mobile Deli                                 15
Quan Catering                                       14
M M Catering                                        13
Mike's Catering                                     11
F & C Catering                                      11
Bach Catering                                       10
Munch A Bunch                                       10
Roadside Rotisserie Corporation / Country Grill      7
BH & MT LLC                                          7
Linda's Catering                                     6
Off the Grid Services, LLC                           6
Eva's Catering                                       6
Cheese Gone Wild                                     5
Casey's Pizza, LLC                                   5
Covered Dish Productions, LLC. dba. The Rib Whip     5
San Francisco Street Foods, Inc.                     5
Golden Catering                                      5
Lobsta Truck                                         5
                                                    ..
Bob Johnson                                          1
Eli's Hot Dogs                                       1
Breaking Bread Inc.                                  1
La Jefa                                              1
Mr. Nice, LLC                                        1
Your Community Foods                                 1
Peruchi Food Truck,LLC                               1
El Gallo Jiro                                        1
Raspados El Yocateco                                 1
The Creme Brulee Cart                                1
Stanley Roth                                         1
Curbside Coffee                                      1
Wexler, LLC                                          1
Geary Cocina                                         1
Faith Sandwich                                       1
SF Cart Project                                      1
Got Snacks                                           1
Golden Gate Halal Food                               1
Rita's Catering                                      1
Munch India                                          1
Leo's Hot Dogs                                       1
Mob Dog                                              1
Bonito Poke                                          1
Kabob Trolley, LLC                                   1
Sausage Slinger                                      1
Curry Up Now                                         1
Kiss Point                                           1
Missing Link SF                                      1
Mario's Colombian and Mexican Food                   1
Alfaro's Truck                                       1
Name: Applicant, dtype: int64

Which mobile food vendor(s) sells at the most locations?

May Catering is the food vendor that sells at the most locations of 52. The second largest food vendor would be Anas Goodies with 34 locations. In third is Natan's Catering with 25 different locations.

Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.

You should try to come up with some questions on your own, but these are examples of reasonable questions:

  • Which parts of the city are the most and least expensive?
  • Which parts of the city are the most dangerous (and at what times)?
  • Are noise complaints and mobile food vendors related?
  • What are the best times and places to find food trucks?
  • Is there a relationship between housing prices and any of the other tables?

Please make sure to clearly state each of your questions in your submission.

In [139]:
pd.read_sql("SELECT * FROM zillow LIMIT 2",db)
Out[139]:
RegionName Date ZriPerSqft_AllHomes MedianSoldPricePerSqft_AllHomes PriceToRentRatio_AllHomes Turnover_AllHomes
0 94109 2010-11-01 00:00:00 3.156 675.1913 19.14 6.0771
1 94110 2010-11-01 00:00:00 2.566 599.6785 18.10 5.4490

Which parts of the city are the most and least expensive?

In [104]:
pd.read_sql("SELECT RegionName, AVG(MedianSoldPricePerSqft_AllHomes) AS Avg_Sqft_Home FROM zillow GROUP BY RegionName LIMIT 5", db)
Out[104]:
RegionName Avg_Sqft_Home
0 94080 375.629802
1 94102 626.417276
2 94103 604.009929
3 94104 1336.128581
4 94105 798.257824
In [77]:
avg_rent=pd.read_sql("SELECT RegionName, AVG(MedianSoldPricePerSqft_AllHomes) AS Avg_Sqft_Home FROM zillow GROUP BY RegionName", db)
avg_rent.max()
Out[77]:
RegionName       94158.000000
Avg_Sqft_Home     1336.128581
dtype: float64
In [78]:
avg_rent.min()
Out[78]:
RegionName       94080.000000
Avg_Sqft_Home      297.008046
dtype: float64

The highest average rent was 1,336 sq ft and the zip code is 94158. And the lowest average rent per square ft was 297 at the zip code of 94080. The barplot below is a distribution of all 26 zip codes and their corresponding average median rent from Zillow.

In [142]:
x = list(avg_rent['RegionName'])
y=list(avg_rent['Avg_Sqft_Home'])
plt.bar(x, y)
plt.ylabel("Avg Sq Ft")
plt.xlabel("Zip Code (941__)")
plt.title("Average Rent by Sq Ft by Zip Code")
plt.show()

What parts of the city are the most dangerous? (And what times?)

In [136]:
pd.read_sql("SELECT * FROM crime LIMIT 2 ",db)
Out[136]:
IncidntNum Category Descript DayOfWeek Datetime PdDistrict Resolution Address Lon Lat
0 150060275 NON-CRIMINAL LOST PROPERTY Monday 2015-01-19 14:00:00 MISSION NONE 18TH ST / VALENCIA ST -122.421582 37.761701
1 150098210 ROBBERY ROBBERY, BODILY FORCE Sunday 2015-02-01 15:45:00 TENDERLOIN NONE 300 Block of LEAVENWORTH ST -122.414406 37.784191
In [18]:
crime_count=pd.read_sql("SELECT Category, COUNT(*) AS Count FROM crime GROUP BY Category", db)
In [137]:
crime_count=crime_count.sort_values(by=["Count"], ascending=[False])
crime_count.head()
Out[137]:
Category Count
16 LARCENY/THEFT 243144
21 OTHER OFFENSES 141431
20 NON-CRIMINAL 125702
1 ASSAULT 90029
35 VANDALISM 54467
In [138]:
crime1=crime_count.drop([20,21])
crime1['Count']
In [123]:
counts=list(crime1.head()['Count'])
counts
labels=list(crime1.head()['Category'])
plt.rcParams["figure.figsize"] = (6,6)
plt.pie(counts,labels=labels,autopct='%1.1F%%',startangle=140)
plt.show()

The pie chart above displays the 5 most common criminal charges in the San Francisco area. The two categories of reports that were removed were 'Other Offenses' and 'Non-Criminal' because they did not provide enough specific information regarding the offense. We see that some form Theft is the most frequent crime among the top 5, at nearly 60% of reports.

What regions have the most amount of noise complaints?

In [96]:
noise_complaint=pd.read_sql("SELECT * FROM noise",db)
In [88]:
from mpl_toolkits.basemap import Basemap
In [103]:
fig,ax=plt.subplots(figsize=(10,20))
noisemap = Basemap(llcrnrlon = -122.553864, llcrnrlat = 37.675669, urcrnrlon=-122.334137,urcrnrlat = 37.816836,resolution = 'h', projection = 'merc')
noisemap.drawmapboundary(fill_color = 'white')
noisemap.drawcoastlines() #maybe dun need
noisemap.readshapefile("geo_export_9c1e2bde-8fa4-45cf-aa3a-7959911db8f7", 'districts')

for lon,lat in zip(noise_complaint.Lon,noise_complaint.Lat):
    x,y=noisemap(lon,lat)
    noisemap.plot(x,y,'ro',markersize=2)
plt.title('Noise Complaint by District')
plt.show()    

The red dots indicate all of the locations where there was a noise complaint reported. It appears that the highest density of these reports is in the downtown area of San Francisco. This is expected because this region of San Francisco has the highest density of population and businesses. There are many companies, restaurants, and homes in the financial district so it is realistic that this area would be so loud.

Parks

In [130]:
park=pd.read_sql('SELECT * FROM parks ',db)
In [135]:
fig,ax=plt.subplots(figsize=(10,20))
parkmap = Basemap(llcrnrlon = -122.553864, llcrnrlat = 37.675669, urcrnrlon=-122.334137,urcrnrlat = 37.816836,resolution = 'h', projection = 'merc')
parkmap.drawmapboundary(fill_color = 'white')
parkmap.drawcoastlines() 
parkmap.readshapefile("geo_export_9c1e2bde-8fa4-45cf-aa3a-7959911db8f7", 'districts')

for lon,lat in zip(park.Lon,park.Lat):
    x,y=parkmap(lon,lat)
    parkmap.plot(x,y,'go',markersize=2)
plt.title('Park by District')
plt.show()    

This plot shows the location of parks in San Francisco, split by the districts. There does not appear to be any sort of clustering of parks, which makes sense. It is unnecessary to have so many parks all in one close proximity.

How many schools are there?

In [143]:
pd.read_sql('SELECT * FROM schools LIMIT 3',db)
Out[143]:
Name Entity LowerGrade UpperGrade GradeRange Category LowerAge UpperAge GeneralType Address Lat Lon
0 Alamo Elementary School SFUSD 0 5 K-5 USD Grades K-5 5 10 PS 250 23RD AVE, San Francisco, CA 94121 37.783005 -122.482300
1 Alvarado Elementary School SFUSD 0 5 K-5 USD Grades K-5 5 10 PS 625 DOUGLASS ST, San Francisco, CA 94114 37.753681 -122.438194
2 Aptos Middle School SFUSD 6 8 6-8 USD Grades 6-8 11 13 PS 105 APTOS AVE, San Francisco, CA 94127 37.729672 -122.465782
In [111]:
school_type=pd.read_sql("SELECT GeneralType, COUNT(*) AS Count FROM schools GROUP BY GeneralType", db)
In [122]:
count_school=list(school_type['Count'])
count_school
label_school=list(school_type['GeneralType'])
plt.rcParams["figure.figsize"] = (6,6)
plt.pie(count_school,labels=label_school,autopct='%1.1F%%')
plt.title('Type of School')
plt.show()

The last graphic displayed is another pie chart. Because there were only four types of schools, they were all plotted together. It appears that CDC or development center are the most frequent in San Francisco. Another interesting statistic was the high frequency of private schools at 28.5%. For people that live in San Francisco, they know that the public school system is a lottery so many families take education into their own hands by sending their children to private schools.

In [ ]: