In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:
| Table | Description |
|---|---|
crime |
Crime reports dating back to 2010. |
mobile_food_locations |
List of all locations where mobile food vendors sell. |
mobile_food_permits |
List of all mobile food vendor permits. More details here. |
mobile_food_schedule |
Schedules for mobile food vendors. |
noise |
Noise complaints dating back to August 2015. |
parking |
List of all parking lots. |
parks |
List of all parks. |
schools |
List of all schools. |
zillow |
Zillow rent and housing statistics dating back to 1996. More details here. |
The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.
Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.
Shapefiles for San Francisco Neighborhoods are available here.
import sqlite3 as sql
import sqlalchemy as sqla
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = (12, 12)
db = sqla.create_engine('sqlite:///sf_data.sqlite')
cursor = db.execute("SELECT * FROM sqlite_master")
rows = cursor.fetchall()
Exercise 1.1. Which mobile food vendor(s) sells at the most locations?
pd.read_sql("SELECT * FROM sqlite_master", db)
pd.read_sql("SELECT * FROM mobile_food_schedule LIMIT 1 ",db)
#nothing for 0040, but 0070 is off the grid. requesting permit still
pd.read_sql("SELECT * FROM mobile_food_locations LIMIT 1",db)
approved=pd.read_sql("SELECT * FROM mobile_food_permits WHERE Status = 'APPROVED' LIMIT 2",db)
approved
db.execute("CREATE TABLE ls AS SELECT * FROM mobile_food_schedule INNER JOIN mobile_food_locations ON mobile_food_locations.locationid = mobile_food_schedule.locationid ")
#db.execute("CREATE TABLE NewTable AS SELECT * FROM Parts WHERE Weight > 14")
#INNER JOIN SupplierParts ON Parts.PartID = SupplierParts.PartID", db
pd.read_sql("SELECT * FROM ls LIMIT 1",db)
db.execute("CREATE TABLE permitSchedule AS SELECT * FROM mobile_food_permits INNER JOIN ls ON ls.permit = mobile_food_permits.permit")
PS=pd.read_sql("SELECT LocationDescription,Applicant FROM permitSchedule WHERE LocationDescription IS NOT NULL ",db)
PS=PS.drop_duplicates(keep='last')
PS['Applicant'].value_counts()
May Catering is the food vendor that sells at the most locations of 52. The second largest food vendor would be Anas Goodies with 34 locations. In third is Natan's Catering with 25 different locations.
Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.
You should try to come up with some questions on your own, but these are examples of reasonable questions:
Please make sure to clearly state each of your questions in your submission.
pd.read_sql("SELECT * FROM zillow LIMIT 2",db)
pd.read_sql("SELECT RegionName, AVG(MedianSoldPricePerSqft_AllHomes) AS Avg_Sqft_Home FROM zillow GROUP BY RegionName LIMIT 5", db)
avg_rent=pd.read_sql("SELECT RegionName, AVG(MedianSoldPricePerSqft_AllHomes) AS Avg_Sqft_Home FROM zillow GROUP BY RegionName", db)
avg_rent.max()
avg_rent.min()
The highest average rent was 1,336 sq ft and the zip code is 94158. And the lowest average rent per square ft was 297 at the zip code of 94080. The barplot below is a distribution of all 26 zip codes and their corresponding average median rent from Zillow.
x = list(avg_rent['RegionName'])
y=list(avg_rent['Avg_Sqft_Home'])
plt.bar(x, y)
plt.ylabel("Avg Sq Ft")
plt.xlabel("Zip Code (941__)")
plt.title("Average Rent by Sq Ft by Zip Code")
plt.show()
pd.read_sql("SELECT * FROM crime LIMIT 2 ",db)
crime_count=pd.read_sql("SELECT Category, COUNT(*) AS Count FROM crime GROUP BY Category", db)
crime_count=crime_count.sort_values(by=["Count"], ascending=[False])
crime_count.head()
crime1=crime_count.drop([20,21])
crime1['Count']
counts=list(crime1.head()['Count'])
counts
labels=list(crime1.head()['Category'])
plt.rcParams["figure.figsize"] = (6,6)
plt.pie(counts,labels=labels,autopct='%1.1F%%',startangle=140)
plt.show()
The pie chart above displays the 5 most common criminal charges in the San Francisco area. The two categories of reports that were removed were 'Other Offenses' and 'Non-Criminal' because they did not provide enough specific information regarding the offense. We see that some form Theft is the most frequent crime among the top 5, at nearly 60% of reports.
noise_complaint=pd.read_sql("SELECT * FROM noise",db)
from mpl_toolkits.basemap import Basemap
fig,ax=plt.subplots(figsize=(10,20))
noisemap = Basemap(llcrnrlon = -122.553864, llcrnrlat = 37.675669, urcrnrlon=-122.334137,urcrnrlat = 37.816836,resolution = 'h', projection = 'merc')
noisemap.drawmapboundary(fill_color = 'white')
noisemap.drawcoastlines() #maybe dun need
noisemap.readshapefile("geo_export_9c1e2bde-8fa4-45cf-aa3a-7959911db8f7", 'districts')
for lon,lat in zip(noise_complaint.Lon,noise_complaint.Lat):
x,y=noisemap(lon,lat)
noisemap.plot(x,y,'ro',markersize=2)
plt.title('Noise Complaint by District')
plt.show()
The red dots indicate all of the locations where there was a noise complaint reported. It appears that the highest density of these reports is in the downtown area of San Francisco. This is expected because this region of San Francisco has the highest density of population and businesses. There are many companies, restaurants, and homes in the financial district so it is realistic that this area would be so loud.
park=pd.read_sql('SELECT * FROM parks ',db)
fig,ax=plt.subplots(figsize=(10,20))
parkmap = Basemap(llcrnrlon = -122.553864, llcrnrlat = 37.675669, urcrnrlon=-122.334137,urcrnrlat = 37.816836,resolution = 'h', projection = 'merc')
parkmap.drawmapboundary(fill_color = 'white')
parkmap.drawcoastlines()
parkmap.readshapefile("geo_export_9c1e2bde-8fa4-45cf-aa3a-7959911db8f7", 'districts')
for lon,lat in zip(park.Lon,park.Lat):
x,y=parkmap(lon,lat)
parkmap.plot(x,y,'go',markersize=2)
plt.title('Park by District')
plt.show()
This plot shows the location of parks in San Francisco, split by the districts. There does not appear to be any sort of clustering of parks, which makes sense. It is unnecessary to have so many parks all in one close proximity.
pd.read_sql('SELECT * FROM schools LIMIT 3',db)
school_type=pd.read_sql("SELECT GeneralType, COUNT(*) AS Count FROM schools GROUP BY GeneralType", db)
count_school=list(school_type['Count'])
count_school
label_school=list(school_type['GeneralType'])
plt.rcParams["figure.figsize"] = (6,6)
plt.pie(count_school,labels=label_school,autopct='%1.1F%%')
plt.title('Type of School')
plt.show()
The last graphic displayed is another pie chart. Because there were only four types of schools, they were all plotted together. It appears that CDC or development center are the most frequent in San Francisco. Another interesting statistic was the high frequency of private schools at 28.5%. For people that live in San Francisco, they know that the public school system is a lottery so many families take education into their own hands by sending their children to private schools.