San Francisco Crime Classification


This dataset contains incidents derived from San Francisco Police Department Crime Incident Reporting system, ranging from 1/1/2003 to 5/13/2015 and the data has already been divided into training set and test set. The training set contains the 878050 records, each representing an incident and the test data set is 884263. The goal of this project is to training a predictive model, and use it to predict which category each record in the test set belongs to. And here is a subset of training data,

Screen Shot 2015-05-28 at 5.23.48 PM

The incidents have the following attributes:

Dates – timestamp of the crime incident, format YYYY-MM-DD HH:MM:SS
Category – category of the crime incident,totally 39 categories
Descript – detailed description of the crime incident
DayOfWeek – the day of the week, 7 values
PdDistrict – name of the Police Department District,
Resolution – how the crime incident was resolved
Address – the approximate street address of the crime incident
X – Longitude
Y – Latitude

Data Clean

Since the data is formatted well, we didn’t perform the further data cleaning because it may result in missing useful information and influencing the accuracy. So at this stage, we are using all the data provided in the following visualization and machine learning processes.

Data Visualization

In order to dig out correlation between attributes listed above, we’ve created four visualization to show the different dimensions in the train set.

Part 1 : Category

S0 to begin with, we’d like to know the distribution of the crime category and answer the question: What’s the most common crimes? As mentioned, there are total 39 crime classifications here, and we marked each category as followed,



We intent to visualize in a way such that the crimes categories are shown from the highest counts to the lowest one. Since there are too many categories, the idea is that we will pick up top 10 crimes categories to see if there is any interesting pattern within a specific category.

Part 2 : DayofWeek

First, we’d also like to see if there is a correlation between the attribute “DayofWeek” and crime counts. So question here is : Are most crimes committed in the weekdays or weekend?

Actually the distribution of “dayofweek”  vs “crime counts” is comparatively even. No extreme value appears in a specific day, with the highest crimes counts 133734 in Friday and the lowest counts 116707 on Sunday.

Part 3 : Hour

Since no obvious trend can be seen when only considering day, we’d like to explore more on the time of each occurrence that belongs to the top 10 most common crimes in San Francisco. So we made a dashboard here for better visualization,
The following picture shows the plot of “hour” vs “crime counts” of all the ten categories.


Then we move on to, given a specific category, what’s the distribution of “hours” and “crime counts”



  • 05:00:00 – 05:59:59 is the most peaceful time period, with the lowest crime incidents
  • Crimes are more likely to take place after 12pm then before 12pm
  • There are three peek-hour commitment period: midnight around 12am, noon at around 12pm and evening period “17:00-18:00”

Also, we can check given a specific hour period, what is the distribution of “category” vs “crime counts”. For example, “00” represents incidents taking place between “00:00:00” to “00:59:59” . And you can click on the gallery to see the detailed statistic in the legend table in each visualization.


The overall distribution of tends to be even and similar. The bigger a specific category accounts for the overall distribution, the bigger it accounts for the distribution given a specific hour period. For example, In each hour during the day, LARCENY/THEFT, OTHER OFFENSES,NON-CRIMINAL and ASSAULT accounts for the large majority of the data.

Part 4 : Police Department District

We are using pie chart visualization to find the correlation between the crime category and the PD District.

We target 5 PD Districts (Southern, Northern, Bayview, Central, Others) and 6 crime categories: WARRANTS(Orange), VANDALISM(Green), VEHICLE THEFT(Purple), ROBBERY(Gray), LARCENY/THEFT(Blue), BURGLARY(Red) and illustrate their proportions on a pie chart.

Each PD District has slight different proportion of different crime categories, but the overall distribution is fairly even. The most common crime category is LARCENY/THEFT as shown in the blue region of the pie chart.

Part 5 : Longitude & Latitude

We have explored the relationship between reported crimes on Prostitution and Sex offence forcible (SOF) by geographically mapping related data from 2010 – 2015 on the map of San Francisco. Each red dot represents one reported crime on SOF each while blue dot represents a reported case on Prostitution.

2010 – Prostitution and SOF

2011 – Prostitution and SOF


2012 – Prostitution and SOF

2013 – Prostitution and SOF

2014 – Prostitution and SOF
2015 – Prostitution and SOF
unnamed-6Map for referencing districts in San Francisco:

It is easy to see from the map that during the past five years, crimes on Prostitution almost exclusively aggregated in the downtown and inner mission area (8 and 9 on the reference map). Although cases on SOF seem to scatter everywhere on the map, they aggregate more densely in area 8 and 9 as well. However, it is very interesting to see that in area 9, crimes on SOF hardly overlap with Prostitution geographically. One possible conclusion from this observation may be that Prostitution may help reduce the rate of SOF.

Part 6: Parellel Coordinate of High-Dimensional Data

We are using parallel coordinates to find the correlation among all the dimensions except ‘Descript’ and ‘Address’ for every category, which means that the following visualization covers six-dimensional data: “Dates”, “DayOfWeek”,”PdDistrict”, “Resolution”, “X” and “Y”. And here we target four crime categories: ‘WARRANTS’, ‘DRUNKENNESS’, ‘LARCENY/THEFT’ and  ‘KIDNAPPING’.






The crime category is largely related with ‘Resolution’ like ‘WARRANTS’ is mainly resolved by ‘Arrest, Booked’, ‘None’ and ‘Arrest Cited’ in descending order, while ‘KIDNAPPING’ is resolved by ‘Arrest, Booked’, ‘None’ and ‘
District Attorney Refuses to Prosecute’. For other fields like ‘PdDistrict’ and ‘DayOfWeek’, the visualization doesn’t indicate a strong relationship between them and the category.

Machine Learning


Our first trial, at current stage of the project, is doing binary classification instead of multi-classifications. Our assumption is quite simple, we are using use (DayOfWeek, PDDistrict) pair as features to predict the crime category. And We choose the category ‘ROBBERY’, use the classification labels are 0/1 array indicating whether a crime belongs to this category.

Training Algorithms

We are using supervised classification, and we’ve tried three algorithms: Naive Bayes, Logistic Regression and SVM  so far.

We use first half of training data as training, and second half of train set as testing for the simple reason that the test.csv does not have a label for category, thus we can’t use it to compute the accuracy.

All of these three algorithms outputs the same training accuracy and validation accuracy.
Training accuracy: 0.974156
validation accuracy: 0.973455


Why this happens? let’s look into the confusion matrix:

[[427371, 0]
[11654, 0]]

As we can see, the classifier simply label everything as non-ROBBERY. The reason behind is that, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY. Thus we can conclude that (DayOfWeek, PDDistrict) pair is not a good predictor for crime categories.


What’s hardest part of the project that you’ve encountered so far?

  • The dataset is huge, with 800 thousands of labeled training set and 800 thousands of test set, which adds more challenges to  whatever visualization or training process.
  • The project initially is to deal with multi-classification rather then binary classifier, which is much more difficult. And we are considering to transfer it to a binary classification, such like whether or not a specific incident is resolved or not(target variable changes to “Resolution” rather than “Category”)
  • As for cross-validation part, we initially intends to user 5-fold validation which makes more sense, but there are some bugs when we implemented because of the limited time. So we are fixing it in later stage.

What are your initial insights?

  • Our initial insights are proposed in each visualization above. We’ve tried to explore the relationship between different variables and the target label. It turns out that some of them works, and others just seems useless, showing no obvious trends. But at least, it shed lights to the direction that we should focusing on attributes like “PdDistrict”,”Resolution ” and “hour” derived from the “Date” variable has large significance on determine the category of each incident data.

Are there any concrete results you can show at this point? If not, why not?

  • It seems that at this stage, we can’t show very instructive results. However, at least we have tried on our first assumption and have the conclusion that  (DayOfWeek, PDDistrict) pair is not a good predictor for crime categories. Maybe, in the later stage, adding other dimensions of the data can render a better performance and better accuracy.

Going forward, are the current biggest problems you’re facing?

  • We are quite on the track and are also well-planned for the rest of the project. So going forward is not the biggest problem we are facing.

Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?

  • Yes, we are definitely on track with our project. But we also need dedicate more time to the machine learning parts, integrating insights we’ve got into feature selection, extraction so that we can further prove the correctness of our ideas.

Given your initial exploration of the data, is it worth proceeding with your project?

  • Yes, it’s worth proceeding with the project since we are discovering a number of interesting patterns through visualization. And as long as we are improving our training algorithm, we think we can make more progress and thus refining the raw massive data into little gems of real useful insights.