Blog Post 2


Name                           CS login
Qian Mei                          qmei
Huan Lin                          hl18
Han Sha                           hsha
Xiaocheng Wang         xwang28

Machine Learning

Last week, we’ve tried to build multi-classification which classifies each observation as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. It turned out that the performance with the accuracy of 22% on test set is not promising.

So this week,  we targeted crime category of LARCENY/THEFT, attempting to build a LARCENY/THEFT / NON-LARCENY/THEFT classifier because such category has totally 174900 incidents in training data, thus we probably can get a more accurate classifier due to more information hidden in more data.

Also, we made some adjustment of the features we fed into the model. When performing feature extraction, we found that the geographical coordinates are not useful as all the crimes occurred in San Francisco. In other words, the longitude and the latitude are nearly the same. In the light of the above statement, it’s really hard to tell the difference based on the longitude and the latitude. However, it seems to make more sense if we are using PdDistrict to tell more information of geographical location of crime zones.

In addition, we added a new feature indicating whether it’s daytime or night. From the visualization, we found that the crime is to some extent related to time. Specifically, 05:00:00 – 05:59:59 is the most peaceful time period.

So we converted “Date” field into 4 discrete values indicating the time period when the crime occurred. Thus, we had three features, Time Period, Day of Week and PDDistrict. Moreover, in order to balance the data from both classes which aims to prevent the classifier from overfitting, we only used part of NON LARCENY/THEFT data.

By adding the above intuitions into the model-building process, the resulting accuracy of predicting whether a crime belongs to “LARCENY/THEFT”  is around 60%. And the performance statistics is as followed,


Precision: 0.62
Recall: 0.67
F1-score: 0.64


Precision: 0.58
Recall: 0.52
F1-score: 0.55

From the statistical data above, we can see that it is much better than previous attempts, 22% accuracy. As type I errors and type II errors for both classes are of no big difference, the classifier predicts well for both classes. Considering that there exist a certain amount of data which belong to different classes with same features, the result is quite good.

For the next week, we plan to optimize our features to improve the performance of our classifier, otherwise we can’t make progress only with current features.


As you can see we’ve made so many visualizations to present the data with different aspects. And undoubtedly some of those indeed helped us a lot when performing feature engineering. But we also realizes the fact that we may need to get rid of some of those that didn’t tell much about the patterns or the interesting things in the data.

Thus, in the next phase of the project, not only will we try to improve our accuracy of the prediction model, but also we may modify visualizations into ones that could tell much more and more interpretive in the final presentation by either adopting various more effective viz method, or adding more animations.


Blog Post I

Name                           CS login
Qian Mei                          qmei
Huan Lin                          hl18
Han Sha                           hsha
Xiaocheng Wang         xwang28

In the last version, we just made visualizations of the pure original data, from which it seems that the pattern is hidden from us, and we didn’t actually make significant discovery. Thus, we improve our vis in this week and tried other predictive models based on the insights from the visualization.

Crime Map

To begin with, we made a better crime map and fixed the bug of limited plotting last time. We have attempted one of the most seemingly promising way of creating geographical mapping for large datasets – using R and its ggmap library. One of our previous problems with D3 is that there is no detailed enough San Francisco topographic basemap available. This means that plotting out the SF contours alone will take a tremendous amount of time, let alone adding more than 10k data points onto the basemap – in fact, the browser crashed when the number of data points exceeded 3k.


The ggmap library is easy to use. the OpenStreetMap package that contains SF basemap can be directly used in R – this means that generating the following 10-year data on Prostitution and Sex Offenses Forcible crimes has taken us less than 20 seconds.

However, we have found it is almost impossible to create interactive geographical mapping using R and serve it up on the client side like what D3 provides – there may be other alternatives, but for now it looks like .jpg files are what R can generate at its best.

We will keep looking for other solutions in the next one week or so. Hopefully we can come across a framework that is efficient for mapping data geographically, and at the same time not compromising any functionalities.

Pattern of Date/Time

In the light of idea that date/time servers as an important independent feature of crime, we further derive the hour,  month, year from the “Date” field. And perform normalization over the hour, month crime counts using the formula (x-std(X)/mean(X)).

We picked up the top 10 crimes. Although the total crime counts varies a lot, but the similar patterns emerge after applying normalization.

Monthly Crime Counts of Top 10 Crimes
Normalized Monthly Crime Counts of Top 10 Crimes

Comparing the above two graphs, we found there are 2 peaks of crimes through out the year: May and October.

Similarly, the hourly crime counts also suggest such kind of pattern after normalization.

Hourly Crime Counts of Top 10 Crimes
Normalized Hourly Crime Counts of Top 10 Crimes

Founding this pattern may help us producing training features. We plan to try using 5 independent features to build predictive models in the next week: hour, dayofweek, month, year and PdDistrict.

One more idea is about how to deal with “Resolution” field. According to common sense, crimes that went unresolved might be more likely to occur again, since the perpetrator wasn’t caught. So we may group crimes into a binary category, Resolved/Unresolved to see whether it would help to make more accurate prediction.

Machine Learning

One of the useful predictions to make is to predict the crime categories from Day of week and PD District. We explored Naïve Bayes, Logistic Regression and SVM thus far. The result is a high accuracy (>0.97). However we conclude it as a bad attempt after looking into the confusion matrix. We realize that the classifier simply label every entry as negative. The reason behind is that, taking ROBBERY as an example, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY.

During one of our group discussion, we decided to adjust the learning goal of our machine learning algorithm. Instead of simple binary classifier outputting whether a crime is ROBBERY or NON_ROBBERY, we adopt multi-class classification, which classifies the crime as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. In this attempt, we are still using the same (DayOfWeek, PDDistrict) as the feature set. The resulting accuracy is around 0.22 for all Naive Bayes, Logistic Regression and SVM.

This is clear not an accurate classifier. We plan to investigate more on the issue in our final project, possibly by using other columns of the data as learning features, trying other categories as the learning goal such as regional data, which can be useful for the police department to focus more on a certain region than the other, as crimes are committed more in some places. If it turns out that the crime category is indeed hard to predict, we will try to analyze and explain why it is so and possibly draw the conclusion on what is missing in the data that leads to the bad performance. All of the three algorithms (Naïve bayes, logistic regression, SVM) can still be exploited to predict the crime category.