Blog Post 2


Name                           CS login
Qian Mei                          qmei
Huan Lin                          hl18
Han Sha                           hsha
Xiaocheng Wang         xwang28

Machine Learning

Last week, we’ve tried to build multi-classification which classifies each observation as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. It turned out that the performance with the accuracy of 22% on test set is not promising.

So this week,  we targeted crime category of LARCENY/THEFT, attempting to build a LARCENY/THEFT / NON-LARCENY/THEFT classifier because such category has totally 174900 incidents in training data, thus we probably can get a more accurate classifier due to more information hidden in more data.

Also, we made some adjustment of the features we fed into the model. When performing feature extraction, we found that the geographical coordinates are not useful as all the crimes occurred in San Francisco. In other words, the longitude and the latitude are nearly the same. In the light of the above statement, it’s really hard to tell the difference based on the longitude and the latitude. However, it seems to make more sense if we are using PdDistrict to tell more information of geographical location of crime zones.

In addition, we added a new feature indicating whether it’s daytime or night. From the visualization, we found that the crime is to some extent related to time. Specifically, 05:00:00 – 05:59:59 is the most peaceful time period.

So we converted “Date” field into 4 discrete values indicating the time period when the crime occurred. Thus, we had three features, Time Period, Day of Week and PDDistrict. Moreover, in order to balance the data from both classes which aims to prevent the classifier from overfitting, we only used part of NON LARCENY/THEFT data.

By adding the above intuitions into the model-building process, the resulting accuracy of predicting whether a crime belongs to “LARCENY/THEFT”  is around 60%. And the performance statistics is as followed,


Precision: 0.62
Recall: 0.67
F1-score: 0.64


Precision: 0.58
Recall: 0.52
F1-score: 0.55

From the statistical data above, we can see that it is much better than previous attempts, 22% accuracy. As type I errors and type II errors for both classes are of no big difference, the classifier predicts well for both classes. Considering that there exist a certain amount of data which belong to different classes with same features, the result is quite good.

For the next week, we plan to optimize our features to improve the performance of our classifier, otherwise we can’t make progress only with current features.


As you can see we’ve made so many visualizations to present the data with different aspects. And undoubtedly some of those indeed helped us a lot when performing feature engineering. But we also realizes the fact that we may need to get rid of some of those that didn’t tell much about the patterns or the interesting things in the data.

Thus, in the next phase of the project, not only will we try to improve our accuracy of the prediction model, but also we may modify visualizations into ones that could tell much more and more interpretive in the final presentation by either adopting various more effective viz method, or adding more animations.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s