Name CS login
Qian Mei qmei
Huan Lin hl18
Han Sha hsha
Xiaocheng Wang xwang28
In the last version, we just made visualizations of the pure original data, from which it seems that the pattern is hidden from us, and we didn’t actually make significant discovery. Thus, we improve our vis in this week and tried other predictive models based on the insights from the visualization.
To begin with, we made a better crime map and fixed the bug of limited plotting last time. We have attempted one of the most seemingly promising way of creating geographical mapping for large datasets – using R and its ggmap library. One of our previous problems with D3 is that there is no detailed enough San Francisco topographic basemap available. This means that plotting out the SF contours alone will take a tremendous amount of time, let alone adding more than 10k data points onto the basemap – in fact, the browser crashed when the number of data points exceeded 3k.
The ggmap library is easy to use. the OpenStreetMap package that contains SF basemap can be directly used in R – this means that generating the following 10-year data on Prostitution and Sex Offenses Forcible crimes has taken us less than 20 seconds.
However, we have found it is almost impossible to create interactive geographical mapping using R and serve it up on the client side like what D3 provides – there may be other alternatives, but for now it looks like .jpg files are what R can generate at its best.
We will keep looking for other solutions in the next one week or so. Hopefully we can come across a framework that is efficient for mapping data geographically, and at the same time not compromising any functionalities.
Pattern of Date/Time
In the light of idea that date/time servers as an important independent feature of crime, we further derive the hour, month, year from the “Date” field. And perform normalization over the hour, month crime counts using the formula (x-std(X)/mean(X)).
We picked up the top 10 crimes. Although the total crime counts varies a lot, but the similar patterns emerge after applying normalization.
Comparing the above two graphs, we found there are 2 peaks of crimes through out the year: May and October.
Similarly, the hourly crime counts also suggest such kind of pattern after normalization.
Founding this pattern may help us producing training features. We plan to try using 5 independent features to build predictive models in the next week: hour, dayofweek, month, year and PdDistrict.
One more idea is about how to deal with “Resolution” field. According to common sense, crimes that went unresolved might be more likely to occur again, since the perpetrator wasn’t caught. So we may group crimes into a binary category, Resolved/Unresolved to see whether it would help to make more accurate prediction.
One of the useful predictions to make is to predict the crime categories from Day of week and PD District. We explored Naïve Bayes, Logistic Regression and SVM thus far. The result is a high accuracy (>0.97). However we conclude it as a bad attempt after looking into the confusion matrix. We realize that the classifier simply label every entry as negative. The reason behind is that, taking ROBBERY as an example, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY.
During one of our group discussion, we decided to adjust the learning goal of our machine learning algorithm. Instead of simple binary classifier outputting whether a crime is ROBBERY or NON_ROBBERY, we adopt multi-class classification, which classifies the crime as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. In this attempt, we are still using the same (DayOfWeek, PDDistrict) as the feature set. The resulting accuracy is around 0.22 for all Naive Bayes, Logistic Regression and SVM.
This is clear not an accurate classifier. We plan to investigate more on the issue in our final project, possibly by using other columns of the data as learning features, trying other categories as the learning goal such as regional data, which can be useful for the police department to focus more on a certain region than the other, as crimes are committed more in some places. If it turns out that the crime category is indeed hard to predict, we will try to analyze and explain why it is so and possibly draw the conclusion on what is missing in the data that leads to the bad performance. All of the three algorithms (Naïve bayes, logistic regression, SVM) can still be exploited to predict the crime category.