Final Report

Snip20160512_1

Group Info

Name                           CS login
Qian Mei                          qmei
Huan Lin                          hl18
Han Sha                           hsha
Xiaocheng Wang         xwang28

Abstract

From 1934 to 1963, San Francisco was infamous for housing some of the world’s most notorious criminals. Despite the fact that nowadays the city by the bay is known more for its tech scene than its criminal past, there is still no scarcity of crime due to the increasing wealth inequality and housing shortages.

Therefore, based on nearly 12 years of crime reports from across all of San Francisco’s neighborhoods, providing time and location, our project is aimed to explore the dataset visually so as to discover the overall trends and distribution using d3 and MATLAB. Furthermore, based on the insights gained from data visualization, we build a predictive model for crime classification by applying machine learning techniques on various combinations of attributes. As for real-world application purpose, our prediction results may be referred by local police department in San Francisco so that they can assign workload of police main power accordingly, quickly and more efficiently given the crime data.

Data

The data we are using in our project comes from Kaggle Data Science Competition Platform(https://www.kaggle.com/c/sf-crime/data). This dataset contains criminal incidents derived from San Francisco Police Department Crime Incident Reporting system, ranging from 1/1/2003 to 5/13/2015 with the following dimensions of attributes:

  • Dates – timestamp of the crime incident, format YYYY-MM-DD HH:MM:SS
  • Category – category of the crime incident,totally 39 categories
  • Descript – detailed description of the crime incident
  • DayOfWeek – the day of the week, 7 values
  • PdDistrict – name of the Police Department District,
  • Resolution – how the crime incident was resolved
  • Address – the approximate street address of the crime incident
  • X – Longitude of crime scene
  • Y – Latitude of crime scene

Since the data is formatted well, we didn’t perform the further data cleaning because it may result in missing useful information. However, due to the fact that the label attribute “Category” is excluded in test data, it’s not applicable since we are evaluating our machine learning model based on accuracy. Thus in our exploration, we focused on information in training data, which contains totally 878050 crime data entries represented as (Dates, Descript, DayofWeek, PdDistrict, Resolution, Address, X, Y). Therefore, preparing for feature extraction and machine learning, we have to split the provided “train.csv” into new training data and test data. Our strategy is such that half for training, and half for testing.

Hypothesis

  • We assume the total crime counts are decreasing from 2003 to 2015.
  • We assume police Department District(PdDistrict) is most indicative and correlated to the crime classification.
  • We assume more crimes would take place in weekdays compared to weekends.
  • We assume southern is the most dangerous district in San Francisco.

Methodology

  • Data Visualization:

In order to solve problems such as, what is the crime category distribution and pattern in the city by the bay, what are the most relevant features for predicting crime categories, we start with exploring the data visually to see if any hints and insights can be obtained. Our visualization are mainly delivered by D3 and MATLAB.

  • Machine Learning:

Besides, we also do some statistics analysis for the purpose of better understanding of the overall dataset. Then, based on the hypothesis listed above, we take the following steps to build and select predictive model:

  1. Split data into train set and test set. Learn a model on training data and test on unseen test data.
  2. Extract attributes to be used as input for machine learning algorithm.
  3. Use one-hot encoding to prepare real training features so as to avoid that any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused.
  4. Try different classification techniques, tune parameters,  use cross validation to avoid overfitting problem.
  5. Compare and analyze performance on different classifiers.

Data Visualization, Trends & Patterns

To begin with, we’d like to know the distribution of the crime category and answer the question: What’s the most common crimes? There are totally 39 crime categories, and they are listed as followed(ordered by the total crime counts from highest to lowest)

crime_map

 

crime_counts
Total Crime Counts of 39 Crime Category from 2003-2015

We can see that that total crime count throughout the 12 years of each crime type differs significantly, suggesting categories are unevenly distributed.

Snip20160511_1.png
Top 1o Crime Category Distribution

Additionally, statistics show that nearly  ¼ of the total crimes belong to “Larceny/Theft”.  Moreover, top five crime types accounts for 60 percentage, and top 20 classes cover 97% of the entire dataset. In the light of this statement, we realize that training instances for #21-#39 crime categories are too few. Therefore, the predictions of this part of data might affect the overall performance of the model negatively considering the fact that we have to build a 39-class classifier. Besides, we believe theft is the most dominant crime type, which thus deserves further exploration.

After mastering the overall distribution of the target label “Category”, we move on to probe into the hidden timely pattern and geographical patterns regarding the proportion of the data belong to top 10 crime categories, for the simple reason that  the data provides information regarding two aspects: time and location.      

untitled           

In order to test the assumption that crimes have timely pattern, we derived information about hour, dayofweek, month and year from two attributes “Dates” and “DayofWeek” and deliver the viz above, which turns out that there are indeed some interesting patterns:

Weekly Crime Count

  • The distribution of “dayofweek” vs “crime counts” is comparatively even. No extreme value appears in a specific day, with the highest crime counts 133734 in Friday and the lowest counts 116707 on Sunday. And it seems that our third hypothesis is incorrect.

Hourly Crime Count:

  • 5am is the most peaceful time period, with the lowest crime incidents.
  • Crimes are more likely to take place after 12pm than before 12pm.
  • There are three peek-hour for committing crimes: mid-night at around 12pm, noon at around 12pm and afternoon period “17:00-18:00”.

Monthly Crime Count:

  • We can see from the stacked area chart, there are 2 peaks of crimes through the year: May and October.

Yearly Crime Count:

  • The 13th year, which corresponds to 2015, has far less crime than the years before. It’s not because the police department has reinforced intensive fights against criminal so that the crime rate has reduced significantly. But, the underlying reason is that we only have the crime data before 05/13/2015.
  • More importantly, San Francisco witnessed increasing crime rate since 2010. Compared to 2010, the crime rate increased 16.5% until 2014. And when we investigate into theft which tops among all the 39 categories, it’s obvious that theft increases sharply since 2009, giving rise to the total high crime rate over the year.  And the vast majority of theft belongs to property crimes: automobile break-ins, pickpocket/pursesnatch and shoplifting.
  • After thorough analysis, we consider two possible causes of great significance:
  1. Electronic devices like smartphones, tablets are increasingly prevalent, and these items seem to be the easiest and most lucrative target for thieves.
  2. Another possible reason may be that, the victims or witnesses of crimes are more willing to report happened crime to police, meaning that the crime wave is actually an increase in crimes being reported instead of crimes happened.

Screen Shot 2016-05-12 at 6.21.36 PM

Statistics also show that, southern has the highest property theft crime, reaching 41845 over the 12 years’ period of time. And this picture shows the yearly theft counts in 2003(The full animation can be accessed by this link: https://embed.plnkr.co/GrPNfG7qQuwZwYKLr1D0/)

 

pie.png

And also from the pie chart above, we discovered the fact is such that nearly a quarter ’s crime occurs in the densely populated, transit-rich Southern Police Station district, which runs from The Embarcadero to south of Market Street. Besides, we can also find some evidence if we look back to the hourly theft count(refer to the bar chart below). The peek is 18:00 ~ 19:00, turning out to be the rush hours in the daytime when people are busy commuting.

Snip20160511_3
Hourly Larceny/Theft Counts

Machine Learning

Our project endures a touch experience when we attempt to build effective and comparatively “accurate” predivtive model. We mainly go through the four phases below and both negative and postive results we’ve gained from our trials are recorded as below.

Negative Results

  • ”Roberry/Non-Roberry” Classifier

Our first trial is doing binary classification instead of multi-classifications. Our assumption is quite simple, we are using use (DayOfWeek, PDDistrict) pair as features to predict the crime category. And We choose the category ‘ROBBERY’, use the classification labels are 0/1 array indicating whether a crime belongs to this category.

Applying selected two features, we’ve tried three algorithms: Naive Bayes, Logistic Regression and SVM. We use first half of training data as training, and second half of train set as testing for the simple reason that the test.csv does not have a label for category, thus we can’t use it to compute the accuracy. All of these three algorithms outputs the same training accuracy and validation accuracy, with training accuracy 0.974156 and validation accuracy 0.973455

Why would this happen? let’s take a look at confusion matrix: [[427371, 0] [11654, 0]] As we can see, the classifier simply labels everything as non-ROBBERY. The reason behind is that, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY. Thus we can conclude that (DayOfWeek, PDDistrict) pair is not a good predictor for crime categories.

  • Triple-classification

In the next phase, we decided to adjust the learning goal of our machine learning algorithm. Instead of simple binary classifier’s outputting whether a crime is ROBBERY or NON_ROBBERY, we adopt multi-class classification, which classifies the crime as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. In this attempt, we are still using the same (DayOfWeek, PDDistrict) as the feature set. The resulting accuracy is around 0.22 for all Naive Bayes, Logistic Regression and SVM.

This is clear not an accurate classifier. We doubt the problem may be that we are using too few features giving rise to underfitting. Thus possibly we can try more attributes as learning features or try other categories as the learning goal such as regional data, which can be useful for the police department to focus more on a certain region than the other, as crimes are committed more in some places.

Positive Results

  • “LARCENY/THEFT / NON-LARCENY/THEFT” Classifier

Based on the category distribution we’ve visualized earlier, we targeted crime category of LARCENY/THEFT, attempting to build a LARCENY/THEFT / NON-LARCENY/THEFT classifier because such category has totally 174900 incidents in training data, thus we probably can get a more accurate classifier due to more information hidden in more data.

Also, we made some adjustment of the features we fed into the model. When performing feature extraction, we found that the geographical coordinates are not useful as all the crimes occurred in San Francisco. In other words, the longitude and the latitude are nearly the same. In the light of the above statement, it’s really hard to tell the difference based on the longitude and the latitude. However, it seems to make more sense if we are using PdDistrict to tell more information of geographical location of crime zones.

In addition, we added a new feature indicating whether it’s daytime or night. From the visualization, we found that the crime is to some extent related to time. Specifically, 05:00:00 – 05:59:59 is the most peaceful time period. So we converted “Date” field into 4 discrete values indicating the time period when the crime occurred. Thus, we had three features, Time Period, Day of Week and PDDistrict.

Moreover, in order to balance the data from both classes which aims to prevent the classifier from overfitting, we only used part of NON LARCENY/THEFT data. By adding the above intuitions into the model-building process, the resulting accuracy of predicting whether a crime belongs to “LARCENY/THEFT” is around 60%. And the performance statistics is as followed,

Binary Classification Performance on LARCENY/THEFT

Precision Recall F1-score
NON LARCENY/THEFT 0.62 0.67 0.64
LARCENY/THEFT 0.58 0.52 0.55

From the statistical data above, we can see that it is much better than previous attempts, 22% accuracy. As type I errors and type II errors for both classes are of no big difference, the classifier predicts well for both classes. Considering that there exist a certain amount of data which belong to different classes with same features, the result is quite good. For the next week, we plan to optimize our features to improve the performance of our classifier, otherwise we can’t make progress only with current features.

  • Ultimate Solution

Since the dataset is relatively large (over 800,000 data entries), before applying any learning algorithms, we preprocess and filter the input file first. The basic idea is to generate a series of subsets of training and testing pairs. Furthermore, as we discussed earlier, since crime distributions are uneven across different categories, especially the unpopular ones such as ‘GAMBLING’ and ‘BRIBERY’, which may introduce outliers and decrease the accuracy, we decide to filter out those unpopular instances. The output of this preprocess is a series of training and testing pairs for top 20, top 10, top 5, top 3 categories, and of course, the original unfiltered training and testing pairs.

After preprocessing, the next step is feature extraction. The raw input such as the date (2003-01-06) is split into 3 separate features (year, month and day). Resolution feature is left out, since there are fair amount of entries with NULL value. The address of individual crime is left out, since there is generally one crime per address. The coordinate (x, y) is also left out for the same reason. However, we do keep PdDistrict as an important geo-related feature, as it is much easier to represent. The resulting feature space we are going to use is [year, month, day, hour, DayOfWeek, PdDistrict].

Another important optimization we utilize is ‘One-Hot-Encoding’. The reason behind is that, for the features, such as DaysOfWeek, are ordinal labels. It makes no sense to represent, for example, Fridays (labeled as 4) as twice as large as Wednesdays (labeled as 2). The better representation would be a 7-bit length vector that represents Fridays as [0000100] (4th bit on), and Wednesdays as [0010000] (2nd bit on), and so on.  

Finally the dataset is ready to use. Thanks to sklearn python packages, we are able to implement various machine learning algorithms easily, from the basic ones like Logistic Regression Classifier, to the complicated ones like Random Forest. Here is a list of classifiers we have tried.

LARCENY/THEFT/NON-LARCENY/THEFT
SVM (Linear SVC) 0.7973
Logistic Regression 0.7974
BernouliNB 0.7973
Adaboost 0.7728
Random Forest 0.7972
Bagging 0.7974
Gradient Boosting 0.7972

Last but not least, we revisit our failed binary classification we discussed earlier on, using the models we constructed above. This time we try to classify the most popular categories. Here is the result.

TOP 3 CLASSES TOP 5 CLASSES TOP 10 CLASSES TOP 20 CLASSES All 39 CLASSES
SVM 0.2907 0.2907 0.1969 0.1415 0.1477
LogReg 0.3329 0.3329 0.2389 0.2043 0.1998
BernouliNB 0.3476 0.3476 0.2487 0.2117 0.2079
Adaboost 0.3767 0.3767 0.2662 0.2138 0.1825
Random Forest 0.2877 0.2877 0.1727 0.1559 0.1309
Bagging 0.4690 0.3412 0.2368 0.2026 0.1931
Gradient Boosting 0.4880 0.3760 0.2701 0.2316 0.2233

Analysis and Future Directions

Although it’s difficult to classify most of examples correctly when doing multi-class classification, our result is better than that of random guessing as we extract several useful features like year, month, day and PdDistrict, and ignore meaningless features like X, Y. We employed several machine learning algorithms discussed in class such as SVM, Naive Bayes, Logistic Regression and so on. The best result that we can reach when classifying top 3 classes is around 0.48 using gradient boosting. On average, we can just reach around 0.35.

We also tried to optimize on feature extractions. But the accuracy seemed to encounter bottlenecks. The reason why the accuracy of multi-class classification can’t be improved further is that, the classifiers have high biases because algorithms just learn simple models from relatively insufficient features.

In addition to that, from the datasets, we also found that there are quite a few examples which have the same feature combinations but belong to different categories, which leads to the bottleneck of the accuracy of multi-class classification. Also, looking into the weight factor of the classifiers, the weight of each feature is not big enough and this indicates that our classifiers may underfit on data. One of the solutions to underfitting problem is to add more features more related to categories such as the criminal environment, weather or temperature.

Moreover, with the number of categories increasing, the accuracy of classifiers decrease a lot. We deem there are two main reasons:

1) features are not enough.

2) numbers of examples of each category are of big difference.

The first part has been been covered and discussed above. As for the second part, we contribute it to the categories distribution. As discussed earlier, training instances for #21~#39 crime types are too few compared to other types, which exerts negative effect on the accuracy due to the noise added to the data. Lack of data is the most severe problem when classifying multi classes. This leads to the result that the classifier will classify an unseen example as the category of more examples rather than that of less examples. In order to classify better, the data should be evenly distributed among all categories.

 

Blog Post 3

Name                           CS login
Qian Mei                          qmei
Huan Lin                          hl18
Han Sha                           hsha
Xiaocheng Wang         xwang28

Machine Learning

Firstly, a quick recap of our last attempt: we are using SVM and extracted three features, namely Time, DayofWeek and PdDistrict. For “Time” field, we discretize the “Date” field by defining 00:00:00~06:59:59 as “Late Night”, define 07:00:00-12:59:59 as “Morning”, define 13:00:00~18:59:59 as “Afternoon”, define 19:00:00~23:59:59 as “Evening”. So the new field “Time” is one of “Late Night”, “Morning”,”Afternoon”,”Evening”. Such feature combination provides the accuracy of 0.66.

This week, we explore further on “LARCENY/THEFT/NON LARCENY/THEFT” classifier. This time, we decided to use more features to see whether performance can be improved. Given the insight by visualizations, we consider “Date” field quite informative. Thus, we transformed the “Date” with the format “YYYY-MM-DD HH:MM:SS” into 4 separate features telling time information: Hour, DayofWeek, Month, Year. Therefore, the final feature space are (Hour, DayofWeek, Month, Year, PdDdistrict).

We plan to use Logistic Regression for the simple reason that compared to SVM, it’s not only output a label, but also it can output the probability. Considering the latter stage multi-classifications,  we may also use Logistic Regression to train different classifiers for different crime category, and when testing an unseen data, we can use different classifiers on the test data and assign the one with the highest probability.

Through numbers of experiments, we had some interesting discoveries.

Focusing on the Data belonging to Top 20 Crime Categories

We’ve done some simple statistic analysis regarding to the distribution of the total 39 crime categories in the data.

————Top 1~20——————–

LARCENY/THEFT,174900
OTHER OFFENSES,126182
NON-CRIMINAL,92304
ASSAULT,76876
DRUG/NARCOTIC,53971
VEHICLE THEFT,53781
VANDALISM,44725
WARRANTS,42214
BURGLARY,36755
SUSPICIOUS OCC,31414
MISSING PERSON,25989
ROBBERY,23000
FRAUD,16679
FORGERY/COUNTERFEITING,10609
SECONDARY CODES,9985
WEAPON LAWS,8555
PROSTITUTION,7484
TRESPASS,7326
STOLEN PROPERTY,4540
SEX OFFENSES FORCIBLE,4388

————–TOP 21~39———————–
DISORDERLY CONDUCT,4320
DRUNKENNESS,4280
RECOVERED VEHICLE,3138
KIDNAPPING,2341
DRIVING UNDER THE INFLUENCE,2268
RUNAWAY,1946
LIQUOR LAWS,1903
ARSON,1513
LOITERING,1225
EMBEZZLEMENT,1166
SUICIDE,508
FAMILY OFFENSES,491
BAD CHECKS,406
BRIBERY,289
EXTORTION,256
SEX OFFENSES NON FORCIBLE,148
GAMBLING,146
PORNOGRAPHY/OBSCENE MAT,22
TREA,6

As you can see that the top 20 categories have 851677 incidents in training data, accounting for 96.99% of the total 878050. So we may consider 21~39 crime types as “outliers” or “exceptional cases”. So get rid of these and focusing on the top 20 categories might reduce some noise in training data.

Splitting Data

Since the test set provided by Kaggle has no category label, the test data means little to us in the current situation. Thus, we have to split the original training set which contains 884263 records in total, into new training data And test data.

We found that the accuracy of the classifier largely depends on the compostition of training data, more specifically, the balance between the positive and negative classes. To begin with, we use all the training records, totally 884263 , and split them by selecting 1/3 of the total as test data, and the rest as training data.

The confusion matrix is as followed:
[[223063       0]]
[57991         0]
The problem here is because, we have 676777 negative cases since any record doesn’t belong to “LARCENY/THEFT”  are counted as class 0, while only have 174900 positive cases. It’s easy to see that such training data is not good. The solution here is that we have to adjust the proportion, so we only take part and use 200000 negative cases.

Feature Transformation

We also performed some feature transformation. As I mentioned, we are using (Hour, DayofWeek, Month, Year, PdDdistrict). All of these five are discrete values. However, in order to fit Logistic Regression model, we have to do the following conversion:
Hour: int between 0~24
DayofWeek:  1 represents Monday, 2 represents Tuesday, so on and so forth. So the domain of these attribute is {1,2,3,4,5,6,7}
Month: 1 represents Jan, 2 represents Feb, and so on and so forth. So the domain is {1,2,3,4,5,6,7,8,9,10,11,12}
Year: since the data ranges from 1/1/2003 to 5/13/2015. Thus we have 12-year’s data. So 1 represents “2003”, 2 represents “2004” and so on and so forth. Thus the domain of Year attribute is  {1,2,3,4,5,6,7,8,9,10,11,12,13}

In this way, the data is much more at the same scale, and it turned out that it indeed works because if we are using the original year data (2003~2015), the accuracy has only 0.54. But after transformation, the accuracy is 0.8.

By adopting the above strategies, our classifiers to predict whether a crime data belongs to “LARCENY/THEFT” or “NON LARCENY/THEFT” reached 0.8. In detail, the performance statistics are listed below:

Train Accuracy:  0.792103764984
Test Accuracy:    0.790845235497
Confusion Matrix:
[[60087       6174]]
[19702         37754]
Classification Report:
precision                  recall                  f1-score
Non LARCENY/THEFT:               0.75                            0.91                     0.82
LARCENY/THEFT:                        0.86                           0.66                     0.74
avg/total :                                       0.80                           0.79                     0.79

Failed Trial and Negative Results

We try to further narrow down the classes we are investigating to top 10, whose training instances number ranges from 174900 to 31414 to see whether we can perform multi-classification here.

From the category statistics mentioned above, we need to adjust the proportion of different class instances in training data, preferably evenly distributed. So we build the new training data following several steps:

  1. Split the data into top 10 classes according to the “Category” field value.
  2.  For each class “.csv” file, shuffle and select 1/3 as training data, 1/3 as test data
  3. Merge the training data of each class into the whole training data, and also merge the individual test data into final testing data.

We found that if we didn’t balance between the negative and positive class distribution, the accuracy of top 10 different crime binary classification is,

#1
Class Distritbution in Train:
Postive class num: 87449 Negative class num: 279109
Class Distritbution in Test:
Postive class num:87449 Negative class num: 279113
weight vector: [[ 0.02757002 0.01937428 0.0108423 0.04621315 -0.02015453]]
Train Acc: 0.7614320244
Test Acc: 0.761434627703
————-
#2
Class Distritbution in Train:
Postive class num: 63091 Negative class num:303467
Class Distritbution in Test:
Postive class num : 63091 Negative class num:303471
weight vector: [[-0.01379186 -0.03205096 -0.01228121 -0.00454187 0.02104554]]
Train Acc:0.827882627033
Test Acc:0.827884505213
————-
#3
Class Distritbution in Train:
Postive class num: 46152 Negative class num:320406
Class Distritbution in Test:
Postive class num :46152 Negative class num:320410
weight vector:[[-0.01391621 0.0148329 0.00530832 0.04136114 0.00258047]]
Train Acc:0.874093595011
Test Acc:0.874094968927
————-
#4(format similar to the above)
38438 328120
38438 328124
[[-0.02003717 0.02809951 0.00260936 -0.00377337 -0.00031551]]
0.895138013629
0.895139157905
————-
#5 (format similar to the above)
26985 339573
26986 339576
[[ 0.01246909 -0.05272665 -0.01737082 -0.05191246 0.02687313]]
0.926382727972
0.926380803247
————-
#6 (format similar to the above)
26890 339668
26891 339671
[[ 0.03214748 0.0258096 0.00812616 -0.11997263 -0.02422655]]
0.926641895689
0.926639968136
————-
#7 (format similar to the above)
22362 344196
22363 344199
[[ 0.00110817 0.03945355 0.0056513 0.00589651 0.00063329]]
0.938994647505
0.938992585156
————-
#8 (format similar to the above)
21107 345451
21107 345455
[[-0.00940511 -0.03385116 -0.01594655 -0.02091812 0.01988797]]
0.94241838945
0.942419017792
————-
#9 (format similar to the above)
18377 348181
18378 348184
[[-0.01654324 -0.02860837 0.00411129 -0.01296279 -0.03588328]]
0.949866051212
0.949863870232
————-
#10 (format similar to the above)
15707 350851
15707 350855
[[-0.02428364 -0.01588374 0.00211298 0.02220584 0.01938171]]
0.957150028099
0.957150495687
————-
But when we check the confusion matrix, both FP and TP are zero. It’s not a good classifier. So we add balance between the positive class and negative class, the result goes like this:
#1 (format similar to the above)
117182 100000
57716 100000
[[ 0.03306154 0.03686975 0.01638531 0.03228388 -0.02917931]]
0.566211748672
0.486171345964
[[32453 67547]
[13492 44224]]
————-
#2 (format similar to the above)
84541 70000
41641 70000
[[-0.03271586 -0.04191189 -0.0186753 -0.03818411 0.03111692]]
0.568813454035
0.481418117
[[21415 48585]
[ 9310 32331]]
————-
#3 (format similar to the above)
61843 70000
30461 70000
[[-0.03532301 -0.00697401 -0.00432802 0.0012246 0.0137239 ]]
0.555509204129
0.622052338719
[[52994 17006]
[20963 9498]]
————-
#4 (format similar to the above)
51506 50000
25370 50000
[[-0.03906219 0.00205233 -0.00434782 -0.03714409 0.01437165]]
0.560232892637
0.561178187608
[[27958 22042]
[11032 14338]]
————-
#5 (format similar to the above)
36160 40000
17811 40000
[[-0.0103175 -0.07263776 -0.02290346 -0.08670964 0.03871946]]
0.570168067227
0.599176627285
[[26233 13767]
[ 9405 8406]]
————-
#6 (format similar to the above)
36033 30000
17748 30000
[[ 0.00940647 0.0025305 0.00200557 -0.12973415 -0.00914027]]
0.607408417004
0.564400603167
[[14352 15648]
[ 5151 12597]]
————-
#7 (format similar to the above)
29965 25000
14760 25000
[[-0.0171227 0.01607077 -0.001303 -0.02918711 0.01266467]]
0.546002001274
0.424295774648
[[ 4018 20982]
[ 1908 12852]]
————-
#8 (format similar to the above)
28283 20000
13931 20000
[[-0.03186482 -0.05198147 -0.0226214 -0.05187081 0.03208369]]
0.596524656711
0.467890719401
[[ 3585 16415]
[ 1640 12291]]
————-
#9 (format similar to the above)
24625 20000
12130 20000
[[-0.03797261 -0.04693154 -0.00353107 -0.04696549 -0.01818957]]
0.577008403361
0.49533146592
[[ 6481 13519]
[ 2696 9434]]
————-
#10 (format similar to the above)
21047 15000
10367 15000
[[-0.04610261 -0.0394896 -0.00536594 -0.01265957 0.03169399]]
0.594224207285
0.470217211338
[[ 2762 12238]
[ 1201 9166]]

Although the accuracy is not good as the above, but the confusion matrix makes more sense.
Besides that, we used both logistic Regression and random forest multi-classifier, the accuracy can only be around 0.2.
So we thought an important reason why other classes can’t perform very well is that there are’t much information in training data. So if we use the unbalanced dataset, the classifier naively predict no. If we use the balanced dataset, the training data is not informative enough to build a decent and predictive model.

What to do next

We may consider using other attribute like “Description”. Since it’s the text feature and related to the essence of the crime, so maybe we can perform sentimental analysis, derive class-indicative keywords to improve accuracy.

Visualization

After couple unsuccessful attempts with finding an appropriate basemap and mapping the X,Y coordinates of the data points, we have decided to change our design. Our machine learning results suggested that Year is the most relevant feature, and comparing to the X,Y coordinates, PdDistrict type is more relevant when it comes to binary classification of our crime categories. Based on this, we have decided to shift our focus on mapping the total incidents of each crime type on the San Francisco contour map on a yearly basis. If time permits, we will also explore the monthly view option.

great_map

Blog Post 2

 

Name                           CS login
Qian Mei                          qmei
Huan Lin                          hl18
Han Sha                           hsha
Xiaocheng Wang         xwang28

Machine Learning

Last week, we’ve tried to build multi-classification which classifies each observation as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. It turned out that the performance with the accuracy of 22% on test set is not promising.

So this week,  we targeted crime category of LARCENY/THEFT, attempting to build a LARCENY/THEFT / NON-LARCENY/THEFT classifier because such category has totally 174900 incidents in training data, thus we probably can get a more accurate classifier due to more information hidden in more data.

Also, we made some adjustment of the features we fed into the model. When performing feature extraction, we found that the geographical coordinates are not useful as all the crimes occurred in San Francisco. In other words, the longitude and the latitude are nearly the same. In the light of the above statement, it’s really hard to tell the difference based on the longitude and the latitude. However, it seems to make more sense if we are using PdDistrict to tell more information of geographical location of crime zones.

In addition, we added a new feature indicating whether it’s daytime or night. From the visualization, we found that the crime is to some extent related to time. Specifically, 05:00:00 – 05:59:59 is the most peaceful time period.

So we converted “Date” field into 4 discrete values indicating the time period when the crime occurred. Thus, we had three features, Time Period, Day of Week and PDDistrict. Moreover, in order to balance the data from both classes which aims to prevent the classifier from overfitting, we only used part of NON LARCENY/THEFT data.

By adding the above intuitions into the model-building process, the resulting accuracy of predicting whether a crime belongs to “LARCENY/THEFT”  is around 60%. And the performance statistics is as followed,

NON LARCENY/THEFT:

Precision: 0.62
Recall: 0.67
F1-score: 0.64

LARCENY/THEFT:

Precision: 0.58
Recall: 0.52
F1-score: 0.55

From the statistical data above, we can see that it is much better than previous attempts, 22% accuracy. As type I errors and type II errors for both classes are of no big difference, the classifier predicts well for both classes. Considering that there exist a certain amount of data which belong to different classes with same features, the result is quite good.

For the next week, we plan to optimize our features to improve the performance of our classifier, otherwise we can’t make progress only with current features.

Visualization

As you can see we’ve made so many visualizations to present the data with different aspects. And undoubtedly some of those indeed helped us a lot when performing feature engineering. But we also realizes the fact that we may need to get rid of some of those that didn’t tell much about the patterns or the interesting things in the data.

Thus, in the next phase of the project, not only will we try to improve our accuracy of the prediction model, but also we may modify visualizations into ones that could tell much more and more interpretive in the final presentation by either adopting various more effective viz method, or adding more animations.

Blog Post I

Name                           CS login
Qian Mei                          qmei
Huan Lin                          hl18
Han Sha                           hsha
Xiaocheng Wang         xwang28

In the last version, we just made visualizations of the pure original data, from which it seems that the pattern is hidden from us, and we didn’t actually make significant discovery. Thus, we improve our vis in this week and tried other predictive models based on the insights from the visualization.

Crime Map

To begin with, we made a better crime map and fixed the bug of limited plotting last time. We have attempted one of the most seemingly promising way of creating geographical mapping for large datasets – using R and its ggmap library. One of our previous problems with D3 is that there is no detailed enough San Francisco topographic basemap available. This means that plotting out the SF contours alone will take a tremendous amount of time, let alone adding more than 10k data points onto the basemap – in fact, the browser crashed when the number of data points exceeded 3k.

ggmap.png

The ggmap library is easy to use. the OpenStreetMap package that contains SF basemap can be directly used in R – this means that generating the following 10-year data on Prostitution and Sex Offenses Forcible crimes has taken us less than 20 seconds.

However, we have found it is almost impossible to create interactive geographical mapping using R and serve it up on the client side like what D3 provides – there may be other alternatives, but for now it looks like .jpg files are what R can generate at its best.

We will keep looking for other solutions in the next one week or so. Hopefully we can come across a framework that is efficient for mapping data geographically, and at the same time not compromising any functionalities.

Pattern of Date/Time

In the light of idea that date/time servers as an important independent feature of crime, we further derive the hour,  month, year from the “Date” field. And perform normalization over the hour, month crime counts using the formula (x-std(X)/mean(X)).

We picked up the top 10 crimes. Although the total crime counts varies a lot, but the similar patterns emerge after applying normalization.

month_count
Monthly Crime Counts of Top 10 Crimes
n_month
Normalized Monthly Crime Counts of Top 10 Crimes

Comparing the above two graphs, we found there are 2 peaks of crimes through out the year: May and October.

Similarly, the hourly crime counts also suggest such kind of pattern after normalization.

hour_count
Hourly Crime Counts of Top 10 Crimes
n_hour
Normalized Hourly Crime Counts of Top 10 Crimes

Founding this pattern may help us producing training features. We plan to try using 5 independent features to build predictive models in the next week: hour, dayofweek, month, year and PdDistrict.

One more idea is about how to deal with “Resolution” field. According to common sense, crimes that went unresolved might be more likely to occur again, since the perpetrator wasn’t caught. So we may group crimes into a binary category, Resolved/Unresolved to see whether it would help to make more accurate prediction.

Machine Learning

One of the useful predictions to make is to predict the crime categories from Day of week and PD District. We explored Naïve Bayes, Logistic Regression and SVM thus far. The result is a high accuracy (>0.97). However we conclude it as a bad attempt after looking into the confusion matrix. We realize that the classifier simply label every entry as negative. The reason behind is that, taking ROBBERY as an example, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY.

During one of our group discussion, we decided to adjust the learning goal of our machine learning algorithm. Instead of simple binary classifier outputting whether a crime is ROBBERY or NON_ROBBERY, we adopt multi-class classification, which classifies the crime as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. In this attempt, we are still using the same (DayOfWeek, PDDistrict) as the feature set. The resulting accuracy is around 0.22 for all Naive Bayes, Logistic Regression and SVM.

This is clear not an accurate classifier. We plan to investigate more on the issue in our final project, possibly by using other columns of the data as learning features, trying other categories as the learning goal such as regional data, which can be useful for the police department to focus more on a certain region than the other, as crimes are committed more in some places. If it turns out that the crime category is indeed hard to predict, we will try to analyze and explain why it is so and possibly draw the conclusion on what is missing in the data that leads to the bad performance. All of the three algorithms (Naïve bayes, logistic regression, SVM) can still be exploited to predict the crime category.

 

 

 

San Francisco Crime Classification

Introduction

This dataset contains incidents derived from San Francisco Police Department Crime Incident Reporting system, ranging from 1/1/2003 to 5/13/2015 and the data has already been divided into training set and test set. The training set contains the 878050 records, each representing an incident and the test data set is 884263. The goal of this project is to training a predictive model, and use it to predict which category each record in the test set belongs to. And here is a subset of training data,

Screen Shot 2015-05-28 at 5.23.48 PM

The incidents have the following attributes:

Dates – timestamp of the crime incident, format YYYY-MM-DD HH:MM:SS
Category – category of the crime incident,totally 39 categories
Descript – detailed description of the crime incident
DayOfWeek – the day of the week, 7 values
PdDistrict – name of the Police Department District,
Resolution – how the crime incident was resolved
Address – the approximate street address of the crime incident
X – Longitude
Y – Latitude

Data Clean

Since the data is formatted well, we didn’t perform the further data cleaning because it may result in missing useful information and influencing the accuracy. So at this stage, we are using all the data provided in the following visualization and machine learning processes.

Data Visualization

In order to dig out correlation between attributes listed above, we’ve created four visualization to show the different dimensions in the train set.

Part 1 : Category

S0 to begin with, we’d like to know the distribution of the crime category and answer the question: What’s the most common crimes? As mentioned, there are total 39 crime classifications here, and we marked each category as followed,

crime_map

crime_counts

Observation:
We intent to visualize in a way such that the crimes categories are shown from the highest counts to the lowest one. Since there are too many categories, the idea is that we will pick up top 10 crimes categories to see if there is any interesting pattern within a specific category.

Part 2 : DayofWeek

First, we’d also like to see if there is a correlation between the attribute “DayofWeek” and crime counts. So question here is : Are most crimes committed in the weekdays or weekend?
dayofweek

Actually the distribution of “dayofweek”  vs “crime counts” is comparatively even. No extreme value appears in a specific day, with the highest crimes counts 133734 in Friday and the lowest counts 116707 on Sunday.

Part 3 : Hour

Since no obvious trend can be seen when only considering day, we’d like to explore more on the time of each occurrence that belongs to the top 10 most common crimes in San Francisco. So we made a dashboard here for better visualization,
The following picture shows the plot of “hour” vs “crime counts” of all the ten categories.

overall

Then we move on to, given a specific category, what’s the distribution of “hours” and “crime counts”

[1]  LARCENY/THEFT
c_1
[2]  OTHER OFFENSES
c_2
[3]  NON-CRIMINAL
c_3
[4]  ASSAULT
c_4
[5]  DRUG/NARCOTIC
c_5
[6]  VEHICLE THEFT
c_6
[7]  VANDALISM
c_7
[8]  WARRANTS
c_8
[9]  BURGLARY
c_9
[10]  SUSPICIOUS OCC
c_10

Observation:

  • 05:00:00 – 05:59:59 is the most peaceful time period, with the lowest crime incidents
  • Crimes are more likely to take place after 12pm then before 12pm
  • There are three peek-hour commitment period: midnight around 12am, noon at around 12pm and evening period “17:00-18:00”

Also, we can check given a specific hour period, what is the distribution of “category” vs “crime counts”. For example, “00” represents incidents taking place between “00:00:00” to “00:59:59” . And you can click on the gallery to see the detailed statistic in the legend table in each visualization.

Observation:

The overall distribution of tends to be even and similar. The bigger a specific category accounts for the overall distribution, the bigger it accounts for the distribution given a specific hour period. For example, In each hour during the day, LARCENY/THEFT, OTHER OFFENSES,NON-CRIMINAL and ASSAULT accounts for the large majority of the data.

Part 4 : Police Department District

We are using pie chart visualization to find the correlation between the crime category and the PD District.

We target 5 PD Districts (Southern, Northern, Bayview, Central, Others) and 6 crime categories: WARRANTS(Orange), VANDALISM(Green), VEHICLE THEFT(Purple), ROBBERY(Gray), LARCENY/THEFT(Blue), BURGLARY(Red) and illustrate their proportions on a pie chart.

Observation:
Each PD District has slight different proportion of different crime categories, but the overall distribution is fairly even. The most common crime category is LARCENY/THEFT as shown in the blue region of the pie chart.

Part 5 : Longitude & Latitude

We have explored the relationship between reported crimes on Prostitution and Sex offence forcible (SOF) by geographically mapping related data from 2010 – 2015 on the map of San Francisco. Each red dot represents one reported crime on SOF each while blue dot represents a reported case on Prostitution.

2010 – Prostitution and SOF
unnamed-1

2011 – Prostitution and SOF

unnamed-2

2012 – Prostitution and SOF
unnamed-3

2013 – Prostitution and SOF
unnamed-4

2014 – Prostitution and SOF
unnamed-5
2015 – Prostitution and SOF
unnamed-6Map for referencing districts in San Francisco:
unnamed

It is easy to see from the map that during the past five years, crimes on Prostitution almost exclusively aggregated in the downtown and inner mission area (8 and 9 on the reference map). Although cases on SOF seem to scatter everywhere on the map, they aggregate more densely in area 8 and 9 as well. However, it is very interesting to see that in area 9, crimes on SOF hardly overlap with Prostitution geographically. One possible conclusion from this observation may be that Prostitution may help reduce the rate of SOF.

Part 6: Parellel Coordinate of High-Dimensional Data

We are using parallel coordinates to find the correlation among all the dimensions except ‘Descript’ and ‘Address’ for every category, which means that the following visualization covers six-dimensional data: “Dates”, “DayOfWeek”,”PdDistrict”, “Resolution”, “X” and “Y”. And here we target four crime categories: ‘WARRANTS’, ‘DRUNKENNESS’, ‘LARCENY/THEFT’ and  ‘KIDNAPPING’.

warrant
WARRANTS

 

drunkness
DRUNKENNESS

 

theft
LARCENY/THEFT
kidnapp
KIDNAPPING

Observation
The crime category is largely related with ‘Resolution’ like ‘WARRANTS’ is mainly resolved by ‘Arrest, Booked’, ‘None’ and ‘Arrest Cited’ in descending order, while ‘KIDNAPPING’ is resolved by ‘Arrest, Booked’, ‘None’ and ‘
District Attorney Refuses to Prosecute’. For other fields like ‘PdDistrict’ and ‘DayOfWeek’, the visualization doesn’t indicate a strong relationship between them and the category.

Machine Learning

Assumption

Our first trial, at current stage of the project, is doing binary classification instead of multi-classifications. Our assumption is quite simple, we are using use (DayOfWeek, PDDistrict) pair as features to predict the crime category. And We choose the category ‘ROBBERY’, use the classification labels are 0/1 array indicating whether a crime belongs to this category.

Training Algorithms

We are using supervised classification, and we’ve tried three algorithms: Naive Bayes, Logistic Regression and SVM  so far.

We use first half of training data as training, and second half of train set as testing for the simple reason that the test.csv does not have a label for category, thus we can’t use it to compute the accuracy.

All of these three algorithms outputs the same training accuracy and validation accuracy.
Training accuracy: 0.974156
validation accuracy: 0.973455

Analysis

Why this happens? let’s look into the confusion matrix:

[[427371, 0]
[11654, 0]]

As we can see, the classifier simply label everything as non-ROBBERY. The reason behind is that, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY. Thus we can conclude that (DayOfWeek, PDDistrict) pair is not a good predictor for crime categories.

Discussion

What’s hardest part of the project that you’ve encountered so far?

  • The dataset is huge, with 800 thousands of labeled training set and 800 thousands of test set, which adds more challenges to  whatever visualization or training process.
  • The project initially is to deal with multi-classification rather then binary classifier, which is much more difficult. And we are considering to transfer it to a binary classification, such like whether or not a specific incident is resolved or not(target variable changes to “Resolution” rather than “Category”)
  • As for cross-validation part, we initially intends to user 5-fold validation which makes more sense, but there are some bugs when we implemented because of the limited time. So we are fixing it in later stage.

What are your initial insights?

  • Our initial insights are proposed in each visualization above. We’ve tried to explore the relationship between different variables and the target label. It turns out that some of them works, and others just seems useless, showing no obvious trends. But at least, it shed lights to the direction that we should focusing on attributes like “PdDistrict”,”Resolution ” and “hour” derived from the “Date” variable has large significance on determine the category of each incident data.

Are there any concrete results you can show at this point? If not, why not?

  • It seems that at this stage, we can’t show very instructive results. However, at least we have tried on our first assumption and have the conclusion that  (DayOfWeek, PDDistrict) pair is not a good predictor for crime categories. Maybe, in the later stage, adding other dimensions of the data can render a better performance and better accuracy.

Going forward, are the current biggest problems you’re facing?

  • We are quite on the track and are also well-planned for the rest of the project. So going forward is not the biggest problem we are facing.

Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?

  • Yes, we are definitely on track with our project. But we also need dedicate more time to the machine learning parts, integrating insights we’ve got into feature selection, extraction so that we can further prove the correctness of our ideas.

Given your initial exploration of the data, is it worth proceeding with your project?

  • Yes, it’s worth proceeding with the project since we are discovering a number of interesting patterns through visualization. And as long as we are improving our training algorithm, we think we can make more progress and thus refining the raw massive data into little gems of real useful insights.