Blog Post 3

Name                           CS login
Qian Mei                          qmei
Huan Lin                          hl18
Han Sha                           hsha
Xiaocheng Wang         xwang28

Machine Learning

Firstly, a quick recap of our last attempt: we are using SVM and extracted three features, namely Time, DayofWeek and PdDistrict. For “Time” field, we discretize the “Date” field by defining 00:00:00~06:59:59 as “Late Night”, define 07:00:00-12:59:59 as “Morning”, define 13:00:00~18:59:59 as “Afternoon”, define 19:00:00~23:59:59 as “Evening”. So the new field “Time” is one of “Late Night”, “Morning”,”Afternoon”,”Evening”. Such feature combination provides the accuracy of 0.66.

This week, we explore further on “LARCENY/THEFT/NON LARCENY/THEFT” classifier. This time, we decided to use more features to see whether performance can be improved. Given the insight by visualizations, we consider “Date” field quite informative. Thus, we transformed the “Date” with the format “YYYY-MM-DD HH:MM:SS” into 4 separate features telling time information: Hour, DayofWeek, Month, Year. Therefore, the final feature space are (Hour, DayofWeek, Month, Year, PdDdistrict).

We plan to use Logistic Regression for the simple reason that compared to SVM, it’s not only output a label, but also it can output the probability. Considering the latter stage multi-classifications,  we may also use Logistic Regression to train different classifiers for different crime category, and when testing an unseen data, we can use different classifiers on the test data and assign the one with the highest probability.

Through numbers of experiments, we had some interesting discoveries.

Focusing on the Data belonging to Top 20 Crime Categories

We’ve done some simple statistic analysis regarding to the distribution of the total 39 crime categories in the data.

————Top 1~20——————–

LARCENY/THEFT,174900
OTHER OFFENSES,126182
NON-CRIMINAL,92304
ASSAULT,76876
DRUG/NARCOTIC,53971
VEHICLE THEFT,53781
VANDALISM,44725
WARRANTS,42214
BURGLARY,36755
SUSPICIOUS OCC,31414
MISSING PERSON,25989
ROBBERY,23000
FRAUD,16679
FORGERY/COUNTERFEITING,10609
SECONDARY CODES,9985
WEAPON LAWS,8555
PROSTITUTION,7484
TRESPASS,7326
STOLEN PROPERTY,4540
SEX OFFENSES FORCIBLE,4388

————–TOP 21~39———————–
DISORDERLY CONDUCT,4320
DRUNKENNESS,4280
RECOVERED VEHICLE,3138
KIDNAPPING,2341
DRIVING UNDER THE INFLUENCE,2268
RUNAWAY,1946
LIQUOR LAWS,1903
ARSON,1513
LOITERING,1225
EMBEZZLEMENT,1166
SUICIDE,508
FAMILY OFFENSES,491
BAD CHECKS,406
BRIBERY,289
EXTORTION,256
SEX OFFENSES NON FORCIBLE,148
GAMBLING,146
PORNOGRAPHY/OBSCENE MAT,22
TREA,6

As you can see that the top 20 categories have 851677 incidents in training data, accounting for 96.99% of the total 878050. So we may consider 21~39 crime types as “outliers” or “exceptional cases”. So get rid of these and focusing on the top 20 categories might reduce some noise in training data.

Splitting Data

Since the test set provided by Kaggle has no category label, the test data means little to us in the current situation. Thus, we have to split the original training set which contains 884263 records in total, into new training data And test data.

We found that the accuracy of the classifier largely depends on the compostition of training data, more specifically, the balance between the positive and negative classes. To begin with, we use all the training records, totally 884263 , and split them by selecting 1/3 of the total as test data, and the rest as training data.

The confusion matrix is as followed:
[[223063       0]]
[57991         0]
The problem here is because, we have 676777 negative cases since any record doesn’t belong to “LARCENY/THEFT”  are counted as class 0, while only have 174900 positive cases. It’s easy to see that such training data is not good. The solution here is that we have to adjust the proportion, so we only take part and use 200000 negative cases.

Feature Transformation

We also performed some feature transformation. As I mentioned, we are using (Hour, DayofWeek, Month, Year, PdDdistrict). All of these five are discrete values. However, in order to fit Logistic Regression model, we have to do the following conversion:
Hour: int between 0~24
DayofWeek:  1 represents Monday, 2 represents Tuesday, so on and so forth. So the domain of these attribute is {1,2,3,4,5,6,7}
Month: 1 represents Jan, 2 represents Feb, and so on and so forth. So the domain is {1,2,3,4,5,6,7,8,9,10,11,12}
Year: since the data ranges from 1/1/2003 to 5/13/2015. Thus we have 12-year’s data. So 1 represents “2003”, 2 represents “2004” and so on and so forth. Thus the domain of Year attribute is  {1,2,3,4,5,6,7,8,9,10,11,12,13}

In this way, the data is much more at the same scale, and it turned out that it indeed works because if we are using the original year data (2003~2015), the accuracy has only 0.54. But after transformation, the accuracy is 0.8.

By adopting the above strategies, our classifiers to predict whether a crime data belongs to “LARCENY/THEFT” or “NON LARCENY/THEFT” reached 0.8. In detail, the performance statistics are listed below:

Train Accuracy:  0.792103764984
Test Accuracy:    0.790845235497
Confusion Matrix:
[[60087       6174]]
[19702         37754]
Classification Report:
precision                  recall                  f1-score
Non LARCENY/THEFT:               0.75                            0.91                     0.82
LARCENY/THEFT:                        0.86                           0.66                     0.74
avg/total :                                       0.80                           0.79                     0.79

Failed Trial and Negative Results

We try to further narrow down the classes we are investigating to top 10, whose training instances number ranges from 174900 to 31414 to see whether we can perform multi-classification here.

From the category statistics mentioned above, we need to adjust the proportion of different class instances in training data, preferably evenly distributed. So we build the new training data following several steps:

  1. Split the data into top 10 classes according to the “Category” field value.
  2.  For each class “.csv” file, shuffle and select 1/3 as training data, 1/3 as test data
  3. Merge the training data of each class into the whole training data, and also merge the individual test data into final testing data.

We found that if we didn’t balance between the negative and positive class distribution, the accuracy of top 10 different crime binary classification is,

#1
Class Distritbution in Train:
Postive class num: 87449 Negative class num: 279109
Class Distritbution in Test:
Postive class num:87449 Negative class num: 279113
weight vector: [[ 0.02757002 0.01937428 0.0108423 0.04621315 -0.02015453]]
Train Acc: 0.7614320244
Test Acc: 0.761434627703
————-
#2
Class Distritbution in Train:
Postive class num: 63091 Negative class num:303467
Class Distritbution in Test:
Postive class num : 63091 Negative class num:303471
weight vector: [[-0.01379186 -0.03205096 -0.01228121 -0.00454187 0.02104554]]
Train Acc:0.827882627033
Test Acc:0.827884505213
————-
#3
Class Distritbution in Train:
Postive class num: 46152 Negative class num:320406
Class Distritbution in Test:
Postive class num :46152 Negative class num:320410
weight vector:[[-0.01391621 0.0148329 0.00530832 0.04136114 0.00258047]]
Train Acc:0.874093595011
Test Acc:0.874094968927
————-
#4(format similar to the above)
38438 328120
38438 328124
[[-0.02003717 0.02809951 0.00260936 -0.00377337 -0.00031551]]
0.895138013629
0.895139157905
————-
#5 (format similar to the above)
26985 339573
26986 339576
[[ 0.01246909 -0.05272665 -0.01737082 -0.05191246 0.02687313]]
0.926382727972
0.926380803247
————-
#6 (format similar to the above)
26890 339668
26891 339671
[[ 0.03214748 0.0258096 0.00812616 -0.11997263 -0.02422655]]
0.926641895689
0.926639968136
————-
#7 (format similar to the above)
22362 344196
22363 344199
[[ 0.00110817 0.03945355 0.0056513 0.00589651 0.00063329]]
0.938994647505
0.938992585156
————-
#8 (format similar to the above)
21107 345451
21107 345455
[[-0.00940511 -0.03385116 -0.01594655 -0.02091812 0.01988797]]
0.94241838945
0.942419017792
————-
#9 (format similar to the above)
18377 348181
18378 348184
[[-0.01654324 -0.02860837 0.00411129 -0.01296279 -0.03588328]]
0.949866051212
0.949863870232
————-
#10 (format similar to the above)
15707 350851
15707 350855
[[-0.02428364 -0.01588374 0.00211298 0.02220584 0.01938171]]
0.957150028099
0.957150495687
————-
But when we check the confusion matrix, both FP and TP are zero. It’s not a good classifier. So we add balance between the positive class and negative class, the result goes like this:
#1 (format similar to the above)
117182 100000
57716 100000
[[ 0.03306154 0.03686975 0.01638531 0.03228388 -0.02917931]]
0.566211748672
0.486171345964
[[32453 67547]
[13492 44224]]
————-
#2 (format similar to the above)
84541 70000
41641 70000
[[-0.03271586 -0.04191189 -0.0186753 -0.03818411 0.03111692]]
0.568813454035
0.481418117
[[21415 48585]
[ 9310 32331]]
————-
#3 (format similar to the above)
61843 70000
30461 70000
[[-0.03532301 -0.00697401 -0.00432802 0.0012246 0.0137239 ]]
0.555509204129
0.622052338719
[[52994 17006]
[20963 9498]]
————-
#4 (format similar to the above)
51506 50000
25370 50000
[[-0.03906219 0.00205233 -0.00434782 -0.03714409 0.01437165]]
0.560232892637
0.561178187608
[[27958 22042]
[11032 14338]]
————-
#5 (format similar to the above)
36160 40000
17811 40000
[[-0.0103175 -0.07263776 -0.02290346 -0.08670964 0.03871946]]
0.570168067227
0.599176627285
[[26233 13767]
[ 9405 8406]]
————-
#6 (format similar to the above)
36033 30000
17748 30000
[[ 0.00940647 0.0025305 0.00200557 -0.12973415 -0.00914027]]
0.607408417004
0.564400603167
[[14352 15648]
[ 5151 12597]]
————-
#7 (format similar to the above)
29965 25000
14760 25000
[[-0.0171227 0.01607077 -0.001303 -0.02918711 0.01266467]]
0.546002001274
0.424295774648
[[ 4018 20982]
[ 1908 12852]]
————-
#8 (format similar to the above)
28283 20000
13931 20000
[[-0.03186482 -0.05198147 -0.0226214 -0.05187081 0.03208369]]
0.596524656711
0.467890719401
[[ 3585 16415]
[ 1640 12291]]
————-
#9 (format similar to the above)
24625 20000
12130 20000
[[-0.03797261 -0.04693154 -0.00353107 -0.04696549 -0.01818957]]
0.577008403361
0.49533146592
[[ 6481 13519]
[ 2696 9434]]
————-
#10 (format similar to the above)
21047 15000
10367 15000
[[-0.04610261 -0.0394896 -0.00536594 -0.01265957 0.03169399]]
0.594224207285
0.470217211338
[[ 2762 12238]
[ 1201 9166]]

Although the accuracy is not good as the above, but the confusion matrix makes more sense.
Besides that, we used both logistic Regression and random forest multi-classifier, the accuracy can only be around 0.2.
So we thought an important reason why other classes can’t perform very well is that there are’t much information in training data. So if we use the unbalanced dataset, the classifier naively predict no. If we use the balanced dataset, the training data is not informative enough to build a decent and predictive model.

What to do next

We may consider using other attribute like “Description”. Since it’s the text feature and related to the essence of the crime, so maybe we can perform sentimental analysis, derive class-indicative keywords to improve accuracy.

Visualization

After couple unsuccessful attempts with finding an appropriate basemap and mapping the X,Y coordinates of the data points, we have decided to change our design. Our machine learning results suggested that Year is the most relevant feature, and comparing to the X,Y coordinates, PdDistrict type is more relevant when it comes to binary classification of our crime categories. Based on this, we have decided to shift our focus on mapping the total incidents of each crime type on the San Francisco contour map on a yearly basis. If time permits, we will also explore the monthly view option.

great_map

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s