Skip to content

On behavior of lbjava when dealing with real valued features.

Xuanyu Zhou edited this page Mar 2, 2017 · 17 revisions

Mixed feature experiments on LBJava examples

There was speculations about lbjava failing when using real-valued features.

Thanks to @Slash0BZ (Ben Zhou) we did a comprehensive experiments, running a few example problems, with different number of real-valued features, across different algorithms, and different number of iterations.

NewsGroup Experiments

Methodology

The data I used was from http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz

I downloaded and unzipped the data and used the script https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/LBJava/20news_data_parser.py to move randomly selected 90% of the files (in each tag) to a training data path and the rest of them to a testing data path.

The parsed data can be read and used directly through https://github.com/Slash0BZ/lbjava/blob/master/lbjava-examples/src/main/java/edu/illinois/cs/cogcomp/lbjava/examples/DocumentReader.java

Then I first tested the original NewsGroupClassifier.lbj defined as https://github.com/Slash0BZ/lbjava/blob/master/lbjava-examples/src/main/lbj/NewsGroupClassifier.lbj

Then I added a constant real feature for each of the examples, an example is

real[] RealFeatureConstant(Document d) <- { int k = 3; sense k; }

int here is fine because lbjava parser will transform the data type to double

Then I tried multiple real features that are randomly generated by Gaussian distribution. The code I used was

import java.util.Random;
real[] GaussianRealFeatures(Document d) <- {
    List words = d.getWords();
    Random ran = new Random();
    for (int i = 0; i < words.size() - 1; i++)
        sense ran.nextGaussian() * 10;
}

I used bash scripts to run test on several algorithms. I wrote an python script that modifies the lbj file to make things easier. The script can be found at https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/LBJava/modify_lbj_file.py

Result Tables

NewsGroup (table for single real feature)

Condition\Algorithm SparseAveragedPerceptron SparseWinnow PassiveAggresive SparseConfidenceWeighted BinaryMIRA
1 round w/o real features 48.916 92.597 19.038 33.739
1 round w/ real features 47.753 92.491 23.268 32.364
10 rounds w/o real features 82.390 91.539 24.802 76.891
10 rounds w/ real features 82.126 91.529 12.427 75.939
50 rounds w/o real features 84.823 91.592 14.120 77.208
50 rounds w/ real features 85.299 91.433 19.566 76.891
100 rounds w/o real features 85.828 91.433 12.956 76.574
100 rounds w real features 84.770 91.486 15.442 61.026

NewsGroup (table for the same amount of Gaussian random real features as discrete ones)

Condition\Algorithm SparseAveragedPerceptron SparseWinnow PassiveAggresive BinaryMIRA
1 round w/o real features 51.454 92.597 12.057 33.739
1 round w/ real features 17.980 6.081 14.913 14.225
10 rounds w/o real features 82.813 91.539 22.369 76.891
10 rounds w/ real features 52.829 42.517 45.743
50 rounds w/o real features 84.294 91.592 21.100 77.208
50 rounds w/ real features 75.727 67.054 75.198
100 rounds w/o real features 85.506 91.433 17.768 76.574
100 rounds w real features 77.631 74.828 74.194

###Problems

  1. SparseConfidenceWeighted encountered a problem where training takes too long on my server. I had to kill the process after waiting for a long time.

  2. SparseWinnow will through a NullPointerException during the testing (testDiscrete()) process after multiple real features are added, if the training rounds are larger than 1.

Badges Experiments

Methodology

Similar to the NewsGroup examples above, the constant real features are the same.

The multiple real features are added through

real[] RealFeatures3(String line) <- {
    for(int i = 0; i < line.length(); i++){
        int k = 3;
        sense k;
    }
}

The random multiple real features are added through

real[] GaussianRealFeatures(String line) <- {
    for (int i = 0; i < line.length(); i++){
        Random ran = new Random();
        sense ran.nextGaussian() * 10;
    }
}

###Result Tables

Badges (table for single real feature)

Condition\Algorithm SparsePerceptron SparseWinnow NaiveBayes
1 round w/o real features 100.0 95.745 100.0
1 round w/ real features 100.0 95.745 100.0
10 rounds w/o real features 100.0 100.0 100.0
10 rounds w/ real features 100.0 100.0 100.0
50 rounds w/o real features 100.0 100.0 100.0
50 rounds w/ real features 100.0 100.0 100.0
100 rounds w/o real features 100.0 100.0 100.0
100 rounds w real features 100.0 100.0 100.0

Badges (table for same amount of constant real features as discrete features)

Condition\Algorithm SparsePerceptron SparseWinnow NaiveBayes
1 round w/o real features 100.0 95.745 100.0
1 round w/ real features 74.468 100.0 100.0
10 rounds w/o real features 100.0 100.0 100.0
10 rounds w/ real features 78.723 100.0 100.0
50 rounds w/o real features 100.0 100.0 100.0
50 rounds w/ real features 100.0 100.0 100.0
100 rounds w/o real features 100.0 100.0 100.0
100 rounds w real features 100.0 100.0 100.0

Badges (table for same amount of of random Gaussian real features as discrete features)

Condition\Algorithm SparsePerceptron SparseWinnow NaiveBayes
1 round w/o real features 100.0 95.745 100.0
1 round w/ real features 55.319 56.383 100.0
10 rounds w/o real features 100.0 100.0 100.0
10 rounds w/ real features 62.766 100.0 100.0
50 rounds w/o real features 100.0 100.0 100.0
50 rounds w/ real features 74.468 87.234 100.0
100 rounds w/o real features 100.0 100.0 100.0
100 rounds w real features 86.170 100.0 100.0

Conclusions

The conclusion made here is that, as more number of real-valued features are added, more training iterations are need to train the system, and there is no clear issues with real-valued features.

Mix feature experiments on POS Tagger

Introduction

The goal of this experiment is to add the 25 dimension real valued phrase similarity vector defined at https://gitlab-beta.engr.illinois.edu/cogcomp/illinois-phrasesim/blob/master/src/main/java/edu/illinois/cs/cogcomp/sim/PhraseSim.java for each word to the end of the original feature vector in POSTaggerKnown defined at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/POSTaggerKnown.java

Methodology

Since the code that are supposed to be generated by LBJava through .lbj files has been modified manually to a certain extent, and many of the changes lack documentations, I was unable to directly modify https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/lbj/POSKnown.lbj and replace the original classes with the newly generated classes. So I took another approach. I first generated a paragamVector class, which are generated from code


real[] paragamVector(Token w) <- {
    PhraseSim ps = null;
    try{
        ps = PhraseSim.getInstance();
    }
    catch (FileNotFoundException e){

    }
    double[] vec = ps.getVector(w.form);
    for(int i = 0; i < vec.length; i++){
        sense vec[i];
    }
}

The generated class can be found at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/paragamVector.java

I manually added some code so that this code handles the case when a word do not have corresponding new vector.

Then I modified the code in POSTaggerKnown$$1.java, which is the extractor for the POSTaggerKnown class. The code can be found at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/POSTaggerKnown%24%241.java

Further with some changes to make the code work on my local machine, POSTrain was able to be ran without errors.

Results

Original Result, 50 rounds of training, default settings


Label   Precision Recall    F1    LCount PCount
------------------------------------------------
#          100.000 100.000 100.000     15     15
$          100.000 100.000 100.000    943    943
''         100.000  99.904  99.952   1045   1044
,          100.000 100.000 100.000   6876   6876
-LRB-      100.000 100.000 100.000    186    186
-RRB-      100.000 100.000 100.000    187    187
.          100.000 100.000 100.000   5381   5381
:          100.000 100.000 100.000    752    752
CC          99.661  99.631  99.646   3250   3249
CD          99.230  98.860  99.044   4823   4805
DT          99.444  99.204  99.324  11183  11156
EX          94.656  98.413  96.498    126    131
FW          30.769  26.667  28.571     30     26
IN          96.644  98.607  97.615  13492  13766
JJ          91.992  91.029  91.508   8215   8129
JJR         80.042  89.125  84.340    423    471
JJS         92.336  94.757  93.530    267    274
LS          88.889  53.333  66.667     15      9
MD          99.293  99.763  99.528   1267   1273
NN          95.464  96.647  96.052  17834  18055
NNP         97.486  95.340  96.401  13177  12887
NNPS        26.254  52.353  34.971    170    339
NNS         98.086  97.891  97.988   8061   8045
PDT         61.702  65.909  63.736     44     47
POS         97.614  99.373  98.485   1276   1299
PRP         99.502  99.773  99.638   2205   2211
PRP$        99.532  99.532  99.532   1068   1068
RB          93.296  89.716  91.471   4405   4236
RBR         77.119  67.159  71.795    271    236
RBS         83.077  78.261  80.597     69     65
RP          71.279  68.766  70.000    397    383
SYM        100.000  90.909  95.238     11     10
TO          99.966 100.000  99.983   2913   2914
UH          76.923  58.824  66.667     17     13
VB          96.326  93.927  95.111   3573   3484
VBD         96.055  94.497  95.270   4561   4487
VBG         91.480  92.757  92.114   1933   1960
VBN         86.730  90.543  88.596   2707   2826
VBP         93.459  92.204  92.827   1565   1544
VBZ         96.880  96.476  96.677   2639   2628
WDT         97.798  91.267  94.420    584    545
WP          98.587  98.587  98.587    283    283
WP$        100.000 100.000 100.000     37     37
WRB         99.344  99.671  99.507    304    305
``         100.000 100.000 100.000   1074   1074
------------------------------------------------
Accuracy    96.532    -       -      -    129654

Result with additional features, 50 rounds of training, default settings


Label   Precision Recall    F1    LCount PCount
------------------------------------------------
#          100.000 100.000 100.000     15     15
$          100.000 100.000 100.000    943    943
''         100.000  99.617  99.808   1045   1041
,          100.000 100.000 100.000   6876   6876
-LRB-      100.000 100.000 100.000    186    186
-RRB-      100.000 100.000 100.000    187    187
.          100.000 100.000 100.000   5381   5381
:          100.000 100.000 100.000    752    752
CC          99.569  99.538  99.554   3250   3249
CD          98.823  99.191  99.007   4823   4841
DT          99.223  99.294  99.258  11183  11191
EX          92.647 100.000  96.183    126    136
FW          50.000  16.667  25.000     30     10
IN          95.896  99.059  97.452  13492  13937
JJ          92.989  90.091  91.517   8215   7959
JJR         86.649  78.251  82.236    423    382
JJS         95.785  93.633  94.697    267    261
LS          90.909  66.667  76.923     15     11
MD          98.521  99.921  99.216   1267   1285
NN          95.861  95.968  95.915  17834  17854
NNP         97.922  94.399  96.128  13177  12703
NNPS        27.350  75.294  40.125    170    468
NNS         98.423  97.531  97.975   8061   7988
PDT         53.226  75.000  62.264     44     62
POS         97.027  99.765  98.377   1276   1312
PRP         99.053  99.592  99.322   2205   2217
PRP$        99.532  99.625  99.579   1068   1069
RB          94.251  87.832  90.928   4405   4105
RBR         66.369  82.288  73.476    271    336
RBS         77.500  89.855  83.221     69     80
RP          66.514  73.048  69.628    397    436
SYM         90.000  81.818  85.714     11     10
TO          99.966 100.000  99.983   2913   2914
UH          90.909  58.824  71.429     17     11
VB          96.666  94.123  95.377   3573   3479
VBD         95.989  94.979  95.482   4561   4513
VBG         89.980  94.309  92.094   1933   2026
VBN         85.827  92.168  88.885   2707   2907
VBP         91.791  94.313  93.035   1565   1608
VBZ         96.045  96.628  96.335   2639   2655
WDT         95.922  92.637  94.251    584    564
WP          98.566  97.173  97.865    283    279
WP$        100.000 100.000 100.000     37     37
WRB         99.671  99.671  99.671    304    304
``         100.000 100.000 100.000   1074   1074
------------------------------------------------
Accuracy    96.420    -       -      -    129654

Result validation

I also validated the result to make sure that the new 25 dimension vector is added to the feature vector. I first checked the size of lexicon and confirmed there is a 25 growth in size, which is from 51359 unique features to 51384 unique features for each example.

Then I also checked the values of features, and confirmed that the discrete features have value 1.0 / 0.0 and the real features have a double number as value.

Conclusions

The introduction of new feature vector slightly decreased the performance of POS tagger. Thanks to Daniel and Bhargav who greatly helped and guided me through this process.

Clone this wiki locally