-
Notifications
You must be signed in to change notification settings - Fork 17
On behavior of lbjava when dealing with real valued features.
There was speculations about lbjava failing when using real-valued features.
Thanks to @Slash0BZ (Ben Zhou) we did a comprehensive experiments, running a few example problems, with different number of real-valued features, across different algorithms, and different number of iterations.
The data I used was from http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
I downloaded and unzipped the data and used the script https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/LBJava/20news_data_parser.py to move randomly selected 90% of the files (in each tag) to a training data path and the rest of them to a testing data path.
The parsed data can be read and used directly through https://github.com/Slash0BZ/lbjava/blob/master/lbjava-examples/src/main/java/edu/illinois/cs/cogcomp/lbjava/examples/DocumentReader.java
Then I first tested the original NewsGroupClassifier.lbj defined as https://github.com/Slash0BZ/lbjava/blob/master/lbjava-examples/src/main/lbj/NewsGroupClassifier.lbj
Then I added a constant real feature for each of the examples, an example is
real[] RealFeatureConstant(Document d) <- { int k = 3; sense k; }
int
here is fine because lbjava parser will transform the data type to double
Then I tried multiple real features that are randomly generated by Gaussian distribution. The code I used was
import java.util.Random;
real[] GaussianRealFeatures(Document d) <- {
List words = d.getWords();
Random ran = new Random();
for (int i = 0; i < words.size() - 1; i++)
sense ran.nextGaussian() * 10;
}
I used bash scripts to run test on several algorithms. I wrote an python script that modifies the lbj file to make things easier. The script can be found at https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/LBJava/modify_lbj_file.py
NewsGroup (table for single real feature)
Condition\Algorithm | SparseAveragedPerceptron | SparseWinnow | PassiveAggresive | SparseConfidenceWeighted | BinaryMIRA |
---|---|---|---|---|---|
1 round w/o real features | 48.916 | 92.597 | 19.038 | 33.739 | |
1 round w/ real features | 47.753 | 92.491 | 23.268 | 32.364 | |
10 rounds w/o real features | 82.390 | 91.539 | 24.802 | 76.891 | |
10 rounds w/ real features | 82.126 | 91.529 | 12.427 | 75.939 | |
50 rounds w/o real features | 84.823 | 91.592 | 14.120 | 77.208 | |
50 rounds w/ real features | 85.299 | 91.433 | 19.566 | 76.891 | |
100 rounds w/o real features | 85.828 | 91.433 | 12.956 | 76.574 | |
100 rounds w real features | 84.770 | 91.486 | 15.442 | 61.026 |
NewsGroup (table for the same amount of Gaussian random real features as discrete ones)
Condition\Algorithm | SparseAveragedPerceptron | SparseWinnow | PassiveAggresive | BinaryMIRA |
---|---|---|---|---|
1 round w/o real features | 51.454 | 92.597 | 12.057 | 33.739 |
1 round w/ real features | 17.980 | 6.081 | 14.913 | 14.225 |
10 rounds w/o real features | 82.813 | 91.539 | 22.369 | 76.891 |
10 rounds w/ real features | 52.829 | 42.517 | 45.743 | |
50 rounds w/o real features | 84.294 | 91.592 | 21.100 | 77.208 |
50 rounds w/ real features | 75.727 | 67.054 | 75.198 | |
100 rounds w/o real features | 85.506 | 91.433 | 17.768 | 76.574 |
100 rounds w real features | 77.631 | 74.828 | 74.194 |
###Problems
-
SparseConfidenceWeighted
encountered a problem where training takes too long on my server. I had to kill the process after waiting for a long time. -
SparseWinnow
will through a NullPointerException during the testing (testDiscrete()
) process after multiple real features are added, if the training rounds are larger than 1.
Similar to the NewsGroup examples above, the constant real features are the same.
The multiple real features are added through
real[] RealFeatures3(String line) <- {
for(int i = 0; i < line.length(); i++){
int k = 3;
sense k;
}
}
The random multiple real features are added through
real[] GaussianRealFeatures(String line) <- {
for (int i = 0; i < line.length(); i++){
Random ran = new Random();
sense ran.nextGaussian() * 10;
}
}
###Result Tables
Badges (table for single real feature)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 100.0 | 95.745 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 100.0 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 100.0 | 100.0 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 100.0 | 100.0 | 100.0 |
Badges (table for same amount of constant real features as discrete features)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 74.468 | 100.0 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 78.723 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 100.0 | 100.0 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 100.0 | 100.0 | 100.0 |
Badges (table for same amount of of random Gaussian real features as discrete features)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 55.319 | 56.383 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 62.766 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 74.468 | 87.234 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 86.170 | 100.0 | 100.0 |
The conclusion made here is that, as more number of real-valued features are added, more training iterations are need to train the system, and there is no clear issues with real-valued features.
The goal of this experiment is to add the 25 dimension real valued phrase similarity vector defined at https://gitlab-beta.engr.illinois.edu/cogcomp/illinois-phrasesim/blob/master/src/main/java/edu/illinois/cs/cogcomp/sim/PhraseSim.java for each word to the end of the original feature vector in POSTaggerKnown defined at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/POSTaggerKnown.java
Since the code that are supposed to be generated by LBJava through .lbj files has been modified manually to a certain extent, and many of the changes lack documentations, I was unable to directly modify https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/lbj/POSKnown.lbj and replace the original classes with the newly generated classes. So I took another approach. I first generated a paragamVector class, which are generated from code
real[] paragamVector(Token w) <- {
PhraseSim ps = null;
try{
ps = PhraseSim.getInstance();
}
catch (FileNotFoundException e){
}
double[] vec = ps.getVector(w.form);
for(int i = 0; i < vec.length; i++){
sense vec[i];
}
}
The generated class can be found at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/paragamVector.java
I manually added some code so that this code handles the case when a word do not have corresponding new vector.
Then I modified the code in POSTaggerKnown$$1.java, which is the extractor for the POSTaggerKnown class. The code can be found at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/POSTaggerKnown%24%241.java
Further with some changes to make the code work on my local machine, POSTrain was able to be ran without errors.
Label Precision Recall F1 LCount PCount
------------------------------------------------
# 100.000 100.000 100.000 15 15
$ 100.000 100.000 100.000 943 943
'' 100.000 99.904 99.952 1045 1044
, 100.000 100.000 100.000 6876 6876
-LRB- 100.000 100.000 100.000 186 186
-RRB- 100.000 100.000 100.000 187 187
. 100.000 100.000 100.000 5381 5381
: 100.000 100.000 100.000 752 752
CC 99.661 99.631 99.646 3250 3249
CD 99.230 98.860 99.044 4823 4805
DT 99.444 99.204 99.324 11183 11156
EX 94.656 98.413 96.498 126 131
FW 30.769 26.667 28.571 30 26
IN 96.644 98.607 97.615 13492 13766
JJ 91.992 91.029 91.508 8215 8129
JJR 80.042 89.125 84.340 423 471
JJS 92.336 94.757 93.530 267 274
LS 88.889 53.333 66.667 15 9
MD 99.293 99.763 99.528 1267 1273
NN 95.464 96.647 96.052 17834 18055
NNP 97.486 95.340 96.401 13177 12887
NNPS 26.254 52.353 34.971 170 339
NNS 98.086 97.891 97.988 8061 8045
PDT 61.702 65.909 63.736 44 47
POS 97.614 99.373 98.485 1276 1299
PRP 99.502 99.773 99.638 2205 2211
PRP$ 99.532 99.532 99.532 1068 1068
RB 93.296 89.716 91.471 4405 4236
RBR 77.119 67.159 71.795 271 236
RBS 83.077 78.261 80.597 69 65
RP 71.279 68.766 70.000 397 383
SYM 100.000 90.909 95.238 11 10
TO 99.966 100.000 99.983 2913 2914
UH 76.923 58.824 66.667 17 13
VB 96.326 93.927 95.111 3573 3484
VBD 96.055 94.497 95.270 4561 4487
VBG 91.480 92.757 92.114 1933 1960
VBN 86.730 90.543 88.596 2707 2826
VBP 93.459 92.204 92.827 1565 1544
VBZ 96.880 96.476 96.677 2639 2628
WDT 97.798 91.267 94.420 584 545
WP 98.587 98.587 98.587 283 283
WP$ 100.000 100.000 100.000 37 37
WRB 99.344 99.671 99.507 304 305
`` 100.000 100.000 100.000 1074 1074
------------------------------------------------
Accuracy 96.532 - - - 129654
I also validated the result to make sure that the new 25 dimension vector is added to the feature vector. I first checked the size of lexicon and confirmed there is a 25 growth in size, which is from 51359 unique features to 51384 unique features for each example.
Then I also checked the values of features, and confirmed that the discrete features have value 1.0 / 0.0 and the real features have a double number as value.
The introduction of new feature vector slightly improved the performance of POS tagger. Thanks to Daniel and Bhargav who greatly helped and guided me through this process.