-
Notifications
You must be signed in to change notification settings - Fork 17
On behavior of lbjava when dealing with real valued features.
There was speculations about lbjava failing when using real-valued features.
Thanks to @Slash0BZ (Ben Zhou) we did a comprehensive experiments, running a few example problems, with different number of real-valued features, across different algorithms, and different number of iterations.
The data I used was from http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
I downloaded and unzipped the data and used the script https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/LBJava/20news_data_parser.py to move randomly selected 90% of the files (in each tag) to a training data path and the rest of them to a testing data path.
The parsed data can be read and used directly through https://github.com/Slash0BZ/lbjava/blob/master/lbjava-examples/src/main/java/edu/illinois/cs/cogcomp/lbjava/examples/DocumentReader.java
Then I first tested the original NewsGroupClassifier.lbj defined as https://github.com/Slash0BZ/lbjava/blob/master/lbjava-examples/src/main/lbj/NewsGroupClassifier.lbj
Then I added a constant real feature for each of the examples, an example is
real[] RealFeatureConstant(Document d) <- { int k = 3; sense k; }
int
here is fine because lbjava parser will transform the data type to double
Then I tried multiple real features that are randomly generated by Gaussian distribution. The code I used was
import java.util.Random;
real[] GaussianRealFeatures(Document d) <- {
List words = d.getWords();
Random ran = new Random();
for (int i = 0; i < words.size() - 1; i++)
sense ran.nextGaussian() * 10;
}
I used bash scripts to run test on several algorithms. I wrote an python script that modifies the lbj file to make things easier. The script can be found at https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/LBJava/modify_lbj_file.py
NewsGroup (table for single real feature)
Condition\Algorithm | SparseAveragedPerceptron | SparseWinnow | PassiveAggresive | SparseConfidenceWeighted | BinaryMIRA |
---|---|---|---|---|---|
1 round w/o real features | 48.916 | 92.597 | 19.038 | 33.739 | |
1 round w/ real features | 47.753 | 92.491 | 23.268 | 32.364 | |
10 rounds w/o real features | 82.390 | 91.539 | 24.802 | 76.891 | |
10 rounds w/ real features | 82.126 | 91.529 | 12.427 | 75.939 | |
50 rounds w/o real features | 84.823 | 91.592 | 14.120 | 77.208 | |
50 rounds w/ real features | 85.299 | 91.433 | 19.566 | 76.891 | |
100 rounds w/o real features | 85.828 | 91.433 | 12.956 | 76.574 | |
100 rounds w real features | 84.770 | 91.486 | 15.442 | 61.026 |
NewsGroup (table for the same amount of Gaussian random real features as discrete ones)
Condition\Algorithm | SparseAveragedPerceptron | SparseWinnow | PassiveAggresive | BinaryMIRA |
---|---|---|---|---|
1 round w/o real features | 51.454 | 92.597 | 12.057 | 33.739 |
1 round w/ real features | 17.980 | 6.081 | 14.913 | 14.225 |
10 rounds w/o real features | 82.813 | 91.539 | 22.369 | 76.891 |
10 rounds w/ real features | 52.829 | 42.517 | 45.743 | |
50 rounds w/o real features | 84.294 | 91.592 | 21.100 | 77.208 |
50 rounds w/ real features | 75.727 | 67.054 | 75.198 | |
100 rounds w/o real features | 85.506 | 91.433 | 17.768 | 76.574 |
100 rounds w real features | 77.631 | 74.828 | 74.194 |
###Problems
-
SparseConfidenceWeighted
encountered a problem where training takes too long on my server. I had to kill the process after waiting for a long time. -
SparseWinnow
will through a NullPointerException during the testing (testDiscrete()
) process after multiple real features are added, if the training rounds are larger than 1.
Similar to the NewsGroup examples above, the constant real features are the same.
The multiple real features are added through
real[] RealFeatures3(String line) <- {
for(int i = 0; i < line.length(); i++){
int k = 3;
sense k;
}
}
The random multiple real features are added through
real[] GaussianRealFeatures(String line) <- {
for (int i = 0; i < line.length(); i++){
Random ran = new Random();
sense ran.nextGaussian() * 10;
}
}
###Result Tables
Badges (table for single real feature)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 100.0 | 95.745 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 100.0 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 100.0 | 100.0 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 100.0 | 100.0 | 100.0 |
Badges (table for same amount of constant real features as discrete features)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 74.468 | 100.0 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 78.723 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 100.0 | 100.0 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 100.0 | 100.0 | 100.0 |
Badges (table for same amount of of random Gaussian real features as discrete features)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 55.319 | 56.383 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 62.766 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 74.468 | 87.234 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 86.170 | 100.0 | 100.0 |
The conclusion made here is that, as more number of real-valued features are added, more training iterations are need to train the system, and there is no clear issues with real-valued features.
The goal of this experiment is to add the 25 dimension real valued phrase similarity vector defined at https://gitlab-beta.engr.illinois.edu/cogcomp/illinois-phrasesim/blob/master/src/main/java/edu/illinois/cs/cogcomp/sim/PhraseSim.java for each word to the end of the original feature vector in POSTaggerKnown defined at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/POSTaggerKnown.java
Since the code that are supposed to be generated by LBJava through .lbj files has been modified manually to a certain extent, and many of the changes lack documentations, I was unable to directly modify https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/lbj/POSKnown.lbj and replace the original classes with the newly generated classes. So I took another approach. I first generated a paragamVector class, which are generated from code
real[] paragamVector(Token w) <- {
PhraseSim ps = null;
try{
ps = PhraseSim.getInstance();
}
catch (FileNotFoundException e){
}
double[] vec = ps.getVector(w.form);
for(int i = 0; i < vec.length; i++){
sense vec[i];
}
}
The generated class can be found at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/paragamVector.java
I manually added some code so that this code handles the case when a word do not have corresponding new vector.
Then I modified the code in POSTaggerKnown$$1.java, which is the extractor for the POSTaggerKnown class. The code can be found at https://github.com/Slash0BZ/illinois-cogcomp-nlp/blob/master/pos/src/main/java/edu/illinois/cs/cogcomp/pos/lbjava/POSTaggerKnown%24%241.java
Further with some changes to make the code work on my local machine, POSTrain was able to be ran without errors.
rounds\features | without real features | with real features | Difference |
---|---|---|---|
50 | 96.525 | 96.420 | 0.105 |
100 | 96.600 | 96.556 | 0.044 |
150 | 96.599 | 96.568 | 0.031 |
200 | 96.610 | 96.588 | 0.022 |
250 | 96.613 | 96.593 | 0.02 |
300 | 96.610 | 96.596 | 0.014 |
350 | 96.609 | 96.593 | 0.016 |
400 | 96.604 | 96.583 | 0.021 |
450 | 96.598 | 96.587 | 0.011 |
500 | 96.589 | 96.582 | 0.007 |
Above is the results table. I ran two threads using different model files and the same data files. Column 2 is the result of POS performances after introducing the real features produced from phrasesim mentioned in the introduction section. Column 3 is the result of the original feature vector defined on the Github page of POS.
Everything else except for the feature vector is the same across these two different trails.
Some observations: the accuracy of both settings start to drop at round 250 to 300, and the accuracy differences of these two settings kept dropping until both settings are overfitting. However, there is no sign of the new feature vector with the phrasesim features has a better performance than the original features.
I also validated the result to make sure that the new 25 dimension vector is added to the feature vector. I first checked the size of lexicon and confirmed there is a 25 growth in size, which is from 51359 unique features to 51384 unique features for each example.
Then I also checked the values of features, and confirmed that the discrete features have value 1.0 / 0.0 and the real features have a double number as value.