-
Notifications
You must be signed in to change notification settings - Fork 17
On behavior of lbjava when dealing with real valued features.
##Introduction
There was speculations about lbjava failing when using real-valued features.
Thanks to @Slash0BZ (Ben Zhou) we did a comprehensive experiments, running a few example problems, with different number of real-valued features, across different algorithms, and different number of iterations.
The data I used was from http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
I downloaded and unzipped the data and used the script https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/LBJava/20news_data_parser.py to move randomly selected 90% of the files (in each tag) to a training data path and the rest of them to a testing data path.
The parsed data can be read and used directly through https://github.com/Slash0BZ/lbjava/blob/master/lbjava-examples/src/main/java/edu/illinois/cs/cogcomp/lbjava/examples/DocumentReader.java
Then I first tested the original NewsGroupClassifier.lbj defined as https://github.com/Slash0BZ/lbjava/blob/master/lbjava-examples/src/main/lbj/NewsGroupClassifier.lbj
Then I added a constant real feature for each of the examples, an example is
real[] RealFeatureConstant(Document d) <- { int k = 3; sense k; }
int
here is fine because lbjava parser will transform the data type to double
Then I tried multiple real features that are randomly generated by Gaussian distribution. The code I used was
import java.util.Random;
real[] GaussianRealFeatures(Document d) <- {
List words = d.getWords();
Random ran = new Random();
for (int i = 0; i < words.size() - 1; i++)
sense ran.nextGaussian() * 10;
}
I used bash scripts to run test on several algorithms. I wrote an python script that modifies the lbj file to make things easier. The script can be found at https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/LBJava/modify_lbj_file.py
NewsGroup (table for single real feature)
Condition\Algorithm | SparseAveragedPerceptron | SparseWinnow | PassiveAggresive | SparseConfidenceWeighted | BinaryMIRA |
---|---|---|---|---|---|
1 round w/o real features | 48.916 | 92.597 | 19.038 | 33.739 | |
1 round w/ real features | 47.753 | 92.491 | 23.268 | 32.364 | |
10 rounds w/o real features | 82.390 | 91.539 | 24.802 | 76.891 | |
10 rounds w/ real features | 82.126 | 91.529 | 12.427 | 75.939 | |
50 rounds w/o real features | 84.823 | 91.592 | 14.120 | 77.208 | |
50 rounds w/ real features | 85.299 | 91.433 | 19.566 | 76.891 | |
100 rounds w/o real features | 85.828 | 91.433 | 12.956 | 76.574 | |
100 rounds w real features | 84.770 | 91.486 | 15.442 | 61.026 |
NewsGroup (table for the same amount of Gaussian random real features as discrete ones)
Condition\Algorithm | SparseAveragedPerceptron | SparseWinnow | PassiveAggresive | BinaryMIRA |
---|---|---|---|---|
1 round w/o real features | 51.454 | 92.597 | 12.057 | 33.739 |
1 round w/ real features | 17.980 | 6.081 | 14.913 | 14.225 |
10 rounds w/o real features | 82.813 | 91.539 | 22.369 | 76.891 |
10 rounds w/ real features | 52.829 | 42.517 | 45.743 | |
50 rounds w/o real features | 84.294 | 91.592 | 21.100 | 77.208 |
50 rounds w/ real features | 75.727 | 67.054 | 75.198 | |
100 rounds w/o real features | 85.506 | 91.433 | 17.768 | 76.574 |
100 rounds w real features | 77.631 | 74.828 | 74.194 |
###Problems
-
SparseConfidenceWeighted
encountered a problem where training takes too long on my server. I had to kill the process after waiting for a long time. -
SparseWinnow
will through a NullPointerException during the testing (testDiscrete()
) process after multiple real features are added, if the training rounds are larger than 1.
Similar to the NewsGroup examples above, the constant real features are the same.
The multiple real features are added through
real[] RealFeatures3(String line) <- {
for(int i = 0; i < line.length(); i++){
int k = 3;
sense k;
}
}
The random multiple real features are added through
real[] GaussianRealFeatures(String line) <- {
for (int i = 0; i < line.length(); i++){
Random ran = new Random();
sense ran.nextGaussian() * 10;
}
}
###Result Tables
Badges (table for single real feature)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 100.0 | 95.745 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 100.0 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 100.0 | 100.0 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 100.0 | 100.0 | 100.0 |
Badges (table for same amount of constant real features as discrete features)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 74.468 | 100.0 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 78.723 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 100.0 | 100.0 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 100.0 | 100.0 | 100.0 |
Badges (table for same amount of of random Gaussian real features as discrete features)
Condition\Algorithm | SparsePerceptron | SparseWinnow | NaiveBayes |
---|---|---|---|
1 round w/o real features | 100.0 | 95.745 | 100.0 |
1 round w/ real features | 55.319 | 56.383 | 100.0 |
10 rounds w/o real features | 100.0 | 100.0 | 100.0 |
10 rounds w/ real features | 62.766 | 100.0 | 100.0 |
50 rounds w/o real features | 100.0 | 100.0 | 100.0 |
50 rounds w/ real features | 74.468 | 87.234 | 100.0 |
100 rounds w/o real features | 100.0 | 100.0 | 100.0 |
100 rounds w real features | 86.170 | 100.0 | 100.0 |
The conclusion made here is that, as more number of real-valued features are added, more training iterations are need to train the system, and there is no clear issues with real-valued features.