This project covers implementations of Simple Regression, Regression Weighted Least Squares, Ridge Regression, Lasso Regression, Quad Regression, and Response Surface in Scalation and R , over 10 datasets downloaded from the UCI Machine Learning Repository. The datasets include:
- Auto MPG (Instances: 406, Attributes: 8)
- Beijing PM2.5 Dataset (Instances: 43824, Attributes: 13)
- Concrete Compressive Strength Dataset (Instances: 1030, Attributes: 9)
- Real Estate Valuation Dataset (Instances: 414, Attributes: 7)
- Parkinson's Tele Monitoring (Instances: 5875, Attributes: 26)
- GPS Trajectories (Instances: 163, Attributes: 15)
- Appliances Energy Prediction (Instances: 19735, Attributes: 29)
- Combined Cycle Powerplant (Instances: 9568, Attributes: 4)
- CSM Dataset (Instances: 217, Attributes: 12)
- Naval Propulsion Dataset (Instances: 11934, Attributes: 16)
This apart, the user also gets the option to run the models on their own datasets, by mentioning the correct path to that file in either of the environments.
These instructions describe the prerequisites and steps to get the project up and running.
This project has the following requirements for Scalation:
- Scala 2.12.8 +
- Java 8
- sbt_1.0 +
It also has the following requirements for executing the R script:
- R 3.5.2 +
- Libraries: caret, lattice, ggplot2, lmridge, lars
After cloning the repository, to generate the R2 - Rbar2 - RCV2 plots, one can navigate to the Scalation folder which contains the build.sbt file. Here, open the terminal and run the command:sbt run
This will build the Scalation project, and the user will get a prompt to select from the 10 datasets. The user can enter his choice by enterining a number between '1' to '10', each corresponding to the respective dataset.
If the user wishes to use this project for their own dataset, they will have to enter '11' as their choice, which will prompt them to enter the path of their dataset (in CSV format). However, there are a few guidelines for the dataset that the user chooses to experiment on:
- it has to be a numeric dataset (data-encoding hasn't been implemented yet!)
- the first column of the dataset needs to be the 'Y' attribute.
If the user chooses to add their own dataset to the list, they will have to navigate one step back, to the
/data
directory and move the dataset there. The naming convention followed in the project is, "x.csv" where 'x' is the choice that the user inputs.
To check the Scala script, the user will have to navigate to Scalation/src/main/scala/RegressionProblem/regression.scala
One thing to note is that, while running this script on big datasets, as the Feature Selection reaches the end, Quality of Fit becomes NAN, because of which '-1' is appended into the feature-selected vector and you will face an ArrayOutOfBounds
error. As a result, I have been unsuccessful in running the script on few of the bigger datasets on the UCI Machine Learning Repository. As per discussions with Dr. Miller, he plans to explore and check for the bug in ForwardSel class as well, so as to address this issue.
The user can run the 'regression.R' script in the /R
sub-directory in the repository to run the R script. The user needs to enter:
source("\path\to\R\script\regression.R")
which will then generate the R2 - Rbar2 - RCV2 plots for the datasets of their choice, over five type of regression models. A train-test split ratio of 60-40 is taken, with which, the forward selection is done, based on the maimum value of the R2 criterion on the train set. Furthermore, the plots are generated by recording the R2 plots, Rbar2 plots on the test set; and the RCV2 plot on the whole dataset. Each of these plots have been saved in the /plots
sub-directory.
The crossVal class in Scalation might have a bug, because, when plotting RCV 2, R2 and Radj2, the line in the plot representing RCV 2 is coming above the rest, whereas this is not possible, as the values of RCV 2 should be less than normal or adjusted values of R2. It was observed that this was being satisfied only for a small value of k (i.e, number of folds) such as 2. Thus, we are showing plots of Cross-Validation for only k=2 on the different datasets. However, the plots generated through R are after performing Cross-Validation for k=10.
Ridge regression was computationally very expensive in our experiments. We frequently got memory errors for large datasets in Ridge which was probably due to ‘lmridge’ library’s inbuilt implementation. Therefore , a better library for ridge regression is required in R. However, strangely enough, on the bigger datasets in ScalaTion, ridge regression and lasso regression performed easily as compared to the other models. This was mainly because, while running this script on big datasets in ScalaTion, as the Feature Selection reaches the end, Quality of Fit becomes NAN, because of which '-1' is appended into the feature-selected vector and you will face an ArrayOutOfBounds error. As a fix, Dr. Miller suggested ignoring the ‘-1’ and generating the graphs, which was accordingly implemented.
See CONTRIBUTORS file for more details.
This project is licensed under the MIT License. See LICENSE for more details.