CSCI-6360-Project01 : Regression Problem in Scalation and R

This project covers implementations of Simple Regression, Regression Weighted Least Squares, Ridge Regression, Lasso Regression, Quad Regression, and Response Surface in Scalation and R , over 10 datasets downloaded from the UCI Machine Learning Repository. The datasets include:

Auto MPG (Instances: 406, Attributes: 8)
Beijing PM2.5 Dataset (Instances: 43824, Attributes: 13)
Concrete Compressive Strength Dataset (Instances: 1030, Attributes: 9)
Real Estate Valuation Dataset (Instances: 414, Attributes: 7)
Parkinson's Tele Monitoring (Instances: 5875, Attributes: 26)
GPS Trajectories (Instances: 163, Attributes: 15)
Appliances Energy Prediction (Instances: 19735, Attributes: 29)
Combined Cycle Powerplant (Instances: 9568, Attributes: 4)
CSM Dataset (Instances: 217, Attributes: 12)
Naval Propulsion Dataset (Instances: 11934, Attributes: 16)

This apart, the user also gets the option to run the models on their own datasets, by mentioning the correct path to that file in either of the environments.

Getting Started

These instructions describe the prerequisites and steps to get the project up and running.

Prerequisites

This project has the following requirements for Scalation:

Scala 2.12.8 +
Java 8
sbt_1.0 +

It also has the following requirements for executing the R script:

R 3.5.2 +
Libraries: caret, lattice, ggplot2, lmridge, lars

Usage

After cloning the repository, to generate the R² - R_bar² - R_CV² plots, one can navigate to the Scalation folder which contains the build.sbt file. Here, open the terminal and run the command:sbt run This will build the Scalation project, and the user will get a prompt to select from the 10 datasets. The user can enter his choice by enterining a number between '1' to '10', each corresponding to the respective dataset.

If the user wishes to use this project for their own dataset, they will have to enter '11' as their choice, which will prompt them to enter the path of their dataset (in CSV format). However, there are a few guidelines for the dataset that the user chooses to experiment on:

it has to be a numeric dataset (data-encoding hasn't been implemented yet!)
the first column of the dataset needs to be the 'Y' attribute. If the user chooses to add their own dataset to the list, they will have to navigate one step back, to the /data directory and move the dataset there. The naming convention followed in the project is, "x.csv" where 'x' is the choice that the user inputs.

To check the Scala script, the user will have to navigate to Scalation/src/main/scala/RegressionProblem/regression.scala

One thing to note is that, while running this script on big datasets, as the Feature Selection reaches the end, Quality of Fit becomes NAN, because of which '-1' is appended into the feature-selected vector and you will face an ArrayOutOfBounds error. As a result, I have been unsuccessful in running the script on few of the bigger datasets on the UCI Machine Learning Repository. As per discussions with Dr. Miller, he plans to explore and check for the bug in ForwardSel class as well, so as to address this issue.

The user can run the 'regression.R' script in the /R sub-directory in the repository to run the R script. The user needs to enter: source("\path\to\R\script\regression.R") which will then generate the R² - R_bar² - R_CV² plots for the datasets of their choice, over five type of regression models. A train-test split ratio of 60-40 is taken, with which, the forward selection is done, based on the maimum value of the R² criterion on the train set. Furthermore, the plots are generated by recording the R² plots, R_bar² plots on the test set; and the R_CV² plot on the whole dataset. Each of these plots have been saved in the /plots sub-directory.

General Issues Faced

The crossVal class in Scalation might have a bug, because, when plotting RCV 2, R2 and Radj2, the line in the plot representing RCV 2 is coming above the rest, whereas this is not possible, as the values of RCV 2 should be less than normal or adjusted values of R2. It was observed that this was being satisfied only for a small value of k (i.e, number of folds) such as 2. Thus, we are showing plots of Cross-Validation for only k=2 on the different datasets. However, the plots generated through R are after performing Cross-Validation for k=10.

Ridge regression was computationally very expensive in our experiments. We frequently got memory errors for large datasets in Ridge which was probably due to ‘lmridge’ library’s inbuilt implementation. Therefore , a better library for ridge regression is required in R. However, strangely enough, on the bigger datasets in ScalaTion, ridge regression and lasso regression performed easily as compared to the other models. This was mainly because, while running this script on big datasets in ScalaTion, as the Feature Selection reaches the end, Quality of Fit becomes NAN, because of which '-1' is appended into the feature-selected vector and you will face an ArrayOutOfBounds error. As a fix, Dr. Miller suggested ignoring the ‘-1’ and generating the graphs, which was accordingly implemented.

Contributors

See CONTRIBUTORS file for more details.

Authors

License

This project is licensed under the MIT License. See LICENSE for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
R		R
Scalation		Scalation
data		data
plots		plots
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSCI-6360-Project01 : Regression Problem in Scalation and R

Getting Started

Prerequisites

Usage

General Issues Faced

Contributors

Authors

License

About

Releases

Packages

Contributors 2

Languages

License

aashishyadavally/Regression-in-ScalaTion-and-R

Folders and files

Latest commit

History

Repository files navigation

CSCI-6360-Project01 : Regression Problem in Scalation and R

Getting Started

Prerequisites

Usage

General Issues Faced

Contributors

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages