This repo contains all my assignments during the DSBDA Lab in Sem 6.
Sr. | Name | Description |
---|---|---|
1 | Data Wrangling I | Perform the following operations using Python on any open-source dataset (e.g., data.csv): 1. Import all required Python Libraries. 2. Locate an open-source dataset (e.g., https://www.kaggle.com), provide a clear description and source URL. 3. Load the dataset into a pandas DataFrame. 4. Data Preprocessing: Check for missing values using isnull() , get initial statistics using describe() , provide variable descriptions and types. Check dimensions of the DataFrame. 5. Data Formatting and Normalization: Check and convert data types (character, numeric, integer, factor, logical). 6. Turn categorical variables into quantitative variables. In addition to the codes and outputs, explain every operation clearly. |
2 | Data Wrangling II | Create an "Academic performance" dataset of students and: 1. Scan all variables for missing values and inconsistencies, handle appropriately. 2. Scan numeric variables for outliers and handle appropriately. 3. Apply transformations to variables (for better scaling, linearity, or normality). Document your approach properly. |
3 | Descriptive Statistics: Measures of Central Tendency and Variability | Perform the following: 1. On [nba.csv] : Provide summary statistics (mean, median, min, max, std deviation) grouped by a categorical variable. 2. On [iris.csv] : Display basic statistics (percentile, mean, std deviation) for each Iris species (setosa , versicolor , virginica ). Provide codes, outputs, and explanations. |
4 | Data Visualization I | Using the inbuilt titanic dataset (891 rows): 1. Use Seaborn to find patterns. 2. Plot a histogram for 'fare' to see price distribution. |
5 | Data Visualization II | Using titanic dataset: 1. Plot a box plot for 'age' distribution across 'sex' and survival status. 2. Write observations based on the plots. |
6 | Data Visualization III | Using the iris.csv dataset: 1. List all features and their types. 2. Create histograms for each feature. 3. Create box plots for each feature. 4. Compare distributions and identify outliers. |
7 | Data Analytics I | Create a Linear Regression Model in Python/R to predict home prices using the Boston Housing Dataset (https://www.kaggle.com/c/boston-housing). Objective: Predict house prices using the features. |
8 | Data Analytics II | Problem Statement: 1. Implement Logistic Regression on Social_Network_Ads.csv . 2. Compute confusion matrix and derive TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall. |
9 | Data Analytics III | Implement a Simple Naïve Bayes classifier using Python/R on iris.csv . Compute Confusion matrix and derive TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall. |
10 | Text Analytics | 1. Extract a sample document and apply: Tokenization, POS Tagging, Stop words removal, Stemming, Lemmatization. 2. Calculate Term Frequency and Inverse Document Frequency representations. |
11 | Hadoop Word Count | Write a Java program for a Word Count application using Hadoop Map-Reduce framework in a local-standalone setup. |
13 | Apache Spark Word Count | Write a simple program in Scala using the Apache Spark Framework for Word Count. |