diff --git a/Prediction Models/Crab_Age_Prediction/Readme.md b/Prediction Models/Crab_Age_Prediction/Readme.md new file mode 100644 index 00000000..9dbb028e --- /dev/null +++ b/Prediction Models/Crab_Age_Prediction/Readme.md @@ -0,0 +1,42 @@ +# Crab Age Prediction Model + +This repository contains a machine learning model that predicts the age of crabs based on various biological measurements. The project involves Exploratory Data Analysis (EDA), feature engineering, and multiple machine learning models to determine which factors most accurately predict crab age. + +## Table of Contents +- [Introduction](#introduction) +- [Problem Statement](#problem-statement) +- [Solution Overview](#solution-overview) +- [Data](#data) + + +## Introduction + +Determining the age of marine species such as crabs is essential for studying population dynamics and ecological impacts. This project focuses on developing a machine learning model to predict crab age based on various biological characteristics, like size, weight, and shell dimensions. The model aims to help biologists and ecologists with accurate age estimations, facilitating better research and conservation efforts. + +## Problem Statement + +Age prediction in crabs is complex due to several challenges: +- **Biological Variability**: Differences in growth rates across individual crabs due to genetics and environmental factors. +- **Measurement Limitations**: Variability in available biological measurements. +- **Feature Selection**: Identifying which measurements contribute most effectively to accurate age prediction. + +This project aims to address these challenges by leveraging machine learning techniques to create a predictive model for crab age. + +## Solution Overview + +The model uses various machine learning algorithms, including linear regression, decision trees, and ensemble methods. Steps taken include: +1. **Exploratory Data Analysis (EDA)**: Identifying patterns, outliers, and relationships within the data. +2. **Feature Engineering**: Selecting and transforming features to improve model accuracy. +3. **Model Selection and Training**: Comparing multiple models to determine the best predictor of crab age. + +Key features may include measurements such as carapace length, width, weight, and other morphological characteristics. + +## Data + +The dataset contains various biological measurements for crabs, including: +- **Carapace Dimensions**: Length, width, and height. +- **Weight Measurements**: Including whole weight, shell weight, etc. +- **Other Characteristics**: Information about species, habitat, or other ecological factors, if available. + +The dataset should be placed in the `data/` folder in CSV format. + diff --git a/Prediction Models/Crab_Age_Prediction/crab-age-predictions-eda-f-e-modeling-10th.ipynb b/Prediction Models/Crab_Age_Prediction/crab-age-predictions-eda-f-e-modeling-10th.ipynb new file mode 100644 index 00000000..67ada5b5 --- /dev/null +++ b/Prediction Models/Crab_Age_Prediction/crab-age-predictions-eda-f-e-modeling-10th.ipynb @@ -0,0 +1,1898 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "64fad5f4", + "metadata": { + "papermill": { + "duration": 0.023427, + "end_time": "2023-06-13T09:19:59.138539", + "exception": false, + "start_time": "2023-06-13T09:19:59.115112", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "#

Import Libraries

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0cf32024", + "metadata": { + "papermill": { + "duration": 20.705581, + "end_time": "2023-06-13T09:20:19.910608", + "exception": false, + "start_time": "2023-06-13T09:19:59.205027", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install sklego\n", + "\n", + "import numpy as np # linear algebra\n", + "import pandas as pd # data processing\n", + "from pandas.api.types import is_numeric_dtype\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "import optuna\n", + "\n", + "from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler\n", + "from sklearn.metrics import mean_absolute_error\n", + "from sklearn.model_selection import KFold, train_test_split, GridSearchCV\n", + "\n", + "\n", + "# Models\n", + "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor, ExtraTreesRegressor, BaggingRegressor, StackingRegressor\n", + "from lightgbm import LGBMRegressor\n", + "from xgboost import XGBRegressor\n", + "from sklego.linear_model import LADRegression\n", + "from catboost import CatBoostRegressor\n", + "\n", + "\n", + "# Ignore warnings ;)\n", + "import warnings\n", + "warnings.simplefilter(\"ignore\")" + ] + }, + { + "cell_type": "markdown", + "id": "5660f80e", + "metadata": { + "id": "9iEKB2Oh3uNF", + "papermill": { + "duration": 0.024267, + "end_time": "2023-06-13T09:20:19.959155", + "exception": false, + "start_time": "2023-06-13T09:20:19.934888", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "#

Import the data

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd92fa96", + "metadata": { + "id": "Y-gW90p23uNH", + "papermill": { + "duration": 0.680544, + "end_time": "2023-06-13T09:20:20.663008", + "exception": false, + "start_time": "2023-06-13T09:20:19.982464", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# files path\n", + "train_path = \"/kaggle/input/playground-series-s3e16/train.csv\"\n", + "test_path = \"/kaggle/input/playground-series-s3e16/test.csv\"\n", + "original_path = \"/kaggle/input/crab-age-prediction/CrabAgePrediction.csv\"\n", + "synthetic_path = \"/kaggle/input/ps-s3-e16-synthetic-train-data/train_synthetic.csv\"\n", + "\n", + "# function to import our dataset \n", + "def import_data(train_path, test_path, original_path, synthetic_path):\n", + " train = pd.read_csv(train_path)\n", + " test = pd.read_csv(test_path)\n", + " original = pd.read_csv(original_path)\n", + " synthetic = pd.read_csv(synthetic_path)\n", + " \n", + " return train, test, original, synthetic\n", + "\n", + "train, test, original, synthetic = import_data(train_path, test_path, original_path, synthetic_path)" + ] + }, + { + "cell_type": "markdown", + "id": "729aaacc", + "metadata": { + "papermill": { + "duration": 0.023237, + "end_time": "2023-06-13T09:20:20.709912", + "exception": false, + "start_time": "2023-06-13T09:20:20.686675", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "The train dataset is a synthetic dataset generated from the [Crab Age Prediction](https://www.kaggle.com/datasets/sidhus/crab-age-prediction) dataset(original). These are the descriptions of the variables in this dataset:\n", + "\n", + "