From a50ea18180000c2b268bf24c68c4545273e50eca Mon Sep 17 00:00:00 2001 From: RAGE UDAY KIRAN <52146396+udayRage@users.noreply.github.com> Date: Fri, 21 Jul 2023 22:20:42 +0900 Subject: [PATCH] Created using Colaboratory --- notebooks/FTApriori.ipynb | 483 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 483 insertions(+) create mode 100644 notebooks/FTApriori.ipynb diff --git a/notebooks/FTApriori.ipynb b/notebooks/FTApriori.ipynb new file mode 100644 index 000000000..222c7304d --- /dev/null +++ b/notebooks/FTApriori.ipynb @@ -0,0 +1,483 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true, + "authorship_tag": "ABX9TyM8UR/tvog692erME19vcT9", + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Finding coverage patterns in transactional databases using CMine" + ], + "metadata": { + "id": "Aj6UkFAj3sHh" + } + }, + { + "cell_type": "markdown", + "source": [ + "This tutorial has two parts. In the first part, we describe the basic approach to find coverage patterns in a transactional database using the CMine algorithm. In the final part, we describe an advanced approach, where we evaluate the CMine algorithm on a dataset at different *minimum coverage support* threshold values." + ], + "metadata": { + "id": "X-YPyS6G4AVR" + } + }, + { + "cell_type": "markdown", + "source": [ + "***" + ], + "metadata": { + "id": "XkW0ZZ276JtD" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Prerequisites:" + ], + "metadata": { + "id": "H8uDhbi55Use" + } + }, + { + "cell_type": "markdown", + "source": [ + "1. Installing the PAMI library" + ], + "metadata": { + "id": "z-avjjpTZzbf" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install -U pami #install the pami repository" + ], + "metadata": { + "id": "2PdVic4l3-DQ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "2. Downloading a sample dataset" + ], + "metadata": { + "id": "KZMafFx1Z7pn" + } + }, + { + "cell_type": "code", + "source": [ + "!wget -nc https://u-aizu.ac.jp/~udayrage/datasets/transactionalDatabases/Transactional_T10I4D100K.csv #download a sample transactional database" + ], + "metadata": { + "id": "SCtq3Erc5nEo" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "3. Printing few lines of a dataset to know its format." + ], + "metadata": { + "id": "nXgP5F4eaBTW" + } + }, + { + "cell_type": "code", + "source": [ + "!head -2 Transactional_T10I4D100K.csv" + ], + "metadata": { + "id": "7cHiIW59aS2T" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "_format:_ every row contains items seperated by a seperator.\n", + "\n", + "__Example:__\n", + "\n", + "item1 item2 item3 item4\n", + "\n", + "item1 item4 item6" + ], + "metadata": { + "id": "8HnTnR6MaebG" + } + }, + { + "cell_type": "markdown", + "source": [ + "***" + ], + "metadata": { + "id": "uJ8Z0ey_6FhV" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Part 1: Finding coverage patterns with CMine" + ], + "metadata": { + "id": "pCnHao5L6PRR" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Step 1: Understanding the statistics of a database to choose an appropriate *minimum coverage support* (*minCS*) value." + ], + "metadata": { + "id": "XzsAMOgb6jjQ" + } + }, + { + "cell_type": "code", + "source": [ + "#import the class file\n", + "import PAMI.extras.dbStats.transactionalDatabaseStats as stats\n", + "\n", + "#specify the file name\n", + "inputFile = 'Transactional_T10I4D100K.csv'\n", + "\n", + "#initialize the class\n", + "obj=stats.transactionalDatabaseStats(inputFile,sep='\\t')\n", + "\n", + "#execute the class\n", + "obj.run()\n", + "\n", + "#Printing each of the database statistics\n", + "print(f'Database size : {obj.getDatabaseSize()}')\n", + "print(f'Total number of items : {obj.getTotalNumberOfItems()}')\n", + "print(f'Database sparsity : {obj.getSparsity()}')\n", + "print(f'Minimum Transaction Size : {obj.getMinimumTransactionLength()}')\n", + "print(f'Average Transaction Size : {obj.getAverageTransactionLength()}')\n", + "print(f'Maximum Transaction Size : {obj.getMaximumTransactionLength()}')\n", + "print(f'Standard Deviation Transaction Size : {obj.getStandardDeviationTransactionLength()}')\n", + "print(f'Variance in Transaction Sizes : {obj.getVarianceTransactionLength()}')\n", + "\n", + "#saving the distribution of items' frequencies and transactional lengths\n", + "itemFrequencies = obj.getSortedListOfItemFrequencies()\n", + "transactionLength = obj.getTransanctionalLengthDistribution()\n", + "obj.save(itemFrequencies, 'itemFrequency.csv')\n", + "obj.save(transactionLength, 'transactionSize.csv')\n", + "\n", + "#Alternative apporach to derive the database statistics and plot the graphs\n", + "# obj.printStats()\n", + "# obj.plotGraphs()" + ], + "metadata": { + "id": "3isxz2qx54te" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 2: Draw the items' frequency graph and transaction length's distribution graphs for more information" + ], + "metadata": { + "id": "CnoHNwAC-kJQ" + } + }, + { + "cell_type": "code", + "source": [ + "import PAMI.extras.graph.plotLineGraphFromDictionary as plt\n", + "\n", + "itemFrequencies = obj.getFrequenciesInRange()\n", + "transactionLength = obj.getTransanctionalLengthDistribution()\n", + "plt.plotLineGraphFromDictionary(itemFrequencies, 100, 'Items\\' frequency graph', 'No of items', 'frequency')\n", + "plt.plotLineGraphFromDictionary(transactionLength, 100, 'transaction distribution graph', 'transaction length', 'frequency')" + ], + "metadata": { + "id": "xwZmfh4H7XSR" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 3: Choosing an appropriate *minCS* value\n", + "\n", + "_Observations_\n", + "\n", + " 1. The input dataset is sparse as the sparsity value is 0.988 (=98.8%)\n", + " 2. Many items have low frequencies as seen in the items' frequency graph\n", + " 3. The dataset is not high dimensional as the inverted curve is around 10.\n", + "\n", + " Based on the above observations, let us choose a _minCS_ value of 300 (in count). We can increase or decrease the _minCS_ based on the number of patterns being generated." + ], + "metadata": { + "id": "BtnhlIfLAEHR" + } + }, + { + "cell_type": "code", + "source": [ + "minimumCoverageSupport=0.08 #A coverage pattern must appear at least in 8% of the transactions" + ], + "metadata": { + "id": "MHGkrW5e-zYR" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 4: Choosing other parameters (minRF and maxOR) values" + ], + "metadata": { + "id": "oAzXb0MYQKoo" + } + }, + { + "cell_type": "code", + "source": [ + "minimumRelativeFrequency=0.02 #every item must appear at least 2% of the transactions\n", + "maximumOverlapRatio=0.8 #Overlap between an itemset and a new item must not be more than 80%" + ], + "metadata": { + "id": "s8xGRdo8QcrS" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 4: Mining coverage patterns using CMine" + ], + "metadata": { + "id": "PlLmo9ZXDJkX" + } + }, + { + "cell_type": "code", + "source": [ + "from PAMI.coveragePattern.basic import CMine as alg #import the algorithm\n", + "\n", + "obj = alg.CMine(iFile=inputFile, minRF=minimumRelativeFrequency, minCS=minimumCoverageSupport, maxOR=maximumOverlapRatio, sep='\\t') #initialize\n", + "obj.startMine() #start the mining process\n", + "\n", + "obj.save('coveragePatterns.txt') #save the patterns\n", + "\n", + "\n", + "coveragePatternsDF= obj.getPatternsAsDataFrame() #get the generated frequent patterns as a dataframe\n", + "print('Total No of patterns: ' + str(len(coveragePatternsDF))) #print the total number of patterns\n", + "print('Runtime: ' + str(obj.getRuntime())) #measure the runtime\n", + "\n", + "print('Memory (RSS): ' + str(obj.getMemoryRSS()))\n", + "print('Memory (USS): ' + str(obj.getMemoryUSS()))" + ], + "metadata": { + "id": "pLV84IYcDHe3" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 5: Investigating the generated patterns\n", + "\n", + "Open the patterns' file and investigate the generated patterns. If the generated patterns were interesting, use them; otherwise, redo the Steps 3 and 4 with a different _minSup_ value." + ], + "metadata": { + "id": "wE3A15V3FHjO" + } + }, + { + "cell_type": "code", + "source": [ + "!head coveragePatterns.txt" + ], + "metadata": { + "id": "nZ6dfQshDc9E" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "The storage format is: _coveragePattern:coverageSupport_\n", + "\n" + ], + "metadata": { + "id": "OSrlE5hwGEnH" + } + }, + { + "cell_type": "markdown", + "source": [ + "***" + ], + "metadata": { + "id": "HcaEFLGCHBjP" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Part 2: Evaluating the CMiner algorithm on a dataset at different minSup values" + ], + "metadata": { + "id": "FUO1nBPoHXJN" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Step 1: Import the libraries and specify the input parameters" + ], + "metadata": { + "id": "g90LiLg1Hzb-" + } + }, + { + "cell_type": "code", + "source": [ + "#Import the libraries\n", + "from PAMI.coveragePattern.basic import CMine as alg #import the algorithm\n", + "import pandas as pd\n", + "\n", + "#Specify the input parameters\n", + "inputFile = 'Transactional_T10I4D100K.csv'\n", + "seperator='\\t'\n", + "minimumCoverageSupportValues = [0.09,0.08,0.07,0.06,0.05]\n", + "#minimumCoverageSupport is specified between 0 to 1." + ], + "metadata": { + "id": "FfYBLxBPG5n_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 2: Create a data frame to store the results of CMine" + ], + "metadata": { + "id": "2lvGcqTKJDee" + } + }, + { + "cell_type": "code", + "source": [ + "result = pd.DataFrame(columns=['algorithm', 'minCS', 'patterns', 'runtime', 'memory'])\n", + "#initialize a data frame to store the results of FPGrowth algorithm" + ], + "metadata": { + "id": "9Hl47i3ZICpp" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 3: Execute the algorithm at different minSup values" + ], + "metadata": { + "id": "Hah3RHNNJf0f" + } + }, + { + "cell_type": "code", + "source": [ + "for minimumCoverageSupport in minimumCoverageSupportValues:\n", + " obj = alg.CMine(inputFile,minRF=minimumRelativeFrequency,minCS=minimumCoverageSupport,maxOR=maximumOverlapRatio,sep=seperator)\n", + " obj.startMine()\n", + " #store the results in the data frame\n", + " result.loc[result.shape[0]] = ['CMine', minimumCoverageSupport, len(obj.getPatterns()), obj.getRuntime(), obj.getMemoryRSS()]" + ], + "metadata": { + "id": "9ivJHoSEJlky" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 4: Print the result" + ], + "metadata": { + "id": "qs3jpnTBJSd6" + } + }, + { + "cell_type": "code", + "source": [ + "print(result)" + ], + "metadata": { + "id": "MReBFwDFJ_3x" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Step 5: Visualizing the results" + ], + "metadata": { + "id": "6ELjr0iPKDSY" + } + }, + { + "cell_type": "code", + "source": [ + "from PAMI.extras.graph import plotLineGraphsFromDataFrame as plt\n", + "\n", + "ab = plt.plotGraphsFromDataFrame(result)\n", + "ab.plotGraphsFromDataFrame()" + ], + "metadata": { + "id": "DvTQdsGLKJIU" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file