TDAmeritrade · NimaSarajpoor · Oct 3, 2023 · Oct 8, 2023 · Oct 13, 2023 · Oct 13, 2023
@@ -0,0 +1,276 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0cdc3c0a",
+   "metadata": {},
+   "source": [
+    "# DAMP: Discord-Aware Matrix Profile"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1bcc1071",
+   "metadata": {},
+   "source": [
+    "Authors in [DAMP](https://www.cs.ucr.edu/~eamonn/DAMP_long_version.pdf) presented a method for discord detection that is scalable and it can be used in offline and online mode.\n",
+    "\n",
+    "To better understand the mechanism behind this method, we should first understand the difference between the full matrix profile and the left matrix profile of a time series `T`. For a subsequence with length `m`, and the start index `i`, i.e. `S_i = T[i:i+m]`, there are two groups of neighbors, known as left and right neighbors. The left neighbors are the subsequences on the left side of `S_i`, i.e. the subsequences in `T[:i]`. And, the right neighbors are the subsequences on the right side of `S_i`, i.e. the subsequences in `T[i+1:]`. The `i`-th element of the full matrix profile is the minimum distance between `S_i` and all of its neighbors, considering both left and right ones. However, in the left matrix profile, the `i`-th element is the minimum distance between the subsequence `S_i` and its left neighbors.\n",
+    "\n",
+    "One can use either the full matrix profile or the left matrix profile to find the top discord, a subsequence whose distance to its nearest neighbor is larger than the distance of any other subsequences to their nearest neighbors. However, using full matrix profile for detecting discords might result in missing the case where there are two rare subsequences that happen to also be similar to each other (a case that is known as \"twin freak\"). On the other hand, the left matrix profile resolves this problem by capturing the discord at its first occurance. Hence, even if there are two or more of such discords, we can still capture the first occurance by using the left matrix profile."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "27bf47eb",
+   "metadata": {},
+   "source": [
+    "The original `DAMP` algorithm needs a parameter called `split_idx`. For a given `split_idx`, the train part is `T[:split_idx]` and the potential anomalies should be coming from `T[split_idx:]`. The value of split_idx is problem dependent. If split_idx is too small, then `T[:split_idx]` may not contain all different kinds of regular patterns. Hence, we may incorrectly select a subsequence as a discord. If split_idx is too large, we may miss a discord if that discord and its nearest neighbor are both in `T[:split_idx]`. The following two extreme scenarios can help with understanding the rationale behind `split_idx`.\n",
+    "\n",
+    "(1) `split_idx = 0`: In this case, the first subsequence can be a discord itself as it is a \"new\" pattern. <br>\n",
+    "(2) `split_idx = len(T) - m` In such case, the last pattern is the only pattern that will be analyzed for the discord. It will be compared against all subsequences except the last one!\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce3102c1",
+   "metadata": {},
+   "source": [
+    "# Getting Started"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c9564cff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import math\n",
+    "\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import stumpy\n",
+    "import time\n",
+    "\n",
+    "from stumpy import core\n",
+    "from scipy.io import loadmat"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ecad47df",
+   "metadata": {},
+   "source": [
+    "## Naive approach"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "409dad09",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def naive_DAMP(T, m, split_idx):\n",
+    "    \"\"\"\n",
+    "    Compute the top-1 discord in `T`, where the subsequence discord resides in T[split_index:]\n",
+    "    \n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    T : numpy.ndarray\n",
+    "        A time series for which the top discord will be computed.\n",
+    "        \n",
+    "    m : int\n",
+    "        Window size\n",
+    "    \n",
+    "    split_idx : int\n",
+    "        The split index between train and test. See note below for further details.\n",
+    "    \n",
+    "    Returns\n",
+    "    -------\n",
+    "    PL : numpy.ndarry\n",
+    "        The [exact] left matrix profile. All infinite distances are ingored in computing\n",
+    "        the discord.\n",
+    "        \n",
+    "    discord_dist : float\n",
+    "        The discord's distance, which is the distance between the top discord and its\n",
+    "        left nearest neighbor\n",
+    "        \n",
+    "    discord_idx : int\n",
+    "        The start index of the top discord\n",
+    "        \n",
+    "    Note\n",
+    "    ----\n",
+    "    \n",
+    "    \"\"\"\n",
+    "    mp = stumpy.stump(T, m)\n",
+    "    IL = mp[:, 2].astype(np.int64)\n",
+    "    \n",
+    "    k = len(T) - m + 1  # len(IL)\n",
+    "    PL = np.full(k, np.inf, dtype=np.float64)\n",
+    "    for i in range(split_idx, k):\n",
+    "        nn_i = IL[i]\n",
+    "        if nn_i >= 0:\n",
+    "            PL[i] = np.linalg.norm(core.z_norm(T[i : i + m]) - core.z_norm(T[nn_i : nn_i + m]))\n",
+    "    \n",
+    "    PL_modified = np.where(PL==np.inf, np.NINF, PL)\n",
+    "    discord_idx = np.argmax(PL_modified)\n",
+    "    discord_dist = PL_modified[discord_idx]\n",
+    "    if discord_dist == np.NINF:\n",
+    "        discord_idx = -1\n",
+    "        \n",
+    "    return PL, discord_dist, discord_idx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "505d4586",
+   "metadata": {},
+   "source": [
+    "## DAMP"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "c21c4587",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def next_pow2(v):\n",
+    "    \"\"\"\n",
+    "    Compute the smallest \"power of two\" number that is greater than/ equal to `v`\n",
+    "    \n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    v : float\n",
+    "        A real positive value\n",
+    "    \n",
+    "    Returns\n",
+    "    -------\n",
+    "    out : int\n",
+    "        An integer value that is power of two, and satisfies `out >= v`\n",
+    "    \"\"\"\n",
+    "    return int(math.pow(2, math.ceil(math.log2(v))))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "79663447",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def _backward_process(\n",
+    "    T, \n",
+    "    m, \n",
+    "    query_idx, \n",
+    "    M_T, \n",
+    "    Σ_T, \n",
+    "    T_subseq_isconstant, \n",
+    "    bsf,\n",
+    "):\n",
+    "    \"\"\"\n",
+    "    Compute the (approximate) left matrix profile value that corresponds to the subsequence \n",
+    "    `T[query_idx:query_idx+m]` and update the best-so-far discord distance.\n",
+    "    \n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    T : numpy.ndarray\n",
+    "        A time series\n",
+    "    \n",
+    "    m : int\n",
+    "        Window size\n",
+    "    \n",
+    "    query_idx : int\n",
+    "        The start index of the query with length `m`, i.e. `T[query_idx:query_idx+m]`\n",
+    "    \n",
+    "    M_T : np.ndarray\n",
+    "        The sliding mean of `T`\n",
+    "        \n",
+    "    Σ_T : np.ndarray\n",
+    "        The sliding standard deviation of `T`\n",
+    "    \n",
+    "    T_subseq_isconstant : numpy.ndarray\n",
+    "        A numpy boolean array whose i-th element indicates whether the subsequence\n",
+    "        `T[i : i+m]` is constant (True)\n",
+    "    \n",
+    "    bsf : float\n",
+    "        The best-so-far discord distance\n",
+    "        \n",
+    "    Returns\n",
+    "    -------\n",
+    "    distance : float\n",
+    "        The [approximate] left matrix profile value that corresponds to \n",
+    "        the query, `T[query_idx : query_idx + m]`.\n",
+    "    \n",
+    "    bsf : float\n",
+    "        The best-so-far discord distance \n",
+    "    \"\"\"\n",
+    "    nn_distance = np.inf  # The query's distance to its 1nn.\n",
+    "    \n",
+    "    # To compute the distance between Q=T[query_idx : query_idx + m],\n",
+    "    # and the subsequences in T[chunk_start : chunk_stop].\n",
+    "    chunksize = next_pow2(m) \n",
+    "    chunk_stop = query_idx\n",
+    "    chunk_start = max(0, chunk_stop - chunksize)\n",
+    "    \n",
+    "    while nn_distance >= bsf:\n",
+    "        QT = core.sliding_dot_product(\n",
+    "            T[query_idx : query_idx + m], \n",
+    "            T[chunk_start : chunk_stop],\n",
+    "        )\n",
+    "        D = core._mass(\n",
+    "            T[query_idx : query_idx + m],\n",
+    "            T[chunk_start : chunk_stop],\n",
+    "            QT=QT,\n",
+    "            μ_Q=M_T[query_idx],\n",
+    "            σ_Q=Σ_T[query_idx],\n",
+    "            M_T=M_T[chunk_start : chunk_stop - m + 1],\n",
+    "            Σ_T=Σ_T[chunk_start : chunk_stop - m + 1],\n",
+    "            Q_subseq_isconstant=T_subseq_isconstant[query_idx],\n",
+    "            T_subseq_isconstant=T_subseq_isconstant[chunk_start : chunk_stop - m + 1],\n",
+    "            )\n",
+    "        \n",
+    "        nn_distance = np.min(D)\n",
+    "        if chunk_start == 0:  \n",
+    "            # all neighbors of `Q` are visited. Hence, `nn_distance` is exact.\n",
+    "            if nn_distance > bsf:\n",
+    "                bsf = nn_distance\n",
+    "            break\n",
+    "        \n",
+    "        else:\n",
+    "            nn_distance = np.min(D)\n",
+    "            if nn_distance < bfs:\n",
+    "                break\n",
+    "            else:\n",
+    "                chunksize = 2 * chunksize\n",
+    "                chunk_start = max(0, chunk_stop - chunksize)\n",
+    "        \n",
+    "    return nn_distance, bsf"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}