From b311533957b91b1397bcc41fd7ba44c81a107605 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Tue, 3 Oct 2023 01:05:34 -0400 Subject: [PATCH 01/17] Add tutorial notebook DAMP --- docs/Tutorial_DAMP.ipynb | 276 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 276 insertions(+) create mode 100644 docs/Tutorial_DAMP.ipynb diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb new file mode 100644 index 000000000..89fd569d7 --- /dev/null +++ b/docs/Tutorial_DAMP.ipynb @@ -0,0 +1,276 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0cdc3c0a", + "metadata": {}, + "source": [ + "# DAMP: Discord-Aware Matrix Profile" + ] + }, + { + "cell_type": "markdown", + "id": "1bcc1071", + "metadata": {}, + "source": [ + "Authors in [DAMP](https://www.cs.ucr.edu/~eamonn/DAMP_long_version.pdf) presented a method for discord detection that is scalable and it can be used in offline and online mode.\n", + "\n", + "To better understand the mechanism behind this method, we should first understand the difference between the full matrix profile and the left matrix profile of a time series `T`. For a subsequence with length `m`, and the start index `i`, i.e. `S_i = T[i:i+m]`, there are two groups of neighbors, known as left and right neighbors. The left neighbors are the subsequences on the left side of `S_i`, i.e. the subsequences in `T[:i]`. And, the right neighbors are the subsequences on the right side of `S_i`, i.e. the subsequences in `T[i+1:]`. The `i`-th element of the full matrix profile is the minimum distance between `S_i` and all of its neighbors, considering both left and right ones. However, in the left matrix profile, the `i`-th element is the minimum distance between the subsequence `S_i` and its left neighbors.\n", + "\n", + "One can use either the full matrix profile or the left matrix profile to find the top discord, a subsequence whose distance to its nearest neighbor is larger than the distance of any other subsequences to their nearest neighbors. However, using full matrix profile for detecting discords might result in missing the case where there are two rare subsequences that happen to also be similar to each other (a case that is known as \"twin freak\"). On the other hand, the left matrix profile resolves this problem by capturing the discord at its first occurance. Hence, even if there are two or more of such discords, we can still capture the first occurance by using the left matrix profile." + ] + }, + { + "cell_type": "markdown", + "id": "27bf47eb", + "metadata": {}, + "source": [ + "The original `DAMP` algorithm needs a parameter called `split_idx`. For a given `split_idx`, the train part is `T[:split_idx]` and the potential anomalies should be coming from `T[split_idx:]`. The value of split_idx is problem dependent. If split_idx is too small, then `T[:split_idx]` may not contain all different kinds of regular patterns. Hence, we may incorrectly select a subsequence as a discord. If split_idx is too large, we may miss a discord if that discord and its nearest neighbor are both in `T[:split_idx]`. The following two extreme scenarios can help with understanding the rationale behind `split_idx`.\n", + "\n", + "(1) `split_idx = 0`: In this case, the first subsequence can be a discord itself as it is a \"new\" pattern.
\n", + "(2) `split_idx = len(T) - m` In such case, the last pattern is the only pattern that will be analyzed for the discord. It will be compared against all subsequences except the last one!\n" + ] + }, + { + "cell_type": "markdown", + "id": "ce3102c1", + "metadata": {}, + "source": [ + "# Getting Started" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c9564cff", + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import stumpy\n", + "import time\n", + "\n", + "from stumpy import core\n", + "from scipy.io import loadmat" + ] + }, + { + "cell_type": "markdown", + "id": "ecad47df", + "metadata": {}, + "source": [ + "## Naive approach" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "409dad09", + "metadata": {}, + "outputs": [], + "source": [ + "def naive_DAMP(T, m, split_idx):\n", + " \"\"\"\n", + " Compute the top-1 discord in `T`, where the subsequence discord resides in T[split_index:]\n", + " \n", + " Parameters\n", + " ----------\n", + " T : numpy.ndarray\n", + " A time series for which the top discord will be computed.\n", + " \n", + " m : int\n", + " Window size\n", + " \n", + " split_idx : int\n", + " The split index between train and test. See note below for further details.\n", + " \n", + " Returns\n", + " -------\n", + " PL : numpy.ndarry\n", + " The [exact] left matrix profile. All infinite distances are ingored in computing\n", + " the discord.\n", + " \n", + " discord_dist : float\n", + " The discord's distance, which is the distance between the top discord and its\n", + " left nearest neighbor\n", + " \n", + " discord_idx : int\n", + " The start index of the top discord\n", + " \n", + " Note\n", + " ----\n", + " \n", + " \"\"\"\n", + " mp = stumpy.stump(T, m)\n", + " IL = mp[:, 2].astype(np.int64)\n", + " \n", + " k = len(T) - m + 1 # len(IL)\n", + " PL = np.full(k, np.inf, dtype=np.float64)\n", + " for i in range(split_idx, k):\n", + " nn_i = IL[i]\n", + " if nn_i >= 0:\n", + " PL[i] = np.linalg.norm(core.z_norm(T[i : i + m]) - core.z_norm(T[nn_i : nn_i + m]))\n", + " \n", + " PL_modified = np.where(PL==np.inf, np.NINF, PL)\n", + " discord_idx = np.argmax(PL_modified)\n", + " discord_dist = PL_modified[discord_idx]\n", + " if discord_dist == np.NINF:\n", + " discord_idx = -1\n", + " \n", + " return PL, discord_dist, discord_idx" + ] + }, + { + "cell_type": "markdown", + "id": "505d4586", + "metadata": {}, + "source": [ + "## DAMP" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "c21c4587", + "metadata": {}, + "outputs": [], + "source": [ + "def next_pow2(v):\n", + " \"\"\"\n", + " Compute the smallest \"power of two\" number that is greater than/ equal to `v`\n", + " \n", + " Parameters\n", + " ----------\n", + " v : float\n", + " A real positive value\n", + " \n", + " Returns\n", + " -------\n", + " out : int\n", + " An integer value that is power of two, and satisfies `out >= v`\n", + " \"\"\"\n", + " return int(math.pow(2, math.ceil(math.log2(v))))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79663447", + "metadata": {}, + "outputs": [], + "source": [ + "def _backward_process(\n", + " T, \n", + " m, \n", + " query_idx, \n", + " M_T, \n", + " Σ_T, \n", + " T_subseq_isconstant, \n", + " bsf,\n", + "):\n", + " \"\"\"\n", + " Compute the (approximate) left matrix profile value that corresponds to the subsequence \n", + " `T[query_idx:query_idx+m]` and update the best-so-far discord distance.\n", + " \n", + " Parameters\n", + " ----------\n", + " T : numpy.ndarray\n", + " A time series\n", + " \n", + " m : int\n", + " Window size\n", + " \n", + " query_idx : int\n", + " The start index of the query with length `m`, i.e. `T[query_idx:query_idx+m]`\n", + " \n", + " M_T : np.ndarray\n", + " The sliding mean of `T`\n", + " \n", + " Σ_T : np.ndarray\n", + " The sliding standard deviation of `T`\n", + " \n", + " T_subseq_isconstant : numpy.ndarray\n", + " A numpy boolean array whose i-th element indicates whether the subsequence\n", + " `T[i : i+m]` is constant (True)\n", + " \n", + " bsf : float\n", + " The best-so-far discord distance\n", + " \n", + " Returns\n", + " -------\n", + " distance : float\n", + " The [approximate] left matrix profile value that corresponds to \n", + " the query, `T[query_idx : query_idx + m]`.\n", + " \n", + " bsf : float\n", + " The best-so-far discord distance \n", + " \"\"\"\n", + " nn_distance = np.inf # The query's distance to its 1nn.\n", + " \n", + " # To compute the distance between Q=T[query_idx : query_idx + m],\n", + " # and the subsequences in T[chunk_start : chunk_stop].\n", + " chunksize = next_pow2(m) \n", + " chunk_stop = query_idx\n", + " chunk_start = max(0, chunk_stop - chunksize)\n", + " \n", + " while nn_distance >= bsf:\n", + " QT = core.sliding_dot_product(\n", + " T[query_idx : query_idx + m], \n", + " T[chunk_start : chunk_stop],\n", + " )\n", + " D = core._mass(\n", + " T[query_idx : query_idx + m],\n", + " T[chunk_start : chunk_stop],\n", + " QT=QT,\n", + " μ_Q=M_T[query_idx],\n", + " σ_Q=Σ_T[query_idx],\n", + " M_T=M_T[chunk_start : chunk_stop - m + 1],\n", + " Σ_T=Σ_T[chunk_start : chunk_stop - m + 1],\n", + " Q_subseq_isconstant=T_subseq_isconstant[query_idx],\n", + " T_subseq_isconstant=T_subseq_isconstant[chunk_start : chunk_stop - m + 1],\n", + " )\n", + " \n", + " nn_distance = np.min(D)\n", + " if chunk_start == 0: \n", + " # all neighbors of `Q` are visited. Hence, `nn_distance` is exact.\n", + " if nn_distance > bsf:\n", + " bsf = nn_distance\n", + " break\n", + " \n", + " else:\n", + " nn_distance = np.min(D)\n", + " if nn_distance < bfs:\n", + " break\n", + " else:\n", + " chunksize = 2 * chunksize\n", + " chunk_start = max(0, chunk_stop - chunksize)\n", + " \n", + " return nn_distance, bsf" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 4ef1f5d443626c856b12970b21fa6d22fd6ae5a8 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sat, 7 Oct 2023 23:23:25 -0400 Subject: [PATCH 02/17] add function to compute range --- docs/Tutorial_DAMP.ipynb | 143 +++++++++++++++++++++++++++++++-------- 1 file changed, 114 insertions(+), 29 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index 89fd569d7..a1624b30a 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -67,7 +67,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "id": "409dad09", "metadata": {}, "outputs": [], @@ -133,7 +133,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 95, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -157,7 +157,94 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 103, + "id": "387ca010", + "metadata": {}, + "outputs": [], + "source": [ + "def naive_get_range_damp(i, m, chunksize_init=None):\n", + " if chunksize_init is None:\n", + " chunksize_init = next_pow2(m)\n", + " \n", + " lst = []\n", + " chunksize = chunksize_init\n", + " stop = i\n", + " for _ in range(i):\n", + " start = stop - chunksize\n", + " start = max(0, start)\n", + " lst.append([start, stop])\n", + " \n", + " if start <= 0:\n", + " break\n", + " \n", + " stop = start\n", + " chunksize = 2 * chunksize\n", + " \n", + " return np.array(lst)\n", + "\n", + "\n", + "def _get_range_damp(i, m, chunksize_init=None):\n", + " \"\"\"\n", + " For the given index `i`, segments the array `np.arange(i)` into \n", + " chunks such that the last chunk has size `chunksize_init`, and \n", + " the one before that has size `2 * chunksize_init` and so on. The\n", + " output contains the (start, stop) of chunks.\n", + " \n", + " Parameters\n", + " ----------\n", + " i : int\n", + " The stop index\n", + " \n", + " m : int\n", + " The window size\n", + " \n", + " chunksize_init : int\n", + " The initial chunksize, which is the size of the last chunk\n", + " \n", + " Returns\n", + " -------\n", + " out : np.ndarray\n", + " A 2D numpy array, where each row has (start, stop) index. The\n", + " very first chunk has start index `0`.\n", + " \"\"\"\n", + " if chunksize_init is None:\n", + " chunksize_init = next_pow2(m)\n", + " \n", + " n = int(math.ceil(math.log2(i / chunksize_init + 1))) \n", + " indices = i - (np.power(2, np.arange(n + 1)) - 1) * chunksize_init\n", + " start_indices = indices[1:]\n", + " stop_indices = indices[:-1]\n", + " \n", + " out = np.empty((n, 2), dtype=np.int64)\n", + " out[:, 0] = start_indices\n", + " out[:, 1] = stop_indices\n", + " \n", + " out[-1, 0] = 0\n", + " \n", + " return out\n", + "\n", + "\n", + "# test \n", + "for i in range(8, 65):\n", + " for m in range(3, 8):\n", + " ref = naive_get_range_damp(i, m)\n", + " cmp = _get_range_damp(i, m)\n", + " np.testing.assert_almost_equal(ref, cmp)\n", + " \n", + " chunksize_init = next_pow2(2 * m)\n", + " ref = naive_get_range_damp(i, m, chunksize_init)\n", + " cmp = _get_range_damp(i, m, chunksize_init)\n", + " np.testing.assert_almost_equal(ref, cmp)\n", + " \n", + " chunksize_init = 1\n", + " ref = naive_get_range_damp(i, m, chunksize_init)\n", + " cmp = _get_range_damp(i, m, chunksize_init)\n", + " np.testing.assert_almost_equal(ref, cmp)" + ] + }, + { + "cell_type": "code", + "execution_count": 86, "id": "79663447", "metadata": {}, "outputs": [], @@ -208,48 +295,46 @@ " bsf : float\n", " The best-so-far discord distance \n", " \"\"\"\n", - " nn_distance = np.inf # The query's distance to its 1nn.\n", - " \n", - " # To compute the distance between Q=T[query_idx : query_idx + m],\n", - " # and the subsequences in T[chunk_start : chunk_stop].\n", + " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", " chunksize = next_pow2(m) \n", - " chunk_stop = query_idx\n", - " chunk_start = max(0, chunk_stop - chunksize)\n", " \n", - " while nn_distance >= bsf:\n", + " nn_distance = np.inf\n", + " for (start, stop) in _get_range_damp(query_idx - excl_zone, m, chunksize):\n", + " # The stop index is the last index from which a subsequence starts,\n", + " # and continues till `stop - 1 + m`\n", " QT = core.sliding_dot_product(\n", " T[query_idx : query_idx + m], \n", - " T[chunk_start : chunk_stop],\n", + " T[start : stop - 1 + m],\n", " )\n", " D = core._mass(\n", " T[query_idx : query_idx + m],\n", - " T[chunk_start : chunk_stop],\n", + " T[start : stop - 1 + m],\n", " QT=QT,\n", " μ_Q=M_T[query_idx],\n", " σ_Q=Σ_T[query_idx],\n", - " M_T=M_T[chunk_start : chunk_stop - m + 1],\n", - " Σ_T=Σ_T[chunk_start : chunk_stop - m + 1],\n", + " M_T=M_T[start : stop],\n", + " Σ_T=Σ_T[start : stop],\n", " Q_subseq_isconstant=T_subseq_isconstant[query_idx],\n", - " T_subseq_isconstant=T_subseq_isconstant[chunk_start : chunk_stop - m + 1],\n", + " T_subseq_isconstant=T_subseq_isconstant[start : stop],\n", " )\n", " \n", - " nn_distance = np.min(D)\n", - " if chunk_start == 0: \n", - " # all neighbors of `Q` are visited. Hence, `nn_distance` is exact.\n", - " if nn_distance > bsf:\n", - " bsf = nn_distance\n", + " nn_distance = min(nn_distance, np.min(D))\n", + " if nn_distance < bfs:\n", " break\n", - " \n", - " else:\n", - " nn_distance = np.min(D)\n", - " if nn_distance < bfs:\n", - " break\n", - " else:\n", - " chunksize = 2 * chunksize\n", - " chunk_start = max(0, chunk_stop - chunksize)\n", - " \n", + " \n", + " else:\n", + " bsf = nn_distance\n", + " \n", " return nn_distance, bsf" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c76317b5", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { From efeb68a5096ed7d3651864c5ec5038afb114df31 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Thu, 12 Oct 2023 22:39:25 -0400 Subject: [PATCH 03/17] Clean code --- docs/Tutorial_DAMP.ipynb | 61 +++++++++++++++++++++++++--------------- 1 file changed, 38 insertions(+), 23 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index a1624b30a..db090cf42 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -25,10 +25,12 @@ "id": "27bf47eb", "metadata": {}, "source": [ - "The original `DAMP` algorithm needs a parameter called `split_idx`. For a given `split_idx`, the train part is `T[:split_idx]` and the potential anomalies should be coming from `T[split_idx:]`. The value of split_idx is problem dependent. If split_idx is too small, then `T[:split_idx]` may not contain all different kinds of regular patterns. Hence, we may incorrectly select a subsequence as a discord. If split_idx is too large, we may miss a discord if that discord and its nearest neighbor are both in `T[:split_idx]`. The following two extreme scenarios can help with understanding the rationale behind `split_idx`.\n", + "The original `DAMP` algorithm needs a parameter called `split_idx`. For a given `split_idx`, the train part is `T[:split_idx]` and the potential anomalies should be coming from `T[split_idx:]`. The value of `split_idx` is problem dependent. If `split_idx` is too small, then `T[:split_idx]` may not contain all different kinds of regular patterns. Hence, we may incorrectly select a subsequence as a discord. If split_idx is too large, we may miss a discord if that discord and its nearest neighbor are both in `T[:split_idx]`. The following two extreme scenarios can help with understanding the rationale behind `split_idx`.\n", "\n", "(1) `split_idx = 0`: In this case, the first subsequence can be a discord itself as it is a \"new\" pattern.
\n", - "(2) `split_idx = len(T) - m` In such case, the last pattern is the only pattern that will be analyzed for the discord. It will be compared against all subsequences except the last one!\n" + "(2) `split_idx = len(T) - m` In such case, the last pattern is the only pattern that will be analyzed for the discord. It will be compared against all previous subsequences!\n", + "\n", + "As we can see, the problem-dependent parameter `split_idx` should be set to a proper value by the user." ] }, { @@ -41,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -67,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "id": "409dad09", "metadata": {}, "outputs": [], @@ -133,7 +135,7 @@ }, { "cell_type": "code", - "execution_count": 95, + "execution_count": 5, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -157,7 +159,7 @@ }, { "cell_type": "code", - "execution_count": 103, + "execution_count": 16, "id": "387ca010", "metadata": {}, "outputs": [], @@ -172,23 +174,23 @@ " for _ in range(i):\n", " start = stop - chunksize\n", " start = max(0, start)\n", - " lst.append([start, stop])\n", " \n", - " if start <= 0:\n", + " lst.append([start, stop])\n", + " if start > 0:\n", + " stop = start\n", + " chunksize = 2 * chunksize\n", + " else:\n", " break\n", - " \n", - " stop = start\n", - " chunksize = 2 * chunksize\n", - " \n", + " \n", " return np.array(lst)\n", "\n", "\n", "def _get_range_damp(i, m, chunksize_init=None):\n", " \"\"\"\n", - " For the given index `i`, segments the array `np.arange(i)` into \n", + " For the given index `i`, segmentize the array `np.arange(i)` into \n", " chunks such that the last chunk has size `chunksize_init`, and \n", " the one before that has size `2 * chunksize_init` and so on. The\n", - " output contains the (start, stop) of chunks.\n", + " output contains the (start, stop) indices of chunks.\n", " \n", " Parameters\n", " ----------\n", @@ -210,17 +212,30 @@ " if chunksize_init is None:\n", " chunksize_init = next_pow2(m)\n", " \n", - " n = int(math.ceil(math.log2(i / chunksize_init + 1))) \n", - " indices = i - (np.power(2, np.arange(n + 1)) - 1) * chunksize_init\n", - " start_indices = indices[1:]\n", - " stop_indices = indices[:-1]\n", + " # The array [0, 1, 2, ..., i-1] is chunked as follows:\n", + " # `s` is the initial chunksize\n", " \n", - " out = np.empty((n, 2), dtype=np.int64)\n", - " out[:, 0] = start_indices\n", - " out[:, 1] = stop_indices\n", + " # chunks are:\n", + " # 1st chunk: [i - s: i],\n", + " # 2nd chunk: [i - s - (2s): i - s],\n", + " # 3rd chunk: [i - 3s - (4s) : i - 3s]\n", + " # 4th chunk: [i - 7s - (8s) : i - 7s]\n", + " # ...\n", + " \n", + " # The first index of the n-th chunk will be `i - (2^n - 1)s`. \n", + " # So, if we have `n` chunks, then `n` should be the answer to:\n", + " # minimum int `n` s.t. (2 ** n - 1) * s >= i \n", " \n", - " out[-1, 0] = 0\n", + " n_chunks = int(math.ceil(math.log2((i / chunksize_init) + 1)))\n", + " chunks_ticks = i - (np.power(2, np.arange(n_chunks + 1)) - 1) * chunksize_init\n", + " start_indices = chunks_ticks[1:]\n", + " stop_indices = chunks_ticks[:-1]\n", " \n", + " out = np.empty((n_chunks, 2), dtype=np.int64)\n", + " out[:, 0] = start_indices\n", + " out[:, 1] = stop_indices\n", + " out[-1, 0] = 0 # the most-left chunk starts at 0.\n", + " \n", " return out\n", "\n", "\n", @@ -244,7 +259,7 @@ }, { "cell_type": "code", - "execution_count": 86, + "execution_count": 17, "id": "79663447", "metadata": {}, "outputs": [], From f3c7f30032cb3e9343632b9a42ac6a35749eed4f Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Thu, 12 Oct 2023 23:09:44 -0400 Subject: [PATCH 04/17] enhance readability --- docs/Tutorial_DAMP.ipynb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index db090cf42..dfffe3135 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -315,8 +315,8 @@ " \n", " nn_distance = np.inf\n", " for (start, stop) in _get_range_damp(query_idx - excl_zone, m, chunksize):\n", - " # The stop index is the last index from which a subsequence starts,\n", - " # and continues till `stop - 1 + m`\n", + " # (start, stop) contains the \"start\" index of subsequences that should \n", + " # be compared against the query `Q = T[query_idx : query_idx + m]`\n", " QT = core.sliding_dot_product(\n", " T[query_idx : query_idx + m], \n", " T[start : stop - 1 + m],\n", @@ -334,7 +334,7 @@ " )\n", " \n", " nn_distance = min(nn_distance, np.min(D))\n", - " if nn_distance < bfs:\n", + " if nn_distance < bsf:\n", " break\n", " \n", " else:\n", From 01bb14cc5bd195f6158f5b699a04f94598ebed01 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sat, 14 Oct 2023 15:26:34 -0400 Subject: [PATCH 05/17] improve docstring --- docs/Tutorial_DAMP.ipynb | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index dfffe3135..ddfc9e6b8 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "id": "409dad09", "metadata": {}, "outputs": [], @@ -135,7 +135,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -159,7 +159,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 5, "id": "387ca010", "metadata": {}, "outputs": [], @@ -206,8 +206,10 @@ " Returns\n", " -------\n", " out : np.ndarray\n", - " A 2D numpy array, where each row has (start, stop) index. The\n", - " very first chunk has start index `0`.\n", + " A 2D numpy array, where each row has (start, stop) index. \n", + " `out[0]` is the first chunk just before `i`. `out[-1]` is\n", + " the left-most chunk in the time series, with start index 0. \n", + " \n", " \"\"\"\n", " if chunksize_init is None:\n", " chunksize_init = next_pow2(m)\n", @@ -223,8 +225,8 @@ " # ...\n", " \n", " # The first index of the n-th chunk will be `i - (2^n - 1)s`. \n", - " # So, if we have `n` chunks, then `n` should be the answer to:\n", - " # minimum int `n` s.t. (2 ** n - 1) * s >= i \n", + " # So, if we have `n` chunks in total, `n` is the minimum int\n", + " # that satisfies `(2**n - 1) * s >= i`\n", " \n", " n_chunks = int(math.ceil(math.log2((i / chunksize_init) + 1)))\n", " chunks_ticks = i - (np.power(2, np.arange(n_chunks + 1)) - 1) * chunksize_init\n", @@ -316,7 +318,7 @@ " nn_distance = np.inf\n", " for (start, stop) in _get_range_damp(query_idx - excl_zone, m, chunksize):\n", " # (start, stop) contains the \"start\" index of subsequences that should \n", - " # be compared against the query `Q = T[query_idx : query_idx + m]`\n", + " # be compared against the query `Q = T[query_idx : query_idx + m]`.\n", " QT = core.sliding_dot_product(\n", " T[query_idx : query_idx + m], \n", " T[start : stop - 1 + m],\n", From f871b4266d256835c676f9fe83d88aa6b1393440 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Mon, 6 Nov 2023 01:40:48 -0500 Subject: [PATCH 06/17] minor fix for finding dhunks for function --- docs/Tutorial_DAMP.ipynb | 136 +++++++++------------------------------ 1 file changed, 29 insertions(+), 107 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index ddfc9e6b8..cd3d4bb49 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -101,9 +101,6 @@ " \n", " discord_idx : int\n", " The start index of the top discord\n", - " \n", - " Note\n", - " ----\n", " \n", " \"\"\"\n", " mp = stumpy.stump(T, m)\n", @@ -159,109 +156,35 @@ }, { "cell_type": "code", - "execution_count": 5, - "id": "387ca010", + "execution_count": 1, + "id": "dd1f7b3d", "metadata": {}, "outputs": [], "source": [ - "def naive_get_range_damp(i, m, chunksize_init=None):\n", - " if chunksize_init is None:\n", - " chunksize_init = next_pow2(m)\n", - " \n", - " lst = []\n", - " chunksize = chunksize_init\n", - " stop = i\n", - " for _ in range(i):\n", - " start = stop - chunksize\n", - " start = max(0, start)\n", - " \n", - " lst.append([start, stop])\n", - " if start > 0:\n", - " stop = start\n", - " chunksize = 2 * chunksize\n", - " else:\n", - " break\n", - " \n", - " return np.array(lst)\n", - "\n", - "\n", - "def _get_range_damp(i, m, chunksize_init=None):\n", + "def naive_get_range_damp(stop, m, chunksize=None, start=0):\n", " \"\"\"\n", - " For the given index `i`, segmentize the array `np.arange(i)` into \n", - " chunks such that the last chunk has size `chunksize_init`, and \n", - " the one before that has size `2 * chunksize_init` and so on. The\n", - " output contains the (start, stop) indices of chunks.\n", - " \n", - " Parameters\n", - " ----------\n", - " i : int\n", - " The stop index\n", - " \n", - " m : int\n", - " The window size\n", - " \n", - " chunksize_init : int\n", - " The initial chunksize, which is the size of the last chunk\n", - " \n", - " Returns\n", - " -------\n", - " out : np.ndarray\n", - " A 2D numpy array, where each row has (start, stop) index. \n", - " `out[0]` is the first chunk just before `i`. `out[-1]` is\n", - " the left-most chunk in the time series, with start index 0. \n", - " \n", + " For the given indices range `(start, stop)`, returns the\n", + " (start, stop) of chunks, each with a length that is power\n", + " of two.\n", " \"\"\"\n", - " if chunksize_init is None:\n", - " chunksize_init = next_pow2(m)\n", - " \n", - " # The array [0, 1, 2, ..., i-1] is chunked as follows:\n", - " # `s` is the initial chunksize\n", - " \n", - " # chunks are:\n", - " # 1st chunk: [i - s: i],\n", - " # 2nd chunk: [i - s - (2s): i - s],\n", - " # 3rd chunk: [i - 3s - (4s) : i - 3s]\n", - " # 4th chunk: [i - 7s - (8s) : i - 7s]\n", - " # ...\n", - " \n", - " # The first index of the n-th chunk will be `i - (2^n - 1)s`. \n", - " # So, if we have `n` chunks in total, `n` is the minimum int\n", - " # that satisfies `(2**n - 1) * s >= i`\n", - " \n", - " n_chunks = int(math.ceil(math.log2((i / chunksize_init) + 1)))\n", - " chunks_ticks = i - (np.power(2, np.arange(n_chunks + 1)) - 1) * chunksize_init\n", - " start_indices = chunks_ticks[1:]\n", - " stop_indices = chunks_ticks[:-1]\n", + " if chunksize is None:\n", + " chunksize = next_pow2(m)\n", " \n", - " out = np.empty((n_chunks, 2), dtype=np.int64)\n", - " out[:, 0] = start_indices\n", - " out[:, 1] = stop_indices\n", - " out[-1, 0] = 0 # the most-left chunk starts at 0.\n", - " \n", - " return out\n", - "\n", - "\n", - "# test \n", - "for i in range(8, 65):\n", - " for m in range(3, 8):\n", - " ref = naive_get_range_damp(i, m)\n", - " cmp = _get_range_damp(i, m)\n", - " np.testing.assert_almost_equal(ref, cmp)\n", - " \n", - " chunksize_init = next_pow2(2 * m)\n", - " ref = naive_get_range_damp(i, m, chunksize_init)\n", - " cmp = _get_range_damp(i, m, chunksize_init)\n", - " np.testing.assert_almost_equal(ref, cmp)\n", + " lst = []\n", + " chunk_stop = stop\n", + " while start < chunk_stop:\n", + " chunk_start = max(chunk_stop - chunksize, start)\n", + " lst.append([chunk_start, chunk_stop])\n", " \n", - " chunksize_init = 1\n", - " ref = naive_get_range_damp(i, m, chunksize_init)\n", - " cmp = _get_range_damp(i, m, chunksize_init)\n", - " np.testing.assert_almost_equal(ref, cmp)" + " chunk_stop = chunk_start + m\n", + " chunksize = 2 * chunksize\n", + " \n", + " return np.array(lst)" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 2, "id": "79663447", "metadata": {}, "outputs": [], @@ -274,6 +197,7 @@ " Σ_T, \n", " T_subseq_isconstant, \n", " bsf,\n", + " chunks_indices=None,\n", "):\n", " \"\"\"\n", " Compute the (approximate) left matrix profile value that corresponds to the subsequence \n", @@ -312,35 +236,33 @@ " bsf : float\n", " The best-so-far discord distance \n", " \"\"\"\n", - " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", - " chunksize = next_pow2(m) \n", + " if chunks_indices is None:\n", + " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", + " chunks_indices = naive_get_range_damp(stop=idx + m - excl_zone, start=0)\n", " \n", " nn_distance = np.inf\n", - " for (start, stop) in _get_range_damp(query_idx - excl_zone, m, chunksize):\n", - " # (start, stop) contains the \"start\" index of subsequences that should \n", - " # be compared against the query `Q = T[query_idx : query_idx + m]`.\n", + " for (start, stop) in chunks_indices:\n", " QT = core.sliding_dot_product(\n", " T[query_idx : query_idx + m], \n", - " T[start : stop - 1 + m],\n", + " T[start : stop],\n", " )\n", " D = core._mass(\n", " T[query_idx : query_idx + m],\n", - " T[start : stop - 1 + m],\n", + " T[start : stop],\n", " QT=QT,\n", " μ_Q=M_T[query_idx],\n", " σ_Q=Σ_T[query_idx],\n", - " M_T=M_T[start : stop],\n", - " Σ_T=Σ_T[start : stop],\n", + " M_T=M_T[start : stop - m + 1],\n", + " Σ_T=Σ_T[start : stop - m + 1],\n", " Q_subseq_isconstant=T_subseq_isconstant[query_idx],\n", - " T_subseq_isconstant=T_subseq_isconstant[start : stop],\n", + " T_subseq_isconstant=T_subseq_isconstant[start : stop - m + 1],\n", " )\n", " \n", " nn_distance = min(nn_distance, np.min(D))\n", " if nn_distance < bsf:\n", " break\n", " \n", - " else:\n", - " bsf = nn_distance\n", + " bsf = max(bsf, nn_distance)\n", " \n", " return nn_distance, bsf" ] From d7d13149f8418e3974c5ced88ebac31a843c7a6b Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sat, 18 Nov 2023 23:14:15 -0500 Subject: [PATCH 07/17] improve docstring --- docs/Tutorial_DAMP.ipynb | 39 ++++++++++++++++++++++++++++++--------- 1 file changed, 30 insertions(+), 9 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index cd3d4bb49..4a5c694d4 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -161,25 +161,44 @@ "metadata": {}, "outputs": [], "source": [ - "def naive_get_range_damp(stop, m, chunksize=None, start=0):\n", + "def naive_get_range_damp(stop, m, start=0):\n", " \"\"\"\n", - " For the given indices range `(start, stop)`, returns the\n", - " (start, stop) of chunks, each with a length that is power\n", - " of two.\n", + " For the given range `(start, stop)`, segmentize it into \n", + " chunks, each with a length of power of two. Each chunk is\n", + " preceded by one that is twice in size.\n", + " \n", + " Parameters\n", + " ----------\n", + " stop : int\n", + " The stop index\n", + " \n", + " m : int\n", + " The window size of subsequences\n", + " \n", + " start : int, default 0\n", + " The start index\n", + " \n", + " Returns\n", + " -------\n", + " out : numpy.ndarray\n", + " A 2D numpy array, where the i-th row shows the \n", + " range (start, stop) of the i-th chunk.\n", " \"\"\"\n", - " if chunksize is None:\n", - " chunksize = next_pow2(m)\n", + " chunksize = next_pow2(m)\n", " \n", - " lst = []\n", + " out = []\n", " chunk_stop = stop\n", " while start < chunk_stop:\n", " chunk_start = max(chunk_stop - chunksize, start)\n", - " lst.append([chunk_start, chunk_stop])\n", + " chunks_range.append([chunk_start, chunk_stop])\n", " \n", + " # The chunk_stop of the preceding chunk\n", + " # is `chunk_start` but shifted by `m` to \n", + " # cover subsequences that are\n", " chunk_stop = chunk_start + m\n", " chunksize = 2 * chunksize\n", " \n", - " return np.array(lst)" + " return np.array(out)" ] }, { @@ -240,6 +259,8 @@ " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", " chunks_indices = naive_get_range_damp(stop=idx + m - excl_zone, start=0)\n", " \n", + " # compare the query `T[query_idx : query_idx + m]` against its\n", + " # left neighbors chunk-by-chunk, in backward manner.\n", " nn_distance = np.inf\n", " for (start, stop) in chunks_indices:\n", " QT = core.sliding_dot_product(\n", From 80f214e4d178334db47596c15b5410e9734316d5 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sat, 18 Nov 2023 23:29:37 -0500 Subject: [PATCH 08/17] minor fix in code and docstring --- docs/Tutorial_DAMP.ipynb | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index 4a5c694d4..a26ff2478 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -190,7 +190,7 @@ " chunk_stop = stop\n", " while start < chunk_stop:\n", " chunk_start = max(chunk_stop - chunksize, start)\n", - " chunks_range.append([chunk_start, chunk_stop])\n", + " out.append([chunk_start, chunk_stop])\n", " \n", " # The chunk_stop of the preceding chunk\n", " # is `chunk_start` but shifted by `m` to \n", @@ -216,7 +216,7 @@ " Σ_T, \n", " T_subseq_isconstant, \n", " bsf,\n", - " chunks_indices=None,\n", + " chunks_range=None,\n", "):\n", " \"\"\"\n", " Compute the (approximate) left matrix profile value that corresponds to the subsequence \n", @@ -246,6 +246,10 @@ " bsf : float\n", " The best-so-far discord distance\n", " \n", + " chunks_range : numpy.ndarray, default None\n", + " A 2D numpy array consisting of rows, each represents\n", + " the (start, stop) range of a chunk\n", + " \n", " Returns\n", " -------\n", " distance : float\n", @@ -255,14 +259,15 @@ " bsf : float\n", " The best-so-far discord distance \n", " \"\"\"\n", - " if chunks_indices is None:\n", + " if chunks_range is None:\n", " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", - " chunks_indices = naive_get_range_damp(stop=idx + m - excl_zone, start=0)\n", + " chunks_range = naive_get_range_damp(idx - excl_zone + m, m, start=0)\n", " \n", - " # compare the query `T[query_idx : query_idx + m]` against its\n", - " # left neighbors chunk-by-chunk, in backward manner.\n", + " # iterating through chunks in backward manner, against\n", + " # which the query `T[query_idx : query_idx + m]` is\n", + " # compared with.\n", " nn_distance = np.inf\n", - " for (start, stop) in chunks_indices:\n", + " for (start, stop) in chunks_range:\n", " QT = core.sliding_dot_product(\n", " T[query_idx : query_idx + m], \n", " T[start : stop],\n", @@ -282,7 +287,8 @@ " nn_distance = min(nn_distance, np.min(D))\n", " if nn_distance < bsf:\n", " break\n", - " \n", + " \n", + " # update best-so-far discord distance\n", " bsf = max(bsf, nn_distance)\n", " \n", " return nn_distance, bsf" From d13015941f395ea0c794cbe7f07e081bbe620c40 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sun, 19 Nov 2023 01:43:16 -0500 Subject: [PATCH 09/17] add functions to complete DAMP --- docs/Tutorial_DAMP.ipynb | 256 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 249 insertions(+), 7 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index a26ff2478..02d4e5954 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 16, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 17, "id": "409dad09", "metadata": {}, "outputs": [], @@ -132,7 +132,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 18, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -156,7 +156,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 34, "id": "dd1f7b3d", "metadata": {}, "outputs": [], @@ -192,6 +192,9 @@ " chunk_start = max(chunk_stop - chunksize, start)\n", " out.append([chunk_start, chunk_stop])\n", " \n", + " if chunk_start == start:\n", + " break\n", + " \n", " # The chunk_stop of the preceding chunk\n", " # is `chunk_start` but shifted by `m` to \n", " # cover subsequences that are\n", @@ -203,7 +206,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 35, "id": "79663447", "metadata": {}, "outputs": [], @@ -212,6 +215,7 @@ " T, \n", " m, \n", " query_idx, \n", + " excl_zone,\n", " M_T, \n", " Σ_T, \n", " T_subseq_isconstant, \n", @@ -233,6 +237,11 @@ " query_idx : int\n", " The start index of the query with length `m`, i.e. `T[query_idx:query_idx+m]`\n", " \n", + " excl_zone : int\n", + " The exclusion size for which the neighbors of a subsequence is being ignored.\n", + " For the subsequence that starts at `s`, the range (s - excl_zone, s + excl_zone + 1)`\n", + " is ignored.\n", + " \n", " M_T : np.ndarray\n", " The sliding mean of `T`\n", " \n", @@ -260,7 +269,6 @@ " The best-so-far discord distance \n", " \"\"\"\n", " if chunks_range is None:\n", - " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", " chunks_range = naive_get_range_damp(idx - excl_zone + m, m, start=0)\n", " \n", " # iterating through chunks in backward manner, against\n", @@ -296,10 +304,244 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 41, "id": "c76317b5", "metadata": {}, "outputs": [], + "source": [ + "def _foreward_process(\n", + " T, \n", + " m, \n", + " excl_zone,\n", + " M_T, \n", + " Σ_T, \n", + " T_subseq_isconstant, \n", + " query_idx, \n", + " PL,\n", + "):\n", + " \"\"\"\n", + " Update the (approximate) left matrix profile `PL` in place by checking the subsequences\n", + " that appear after `query_idx` in the data `T`, and their distance to `T[query_idx : query_idx + m]`\n", + " is less than `bsf`.\n", + " \n", + " Paramaters\n", + " ----------\n", + " T : numpy.ndarray\n", + " The time series or sequence for which the top discord will be computed.\n", + " \n", + " m : int\n", + " Window size\n", + " \n", + " excl_zone : int\n", + " The exclusion size for which the neighbors of a subsequence is being ignored.\n", + " For the subsequence that starts at `s`, the range (s - excl_zone, s + excl_zone + 1)`\n", + " is ignored.\n", + " \n", + " M_T : numpy.ndarray\n", + " The sliding mean\n", + " \n", + " Σ_T : numpy.ndarray\n", + " The sliding standard deviation\n", + " \n", + " T_subseq_isconstant : numpy.ndarray\n", + " A numpy boolean array whose i-th element indicates whether the subsequence\n", + " `T[i : i+m]` is constant (True)\n", + " \n", + " query_idx : int\n", + " The start index of the query with length `m`, i.e. `T[query_idx:query_idx+m]`\n", + " \n", + " PL : numpy.ndarray\n", + " A 1D numpy array that contains approximate values of left matrix profile.\n", + " \n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\" \n", + " start = query_idx + excl_zone + 1 \n", + " \n", + " chunksize = next_pow2(m)\n", + " stop = min(start + chunksize, len(T))\n", + " if stop - start >= m:\n", + " QT = core.sliding_dot_product(T[query_idx : query_idx + m], T[start : stop])\n", + " D = core._mass(\n", + " T[query_idx : query_idx + m],\n", + " T[start : stop],\n", + " QT=QT,\n", + " μ_Q=M_T[query_idx],\n", + " σ_Q=Σ_T[query_idx],\n", + " M_T=M_T[start : stop - m + 1],\n", + " Σ_T=Σ_T[start : stop - m + 1],\n", + " Q_subseq_isconstant=T_subseq_isconstant[query_idx],\n", + " T_subseq_isconstant=T_subseq_isconstant[start : stop - m + 1],\n", + " )\n", + " \n", + " PL[start : stop - m + 1] = np.minimum(PL[start : stop - m + 1], D)\n", + "\n", + " return" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "769bddad", + "metadata": {}, + "outputs": [], + "source": [ + "def DAMP(T, m, split_idx): \n", + " \"\"\"\n", + " Compute the approximate left matrix profile for `T`. The global maximum\n", + " of this approximate profile is exact.\n", + " \n", + " Parameters\n", + " ----------\n", + " T : numpy.ndarray\n", + " A time series\n", + " \n", + " m : int\n", + " Window size\n", + " \n", + " split_idx : int\n", + " location of split point between train and test. The data `T[:split_idx]`\n", + " is considered as train and the remaining, i.e. `T[split_idx:]` is test.\n", + " \n", + " Returns\n", + " -------\n", + " PL : numpy.ndarry\n", + " The approximate left matrix profile. The finite global maximum of `PL` is exact,\n", + " and its corresponding index indicates the location of discord.\n", + " \n", + " discord_dist : float\n", + " The distance between the discord and its nearest neighbor\n", + " \n", + " discord_idx : int\n", + " The start index of discord.\n", + " \"\"\"\n", + " T, M_T, Σ_T, T_subseq_isconstant = core.preprocess(T, m) \n", + " l = len(T) - m + 1\n", + " \n", + " bsf = np.NINF\n", + " PL = np.full(l, np.inf, dtype=np.float64) \n", + " \n", + " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", + " chunks_range = naive_get_range_damp(split_idx - excl_zone + m, m, start=0)\n", + " n_chunks = len(chunks_range)\n", + " next_expected_chunk_size = next_pow2(m) * (2 ** (n_chunks - 1)) \n", + " for i in range(split_idx, l):\n", + " if PL[i] < bsf:\n", + " continue\n", + " PL[i], bsf = _backward_process(T, m, i, excl_zone, M_T, Σ_T, T_subseq_isconstant, bsf, chunks_range=chunks_range)\n", + " _foreward_process(T, m, excl_zone, M_T, Σ_T, T_subseq_isconstant, i, PL)\n", + " \n", + " # update chunks_range for next iteration by shifting (start, stop)\n", + " # of chunks by one. \n", + " chunks_range = chunks_range + 1 \n", + " if chunks_range[-1][1] - chunks_range[-1][0] < next_expected_chunk_size:\n", + " chunks_range[-1][0] = 0\n", + " else:\n", + " new_chunk = np.array([\n", + " [0, chunks_range[-1][0]]\n", + " ])\n", + " chunks_range = np.append(chunks_range, new_chunk, axis=0)\n", + " \n", + " P = np.where(PL==np.inf, np.NINF, PL)\n", + " discord_idx = np.argmax(P)\n", + " discord_dist = P[discord_idx]\n", + " if discord_dist == np.NINF:\n", + " discord_idx = -1\n", + " \n", + " return PL, discord_dist, discord_idx" + ] + }, + { + "cell_type": "markdown", + "id": "f67cd2b9", + "metadata": {}, + "source": [ + "## Example" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "2cfbc771", + "metadata": {}, + "outputs": [], + "source": [ + "seed = 100\n", + "np.random.seed(seed)\n", + "\n", + "T = np.random.rand(100000)\n", + "m = 50\n", + "split_index = 200" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "a935f31f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "discord_dist: 8.500883427933502\n", + "discord_index: 209\n", + "running time [sec]: 8.424577951431274\n" + ] + } + ], + "source": [ + "naive_DAMP(T, m, split_index) # dummy run\n", + "t_start = time.time()\n", + "excl_zone_denom = core.config.STUMPY_EXCL_ZONE_DENOM\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = 1.0\n", + "PL, discord_dist, discord_index = naive_DAMP(T, m, split_index)\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = excl_zone_denom\n", + "t_end = time.time()\n", + "\n", + "print('discord_dist: ', discord_dist)\n", + "print('discord_index: ', discord_index)\n", + "print('running time [sec]: ', t_end - t_start)" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "2b461592", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "discord_dist: 8.500883427933504\n", + "discord_index: 209\n", + "running time [sec]: 2.302816152572632\n" + ] + } + ], + "source": [ + "DAMP(T, m, split_index) # dummpy run\n", + "\n", + "t_start = time.time()\n", + "excl_zone_denom = core.config.STUMPY_EXCL_ZONE_DENOM\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = 1\n", + "PL, discord_score, discord_idx = DAMP(T, m, split_index)\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = excl_zone_denom\n", + "t_end = time.time()\n", + "\n", + "print('discord_dist: ', discord_score)\n", + "print('discord_index: ', discord_idx)\n", + "print('running time [sec]: ', t_end - t_start)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0799d04", + "metadata": {}, + "outputs": [], "source": [] } ], From d3d7e7de16f88f08a00d8a0340ba4e73dbf092ba Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sun, 19 Nov 2023 14:20:05 -0500 Subject: [PATCH 10/17] minor enhancements --- docs/Tutorial_DAMP.ipynb | 130 ++++++++++++++++++++++----------------- 1 file changed, 75 insertions(+), 55 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index 02d4e5954..96fe4e12a 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -15,7 +15,7 @@ "source": [ "Authors in [DAMP](https://www.cs.ucr.edu/~eamonn/DAMP_long_version.pdf) presented a method for discord detection that is scalable and it can be used in offline and online mode.\n", "\n", - "To better understand the mechanism behind this method, we should first understand the difference between the full matrix profile and the left matrix profile of a time series `T`. For a subsequence with length `m`, and the start index `i`, i.e. `S_i = T[i:i+m]`, there are two groups of neighbors, known as left and right neighbors. The left neighbors are the subsequences on the left side of `S_i`, i.e. the subsequences in `T[:i]`. And, the right neighbors are the subsequences on the right side of `S_i`, i.e. the subsequences in `T[i+1:]`. The `i`-th element of the full matrix profile is the minimum distance between `S_i` and all of its neighbors, considering both left and right ones. However, in the left matrix profile, the `i`-th element is the minimum distance between the subsequence `S_i` and its left neighbors.\n", + "To better understand the mechanism behind this method, we should first understand the difference between the full matrix profile and the left matrix profile of a time series `T`. For a subsequence with length `m`, and the start index `i`, i.e. `S_i = T[i:i+m]`, there are two groups of neighbors, known as left and right neighbors. The left neighbors are the subsequences on the left side of `S_i`, i.e. the subsequences in `T[:i]`. And, the right neighbors are the subsequences on the right side of `S_i`, i.e. the subsequences in `T[i+1:]`. The `i`-th element of the full matrix profile is the minimum distance between `S_i` and all of its neighbors, considering both left and right ones. However, in the left matrix profile, the `i`-th element is the minimum distance between the subsequence `S_i` and its left neighbors. In both cases, it is recommended to avoid considering subsequences that are close to `S_i`, known as trivial neighbors.\n", "\n", "One can use either the full matrix profile or the left matrix profile to find the top discord, a subsequence whose distance to its nearest neighbor is larger than the distance of any other subsequences to their nearest neighbors. However, using full matrix profile for detecting discords might result in missing the case where there are two rare subsequences that happen to also be similar to each other (a case that is known as \"twin freak\"). On the other hand, the left matrix profile resolves this problem by capturing the discord at its first occurance. Hence, even if there are two or more of such discords, we can still capture the first occurance by using the left matrix profile." ] @@ -25,7 +25,7 @@ "id": "27bf47eb", "metadata": {}, "source": [ - "The original `DAMP` algorithm needs a parameter called `split_idx`. For a given `split_idx`, the train part is `T[:split_idx]` and the potential anomalies should be coming from `T[split_idx:]`. The value of `split_idx` is problem dependent. If `split_idx` is too small, then `T[:split_idx]` may not contain all different kinds of regular patterns. Hence, we may incorrectly select a subsequence as a discord. If split_idx is too large, we may miss a discord if that discord and its nearest neighbor are both in `T[:split_idx]`. The following two extreme scenarios can help with understanding the rationale behind `split_idx`.\n", + "The original `DAMP` algorithm needs a parameter called `split_idx`. For a given `split_idx`, the train part is `T[:split_idx]` and the potential anomalies should be coming from `T[split_idx:]`. The value of `split_idx` is problem dependent. If `split_idx` is too small, then `T[:split_idx]` may not contain all different kinds of regular patterns. Hence, we may incorrectly select a subsequence as a discord. If `split_idx` is too large, we may miss a discord if that discord and its nearest neighbor are both in `T[:split_idx]`. The following two extreme scenarios can help with understanding the importance of choosing proper `split_idx`.\n", "\n", "(1) `split_idx = 0`: In this case, the first subsequence can be a discord itself as it is a \"new\" pattern.
\n", "(2) `split_idx = len(T) - m` In such case, the last pattern is the only pattern that will be analyzed for the discord. It will be compared against all previous subsequences!\n", @@ -43,17 +43,17 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 2, "id": "c9564cff", "metadata": {}, "outputs": [], "source": [ "import math\n", + "import time\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import stumpy\n", - "import time\n", "\n", "from stumpy import core\n", "from scipy.io import loadmat" @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 3, "id": "409dad09", "metadata": {}, "outputs": [], @@ -92,7 +92,7 @@ " Returns\n", " -------\n", " PL : numpy.ndarry\n", - " The [exact] left matrix profile. All infinite distances are ingored in computing\n", + " The (exact) left matrix profile. All infinite distances are ingored when computing\n", " the discord.\n", " \n", " discord_dist : float\n", @@ -113,10 +113,13 @@ " if nn_i >= 0:\n", " PL[i] = np.linalg.norm(core.z_norm(T[i : i + m]) - core.z_norm(T[nn_i : nn_i + m]))\n", " \n", - " PL_modified = np.where(PL==np.inf, np.NINF, PL)\n", - " discord_idx = np.argmax(PL_modified)\n", - " discord_dist = PL_modified[discord_idx]\n", + " # finding the max of PL, while ignoring infinite values\n", + " discord_idx = np.argmax(\n", + " np.where(PL==np.inf, np.NINF, PL)\n", + " )\n", + " discord_dist = PL[discord_idx]\n", " if discord_dist == np.NINF:\n", + " discord_dist = np.inf\n", " discord_idx = -1\n", " \n", " return PL, discord_dist, discord_idx" @@ -132,7 +135,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 4, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -156,7 +159,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 5, "id": "dd1f7b3d", "metadata": {}, "outputs": [], @@ -165,7 +168,8 @@ " \"\"\"\n", " For the given range `(start, stop)`, segmentize it into \n", " chunks, each with a length of power of two. Each chunk is\n", - " preceded by one that is twice in size.\n", + " preceded by one that is twice in size. The last chunk (i.e.\n", + " the left-most chunk) starts at `start`.\n", " \n", " Parameters\n", " ----------\n", @@ -173,7 +177,7 @@ " The stop index\n", " \n", " m : int\n", - " The window size of subsequences\n", + " Window size\n", " \n", " start : int, default 0\n", " The start index\n", @@ -182,7 +186,8 @@ " -------\n", " out : numpy.ndarray\n", " A 2D numpy array, where the i-th row shows the \n", - " range (start, stop) of the i-th chunk.\n", + " range (start, stop) of the i-th chunk. out[0][1]\n", + " must be `stop` and out[-1][0] must be `start`. \n", " \"\"\"\n", " chunksize = next_pow2(m)\n", " \n", @@ -197,7 +202,8 @@ " \n", " # The chunk_stop of the preceding chunk\n", " # is `chunk_start` but shifted by `m` to \n", - " # cover subsequences that are\n", + " # include new subsequences that end after\n", + " # current `chunk_start`.\n", " chunk_stop = chunk_start + m\n", " chunksize = 2 * chunksize\n", " \n", @@ -206,7 +212,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 6, "id": "79663447", "metadata": {}, "outputs": [], @@ -223,8 +229,8 @@ " chunks_range=None,\n", "):\n", " \"\"\"\n", - " Compute the (approximate) left matrix profile value that corresponds to the subsequence \n", - " `T[query_idx:query_idx+m]` and update the best-so-far discord distance.\n", + " Compute the (approximate) distance between the subsequence `T[query_idx:query_idx+m]`\n", + " and its nearest neighbor, and update the best-so-far discord distance. \n", " \n", " Parameters\n", " ----------\n", @@ -238,9 +244,7 @@ " The start index of the query with length `m`, i.e. `T[query_idx:query_idx+m]`\n", " \n", " excl_zone : int\n", - " The exclusion size for which the neighbors of a subsequence is being ignored.\n", - " For the subsequence that starts at `s`, the range (s - excl_zone, s + excl_zone + 1)`\n", - " is ignored.\n", + " Size of the exclusion zone\n", " \n", " M_T : np.ndarray\n", " The sliding mean of `T`\n", @@ -294,9 +298,9 @@ " \n", " nn_distance = min(nn_distance, np.min(D))\n", " if nn_distance < bsf:\n", - " break\n", + " break # no need to explore the remaining neighbors\n", " \n", - " # update best-so-far discord distance\n", + " # update the best-so-far discord distance\n", " bsf = max(bsf, nn_distance)\n", " \n", " return nn_distance, bsf" @@ -304,7 +308,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 12, "id": "c76317b5", "metadata": {}, "outputs": [], @@ -316,26 +320,24 @@ " M_T, \n", " Σ_T, \n", " T_subseq_isconstant, \n", - " query_idx, \n", + " query_idx,\n", + " lookahead,\n", " PL,\n", "):\n", " \"\"\"\n", - " Update the (approximate) left matrix profile `PL` in place by checking the subsequences\n", - " that appear after `query_idx` in the data `T`, and their distance to `T[query_idx : query_idx + m]`\n", - " is less than `bsf`.\n", + " Update the (approximate) left matrix profile `PL` in place by checking the right neighbors of\n", + " `T[query_idx : query_idx + m]`\n", " \n", " Paramaters\n", " ----------\n", " T : numpy.ndarray\n", - " The time series or sequence for which the top discord will be computed.\n", + " A time series\n", " \n", " m : int\n", " Window size\n", " \n", " excl_zone : int\n", - " The exclusion size for which the neighbors of a subsequence is being ignored.\n", - " For the subsequence that starts at `s`, the range (s - excl_zone, s + excl_zone + 1)`\n", - " is ignored.\n", + " Size of the exclusion zone\n", " \n", " M_T : numpy.ndarray\n", " The sliding mean\n", @@ -350,17 +352,20 @@ " query_idx : int\n", " The start index of the query with length `m`, i.e. `T[query_idx:query_idx+m]`\n", " \n", + " lookahead : int\n", + " The size of chunk whose subsequences are compared with the query.\n", + " \n", " PL : numpy.ndarray\n", - " A 1D numpy array that contains approximate values of left matrix profile.\n", + " A 1D numpy array that contains the (approximate) values of\n", + " the left matrix profile.\n", " \n", " Returns\n", " -------\n", " None\n", " \"\"\" \n", - " start = query_idx + excl_zone + 1 \n", + " start = query_idx + excl_zone + 1\n", + " stop = min(start + lookahead, len(T))\n", " \n", - " chunksize = next_pow2(m)\n", - " stop = min(start + chunksize, len(T))\n", " if stop - start >= m:\n", " QT = core.sliding_dot_product(T[query_idx : query_idx + m], T[start : stop])\n", " D = core._mass(\n", @@ -382,7 +387,7 @@ }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 14, "id": "769bddad", "metadata": {}, "outputs": [], @@ -395,48 +400,62 @@ " Parameters\n", " ----------\n", " T : numpy.ndarray\n", - " A time series\n", + " A time series of interest\n", " \n", " m : int\n", " Window size\n", " \n", " split_idx : int\n", - " location of split point between train and test. The data `T[:split_idx]`\n", - " is considered as train and the remaining, i.e. `T[split_idx:]` is test.\n", + " The location of split point between train and test. The data `T[:split_idx]`\n", + " is considered as the train set and the remaining, i.e. `T[split_idx:]` is test.\n", " \n", " Returns\n", " -------\n", " PL : numpy.ndarry\n", - " The approximate left matrix profile. The finite global maximum of `PL` is exact,\n", + " The (approximate) left matrix profile. The finite global maximum of `PL` is exact,\n", " and its corresponding index indicates the location of discord.\n", " \n", " discord_dist : float\n", - " The distance between the discord and its nearest neighbor\n", + " The distance between the discord and its nearest neighbor.\n", " \n", " discord_idx : int\n", " The start index of discord.\n", " \"\"\"\n", - " T, M_T, Σ_T, T_subseq_isconstant = core.preprocess(T, m) \n", + " T, M_T, Σ_T, T_subseq_isconstant = core.preprocess(T, m)\n", + " \n", " l = len(T) - m + 1\n", - " \n", - " bsf = np.NINF\n", " PL = np.full(l, np.inf, dtype=np.float64) \n", " \n", " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", " chunks_range = naive_get_range_damp(split_idx - excl_zone + m, m, start=0)\n", " n_chunks = len(chunks_range)\n", - " next_expected_chunk_size = next_pow2(m) * (2 ** (n_chunks - 1)) \n", + " next_expected_chunk_size = next_pow2(m) * (2 ** (n_chunks - 1))\n", + " \n", + " bsf = np.NINF # initializing best-so-far discord distance\n", + " lookahead = next_pow2(m) # use for forward processing\n", " for i in range(split_idx, l):\n", " if PL[i] < bsf:\n", " continue\n", - " PL[i], bsf = _backward_process(T, m, i, excl_zone, M_T, Σ_T, T_subseq_isconstant, bsf, chunks_range=chunks_range)\n", - " _foreward_process(T, m, excl_zone, M_T, Σ_T, T_subseq_isconstant, i, PL)\n", + " \n", + " PL[i], bsf = _backward_process(\n", + " T, \n", + " m, \n", + " i, \n", + " excl_zone, \n", + " M_T, \n", + " Σ_T, \n", + " T_subseq_isconstant, \n", + " bsf, \n", + " chunks_range=chunks_range\n", + " )\n", + " _foreward_process(T, m, excl_zone, M_T, Σ_T, T_subseq_isconstant, i, lookahead, PL)\n", " \n", " # update chunks_range for next iteration by shifting (start, stop)\n", " # of chunks by one. \n", " chunks_range = chunks_range + 1 \n", - " if chunks_range[-1][1] - chunks_range[-1][0] < next_expected_chunk_size:\n", - " chunks_range[-1][0] = 0\n", + " last_chunk_size = chunks_range[-1][1] - chunks_range[-1][0]\n", + " if last_chunk_size < next_expected_chunk_size:\n", + " chunks_range[-1][0] = 0 # reset the start index of last chunk to 0.\n", " else:\n", " new_chunk = np.array([\n", " [0, chunks_range[-1][0]]\n", @@ -447,6 +466,7 @@ " discord_idx = np.argmax(P)\n", " discord_dist = P[discord_idx]\n", " if discord_dist == np.NINF:\n", + " discord_dist = np.inf\n", " discord_idx = -1\n", " \n", " return PL, discord_dist, discord_idx" @@ -462,7 +482,7 @@ }, { "cell_type": "code", - "execution_count": 51, + "execution_count": 15, "id": "2cfbc771", "metadata": {}, "outputs": [], @@ -477,7 +497,7 @@ }, { "cell_type": "code", - "execution_count": 52, + "execution_count": 16, "id": "a935f31f", "metadata": {}, "outputs": [ @@ -487,7 +507,7 @@ "text": [ "discord_dist: 8.500883427933502\n", "discord_index: 209\n", - "running time [sec]: 8.424577951431274\n" + "running time [sec]: 8.424530267715454\n" ] } ], @@ -507,7 +527,7 @@ }, { "cell_type": "code", - "execution_count": 54, + "execution_count": 17, "id": "2b461592", "metadata": {}, "outputs": [ @@ -517,7 +537,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 2.302816152572632\n" + "running time [sec]: 2.2287747859954834\n" ] } ], From 50c6709a30b6bcd438ee97ffd657c14b3b2d7fd6 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sun, 19 Nov 2023 16:20:20 -0500 Subject: [PATCH 11/17] enhance code readability --- docs/Tutorial_DAMP.ipynb | 55 ++++++++++++++++++++++------------------ 1 file changed, 30 insertions(+), 25 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index 96fe4e12a..80d0941d4 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 4, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 5, "id": "409dad09", "metadata": {}, "outputs": [], @@ -135,7 +135,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -159,7 +159,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "id": "dd1f7b3d", "metadata": {}, "outputs": [], @@ -212,7 +212,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "id": "79663447", "metadata": {}, "outputs": [], @@ -308,7 +308,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 9, "id": "c76317b5", "metadata": {}, "outputs": [], @@ -387,7 +387,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 17, "id": "769bddad", "metadata": {}, "outputs": [], @@ -428,14 +428,18 @@ " \n", " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", " chunks_range = naive_get_range_damp(split_idx - excl_zone + m, m, start=0)\n", - " n_chunks = len(chunks_range)\n", - " next_expected_chunk_size = next_pow2(m) * (2 ** (n_chunks - 1))\n", " \n", - " bsf = np.NINF # initializing best-so-far discord distance\n", - " lookahead = next_pow2(m) # use for forward processing\n", + " # Compute the last chunk size, and the expected size of the last chunk. The latter\n", + " # is basically the `n_chunks`-th item in a geometric series with coeficient of\n", + " # `next_pow2(m)` and the ratio of `2`.\n", + " last_chunk_size = chunks_range[-1][1] - chunks_range[-1][0]\n", + " expected_last_chunk_size = next_pow2(m) * (2 ** (len(chunks_range) - 1)) \n", + " \n", + " bsf = np.NINF \n", + " lookahead = next_pow2(m)\n", " for i in range(split_idx, l):\n", " if PL[i] < bsf:\n", - " continue\n", + " continue \n", " \n", " PL[i], bsf = _backward_process(\n", " T, \n", @@ -450,18 +454,19 @@ " )\n", " _foreward_process(T, m, excl_zone, M_T, Σ_T, T_subseq_isconstant, i, lookahead, PL)\n", " \n", - " # update chunks_range for next iteration by shifting (start, stop)\n", - " # of chunks by one. \n", - " chunks_range = chunks_range + 1 \n", - " last_chunk_size = chunks_range[-1][1] - chunks_range[-1][0]\n", - " if last_chunk_size < next_expected_chunk_size:\n", - " chunks_range[-1][0] = 0 # reset the start index of last chunk to 0.\n", + " # update chunks_range\n", + " chunks_range = chunks_range + 1 \n", + " if last_chunk_size < expected_last_chunk_size:\n", + " chunks_range[-1][0] = 0 \n", + " last_chunk_size += 1\n", " else:\n", " new_chunk = np.array([\n", - " [0, chunks_range[-1][0]]\n", + " [0, chunks_range[-1][0] + m]\n", " ])\n", " chunks_range = np.append(chunks_range, new_chunk, axis=0)\n", - " \n", + " expected_last_chunk_size = 2 * expected_last_chunk_size\n", + " last_chunk_size = 1 \n", + " \n", " P = np.where(PL==np.inf, np.NINF, PL)\n", " discord_idx = np.argmax(P)\n", " discord_dist = P[discord_idx]\n", @@ -482,7 +487,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 18, "id": "2cfbc771", "metadata": {}, "outputs": [], @@ -497,7 +502,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 19, "id": "a935f31f", "metadata": {}, "outputs": [ @@ -507,7 +512,7 @@ "text": [ "discord_dist: 8.500883427933502\n", "discord_index: 209\n", - "running time [sec]: 8.424530267715454\n" + "running time [sec]: 8.759973049163818\n" ] } ], @@ -527,7 +532,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 20, "id": "2b461592", "metadata": {}, "outputs": [ @@ -537,7 +542,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 2.2287747859954834\n" + "running time [sec]: 2.2458388805389404\n" ] } ], From 633cce8c6a57be43652c1f2c0bcd963821473845 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sun, 19 Nov 2023 16:39:25 -0500 Subject: [PATCH 12/17] Revise comments --- docs/Tutorial_DAMP.ipynb | 38 ++++++++++++++++++-------------------- 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index 80d0941d4..91d39810f 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 2, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 3, "id": "409dad09", "metadata": {}, "outputs": [], @@ -135,7 +135,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 4, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -159,7 +159,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 5, "id": "dd1f7b3d", "metadata": {}, "outputs": [], @@ -212,7 +212,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 6, "id": "79663447", "metadata": {}, "outputs": [], @@ -265,19 +265,18 @@ " \n", " Returns\n", " -------\n", - " distance : float\n", - " The [approximate] left matrix profile value that corresponds to \n", + " nn_distance : float\n", + " The (approximate) left matrix profile value that corresponds to \n", " the query, `T[query_idx : query_idx + m]`.\n", " \n", " bsf : float\n", " The best-so-far discord distance \n", " \"\"\"\n", " if chunks_range is None:\n", - " chunks_range = naive_get_range_damp(idx - excl_zone + m, m, start=0)\n", + " # The avoid the trivial (left) neighbors of `T[query_idx : query_idx+m]`,\n", + " # the stop index of non-trivial subsequences is `query_idx - excl_zone + m`\n", + " chunks_range = naive_get_range_damp(query_idx - excl_zone + m, m, start=0)\n", " \n", - " # iterating through chunks in backward manner, against\n", - " # which the query `T[query_idx : query_idx + m]` is\n", - " # compared with.\n", " nn_distance = np.inf\n", " for (start, stop) in chunks_range:\n", " QT = core.sliding_dot_product(\n", @@ -298,9 +297,8 @@ " \n", " nn_distance = min(nn_distance, np.min(D))\n", " if nn_distance < bsf:\n", - " break # no need to explore the remaining neighbors\n", + " break\n", " \n", - " # update the best-so-far discord distance\n", " bsf = max(bsf, nn_distance)\n", " \n", " return nn_distance, bsf" @@ -308,7 +306,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 7, "id": "c76317b5", "metadata": {}, "outputs": [], @@ -387,7 +385,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 8, "id": "769bddad", "metadata": {}, "outputs": [], @@ -487,7 +485,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 9, "id": "2cfbc771", "metadata": {}, "outputs": [], @@ -502,7 +500,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 10, "id": "a935f31f", "metadata": {}, "outputs": [ @@ -512,7 +510,7 @@ "text": [ "discord_dist: 8.500883427933502\n", "discord_index: 209\n", - "running time [sec]: 8.759973049163818\n" + "running time [sec]: 8.726994037628174\n" ] } ], @@ -532,7 +530,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 11, "id": "2b461592", "metadata": {}, "outputs": [ @@ -542,7 +540,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 2.2458388805389404\n" + "running time [sec]: 2.205477237701416\n" ] } ], From c112477d1ca0ae8e4a8cb1ddaa61d334ba11cd6b Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sun, 26 Nov 2023 00:52:52 -0500 Subject: [PATCH 13/17] add performant func and minor enhancements --- docs/Tutorial_DAMP.ipynb | 269 ++++++++++++++++++++++++++++----------- 1 file changed, 195 insertions(+), 74 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index 91d39810f..ba7066298 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "id": "409dad09", "metadata": {}, "outputs": [], @@ -105,21 +105,14 @@ " \"\"\"\n", " mp = stumpy.stump(T, m)\n", " IL = mp[:, 2].astype(np.int64)\n", + " PL = core._idx_to_mp(IL, T, m, check_neg=False)\n", + " \n", + " # ignore the values upto `split_idx`\n", + " PL[:split_idx] = np.inf\n", " \n", - " k = len(T) - m + 1 # len(IL)\n", - " PL = np.full(k, np.inf, dtype=np.float64)\n", - " for i in range(split_idx, k):\n", - " nn_i = IL[i]\n", - " if nn_i >= 0:\n", - " PL[i] = np.linalg.norm(core.z_norm(T[i : i + m]) - core.z_norm(T[nn_i : nn_i + m]))\n", - " \n", - " # finding the max of PL, while ignoring infinite values\n", - " discord_idx = np.argmax(\n", - " np.where(PL==np.inf, np.NINF, PL)\n", - " )\n", + " discord_idx = np.where(np.isinf(PL), np.NINF, PL).argmax()\n", " discord_dist = PL[discord_idx]\n", - " if discord_dist == np.NINF:\n", - " discord_dist = np.inf\n", + " if discord_dist == np.inf:\n", " discord_idx = -1\n", " \n", " return PL, discord_dist, discord_idx" @@ -135,18 +128,18 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "id": "c21c4587", "metadata": {}, "outputs": [], "source": [ - "def next_pow2(v):\n", + "def next_pow2(val):\n", " \"\"\"\n", " Compute the smallest \"power of two\" number that is greater than/ equal to `v`\n", " \n", " Parameters\n", " ----------\n", - " v : float\n", + " val : float\n", " A real positive value\n", " \n", " Returns\n", @@ -154,22 +147,43 @@ " out : int\n", " An integer value that is power of two, and satisfies `out >= v`\n", " \"\"\"\n", - " return int(math.pow(2, math.ceil(math.log2(v))))" + " return int(math.pow(2, math.ceil(math.log2(val))))" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "id": "dd1f7b3d", "metadata": {}, "outputs": [], "source": [ - "def naive_get_range_damp(stop, m, start=0):\n", + "def naive_get_damp_range(stop, m, start=0):\n", + " \"\"\"\n", + " naive version of the function `get_damp_range`\n", + " \"\"\"\n", + " out = []\n", + " \n", + " chunksize = next_pow2(m)\n", + " chunk_stop = stop\n", + " while start < chunk_stop:\n", + " chunk_start = max(chunk_stop - chunksize, start)\n", + " out.append([chunk_start, chunk_stop])\n", + " \n", + " if chunk_start == start:\n", + " break\n", + " \n", + " chunk_stop = chunk_start + m - 1\n", + " chunksize = 2 * chunksize\n", + " \n", + " return np.array(out)\n", + "\n", + "\n", + "def get_damp_range(stop, m, start=0):\n", " \"\"\"\n", - " For the given range `(start, stop)`, segmentize it into \n", - " chunks, each with a length of power of two. Each chunk is\n", - " preceded by one that is twice in size. The last chunk (i.e.\n", - " the left-most chunk) starts at `start`.\n", + " For the given range `(start, stop)`, segment it into chunks,\n", + " each with a length of power of two. Each chunk is preceded \n", + " by one that is twice in size. The last chunk (i.e. the left-most \n", + " chunk) starts at index `start`.\n", " \n", " Parameters\n", " ----------\n", @@ -186,33 +200,63 @@ " -------\n", " out : numpy.ndarray\n", " A 2D numpy array, where the i-th row shows the \n", - " range (start, stop) of the i-th chunk. out[0][1]\n", - " must be `stop` and out[-1][0] must be `start`. \n", + " range (start, stop) of the i-th chunk. out[0, 1]\n", + " must be `stop` and out[-1, 0] must be `start`. \n", " \"\"\"\n", - " chunksize = next_pow2(m)\n", - " \n", " out = []\n", + " \n", + " n_lower = math.ceil(math.log2(next_pow2(m)))\n", + " n_upper = math.ceil(math.log2(stop + 2**n_lower)) + 1\n", " chunk_stop = stop\n", - " while start < chunk_stop:\n", - " chunk_start = max(chunk_stop - chunksize, start)\n", - " out.append([chunk_start, chunk_stop])\n", + " \n", + " for chunk_size in np.power(2, range(n_lower, n_upper)):\n", + " chunk_start = chunk_stop - chunk_size\n", + " out.append([max(chunk_start, start), chunk_stop])\n", + " chunk_stop = chunk_start + m - 1\n", " \n", - " if chunk_start == start:\n", - " break\n", + " if chunk_start <= 0:\n", + " break\n", " \n", - " # The chunk_stop of the preceding chunk\n", - " # is `chunk_start` but shifted by `m` to \n", - " # include new subsequences that end after\n", - " # current `chunk_start`.\n", - " chunk_stop = chunk_start + m\n", - " chunksize = 2 * chunksize\n", - " \n", " return np.array(out)" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, + "id": "5e45d72c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 92, 100],\n", + " [ 80, 96],\n", + " [ 52, 84],\n", + " [ 0, 56]])" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Example\n", + "chunks = get_damp_range(100, m=5)\n", + "chunks" + ] + }, + { + "cell_type": "markdown", + "id": "e4bfcc2f", + "metadata": {}, + "source": [ + "The lengths of chunks are: 8, 16, 32, and 56 (not 64 as the chunk must be truncated at `start=0`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, "id": "79663447", "metadata": {}, "outputs": [], @@ -306,7 +350,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 9, "id": "c76317b5", "metadata": {}, "outputs": [], @@ -323,8 +367,8 @@ " PL,\n", "):\n", " \"\"\"\n", - " Update the (approximate) left matrix profile `PL` in place by checking the right neighbors of\n", - " `T[query_idx : query_idx + m]`\n", + " Update the (approximate) left matrix profile `PL` for a chunk of forthcoming subsequences\n", + " by computing their distance to their left neighbor T[query_idx : query_idx+m]\n", " \n", " Paramaters\n", " ----------\n", @@ -385,7 +429,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 14, "id": "769bddad", "metadata": {}, "outputs": [], @@ -425,13 +469,10 @@ " PL = np.full(l, np.inf, dtype=np.float64) \n", " \n", " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", - " chunks_range = naive_get_range_damp(split_idx - excl_zone + m, m, start=0)\n", + " chunks_range = get_damp_range(split_idx - excl_zone + m, m, start=0)\n", " \n", - " # Compute the last chunk size, and the expected size of the last chunk. The latter\n", - " # is basically the `n_chunks`-th item in a geometric series with coeficient of\n", - " # `next_pow2(m)` and the ratio of `2`.\n", - " last_chunk_size = chunks_range[-1][1] - chunks_range[-1][0]\n", - " expected_last_chunk_size = next_pow2(m) * (2 ** (len(chunks_range) - 1)) \n", + " last_chunk_size = chunks_range[-1, 1] - chunks_range[-1, 0]\n", + " last_chunk_size_cutoff = next_pow2(last_chunk_size)\n", " \n", " bsf = np.NINF \n", " lookahead = next_pow2(m)\n", @@ -452,24 +493,19 @@ " )\n", " _foreward_process(T, m, excl_zone, M_T, Σ_T, T_subseq_isconstant, i, lookahead, PL)\n", " \n", - " # update chunks_range\n", " chunks_range = chunks_range + 1 \n", - " if last_chunk_size < expected_last_chunk_size:\n", - " chunks_range[-1][0] = 0 \n", + " # make a decision on whether to create new chunk or not for the subsequence [0, m]\n", + " if last_chunk_size < last_chunk_size_cutoff:\n", + " chunks_range[-1, 0] = 0 \n", " last_chunk_size += 1\n", " else:\n", - " new_chunk = np.array([\n", - " [0, chunks_range[-1][0] + m]\n", - " ])\n", - " chunks_range = np.append(chunks_range, new_chunk, axis=0)\n", - " expected_last_chunk_size = 2 * expected_last_chunk_size\n", - " last_chunk_size = 1 \n", + " chunks_range = np.append(chunks_range, np.array([[0, m]]), axis=0)\n", + " last_chunk_size = m\n", + " last_chunk_size_cutoff = 2 * last_chunk_size_cutoff\n", " \n", - " P = np.where(PL==np.inf, np.NINF, PL)\n", - " discord_idx = np.argmax(P)\n", - " discord_dist = P[discord_idx]\n", - " if discord_dist == np.NINF:\n", - " discord_dist = np.inf\n", + " discord_idx = np.where(np.isinf(PL), np.NINF, PL).argmax()\n", + " discord_dist = PL[discord_idx]\n", + " if discord_dist == np.inf:\n", " discord_idx = -1\n", " \n", " return PL, discord_dist, discord_idx" @@ -480,12 +516,12 @@ "id": "f67cd2b9", "metadata": {}, "source": [ - "## Example" + "## Example 1" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 15, "id": "2cfbc771", "metadata": {}, "outputs": [], @@ -500,7 +536,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 16, "id": "a935f31f", "metadata": {}, "outputs": [ @@ -508,9 +544,9 @@ "name": "stdout", "output_type": "stream", "text": [ - "discord_dist: 8.500883427933502\n", + "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 8.726994037628174\n" + "running time [sec]: 4.833993911743164\n" ] } ], @@ -530,7 +566,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 17, "id": "2b461592", "metadata": {}, "outputs": [ @@ -540,7 +576,92 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 2.205477237701416\n" + "running time [sec]: 2.2122771739959717\n" + ] + } + ], + "source": [ + "DAMP(T, m, split_index) # dummpy run\n", + "\n", + "t_start = time.time()\n", + "excl_zone_denom = core.config.STUMPY_EXCL_ZONE_DENOM\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = 1\n", + "PL, discord_score, discord_idx = DAMP(T, m, split_index)\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = excl_zone_denom\n", + "t_end = time.time()\n", + "\n", + "print('discord_dist: ', discord_score)\n", + "print('discord_index: ', discord_idx)\n", + "print('running time [sec]: ', t_end - t_start)" + ] + }, + { + "cell_type": "markdown", + "id": "90688aa9", + "metadata": {}, + "source": [ + "## Example 2" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "6d64f488", + "metadata": {}, + "outputs": [], + "source": [ + "seed = 100\n", + "np.random.seed(seed)\n", + "\n", + "T = np.random.rand(100000)\n", + "m = 500\n", + "split_index = 5000" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "159e03dc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "discord_dist: 29.42331306408431\n", + "discord_index: 6110\n", + "running time [sec]: 5.537461042404175\n" + ] + } + ], + "source": [ + "naive_DAMP(T, m, split_index) # dummpy run\n", + "\n", + "t_start = time.time()\n", + "excl_zone_denom = core.config.STUMPY_EXCL_ZONE_DENOM\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = 1\n", + "PL, discord_score, discord_idx = naive_DAMP(T, m, split_index)\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = excl_zone_denom\n", + "t_end = time.time()\n", + "\n", + "print('discord_dist: ', discord_score)\n", + "print('discord_index: ', discord_idx)\n", + "print('running time [sec]: ', t_end - t_start)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "9ef28e10", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "discord_dist: 29.423313064084333\n", + "discord_index: 6110\n", + "running time [sec]: 55.6033251285553\n" ] } ], @@ -562,7 +683,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b0799d04", + "id": "3ae28748", "metadata": {}, "outputs": [], "source": [] From 585fc4c25506c5aa632dcd493410e19abe4be487 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sun, 26 Nov 2023 13:00:11 -0500 Subject: [PATCH 14/17] add new example --- docs/Tutorial_DAMP.ipynb | 125 ++++++++++++++++++++++++++++++++------- 1 file changed, 105 insertions(+), 20 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index ba7066298..d5f1e69d1 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "id": "409dad09", "metadata": {}, "outputs": [], @@ -128,7 +128,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -152,7 +152,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "id": "dd1f7b3d", "metadata": {}, "outputs": [], @@ -222,7 +222,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "id": "5e45d72c", "metadata": {}, "outputs": [ @@ -235,7 +235,7 @@ " [ 0, 56]])" ] }, - "execution_count": 7, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -256,7 +256,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "id": "79663447", "metadata": {}, "outputs": [], @@ -350,7 +350,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "id": "c76317b5", "metadata": {}, "outputs": [], @@ -429,7 +429,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 9, "id": "769bddad", "metadata": {}, "outputs": [], @@ -521,7 +521,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 10, "id": "2cfbc771", "metadata": {}, "outputs": [], @@ -536,7 +536,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 11, "id": "a935f31f", "metadata": {}, "outputs": [ @@ -546,7 +546,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 4.833993911743164\n" + "running time [sec]: 5.248243093490601\n" ] } ], @@ -566,7 +566,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 12, "id": "2b461592", "metadata": {}, "outputs": [ @@ -576,7 +576,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 2.2122771739959717\n" + "running time [sec]: 2.2584328651428223\n" ] } ], @@ -597,7 +597,7 @@ }, { "cell_type": "markdown", - "id": "90688aa9", + "id": "d5fccfb3", "metadata": {}, "source": [ "## Example 2" @@ -605,7 +605,92 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 13, + "id": "d8e9fa63", + "metadata": {}, + "outputs": [], + "source": [ + "seed = 100\n", + "np.random.seed(seed)\n", + "\n", + "T = np.random.rand(100000)\n", + "m = 50\n", + "split_index = 5000" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "263bca9c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "discord_dist: 7.606279752022369\n", + "discord_index: 8109\n", + "running time [sec]: 5.100044012069702\n" + ] + } + ], + "source": [ + "naive_DAMP(T, m, split_index) # dummpy run\n", + "\n", + "t_start = time.time()\n", + "excl_zone_denom = core.config.STUMPY_EXCL_ZONE_DENOM\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = 1\n", + "PL, discord_score, discord_idx = naive_DAMP(T, m, split_index)\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = excl_zone_denom\n", + "t_end = time.time()\n", + "\n", + "print('discord_dist: ', discord_score)\n", + "print('discord_index: ', discord_idx)\n", + "print('running time [sec]: ', t_end - t_start)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "78d09bda", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "discord_dist: 7.606279752022363\n", + "discord_index: 8109\n", + "running time [sec]: 6.70897912979126\n" + ] + } + ], + "source": [ + "DAMP(T, m, split_index) # dummpy run\n", + "\n", + "t_start = time.time()\n", + "excl_zone_denom = core.config.STUMPY_EXCL_ZONE_DENOM\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = 1\n", + "PL, discord_score, discord_idx = DAMP(T, m, split_index)\n", + "core.config.STUMPY_EXCL_ZONE_DENOM = excl_zone_denom\n", + "t_end = time.time()\n", + "\n", + "print('discord_dist: ', discord_score)\n", + "print('discord_index: ', discord_idx)\n", + "print('running time [sec]: ', t_end - t_start)" + ] + }, + { + "cell_type": "markdown", + "id": "90688aa9", + "metadata": {}, + "source": [ + "## Example 3" + ] + }, + { + "cell_type": "code", + "execution_count": 16, "id": "6d64f488", "metadata": {}, "outputs": [], @@ -620,7 +705,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 17, "id": "159e03dc", "metadata": {}, "outputs": [ @@ -630,7 +715,7 @@ "text": [ "discord_dist: 29.42331306408431\n", "discord_index: 6110\n", - "running time [sec]: 5.537461042404175\n" + "running time [sec]: 5.572257995605469\n" ] } ], @@ -651,7 +736,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 18, "id": "9ef28e10", "metadata": {}, "outputs": [ @@ -661,7 +746,7 @@ "text": [ "discord_dist: 29.423313064084333\n", "discord_index: 6110\n", - "running time [sec]: 55.6033251285553\n" + "running time [sec]: 55.545814990997314\n" ] } ], From 53f73469d1385e6da98aad5e2a0fc119af224f38 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Sun, 26 Nov 2023 19:24:29 -0500 Subject: [PATCH 15/17] minor fixes --- docs/Tutorial_DAMP.ipynb | 77 ++++++++++++++++++---------------------- 1 file changed, 35 insertions(+), 42 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index d5f1e69d1..85e9533fe 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 18, "id": "409dad09", "metadata": {}, "outputs": [], @@ -128,14 +128,14 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "id": "c21c4587", "metadata": {}, "outputs": [], "source": [ "def next_pow2(val):\n", " \"\"\"\n", - " Compute the smallest \"power of two\" number that is greater than/ equal to `v`\n", + " Compute the smallest \"power of two\" number that is greater than/ equal to `val`\n", " \n", " Parameters\n", " ----------\n", @@ -152,7 +152,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 19, "id": "dd1f7b3d", "metadata": {}, "outputs": [], @@ -214,7 +214,7 @@ " out.append([max(chunk_start, start), chunk_stop])\n", " chunk_stop = chunk_start + m - 1\n", " \n", - " if chunk_start <= 0:\n", + " if chunk_start <= start:\n", " break\n", " \n", " return np.array(out)" @@ -222,7 +222,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 20, "id": "5e45d72c", "metadata": {}, "outputs": [ @@ -235,7 +235,7 @@ " [ 0, 56]])" ] }, - "execution_count": 6, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } @@ -256,7 +256,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "id": "79663447", "metadata": {}, "outputs": [], @@ -269,8 +269,8 @@ " M_T, \n", " Σ_T, \n", " T_subseq_isconstant, \n", + " chunks_range,\n", " bsf,\n", - " chunks_range=None,\n", "):\n", " \"\"\"\n", " Compute the (approximate) distance between the subsequence `T[query_idx:query_idx+m]`\n", @@ -300,13 +300,13 @@ " A numpy boolean array whose i-th element indicates whether the subsequence\n", " `T[i : i+m]` is constant (True)\n", " \n", - " bsf : float\n", - " The best-so-far discord distance\n", - " \n", - " chunks_range : numpy.ndarray, default None\n", + " chunks_range : numpy.ndarray\n", " A 2D numpy array consisting of rows, each represents\n", " the (start, stop) range of a chunk\n", " \n", + " bsf : float\n", + " The best-so-far discord distance\n", + " \n", " Returns\n", " -------\n", " nn_distance : float\n", @@ -316,11 +316,6 @@ " bsf : float\n", " The best-so-far discord distance \n", " \"\"\"\n", - " if chunks_range is None:\n", - " # The avoid the trivial (left) neighbors of `T[query_idx : query_idx+m]`,\n", - " # the stop index of non-trivial subsequences is `query_idx - excl_zone + m`\n", - " chunks_range = naive_get_range_damp(query_idx - excl_zone + m, m, start=0)\n", - " \n", " nn_distance = np.inf\n", " for (start, stop) in chunks_range:\n", " QT = core.sliding_dot_product(\n", @@ -350,7 +345,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 21, "id": "c76317b5", "metadata": {}, "outputs": [], @@ -429,7 +424,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 32, "id": "769bddad", "metadata": {}, "outputs": [], @@ -470,7 +465,6 @@ " \n", " excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))\n", " chunks_range = get_damp_range(split_idx - excl_zone + m, m, start=0)\n", - " \n", " last_chunk_size = chunks_range[-1, 1] - chunks_range[-1, 0]\n", " last_chunk_size_cutoff = next_pow2(last_chunk_size)\n", " \n", @@ -487,14 +481,13 @@ " excl_zone, \n", " M_T, \n", " Σ_T, \n", - " T_subseq_isconstant, \n", - " bsf, \n", - " chunks_range=chunks_range\n", + " T_subseq_isconstant, \n", + " chunks_range,\n", + " bsf,\n", " )\n", " _foreward_process(T, m, excl_zone, M_T, Σ_T, T_subseq_isconstant, i, lookahead, PL)\n", " \n", - " chunks_range = chunks_range + 1 \n", - " # make a decision on whether to create new chunk or not for the subsequence [0, m]\n", + " chunks_range[:] = chunks_range + 1 \n", " if last_chunk_size < last_chunk_size_cutoff:\n", " chunks_range[-1, 0] = 0 \n", " last_chunk_size += 1\n", @@ -521,7 +514,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 23, "id": "2cfbc771", "metadata": {}, "outputs": [], @@ -536,7 +529,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 24, "id": "a935f31f", "metadata": {}, "outputs": [ @@ -546,7 +539,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 5.248243093490601\n" + "running time [sec]: 4.881464958190918\n" ] } ], @@ -566,7 +559,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 25, "id": "2b461592", "metadata": {}, "outputs": [ @@ -576,7 +569,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 2.2584328651428223\n" + "running time [sec]: 2.2210960388183594\n" ] } ], @@ -605,7 +598,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 26, "id": "d8e9fa63", "metadata": {}, "outputs": [], @@ -620,7 +613,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 27, "id": "263bca9c", "metadata": {}, "outputs": [ @@ -630,7 +623,7 @@ "text": [ "discord_dist: 7.606279752022369\n", "discord_index: 8109\n", - "running time [sec]: 5.100044012069702\n" + "running time [sec]: 4.996989965438843\n" ] } ], @@ -651,7 +644,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 28, "id": "78d09bda", "metadata": {}, "outputs": [ @@ -661,7 +654,7 @@ "text": [ "discord_dist: 7.606279752022363\n", "discord_index: 8109\n", - "running time [sec]: 6.70897912979126\n" + "running time [sec]: 6.67956805229187\n" ] } ], @@ -690,7 +683,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 33, "id": "6d64f488", "metadata": {}, "outputs": [], @@ -705,7 +698,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 34, "id": "159e03dc", "metadata": {}, "outputs": [ @@ -715,7 +708,7 @@ "text": [ "discord_dist: 29.42331306408431\n", "discord_index: 6110\n", - "running time [sec]: 5.572257995605469\n" + "running time [sec]: 5.483174085617065\n" ] } ], @@ -736,7 +729,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 35, "id": "9ef28e10", "metadata": {}, "outputs": [ @@ -746,7 +739,7 @@ "text": [ "discord_dist: 29.423313064084333\n", "discord_index: 6110\n", - "running time [sec]: 55.545814990997314\n" + "running time [sec]: 55.60335397720337\n" ] } ], From 99593e04d4b82a7098925d82fa902bc766a61760 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Tue, 28 Nov 2023 22:57:03 -0500 Subject: [PATCH 16/17] minor fix to make sure chunks get updated in each iter --- docs/Tutorial_DAMP.ipynb | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index 85e9533fe..117276190 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -471,21 +471,19 @@ " bsf = np.NINF \n", " lookahead = next_pow2(m)\n", " for i in range(split_idx, l):\n", - " if PL[i] < bsf:\n", - " continue \n", - " \n", - " PL[i], bsf = _backward_process(\n", - " T, \n", - " m, \n", - " i, \n", - " excl_zone, \n", - " M_T, \n", - " Σ_T, \n", - " T_subseq_isconstant, \n", - " chunks_range,\n", - " bsf,\n", - " )\n", - " _foreward_process(T, m, excl_zone, M_T, Σ_T, T_subseq_isconstant, i, lookahead, PL)\n", + " if PL[i] >= bsf: \n", + " PL[i], bsf = _backward_process(\n", + " T, \n", + " m, \n", + " i, \n", + " excl_zone, \n", + " M_T, \n", + " Σ_T, \n", + " T_subseq_isconstant, \n", + " chunks_range,\n", + " bsf,\n", + " )\n", + " _foreward_process(T, m, excl_zone, M_T, Σ_T, T_subseq_isconstant, i, lookahead, PL)\n", " \n", " chunks_range[:] = chunks_range + 1 \n", " if last_chunk_size < last_chunk_size_cutoff:\n", From 5c110a174886738dca42c5477b0c5673eef674e6 Mon Sep 17 00:00:00 2001 From: nimasarajpoor Date: Fri, 8 Dec 2023 01:11:37 -0500 Subject: [PATCH 17/17] use JIT compiled functions with fastmath flag where possible --- docs/Tutorial_DAMP.ipynb | 55 +++++++++++++++++++++------------------- 1 file changed, 29 insertions(+), 26 deletions(-) diff --git a/docs/Tutorial_DAMP.ipynb b/docs/Tutorial_DAMP.ipynb index 117276190..c57011438 100644 --- a/docs/Tutorial_DAMP.ipynb +++ b/docs/Tutorial_DAMP.ipynb @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "id": "c9564cff", "metadata": {}, "outputs": [], @@ -55,6 +55,7 @@ "import matplotlib.pyplot as plt\n", "import stumpy\n", "\n", + "from numba import njit\n", "from stumpy import core\n", "from scipy.io import loadmat" ] @@ -69,7 +70,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 3, "id": "409dad09", "metadata": {}, "outputs": [], @@ -128,7 +129,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "id": "c21c4587", "metadata": {}, "outputs": [], @@ -152,7 +153,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 5, "id": "dd1f7b3d", "metadata": {}, "outputs": [], @@ -222,7 +223,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 6, "id": "5e45d72c", "metadata": {}, "outputs": [ @@ -235,7 +236,7 @@ " [ 0, 56]])" ] }, - "execution_count": 20, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -256,11 +257,12 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "id": "79663447", "metadata": {}, "outputs": [], "source": [ + "@njit(fastmath={\"nnan\", \"nsz\", \"arcp\", \"contract\", \"afn\", \"reassoc\"})\n", "def _backward_process(\n", " T, \n", " m, \n", @@ -318,7 +320,7 @@ " \"\"\"\n", " nn_distance = np.inf\n", " for (start, stop) in chunks_range:\n", - " QT = core.sliding_dot_product(\n", + " QT = core._sliding_dot_product(\n", " T[query_idx : query_idx + m], \n", " T[start : stop],\n", " )\n", @@ -345,11 +347,12 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 8, "id": "c76317b5", "metadata": {}, "outputs": [], "source": [ + "@njit(fastmath={\"nnan\", \"nsz\", \"arcp\", \"contract\", \"afn\", \"reassoc\"})\n", "def _foreward_process(\n", " T, \n", " m, \n", @@ -404,7 +407,7 @@ " stop = min(start + lookahead, len(T))\n", " \n", " if stop - start >= m:\n", - " QT = core.sliding_dot_product(T[query_idx : query_idx + m], T[start : stop])\n", + " QT = core._sliding_dot_product(T[query_idx : query_idx + m], T[start : stop])\n", " D = core._mass(\n", " T[query_idx : query_idx + m],\n", " T[start : stop],\n", @@ -424,7 +427,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 9, "id": "769bddad", "metadata": {}, "outputs": [], @@ -512,7 +515,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 10, "id": "2cfbc771", "metadata": {}, "outputs": [], @@ -527,7 +530,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 11, "id": "a935f31f", "metadata": {}, "outputs": [ @@ -537,7 +540,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 4.881464958190918\n" + "running time [sec]: 4.886792898178101\n" ] } ], @@ -557,7 +560,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 12, "id": "2b461592", "metadata": {}, "outputs": [ @@ -567,7 +570,7 @@ "text": [ "discord_dist: 8.500883427933504\n", "discord_index: 209\n", - "running time [sec]: 2.2210960388183594\n" + "running time [sec]: 0.4022231101989746\n" ] } ], @@ -596,7 +599,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 13, "id": "d8e9fa63", "metadata": {}, "outputs": [], @@ -611,7 +614,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 14, "id": "263bca9c", "metadata": {}, "outputs": [ @@ -621,7 +624,7 @@ "text": [ "discord_dist: 7.606279752022369\n", "discord_index: 8109\n", - "running time [sec]: 4.996989965438843\n" + "running time [sec]: 5.08968710899353\n" ] } ], @@ -642,7 +645,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 15, "id": "78d09bda", "metadata": {}, "outputs": [ @@ -652,7 +655,7 @@ "text": [ "discord_dist: 7.606279752022363\n", "discord_index: 8109\n", - "running time [sec]: 6.67956805229187\n" + "running time [sec]: 2.9500508308410645\n" ] } ], @@ -681,7 +684,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 16, "id": "6d64f488", "metadata": {}, "outputs": [], @@ -696,7 +699,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 17, "id": "159e03dc", "metadata": {}, "outputs": [ @@ -706,7 +709,7 @@ "text": [ "discord_dist: 29.42331306408431\n", "discord_index: 6110\n", - "running time [sec]: 5.483174085617065\n" + "running time [sec]: 5.911180019378662\n" ] } ], @@ -727,7 +730,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 18, "id": "9ef28e10", "metadata": {}, "outputs": [ @@ -737,7 +740,7 @@ "text": [ "discord_dist: 29.423313064084333\n", "discord_index: 6110\n", - "running time [sec]: 55.60335397720337\n" + "running time [sec]: 34.58638381958008\n" ] } ],