Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial notebook DAMP #920

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
276 changes: 276 additions & 0 deletions docs/Tutorial_DAMP.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
{
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #24.        For the given index i, segmentize the array np.arange(i) into 

Is there a better way to describe the task of this function? How about:

"Given the index i, divide the array np.arange(i) into chunks. Ensure the last chunk's size is chunksize_init, with the preceding chunks doubling in size. The output consists of (start, stop) indices of each chunk."

Also, we may add:

The left-most chunk always start at 0.


Reply via ReviewNB

Copy link
Contributor

@seanlaw seanlaw Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems weird. It's not clear why you need to end with the last element being chunksize_init

Something feels backwards here. The intent is hard to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we actually need to call the function multiple times? Or is it possible to call the function once and then shift the indices somehow?

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something feels backwards here

Well, it is backward (I mean... the process itself is backward). but, I trust your lens! Please allow me to think more and see if I can come up with something that is more elegant.

Also, do we actually need to call the function multiple times? Or is it possible to call the function once and then shift the indices somehow?

I wanted to do that but I decided to start simple. What we can do is to keep shifting the start, stop indices of all chunks by one in each iteration. So, for instance, if I get the chunks for the subsequences in T[:idx]. Then, I can find the chunks for T[:(idx+1)] by just shifting start/stop indices of chunks by one to the right. We just need to keep track of the left-most chunk though as it can become the power two of a number at some point. In such case, the number of chunks will be increased by one.

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[FYR...before I forget]
Another approach is to ALWAYS OFFSET FROM THE QUERY INDEX idx by s, 2s, 4s, 8s, and so on, where s is power of two. The good thing is that the difference between any two consecutive offset is STILL a power of two. This may result in cleaner code. IIRC, I think I saw something similar in the MATLAB code before. Going to check that as well.

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MATLAB code uses the following lines for updating the start / stop of each chunk:

# Query is at index `i`, i.e. `query = T(i:i+SubsequenceLength-1)`
# X is the chunk size, initially set to `2 ^ nextpow2(8 * SubsequenceLength)`

X_start = i - X + 1 + (expansion_num * SubsequenceLength)
X_end = i - (X / 2) + (expansion_num * SubsequenceLength)

# and then
approximate_distance = min( real(MASS_V2(T(X_start:X_end), query))); 
X = X * 2
expansion_num = expansion_num  + 1

The term (expansion_num * SubsequenceLength) is to take into account the elements of the last subsequence in the new chunk. To keep the length of chunk untouched, the X_start has the same shift.


Note 1:
According to the paper / MATLAB code, the length of chunk, T in core.mass(Q, T), (NOT the number of subsequences) should be a power-of-two.

Note 2:
The reason behind considering the length power-of-two for chunk is to speed up the MASS algorithm (according to the authors)

So, I did a quick check to see how much having the length power-of-two affects the performance.

seed = 0
np.random.seed(seed)

T = np.random.rand(10_000_000)
m = 50

T, M_T, Σ_T, T_subseq_isconstant = stumpy.core.preprocess(T, m)

query_idx = 600000
t_lst = []
for stop in range(2 ** 20 - 1000 - 1, 2 ** 20 + 1000 + 1):
    time_start = time.time()

    QT = core.sliding_dot_product(
        T[query_idx : query_idx + m], 
        T[start : stop],
    )
    
    D = core._mass(
        T[query_idx : query_idx + m],
        T[start : stop],
        QT=QT,
        μ_Q=M_T[query_idx],
        σ_Q=Σ_T[query_idx],
        M_T=M_T[start : stop - m + 1],
        Σ_T=Σ_T[start : stop - m + 1],
        Q_subseq_isconstant=T_subseq_isconstant[query_idx],
        T_subseq_isconstant=T_subseq_isconstant[start : stop - m + 1],
    )

    time_end = time.time()
    t_lst.append(time_end - time_start)

And, the I plot it:

indices = np.arange(2 ** 20 - 1000 - 1, 2 ** 20 + 1000 + 1)
indices = indices[2:]
t_lst = t_lst[2:]

idx = np.flatnonzero(indices == 2 ** 20)[0]

plt.figure(figsize=(20, 5))
plt.scatter(indices[idx], t_lst[idx], color='r', label='chunksize = 2 ** 20')
plt.plot(indices[idx-200 : idx+200], t_lst[idx-200:idx+200]) 
plt.ylabel('running time')
plt.legend()
plt.show()
image

Well, it seems the running time for the chunk size 2 ** 20 is low (not necessarily the lowest) but it should be okay. To remain faithful to the paper, I am going to consider the length power-of-two for each chunk size.

NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #28.        while start < chunk_stop:

Is this code readable? I tried to vectorize it but couldn't figure out "how". If the number of subsequences in each chunk were supposed to be a power of two, I think I would be able to vectorize it. However, according to the paper, the size of the chunk itself should be a power of two. In other words, the number of subsequences is like... 2 ** num - m + 1

Since I couldn't find out a cleaner solution, I am going with naive_get_range_damp for now.


Reply via ReviewNB

Copy link
Contributor

@seanlaw seanlaw Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to see more examples of what the inputs/outputs are suppose to be in order to understand what is expected. Right now, I'm very confused. Can you give me some concrete examples and any gotchas (i.e., when things aren't perfectly square)?

NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #67.            PL[start : stop - m + 1] = np.minimum(PL[start : stop - m + 1], D)

Previously, we had the variable pruned , which was a boolean vector (of length `len(T)-m+1`), where pruned[i] is True if the i-th subsequence is pruned. And, instead of line above (i.e. line #67), we had:

mask = np.argwhere(D < bsf) + start

pruned[mask] = True

Recall that the paper's original algorithm does not compute PL[i] if pruned[i] is True. It just skips it. In such case, the original algorithm set PL[i] as follows:

PL[i] = PL[i-1]

which makes it difficult to find the correct index of discord. The MATALB code does a hack instead as follows:

PL[i] = PL[i-1] - 0.000001

and this does not seem to be a clean solution.

So, instead, I am updating the (approximate) matrix profile PL by using the computed distance profile D . This helps us to avoid the hack, and I think It should not be computationally more expensive compared to np.argwhere(D < bfs)



Reply via ReviewNB

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 60-68 updates the chunks_range by shifting it by one to the right in each iteration. I feel the code is not that clean! Still trying to figure out if there is a better way for updating the chunks range. Any thoughts?


Reply via ReviewNB

NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
NimaSarajpoor marked this conversation as resolved.
Show resolved Hide resolved
"cells": [
{
"cell_type": "markdown",
"id": "0cdc3c0a",
"metadata": {},
"source": [
"# DAMP: Discord-Aware Matrix Profile"
]
},
{
"cell_type": "markdown",
"id": "1bcc1071",
"metadata": {},
"source": [
"Authors in [DAMP](https://www.cs.ucr.edu/~eamonn/DAMP_long_version.pdf) presented a method for discord detection that is scalable and it can be used in offline and online mode.\n",
"\n",
"To better understand the mechanism behind this method, we should first understand the difference between the full matrix profile and the left matrix profile of a time series `T`. For a subsequence with length `m`, and the start index `i`, i.e. `S_i = T[i:i+m]`, there are two groups of neighbors, known as left and right neighbors. The left neighbors are the subsequences on the left side of `S_i`, i.e. the subsequences in `T[:i]`. And, the right neighbors are the subsequences on the right side of `S_i`, i.e. the subsequences in `T[i+1:]`. The `i`-th element of the full matrix profile is the minimum distance between `S_i` and all of its neighbors, considering both left and right ones. However, in the left matrix profile, the `i`-th element is the minimum distance between the subsequence `S_i` and its left neighbors.\n",
"\n",
"One can use either the full matrix profile or the left matrix profile to find the top discord, a subsequence whose distance to its nearest neighbor is larger than the distance of any other subsequences to their nearest neighbors. However, using full matrix profile for detecting discords might result in missing the case where there are two rare subsequences that happen to also be similar to each other (a case that is known as \"twin freak\"). On the other hand, the left matrix profile resolves this problem by capturing the discord at its first occurance. Hence, even if there are two or more of such discords, we can still capture the first occurance by using the left matrix profile."
]
},
{
"cell_type": "markdown",
"id": "27bf47eb",
"metadata": {},
"source": [
"The original `DAMP` algorithm needs a parameter called `split_idx`. For a given `split_idx`, the train part is `T[:split_idx]` and the potential anomalies should be coming from `T[split_idx:]`. The value of split_idx is problem dependent. If split_idx is too small, then `T[:split_idx]` may not contain all different kinds of regular patterns. Hence, we may incorrectly select a subsequence as a discord. If split_idx is too large, we may miss a discord if that discord and its nearest neighbor are both in `T[:split_idx]`. The following two extreme scenarios can help with understanding the rationale behind `split_idx`.\n",
"\n",
"(1) `split_idx = 0`: In this case, the first subsequence can be a discord itself as it is a \"new\" pattern. <br>\n",
"(2) `split_idx = len(T) - m` In such case, the last pattern is the only pattern that will be analyzed for the discord. It will be compared against all subsequences except the last one!\n"
]
},
{
"cell_type": "markdown",
"id": "ce3102c1",
"metadata": {},
"source": [
"# Getting Started"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c9564cff",
"metadata": {},
"outputs": [],
"source": [
"import math\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import stumpy\n",
"import time\n",
"\n",
"from stumpy import core\n",
"from scipy.io import loadmat"
]
},
{
"cell_type": "markdown",
"id": "ecad47df",
"metadata": {},
"source": [
"## Naive approach"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "409dad09",
"metadata": {},
"outputs": [],
"source": [
"def naive_DAMP(T, m, split_idx):\n",
" \"\"\"\n",
" Compute the top-1 discord in `T`, where the subsequence discord resides in T[split_index:]\n",
" \n",
" Parameters\n",
" ----------\n",
" T : numpy.ndarray\n",
" A time series for which the top discord will be computed.\n",
" \n",
" m : int\n",
" Window size\n",
" \n",
" split_idx : int\n",
" The split index between train and test. See note below for further details.\n",
" \n",
" Returns\n",
" -------\n",
" PL : numpy.ndarry\n",
" The [exact] left matrix profile. All infinite distances are ingored in computing\n",
" the discord.\n",
" \n",
" discord_dist : float\n",
" The discord's distance, which is the distance between the top discord and its\n",
" left nearest neighbor\n",
" \n",
" discord_idx : int\n",
" The start index of the top discord\n",
" \n",
" Note\n",
" ----\n",
" \n",
" \"\"\"\n",
" mp = stumpy.stump(T, m)\n",
" IL = mp[:, 2].astype(np.int64)\n",
" \n",
" k = len(T) - m + 1 # len(IL)\n",
" PL = np.full(k, np.inf, dtype=np.float64)\n",
" for i in range(split_idx, k):\n",
" nn_i = IL[i]\n",
" if nn_i >= 0:\n",
" PL[i] = np.linalg.norm(core.z_norm(T[i : i + m]) - core.z_norm(T[nn_i : nn_i + m]))\n",
" \n",
" PL_modified = np.where(PL==np.inf, np.NINF, PL)\n",
" discord_idx = np.argmax(PL_modified)\n",
" discord_dist = PL_modified[discord_idx]\n",
" if discord_dist == np.NINF:\n",
" discord_idx = -1\n",
" \n",
" return PL, discord_dist, discord_idx"
]
},
{
"cell_type": "markdown",
"id": "505d4586",
"metadata": {},
"source": [
"## DAMP"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c21c4587",
"metadata": {},
"outputs": [],
"source": [
"def next_pow2(v):\n",
" \"\"\"\n",
" Compute the smallest \"power of two\" number that is greater than/ equal to `v`\n",
" \n",
" Parameters\n",
" ----------\n",
" v : float\n",
" A real positive value\n",
" \n",
" Returns\n",
" -------\n",
" out : int\n",
" An integer value that is power of two, and satisfies `out >= v`\n",
" \"\"\"\n",
" return int(math.pow(2, math.ceil(math.log2(v))))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "79663447",
"metadata": {},
"outputs": [],
"source": [
"def _backward_process(\n",
" T, \n",
" m, \n",
" query_idx, \n",
" M_T, \n",
" Σ_T, \n",
" T_subseq_isconstant, \n",
" bsf,\n",
"):\n",
" \"\"\"\n",
" Compute the (approximate) left matrix profile value that corresponds to the subsequence \n",
" `T[query_idx:query_idx+m]` and update the best-so-far discord distance.\n",
" \n",
" Parameters\n",
" ----------\n",
" T : numpy.ndarray\n",
" A time series\n",
" \n",
" m : int\n",
" Window size\n",
" \n",
" query_idx : int\n",
" The start index of the query with length `m`, i.e. `T[query_idx:query_idx+m]`\n",
" \n",
" M_T : np.ndarray\n",
" The sliding mean of `T`\n",
" \n",
" Σ_T : np.ndarray\n",
" The sliding standard deviation of `T`\n",
" \n",
" T_subseq_isconstant : numpy.ndarray\n",
" A numpy boolean array whose i-th element indicates whether the subsequence\n",
" `T[i : i+m]` is constant (True)\n",
" \n",
" bsf : float\n",
" The best-so-far discord distance\n",
" \n",
" Returns\n",
" -------\n",
" distance : float\n",
" The [approximate] left matrix profile value that corresponds to \n",
" the query, `T[query_idx : query_idx + m]`.\n",
" \n",
" bsf : float\n",
" The best-so-far discord distance \n",
" \"\"\"\n",
" nn_distance = np.inf # The query's distance to its 1nn.\n",
" \n",
" # To compute the distance between Q=T[query_idx : query_idx + m],\n",
" # and the subsequences in T[chunk_start : chunk_stop].\n",
" chunksize = next_pow2(m) \n",
" chunk_stop = query_idx\n",
" chunk_start = max(0, chunk_stop - chunksize)\n",
" \n",
" while nn_distance >= bsf:\n",
" QT = core.sliding_dot_product(\n",
" T[query_idx : query_idx + m], \n",
" T[chunk_start : chunk_stop],\n",
" )\n",
" D = core._mass(\n",
" T[query_idx : query_idx + m],\n",
" T[chunk_start : chunk_stop],\n",
" QT=QT,\n",
" μ_Q=M_T[query_idx],\n",
" σ_Q=Σ_T[query_idx],\n",
" M_T=M_T[chunk_start : chunk_stop - m + 1],\n",
" Σ_T=Σ_T[chunk_start : chunk_stop - m + 1],\n",
" Q_subseq_isconstant=T_subseq_isconstant[query_idx],\n",
" T_subseq_isconstant=T_subseq_isconstant[chunk_start : chunk_stop - m + 1],\n",
" )\n",
" \n",
" nn_distance = np.min(D)\n",
" if chunk_start == 0: \n",
" # all neighbors of `Q` are visited. Hence, `nn_distance` is exact.\n",
" if nn_distance > bsf:\n",
" bsf = nn_distance\n",
" break\n",
" \n",
" else:\n",
" nn_distance = np.min(D)\n",
" if nn_distance < bfs:\n",
" break\n",
" else:\n",
" chunksize = 2 * chunksize\n",
" chunk_start = max(0, chunk_stop - chunksize)\n",
" \n",
" return nn_distance, bsf"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}