Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial notebook DAMP #920
base: main
Are you sure you want to change the base?
Tutorial notebook DAMP #920
Changes from 1 commit
b311533
4ef1f5d
efeb68a
f3c7f30
01bb14c
f871b42
d7d1314
80f214e
d130159
d3d7e7d
50c6709
633cce8
c112477
585fc4c
53f7346
99593e0
5c110a1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to describe the task of this function? How about:
"Given the index
i
, divide the arraynp.arange(i)
into chunks. Ensure the last chunk's size ischunksize_init
, with the preceding chunks doubling in size. The output consists of (start, stop) indices of each chunk."Also, we may add:
The left-most chunk always start at 0.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems weird. It's not clear why you need to end with the last element being
chunksize_init
Something feels backwards here. The intent is hard to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we actually need to call the function multiple times? Or is it possible to call the function once and then shift the indices somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it is backward (I mean... the process itself is backward). but, I trust your lens! Please allow me to think more and see if I can come up with something that is more elegant.
I wanted to do that but I decided to start simple. What we can do is to keep shifting the start, stop indices of all chunks by one in each iteration. So, for instance, if I get the chunks for the subsequences in
T[:idx]
. Then, I can find the chunks forT[:(idx+1)]
by just shifting start/stop indices of chunks by one to the right. We just need to keep track of the left-most chunk though as it can become the power two of a number at some point. In such case, the number of chunks will be increased by one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[FYR...before I forget]
Another approach is to ALWAYS OFFSET FROM THE QUERY INDEX
idx
bys
,2s
,4s
,8s
, and so on, wheres
is power of two. The good thing is that the difference between any two consecutive offset is STILL a power of two. This may result in cleaner code. IIRC, I think I saw something similar in the MATLAB code before. Going to check that as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MATLAB code uses the following lines for updating the start / stop of each chunk:
The term
(expansion_num * SubsequenceLength)
is to take into account the elements of the last subsequence in the new chunk. To keep the length of chunk untouched, theX_start
has the same shift.Note 1:
According to the paper / MATLAB code, the length of chunk,
T
incore.mass(Q, T)
, (NOT the number of subsequences) should be a power-of-two.Note 2:
The reason behind considering the length
power-of-two
for chunk is to speed up the MASS algorithm (according to the authors)So, I did a quick check to see how much having the length power-of-two affects the performance.
And, the I plot it:
Well, it seems the running time for the chunk size
2 ** 20
is low (not necessarily the lowest) but it should be okay. To remain faithful to the paper, I am going to consider the length power-of-two for each chunk size.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code readable? I tried to vectorize it but couldn't figure out "how". If the number of subsequences in each chunk were supposed to be a power of two, I think I would be able to vectorize it. However, according to the paper, the size of the chunk itself should be a power of two. In other words, the number of subsequences is like...
2 ** num - m + 1
Since I couldn't find out a cleaner solution, I am going with
naive_get_range_damp
for now.Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to see more examples of what the inputs/outputs are suppose to be in order to understand what is expected. Right now, I'm very confused. Can you give me some concrete examples and any gotchas (i.e., when things aren't perfectly square)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we had the variable
pruned
, which was a boolean vector (of length `len(T)-m+1`), wherepruned[i]
is True if the i-th subsequence is pruned. And, instead of line above (i.e. line #67), we had:Recall that the paper's original algorithm does not compute
PL[i]
ifpruned[i]
is True. It just skips it. In such case, the original algorithm setPL[i]
as follows:which makes it difficult to find the correct index of discord. The MATALB code does a hack instead as follows:
and this does not seem to be a clean solution.
So, instead, I am updating the (approximate) matrix profile
PL
by using the computed distance profile D . This helps us to avoid the hack, and I think It should not be computationally more expensive compared tonp.argwhere(D < bfs)
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 60-68 updates the
chunks_range
by shifting it by one to the right in each iteration. I feel the code is not that clean! Still trying to figure out if there is a better way for updating the chunks range. Any thoughts?Reply via ReviewNB