Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Programming c++ implementation #9

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
bb10165
added Dynp class
npkamath Jan 2, 2024
d13dfc7
reverted extra spaces
npkamath Jan 2, 2024
5bbbaf2
reverted spaces
npkamath Jan 2, 2024
930430a
removed read_input function
npkamath Jan 2, 2024
379897a
Merge remote-tracking branch 'refs/remotes/origin/main'
npkamath Jan 2, 2024
e35cff1
removed read_input function
npkamath Jan 2, 2024
e8ccb7c
Delete PULL_REQUEST_TEMPLATE.md
npkamath Jan 2, 2024
fccca35
refactor: Remove Dynp directory
b-butler Jan 9, 2024
0c64950
fixed namespace, naming, and cleaned up comments
npkamath Jan 12, 2024
fe6bac1
added dynp.py file, fixed constructors, added column_indices for fast…
npkamath Jan 14, 2024
245a92f
fixed index system
npkamath Jan 15, 2024
0af1b08
reorganized class variables and added dynp.py functions and fit
npkamath Jan 15, 2024
66e2104
fit function added with parameter input, removed whitespace with pre…
npkamath Jan 19, 2024
a4cff6f
Merge remote-tracking branch 'upstream/main'
b-butler Jan 24, 2024
9ccf1fe
build: Switch to scikit-build-core.
b-butler Jan 24, 2024
8da0fe0
ci: Add reusable workflow to install system packages
b-butler Jan 24, 2024
943a6c0
ci: Fix package installation by using composite action
b-butler Jan 24, 2024
98ee45c
ci: Correctly specify shell for custom action
b-butler Jan 24, 2024
744b928
ci: Fix action step name formatting
b-butler Jan 24, 2024
b8dd0f2
ci: Remove trailing ":"
b-butler Jan 24, 2024
c7e6920
ci: Update package manager caches before installing
b-butler Jan 24, 2024
7771157
ci: Fix apt-get package names
b-butler Jan 24, 2024
f7287f8
ci: Fix one last package name
b-butler Jan 24, 2024
47aedc0
ci: Yet another package name change
b-butler Jan 24, 2024
4fd3a58
documentation and renaming added
npkamath Feb 26, 2024
4dd27f8
upper triangular restructured and dynp restructured
npkamath Feb 26, 2024
f51ff1c
upper triangular restructured and dynp restructured
npkamath Feb 26, 2024
c0afe6f
conditionals
npkamath Feb 26, 2024
77e3e00
naming and other cpp changes
npkamath Feb 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
cmake_minimum_required(VERSION 3.8)
project(_dupin VERSION 0.0.1)

set(DEFAULT_BUILD_TYPE "Release")
if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE ${DEFAULT_BUILD_TYPE})
endif()

find_package(Eigen3 REQUIRED)
find_package(TBB REQUIRED)
add_subdirectory(extern/pybind11)

include_directories(${PROJECT_SOURCE_DIR}/src)
add_subdirectory(src)
46 changes: 46 additions & 0 deletions dupin/detect/dynp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import _DynP

class DynP:
"""Dynamic Programming class for calculating optimal segmentation

Attributes:
data (np.ndarray): Matrix storing the dataset.
num_bkps (int): Number of breakpoints to detect.
jump (int): Interval for checking potential breakpoints.
min_size (int): Minimum size of a segment.
"""

def __init__(self, data: np.ndarray, num_bkps: int, jump: int, min_size: int):
"""Initializes the DynamicProgramming instance with given parameters."""
self.dynp = _DynP.DynamicProgramming(data, num_bkps, jump, min_size)
npkamath marked this conversation as resolved.
Show resolved Hide resolved

def set_num_threads(self, num_threads: int):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term dev wise, if additional CPP methods will be added we will most likely want num_threads to be controlled on the level of the whole module? Would preprocessor be a good place to have this? @b-butler thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to as a _DynP module function and exporting it to dupin/util.py is probably the best solution in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would just move this function to util.py right? would just have to import _DynP in util.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

"""Sets the number of threads for parallelization.

Args:
num_threads (int): The number of threads to use.
"""
self.dynp.set_threads(num_threads)

def return_breakpoints(self) -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify which part

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant return_breakpoints, I removed it and just kept fit.

"""Returns the optimal set of breakpoints after segmentation.

Returns:
list: A list of integers representing the breakpoints.
"""
return self.dynp.return_breakpoints()

def initialize_cost_matrix(self):
"""Initializes and fills the upper triangular cost matrix for all data segments."""
self.dynp.initialize_cost_matrix()

def fit(self) -> list:
"""Calculates the cost matrix and returns the breakpoints.

Returns:
list: A list of integers representing the breakpoints.
"""
return self.dynp.fit()
npkamath marked this conversation as resolved.
Show resolved Hide resolved



17 changes: 17 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
pybind11_add_module(_dupin dupininterface.cpp
dupin.h dupin.cpp
npkamath marked this conversation as resolved.
Show resolved Hide resolved
)

set_target_properties(_dupin PROPERTIES
CXX_STANDARD 17
CMAKE_CXX_STANDARD_REQUIRED True
)

target_include_directories(_dupin PRIVATE
${EIGEN3_INCLUDE_DIR}
${TBB_INCLUDE_DIRS}
)

target_link_libraries(_dupin PRIVATE TBB::tbb)
target_compile_definitions(_dupin PRIVATE VERSION_INFO=${PROJECT_VERSION})
target_compile_options(_dupin PRIVATE -O2 -march=native)
173 changes: 173 additions & 0 deletions src/dupin.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
#include <iostream>
#include <iomanip>
#include <limits>
#include <unordered_map>
#include <vector>
#include <Eigen/Dense>
#include <tbb/blocked_range2d.h>
#include <tbb/global_control.h>
#include <tbb/parallel_for.h>
#include "dupin.h"

using namespace std;
using namespace Eigen;
Comment on lines +12 to +13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use the namespace, explicitly scope them. It makes the code more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using namespace std;
using namespace Eigen;

If you use std:: and Eigen:: then you don't need these lines.


DynamicProgramming::DynamicProgramming()
: num_bkps(1), num_parameters(0), num_timesteps(0), jump(1), min_size(3) {}

DynamicProgramming::DynamicProgramming(const Eigen::MatrixXd &data, int num_bkps_,
int jump_, int min_size_)
: data(data), num_bkps(num_bkps_),
jump(jump_), min_size(min_size_) {
num_timesteps = data.rows();
num_parameters = data.cols();
}

void DynamicProgramming::scale_data() {
Eigen::VectorXd min_val = data.colwise().minCoeff();
Eigen::VectorXd max_val = data.colwise().maxCoeff();
Eigen::VectorXd range = max_val - min_val;

for (int j = 0; j < num_parameters; ++j) {
if (range(j) == 0.0) {
data.col(j).setZero();
} else {
data.col(j) = (data.col(j).array() - min_val(j)) / range(j);
}
}
}
void DynamicProgramming::regression_setup(linear_fit_struct &lfit) {
lfit.x = Eigen::VectorXd::LinSpaced(num_timesteps, 0, num_timesteps - 1) /
(num_timesteps - 1);
lfit.y = data;
}

Eigen::VectorXd DynamicProgramming::regression_line(int start, int end, int dim,
linear_fit_struct &lfit) {
int n = end - start;
Eigen::VectorXd x = lfit.x.segment(start, n);
Eigen::VectorXd y = lfit.y.col(dim).segment(start, n);

double x_mean = x.mean();
double y_mean = y.mean();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably save some time by saving and computing the cumulative sum and then computing the mean through ((sum_[end] - sum_[begin]) / (end - begin)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were your thoughts @npkamath?


Eigen::VectorXd x_centered = x.array() - x_mean;
Eigen::VectorXd y_centered = y.array() - y_mean;

double slope = x_centered.dot(y_centered) / x_centered.squaredNorm();
double intercept = y_mean - slope * x_mean;

return x.unaryExpr(
[slope, intercept](double xi) { return slope * xi + intercept; });
}

double DynamicProgramming::l2_cost(Eigen::MatrixXd &predicted_y, int start, int end) {
Eigen::MatrixXd diff = predicted_y.block(start, 0, end - start, num_parameters) -
data.block(start, 0, end - start, num_parameters);
return std::sqrt(diff.array().square().sum());
}

Eigen::MatrixXd DynamicProgramming::predicted(int start, int end,
linear_fit_struct &lfit) {
Eigen::MatrixXd predicted_y(num_timesteps, num_parameters);
for (int i = 0; i < num_parameters; ++i) {
predicted_y.block(start, i, end - start, 1) =
regression_line(start, end, i, lfit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should compute the regression for all dimensions simultaneously; it should be faster. Also, is there a way to avoid the copying of data and have the prediction method take a reference to the data to store in?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had an attempt at it but it is not returning the right type to pass into the next function; I will try a different approach, maybe not with Eigen

}
return predicted_y;
}

double DynamicProgramming::cost_function(int start, int end) {
linear_fit_struct lfit;
regression_setup(lfit);
Comment on lines +86 to +87
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the constructor for linear_fit_struct set up the values.

Eigen::MatrixXd predicted_y = predicted(start, end, lfit);
return l2_cost(predicted_y, start, end);
}

void DynamicProgramming::initialize_cost_matrix() {
scale_data();
cost_matrix.initialize(num_timesteps);
tbb::parallel_for(tbb::blocked_range<int>(0, num_timesteps),
[&](const tbb::blocked_range<int> &r) {
for (int i = r.begin(); i < r.end(); ++i) {
for (int j = i + min_size; j < num_timesteps; ++j) {
cost_matrix(i, j) = cost_function(i, j);
}
}
});
}

std::pair<double, std::vector<int>> DynamicProgramming::seg(int start, int end,
int num_bkps) {
MemoKey key = {start, end, num_bkps};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MemoKey key = {start, end, num_bkps};
MemoKey key{start, end, num_bkps};

auto it = memo.find(key);
if (it != memo.end()) {
return it->second;
}
if (num_bkps == 0) {
return {cost_matrix(start, end), {end}};
}

std::pair<double, std::vector<int>> best = {std::numeric_limits<double>::infinity(), {}};

for (int bkp = start + min_size; bkp < end; bkp++) {
if ((bkp - start) >= min_size && (end - bkp) >= min_size) {
npkamath marked this conversation as resolved.
Show resolved Hide resolved
auto left = seg(start, bkp, num_bkps - 1);
auto right = seg(bkp, end, 0);
double cost = left.first + right.first;
if (cost < best.first) {
best.first = cost;
best.second = left.second;
best.second.push_back(bkp);
best.second.insert(best.second.end(), right.second.begin(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't right always be one element long? If so, we could just push back.

right.second.end());
}
}
}

memo[key] = best;
return best;
}

std::vector<int> DynamicProgramming::compute_breakpoints() {
auto result = seg(0, num_timesteps - 1, num_bkps);
std::vector<int> breakpoints = result.second;
std::sort(breakpoints.begin(), breakpoints.end());
npkamath marked this conversation as resolved.
Show resolved Hide resolved
breakpoints.erase(std::unique(breakpoints.begin(), breakpoints.end()),
breakpoints.end());
npkamath marked this conversation as resolved.
Show resolved Hide resolved
return breakpoints;
}

std::vector<int> DynamicProgramming::fit(){
initialize_cost_matrix();
return compute_breakpoints();
}

void set_parallelization(int num_threads) {
static tbb::global_control gc(tbb::global_control::max_allowed_parallelism,
num_threads);
}

int DynamicProgramming::get_num_timesteps() { return num_timesteps; }

int DynamicProgramming::get_num_parameters() { return num_parameters; }

int DynamicProgramming::get_num_bkps() { return num_bkps; }

Eigen::MatrixXd &DynamicProgramming::getDatum() { return data; }

DynamicProgramming::UpperTriangularMatrix &
DynamicProgramming::getCostMatrix() {
return cost_matrix;
}

void DynamicProgramming::setDatum(const Eigen::MatrixXd &value) {
data = value;
}

void DynamicProgramming::setCostMatrix(
const DynamicProgramming::UpperTriangularMatrix &value) {
cost_matrix = value;
}
Comment on lines +165 to +168
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should anyone be setting the full cost matrix?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! honestly the setter functions aren't needed anymore. I can probably delete them and just set the getters as actual functions rather than properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved but not deleted.


int main() { return 0; }
Loading
Loading