Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating function main #4

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions diffpy/snmf/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,16 @@ def initialize_variables(data_input, component_amount, data_type, sparsity=1, sm

component_matrix_guess = np.random.rand(signal_length, component_amount)
weight_matrix_guess = np.random.rand(component_amount, moment_amount)
stretching_matrix_guess = np.ones(component_amount, moment_amount) + np.random.randn(component_amount,
stretching_matrix_guess = np.ones((component_amount, moment_amount)) + np.random.randn(component_amount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should probably not be in io.py. I would suggest to make functions somewhere in subroutines that initialize specific arrays then here write something like:

stretching_matrix_guess = initialize_arrays(number_of_components, number_of_moments)

or sthg like that.

In the function in subroutines.py, put into docstring what the arrays are and how the decision were made on how to do it. Currently the code is a bit too hard to read as it is....I think that by component_amount , based on earlier conversations, you mean number_of_components. I have no idea what moment_amount means, nor why you initialize it with ones and then add some kind of random component, so a docstring could be veyr helpful here (but in subroutines, not in io.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create that function on a new branch. I will also rename the variables to make it more clear.

moment_amount) * 1e-3

diagonals = [np.ones(moment_amount - 2), -2 * np.ones(moment_amount - 2), np.ones(moment_amount - 2)]
smoothness_term = .25 * scipy.sparse.diags(diagonals, [0, 1, 2], shape=(moment_amount - 2, moment_amount))

hessian_helper_matrix = scipy.sparse.block_diag([smoothness_term.T @ smoothness_term] * component_amount)
sequence = np.arange(moment_amount * component_amount).reshape(component_amount, moment_amount).T.flatten()

hessian_helper_matrix = hessian_helper_matrix.tocsr()
hessian_helper_matrix = hessian_helper_matrix[sequence, :][:, sequence]

return {
Expand Down Expand Up @@ -102,7 +104,7 @@ def load_input_signals(file_path=None):
for item in directory_path.iterdir():
if item.is_file():
data = loadData(item.resolve())
if current_grid and current_grid != data[:, 0]:
if len(current_grid) != 0 and (current_grid != data[:, 0]).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment line would help here with the intent. I.e., why do we ignore and not handle if the data are on a different grid. If I remember correctly t here was a physics reason (mathematically it should be straightforward). Just recording for future people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment line to explain this

print(f"{item.name} was ignored as it is not on a compatible grid.")
continue
else:
Expand Down
34 changes: 34 additions & 0 deletions diffpy/snmf/stretchednmfapp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import numpy as np
import argparse

from diffpy.snmf.io import load_input_signals, initialize_variables


def create_parser():
parser = argparse.ArgumentParser(
prog="stretched_nmf",
description="Stretched Nonnegative Matrix Factorization"
)
parser.add_argument('-v', '--version', action='version', help='Print the software version number')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this at the bottom, it will be rarely used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do that.

parser.add_argument('-d', '--directory', type=str,
help="Directory containing experimental data. Ensure it is in quotations or apostrophes.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it have to be in quotes? Call it maybe input-directory Give it a default value of None then later if args.data_directory == None set it to the cwd. Mention this in the help text.

I think it will end up being a good idea to add an argument output-directory as an optional argument. If it is not specified, maybe have the default behavior to either dump the results into input-directory or create a directory snmf_results that hangs off the input-directory?

Copy link
Contributor Author

@aajayi-21 aajayi-21 Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default, the default value of an optional argument is None (so it would be redundant to explicitly set that). The load_input_signal function handles the case where it is None as you described.

When I was using it, the cli wouldn't work when I specified a directory and it wasn't in quotes. I can see if I can fix this in the code.

I will rename the variables and add an argument as you described. Should I make a new function in io.py to handle dumping the results in a new directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good.

The most important thing is that the users know what is happening. They don't need to know "the default value is None and this is handled in the code" they need to know the arg is optional and the default behavior is that it uses the current working directory.

It is always a good idea to make your intentions clearer for other devs who come later, so I could definitely argue in favor of explicitly defining default to be None. It just reduces cognitive overload on the person coming later....they don't have to scratch their head and think, "what is going on?" and having to remember what the default behavior of argparse is....


parser.add_argument('component_number', type=int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest maybe components to make it easier for the users. Later in the code change it number_of_components to make it clearer for future developers. I think component_number has a different meaning, for example, component-1 component-2 etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will rename it here and in the rest of the code.

help="The number of component signals to obtain from experimental "
"data. Must be an integer greater than 0.")
parser.add_argument('data_type', type=str, choices=['xrd', 'pdf'], help="The type of the experimental data.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the cli, always use a - rather than a _. Later argparse will replace it with _ for the variable name. So data-type would be here, but we would access it as args.data_type later. Like capitalization...if we always do it the same way it prevents aggravation later.

For this it is often present in the file header, so maybe better to make this optional, but raise an exception and ask the user to rerun specifying it if you can't find it in the file header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. I will make it an optional argument and rename it as you described.

args = parser.parse_args()
return args


def main():
args = create_parser()

grid, data_input = load_input_signals(args.directory)
variables = initialize_variables(data_input, args.component_number, args.data_type)
lifted_data = data_input - np.ndarray.min(data_input[:])
return lifted_data


if __name__ == "__main__":
main()
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env python

# Installation script for diffpy.pdfmorph
# Installation script for diffpy.snmf

"""diffpy.snmf - a package implementing the stretched NMF algorithm.

Expand Down Expand Up @@ -28,7 +28,7 @@
entry_points={
# define console_scripts here, see setuptools docs for details.
'console_scripts': [
'pdfmorph = diffpy.pdfmorph.pdfmorphapp:main',
'stretched_nmf = diffpy.snmf.stretchednmfapp:main',
],
},
test_suite='tests',
Expand Down