Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating function main #4

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions diffpy/snmf/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,16 @@ def initialize_variables(data_input, component_amount, data_type, sparsity=1, sm

component_matrix_guess = np.random.rand(signal_length, component_amount)
weight_matrix_guess = np.random.rand(component_amount, moment_amount)
stretching_matrix_guess = np.ones(component_amount, moment_amount) + np.random.randn(component_amount,
stretching_matrix_guess = np.ones((component_amount, moment_amount)) + np.random.randn(component_amount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should probably not be in io.py. I would suggest to make functions somewhere in subroutines that initialize specific arrays then here write something like:

stretching_matrix_guess = initialize_arrays(number_of_components, number_of_moments)

or sthg like that.

In the function in subroutines.py, put into docstring what the arrays are and how the decision were made on how to do it. Currently the code is a bit too hard to read as it is....I think that by component_amount , based on earlier conversations, you mean number_of_components. I have no idea what moment_amount means, nor why you initialize it with ones and then add some kind of random component, so a docstring could be veyr helpful here (but in subroutines, not in io.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create that function on a new branch. I will also rename the variables to make it more clear.

moment_amount) * 1e-3

diagonals = [np.ones(moment_amount - 2), -2 * np.ones(moment_amount - 2), np.ones(moment_amount - 2)]
smoothness_term = .25 * scipy.sparse.diags(diagonals, [0, 1, 2], shape=(moment_amount - 2, moment_amount))

hessian_helper_matrix = scipy.sparse.block_diag([smoothness_term.T @ smoothness_term] * component_amount)
sequence = np.arange(moment_amount * component_amount).reshape(component_amount, moment_amount).T.flatten()

hessian_helper_matrix = hessian_helper_matrix.tocsr()
hessian_helper_matrix = hessian_helper_matrix[sequence, :][:, sequence]

return {
Expand Down Expand Up @@ -99,10 +101,12 @@ def load_input_signals(file_path=None):
values_list = []
grid_list = []
current_grid = []
file_extension = ""
for item in directory_path.iterdir():
if item.is_file():
file_extension = Path(item).suffix
data = loadData(item.resolve())
if current_grid and current_grid != data[:, 0]:
if len(current_grid) != 0 and (current_grid != data[:, 0]).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment line would help here with the intent. I.e., why do we ignore and not handle if the data are on a different grid. If I remember correctly t here was a physics reason (mathematically it should be straightforward). Just recording for future people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment line to explain this

print(f"{item.name} was ignored as it is not on a compatible grid.")
continue
else:
Expand All @@ -113,4 +117,10 @@ def load_input_signals(file_path=None):
grid_array = np.column_stack(grid_list)
grid_vector = np.unique(grid_array, axis=1)
values_array = np.column_stack(values_list)
return grid_vector, values_array
if file_extension in {'.gr','.chi'}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't hard-code this here. make globals at the top near the imports, something like PDF_CONTAINING_FILE_EXTENSION and DIFFRACTION_CONTAINING_FILE_EXTENSIONS. To make it easier to read use () or []rather than {} so it is explicityly a tuple/list.

Always try and use this structure, it makes the code easier to maintain (if I come up with a new file extension I want to add, I just add it once at the top!)

.chi contains diffraction data, not PDF

I think using file extensions for this is probably a bad idea. For example, iq just means I(Q) but it could be x-ray, electron or neutron data in principle. xye is a file format and could contain anything.

for our files (.iq, sq, gr, cgr) to the extent possible we use a header and metadata about the data is in the header. This is where you try and get default values from. Long is already doing this in the pdfitc code, so maybe try and steal from there. I am not sure if you have access, I can give it to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. I will look at the pdfitc code to see if there's a better way of doing this. I will let you know if I am not able to access it.

data_type = 'pdf'
elif file_extension in {'iq','.xy','.xye','.xrd'}:
data_type = 'xrd'
else:
data_type = None
return grid_vector, values_array,data_type
43 changes: 43 additions & 0 deletions diffpy/snmf/stretchednmfapp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import numpy as np
import argparse

from diffpy.snmf.io import load_input_signals, initialize_variables
from diffpy.snmf.subroutines import *
aajayi-21 marked this conversation as resolved.
Show resolved Hide resolved


def create_parser():
parser = argparse.ArgumentParser(
prog="stretched_nmf",
description="Stretched Nonnegative Matrix Factorization"
)
parser.add_argument('-v', '--version', action='version', help='Print the software version number')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this at the bottom, it will be rarely used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do that.

parser.add_argument('-i', '--input-directory', type=str,
help="Directory containing experimental data. Has a default value of None which sets the input as your current working directory.")
parser.add_argument('-o', '--output-directory', type=str,
help="The directory where the results will be dumped. Default behavior will create a new directory named 'smnf_results' inside the input directory.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumped is a developer work. Maybe written for the users?

For consistency maybe adopt uniform language (lowers cognitive overload of the user), so if we used Defaults to current working directory. above, use Defaults to '<input_directory>/snmf_results'".

I am not sure how you implemented it, but don't dump it off cwd but off input_directory. The code could be run from anywhere but the results should logically be associated with the inputs.

There was a typo in your help string (smnf), double check this typo isn't anywhere else in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the help string and will check for typos

parser.add_argument('-t', '--data-type', type=str, choices=['xrd', 'pdf'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for our purposes, the choices should be powder_diffraction or pdf. Explain what PDF means here for people using the code.

Out of interest, how is this info used? If we want this snmf to be useful more broadly (e.g., for Raman or NMR data or X-ray fluoresence or something) we may want to figure out in a more abstract way what these inputs are for and describe them that way. we can use the diffraction case as an example. e.g., if it is used later to determine if the data are sparse or not we may want to mention that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the choices and explain what PDF means.

The information is used later on. It is used to determine if the data is sparse or not. It also changes the way the algorithm runs. For example, power_diffraction data is not lifted and when the data is powder diffraction the algorithm will run stretched_nmf first and then run sparse_stretched_nmf.

help="The type of the experimental data.")
parser.add_argument('components', type=int,
sbillinge marked this conversation as resolved.
Show resolved Hide resolved
help="The number of component signals to obtain from experimental "
"data. Must be an integer greater than 0.")
args = parser.parse_args()
return args


def main():
args = create_parser()

grid, data_input, data_type = load_input_signals(args.input_directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is maybe clearer to put the default logic here that turns input_directory from None to cwd. Or else, make it an optional argument in load_input_data.

Again, it is about code readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this.

if args.data_type is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for readability I would separate the action of handling the default behavior and calling the function. These are just general comments to make your code more readable, I hope you are taking them on board and reusing later...

What I mean is:

    grid, data_input, data_type = load_input_signals(args.input_directory)
    if args.data_type:
       data_type = args.data_type
   variables = initialize_variables(data_input, args.components, data_type)

if not args.data_type:
try:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. For example, it's not good practice to have the function call inside the if statement if that's not needed.

variables = initialize_variables(data_input, args.components, args.data_type)
else:
variables = initialize_variables(data_input, args.components, data_type)

if variables["data_type"] == 'pdf':
sbillinge marked this conversation as resolved.
Show resolved Hide resolved
lifted_data = data_input - np.ndarray.min(data_input[:])
maxiter = 300
return lifted_data


if __name__ == "__main__":
main()
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env python

# Installation script for diffpy.pdfmorph
# Installation script for diffpy.snmf

"""diffpy.snmf - a package implementing the stretched NMF algorithm.

Expand Down Expand Up @@ -28,7 +28,7 @@
entry_points={
# define console_scripts here, see setuptools docs for details.
'console_scripts': [
'pdfmorph = diffpy.pdfmorph.pdfmorphapp:main',
'stretched_nmf = diffpy.snmf.stretchednmfapp:main',
],
},
test_suite='tests',
Expand Down