Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating function main #4

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft

Conversation

aajayi-21
Copy link
Contributor

No description provided.

@aajayi-21
Copy link
Contributor Author

There is an error in io.py that I corrected. I also feel that there is a much better way to handle user input for the package in my main file.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main issue is to use argparse for the cli inputs and not input.

Take a look at pdfmorph/talk to Andrew to see how it is done.

@@ -102,7 +104,7 @@ def load_input_signals(file_path=None):
for item in directory_path.iterdir():
if item.is_file():
data = loadData(item.resolve())
if current_grid and current_grid != data[:, 0]:
if len(current_grid) != 0 and (current_grid != data[:, 0]).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment line would help here with the intent. I.e., why do we ignore and not handle if the data are on a different grid. If I remember correctly t here was a physics reason (mathematically it should be straightforward). Just recording for future people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment line to explain this

diffpy/snmf/stretchednmfapp.py Outdated Show resolved Hide resolved
if not directory_path:
directory_path = None

data_type = input("Specify the data type ('xrd' or 'pdf'): ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can often be obtained from file header.

if data_type != 'xrd' and data_type != 'pdf':
raise ValueError("The data type must be 'xrd' or 'pdf'")

component_amount = input("\nEnter the amount of components to obtain:")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be "number" of components. "amount" sounds like a quantity of a continuous quantity, like amount of sand or amount of flour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the naming.

try:
component_amount = int(component_amount)
except TypeError:
raise TypeError("Please enter an integer greater than 0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in the input instructions rather than have the poor user have to enter quantities and have the program fail to find out they were supposed to put in an integer...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will do this

@aajayi-21
Copy link
Contributor Author

I understand. For argparse should I create a separate function in io.py or put it in main

@sbillinge
Copy link
Contributor

sbillinge commented Jul 19, 2023 via email

@sbillinge
Copy link
Contributor

btw, your current main.py is acting as the "app" so stick with that I would say.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 you got it.....

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls see comments

)
parser.add_argument('-v', '--version', action='version', help='Print the software version number')
parser.add_argument('-d', '--directory', type=str,
help="Directory containing experimental data. Ensure it is in quotations or apostrophes.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it have to be in quotes? Call it maybe input-directory Give it a default value of None then later if args.data_directory == None set it to the cwd. Mention this in the help text.

I think it will end up being a good idea to add an argument output-directory as an optional argument. If it is not specified, maybe have the default behavior to either dump the results into input-directory or create a directory snmf_results that hangs off the input-directory?

Copy link
Contributor Author

@aajayi-21 aajayi-21 Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default, the default value of an optional argument is None (so it would be redundant to explicitly set that). The load_input_signal function handles the case where it is None as you described.

When I was using it, the cli wouldn't work when I specified a directory and it wasn't in quotes. I can see if I can fix this in the code.

I will rename the variables and add an argument as you described. Should I make a new function in io.py to handle dumping the results in a new directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good.

The most important thing is that the users know what is happening. They don't need to know "the default value is None and this is handled in the code" they need to know the arg is optional and the default behavior is that it uses the current working directory.

It is always a good idea to make your intentions clearer for other devs who come later, so I could definitely argue in favor of explicitly defining default to be None. It just reduces cognitive overload on the person coming later....they don't have to scratch their head and think, "what is going on?" and having to remember what the default behavior of argparse is....

parser.add_argument('-d', '--directory', type=str,
help="Directory containing experimental data. Ensure it is in quotations or apostrophes.")

parser.add_argument('component_number', type=int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest maybe components to make it easier for the users. Later in the code change it number_of_components to make it clearer for future developers. I think component_number has a different meaning, for example, component-1 component-2 etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will rename it here and in the rest of the code.

parser.add_argument('component_number', type=int,
help="The number of component signals to obtain from experimental "
"data. Must be an integer greater than 0.")
parser.add_argument('data_type', type=str, choices=['xrd', 'pdf'], help="The type of the experimental data.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the cli, always use a - rather than a _. Later argparse will replace it with _ for the variable name. So data-type would be here, but we would access it as args.data_type later. Like capitalization...if we always do it the same way it prevents aggravation later.

For this it is often present in the file header, so maybe better to make this optional, but raise an exception and ask the user to rerun specifying it if you can't find it in the file header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. I will make it an optional argument and rename it as you described.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see comment.

diffpy/snmf/stretchednmfapp.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great! Good work!

@@ -44,14 +44,16 @@ def initialize_variables(data_input, component_amount, data_type, sparsity=1, sm

component_matrix_guess = np.random.rand(signal_length, component_amount)
weight_matrix_guess = np.random.rand(component_amount, moment_amount)
stretching_matrix_guess = np.ones(component_amount, moment_amount) + np.random.randn(component_amount,
stretching_matrix_guess = np.ones((component_amount, moment_amount)) + np.random.randn(component_amount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should probably not be in io.py. I would suggest to make functions somewhere in subroutines that initialize specific arrays then here write something like:

stretching_matrix_guess = initialize_arrays(number_of_components, number_of_moments)

or sthg like that.

In the function in subroutines.py, put into docstring what the arrays are and how the decision were made on how to do it. Currently the code is a bit too hard to read as it is....I think that by component_amount , based on earlier conversations, you mean number_of_components. I have no idea what moment_amount means, nor why you initialize it with ones and then add some kind of random component, so a docstring could be veyr helpful here (but in subroutines, not in io.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create that function on a new branch. I will also rename the variables to make it more clear.

@@ -113,4 +117,10 @@ def load_input_signals(file_path=None):
grid_array = np.column_stack(grid_list)
grid_vector = np.unique(grid_array, axis=1)
values_array = np.column_stack(values_list)
return grid_vector, values_array
if file_extension in {'.gr','.chi'}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't hard-code this here. make globals at the top near the imports, something like PDF_CONTAINING_FILE_EXTENSION and DIFFRACTION_CONTAINING_FILE_EXTENSIONS. To make it easier to read use () or []rather than {} so it is explicityly a tuple/list.

Always try and use this structure, it makes the code easier to maintain (if I come up with a new file extension I want to add, I just add it once at the top!)

.chi contains diffraction data, not PDF

I think using file extensions for this is probably a bad idea. For example, iq just means I(Q) but it could be x-ray, electron or neutron data in principle. xye is a file format and could contain anything.

for our files (.iq, sq, gr, cgr) to the extent possible we use a header and metadata about the data is in the header. This is where you try and get default values from. Long is already doing this in the pdfitc code, so maybe try and steal from there. I am not sure if you have access, I can give it to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. I will look at the pdfitc code to see if there's a better way of doing this. I will let you know if I am not able to access it.

prog="stretched_nmf",
description="Stretched Nonnegative Matrix Factorization"
)
parser.add_argument('-v', '--version', action='version', help='Print the software version number')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this at the bottom, it will be rarely used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do that.

)
parser.add_argument('-v', '--version', action='version', help='Print the software version number')
parser.add_argument('-i', '--input-directory', type=str, default=None,
help="Directory containing experimental data. Default before will cause the program to use the current working directory as the input directory.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:
"Directory containing experimental data. Defaults to current working directory."
Were you able to resolve the issue with the quotes? It is ok ot put that back if needed, but is a bit ugly in a user interface if it can be avoided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not. I will see if I can find a solution.

parser.add_argument('-i', '--input-directory', type=str, default=None,
help="Directory containing experimental data. Default before will cause the program to use the current working directory as the input directory.")
parser.add_argument('-o', '--output-directory', type=str,
help="The directory where the results will be dumped. Default behavior will create a new directory named 'smnf_results' inside the input directory.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumped is a developer work. Maybe written for the users?

For consistency maybe adopt uniform language (lowers cognitive overload of the user), so if we used Defaults to current working directory. above, use Defaults to '<input_directory>/snmf_results'".

I am not sure how you implemented it, but don't dump it off cwd but off input_directory. The code could be run from anywhere but the results should logically be associated with the inputs.

There was a typo in your help string (smnf), double check this typo isn't anywhere else in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the help string and will check for typos

diffpy/snmf/stretchednmfapp.py Show resolved Hide resolved
def main():
args = create_parser()

grid, data_input, data_type = load_input_signals(args.input_directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is maybe clearer to put the default logic here that turns input_directory from None to cwd. Or else, make it an optional argument in load_input_data.

Again, it is about code readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this.

args = create_parser()

grid, data_input, data_type = load_input_signals(args.input_directory)
if args.data_type is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for readability I would separate the action of handling the default behavior and calling the function. These are just general comments to make your code more readable, I hope you are taking them on board and reusing later...

What I mean is:

    grid, data_input, data_type = load_input_signals(args.input_directory)
    if args.data_type:
       data_type = args.data_type
   variables = initialize_variables(data_input, args.components, data_type)

if not args.data_type:
try:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. For example, it's not good practice to have the function call inside the if statement if that's not needed.

diffpy/snmf/stretchednmfapp.py Show resolved Hide resolved

weights_matrix = variables["weight_matrix_guess"]
component_matrix = variables["component_matrix_guess"]
stretching_factor_matrix = variables["stretching_matrix_guess"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you did this for half the variables and not the others. I would just not do this and get the values from the dict below.

If you lint the code with black (whatever you do, do that on a separate branch! it makes the code review impossible to mix linting edits with functional edits!) it will arrange this in a nice readable way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, I will remove this.

I will look into setting up black to make my code more readable and better formatted.

… changed help string of output-directory default behavior for input_directory handled in main,
…d "moment_amount" to "number_of_moments" in io.py
Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments. Please read them. We can discuss how to do the refactor from the nice chart you made and shared.

diffpy/snmf/stretchednmfapp.py Show resolved Hide resolved
Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see my comment

@@ -18,6 +18,8 @@ def create_parser():
help="The directory where the results will be written. Defaults to '<input_directory>/snmf_results'.")
parser.add_argument('-t', '--data-type', type=str, choices=['powder_diffraction', 'pdf'],
help="The type of the experimental data.")
parser.add_argument('-l', '--lift', type=float, default=1,
help="The factor that determines how much the data is lifted. By default, the data will be vertically translated to make the minimum value 0.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "The lifting factor. Data will be lifted by lifted data = data - min(data)*lift. Default is 1"

btw, it occurs to me that if the min(data) is positive, the "lift" will "lower" the data to zero if lift = 1 and will lower it to below zero if lift > 1. This is probably undesirable behavior. Make sure you have tests for this eventuality that give the behavior you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this changes.

The same thought occurred to me. If the min(data) is positive, it's not apparent to me that the data would need to be lifted. My idea is that if the minimum is positive, then do nothing to the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is reasonable, though why not allow the user to lift the data if she wants to?

Maybe make sure the lift is a lift, so it adds the absolute value of min(data)*lift to the data, something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know what to do here. There is really too much going on for me to merge this PR, so it would be better to split things onto more different PRs, one for ach different bits of functionality.

On thing we don't want to forget is to refactor number_of_moments to number_of_inputs or something. I know that "moments" will confuse me (and others) in the future.

If you want I can merge this and you can make PRs fixing things. I don'r really mind doing that for ugly "main" functions that work but are ugly and you are gradually going through them clenaing things one thing at a time but want them to keep working "ugly" as you go.

@aajayi-21
Copy link
Contributor Author

I understand. I think it will be better for me to make a bunch of new PRs. Do I make these PRs one at a time or all at once?

Also, if we go with the approach where I have a ComponentSignal object, some of the functions have to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants