Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Refactoring TODO #1

Open
7 tasks
EdwardRaff opened this issue Mar 25, 2015 · 9 comments
Open
7 tasks

General Refactoring TODO #1

EdwardRaff opened this issue Mar 25, 2015 · 9 comments

Comments

@EdwardRaff
Copy link
Owner

EdwardRaff commented Mar 25, 2015

Notes for myself on some refactoring I want to do at some point. Probably going to do these all at once when I move to Java 8

  • use "size" as method name instead of "getSampleSize"
  • standardize Vec / Matrix method order to (Other obj, constant, Obj to mutate).
  • Make transforms have a train method that just takes a DataSet, and remove the factories used. It will be up to the implementation to throw an error if the data set isn't of the desired type.
  • Move ARFF into io package
  • Remove weight from data point object, make methods take in a weight vector if they are going to use weights. (Might be a Java 8 only one)
  • Remove factory concept from Transforms and just use a fit method.
  • Improve Text Loaders. Rename to TextCorpus for the current loaders. Make the base abstract classes non abstract so they can be used for unlabeled data loading.
@TKlerx
Copy link
Contributor

TKlerx commented Apr 17, 2015

May I add some things?!

I have seen some double comparisons with a ==, which is not numerically stable, e.g.

if(a == b && a == x)

in distributions.Uniform:26 (and in many other lines).

I would change every occurrence in the code, but usually I use apache-commons-math.Precision.
Is it your goal to keep JSAT without any external dependencies?
References:
[https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/]
[https://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/util/Precision.html#equals%28double,%20double,%20int%29]
[http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#689]

In addition, I already corrected some java docs. You can see it in my dev branch. Hell, you wrote a lot of code. How long have you been developing JSAT?

@EdwardRaff
Copy link
Owner Author

May I add some things?!

Of course!

in distributions.Uniform:26 (and in many other lines).

In that case it is actually safe. if a == b && a == x is the case where the uniform distribution is over one infinitesimal point, which really doesn't make sense, but is needed because otherwise the (a <= x && x <= b) case would apply, and result in a NaN from a division by zero.

I'm generally aware of the dangers of floating point, but there are instances where such == checks are done intentionally, such as the case in the Uniform distribution. There are a few other instances of code smells in JSAT that are done intentionally, hopefully I did a better job documenting them then I did that one in Uniform.java

I already corrected some java docs.

Feel free to do a pull request on them!

Hell, you wrote a lot of code. How long have you been developing JSAT?

Been a free time project for just over 4 years now :) I'm very much a "learn by doing" kinda guy, so whenever I see a paper I'm interested in I just go and try and implement it when I have the time. Though I did have more free time before working for money :-/

@TKlerx
Copy link
Contributor

TKlerx commented Apr 18, 2015

Feel free to do a pull request on them!

I will try to do so, but I already added some things which I would say are necessary (apache-commons-math for safe double comparison; plan to add more, e.g. trove for primitive lists).
My plan is to create a Maven repo that contains my JSAT version so I also changed the pom.xml and will even add more stuff.
I will try to cherry pick the safe changes without adding dependencies etc.

I ran findbugs on the project and it found some bad double compares and x == Double.NaN checks.
What do you think about including some de facto standard libraries for such hassles?

@EdwardRaff
Copy link
Owner Author

As I said, some of those checks are done intentionally.

Right now I do not want to have any dependencies in JSAT.

Also, FYI - there is some code where you will experience a serious performance regression in using Trove instead of the code I have for primitive maps. (And some code that is on my local repo and not committed yet).

@TKlerx
Copy link
Contributor

TKlerx commented Apr 18, 2015

As I said, some of those checks are done intentionally.

I think I found some places where the checks should not be done but I will write a testcase first which may take some time. Will let you know when it's done.

Do you have any advice how to compare two doubles for equality in JSAT without an external library? I have an array of doubles and want to check whether they are all the same.

Right now I do not want to have any dependencies in JSAT.

Ok :(

Also, FYI - there is some code where you will experience a serious performance regression in using Trove instead of the code I have for primitive maps. (And some code that is on my local repo and not committed yet).

I would use them e.g. instead of List to prevent a lot of AutoBoxing (I had an algorithm where removing boxing/unboxing increased the algorithm by factor 2).

@EdwardRaff
Copy link
Owner Author

Feel free to email me the places where you have concerns and I will review them

You need to decide what level of "sameness" you need. In ML we usually don't care about the difference in value less than 1e-3, so doing abs(a-b) < 1e-3 should be good for most cases. In the Vec class there is a compare to method that takes a threshold on the absolute difference.

As I said, I have my own primitive collections in JSAT, I did not say to use generics.

@TKlerx
Copy link
Contributor

TKlerx commented Apr 18, 2015

As I said, I have my own primitive collections in JSAT, I did not say to use generics.

Ahh, found them by now. Is there a reason why you use List/List then? E.g. NaiveBayes.getSampleVariableVector (l. 388)

@EdwardRaff
Copy link
Owner Author

That code was written before I created the DoubleList class, and I missed changing that one.

@semper-omnia-paratus
Copy link

semper-omnia-paratus commented Jun 16, 2016

I would separate abstract classes/interfaces from concrete implementations by putting them into sub-packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants