Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with more than 2^31-1 observations #43

Open
mcaceresb opened this issue Aug 19, 2018 · 15 comments
Open

Error with more than 2^31-1 observations #43

mcaceresb opened this issue Aug 19, 2018 · 15 comments
Labels

Comments

@mcaceresb
Copy link
Owner

A bug in Stata causes gtools to exit with error if the user has more than 2^31-1 observations in memory. See this bug report.

I contacted StataCorp about it and they replied:

The SPI can work with datasets containing up to 2^31-1 observations. Our development group is looking into modifying future versions of the SPI to allow more observations.

mcaceresb added a commit that referenced this issue Aug 19, 2018
…teger bug

* Closes #43
* Added option to selectively test via gtools, test[]
* Gtools exits with error if `_N > 2^31-1` and points the user to the
  pertinent bug report. The SPI uses long integers (32-bit) instead of
  long long integers (64-bit) for its observation numbers (not to
  mention theyare signed integers, but alas). At any rate, this de facto
  limits the number of observations to 2^31-1.
@mcaceresb
Copy link
Owner Author

Re-open because technically it hasn't been fixed. It just throws the correct error now.

@mcaceresb mcaceresb reopened this Aug 21, 2018
@miserton
Copy link

miserton commented Aug 9, 2022

Any update on this bug - is it correctable now?

@mcaceresb
Copy link
Owner Author

@miserton Not afaik. I would contact StataCorp to ask about updates to this point, since I cannot fix it until they update the SPI. Sorry!

@miserton
Copy link

I reached out to StataCorp:

The stplugin.h has been frozen so that we do not make any changes that could break commands that are still using this code. For 64-bit integer support we recommend working with Java or Python code instead of the older 32-bit C plugin limits.

Unless gtools could call the C code through Java or Python, it doesn't look like the 2.1B limitation will be overcome.

@mcaceresb
Copy link
Owner Author

@miserton That's a bit of a strange answer since changing stplugin.h in particular would only involve adding a long long type (and in my system long is 64-bit anyway). I always thought the issue were the internals, and that chancing the internal function to take 64-bit input was difficult for some reason. But I don't really know.

In any case, you are right this is unlikely to be resolved any time soon, if ever.

@wbuchanan
Copy link

@miserton can you share who provided that answer to you? @mcaceresb do you have any interest in porting things over to Java or any idea how much different the performance would be in Java compared to C? Also, do you know if C allows overloading function signatures (thinking that might make it even easier since multiple dispatch would prevent breaking any existing code while allowing the same function names to be used with updated types).

@mcaceresb
Copy link
Owner Author

@wbuchanan Porting to Java would take quite a bit of effort (partly since I'm not particularly familiar wit Java; not sure if there would be a simple thin wrapper to execute C code in Java but I expect it would still take some work).

The problem is that I cannot modify observations past 2^31-1. The only solution I can think of given the current limitations would be to create multiple separate datasets and append them at the end from Stata.

@miserton
Copy link

@miserton can you share who provided that answer to you? @mcaceresb do you have any interest in porting things over to Java or any idea how much different the performance would be in Java compared to C? Also, do you know if C allows overloading function signatures (thinking that might make it even easier since multiple dispatch would prevent breaking any existing code while allowing the same function names to be used with updated types).

I got the answer from Stata technical support ([email protected]).

@wbuchanan
Copy link

@mcaceresb that is what I tried the most recent time and there were still too many observations. There is something called the Java Native Interface, but I've not used that at all. However, even though it's been a while I'm still pretty familiar with the Java API for Stata. The only other thing that might be feasible would be trying to figure out how to implement things in a Cythonic way and using the Python API for the entry point, but I'm not sure how much performance degradation that would cause.

@mcaceresb
Copy link
Owner Author

@wbuchanan

that is what I tried the most recent time and there were still too many observations.

Wait, if k is the maximum number of variables that match a given stub, you can reshape long by looping ceil(_N * k / (2^31-1)) chunks of roughly equal size, no?

@wbuchanan
Copy link

Turns out there was a bug in part of the code that was intended to reduce N by a decent amount. That said, given the size of N and the size of k it would almost definitely be more efficient to port that part of the process to a different language that wouldn't have the restriction on the size of N. (It is a fairly large data set with several hundred variables that need to go from wide to long).

@miserton
Copy link

Turns out there was a bug in part of the code that was intended to reduce N by a decent amount. That said, given the size of N and the size of k it would almost definitely be more efficient to port that part of the process to a different language that wouldn't have the restriction on the size of N. (It is a fairly large data set with several hundred variables that need to go from wide to long).

I was grappling with a similar issue when I ran into the maximum number of variables, needing to reshape a very large dataset and greshape was the only thing I thought might work. What I ultimately did was save the data as a CSV and then use some Bash code to write the equivalent of a reshape. I was going from long to wide in the example, but you can do the same thing going from wide to long:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1689382-how-to-identify-duplicate-combinations-of-observations-in-long-form-without-reshaping

@mcaceresb
Copy link
Owner Author

mcaceresb commented Jun 21, 2024

@wbuchanan I had a reshape like this: 8 stubs, 144 variables per stub (monthly data over 12 years), a bit under 100M observations. The main thing in my favor was that a TON of the wide data was missing, so the long version was considerably smaller than the theoretical max of ~14B (having to do this is why I added the dropmiss option).

Anyway, I processed it by chunk without much issue, but you're right it should be much slower than doing it at once from another language. This suggests R's data.table is extremely fast, but it also shows pandas might give enough of a speed boost while having an easier Stata interface.

@wbuchanan
Copy link

@mcaceresb I think there might be an older version of gtools in the FSRDCs computing environment since there isn't anything mentioning dropmiss in the documentation. That said, by chunking columns and dropping records missing all of the variables to be reshaped manually prior to the reshape I was able to get things to work well. If you wouldn't mind sending the current version of gtools to one of the FSRDC staff I can get you connected; they have an SOP that requires the author of a package to send the source to them.

@mcaceresb
Copy link
Owner Author

Feel free to email me and I can send them the code in whatever format they like. @wbuchanan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants