Error with more than 2^31-1 observations #43

mcaceresb · 2018-08-19T17:08:58Z

A bug in Stata causes gtools to exit with error if the user has more than 2^31-1 observations in memory. See this bug report.

I contacted StataCorp about it and they replied:

The SPI can work with datasets containing up to 2^31-1 observations. Our development group is looking into modifying future versions of the SPI to allow more observations.

…teger bug * Closes #43 * Added option to selectively test via gtools, test[] * Gtools exits with error if `_N > 2^31-1` and points the user to the pertinent bug report. The SPI uses long integers (32-bit) instead of long long integers (64-bit) for its observation numbers (not to mention theyare signed integers, but alas). At any rate, this de facto limits the number of observations to 2^31-1.

mcaceresb · 2018-08-21T14:31:07Z

Re-open because technically it hasn't been fixed. It just throws the correct error now.

miserton · 2022-08-09T18:50:55Z

Any update on this bug - is it correctable now?

mcaceresb · 2022-08-09T19:31:41Z

@miserton Not afaik. I would contact StataCorp to ask about updates to this point, since I cannot fix it until they update the SPI. Sorry!

miserton · 2022-08-10T13:31:47Z

I reached out to StataCorp:

The stplugin.h has been frozen so that we do not make any changes that could break commands that are still using this code. For 64-bit integer support we recommend working with Java or Python code instead of the older 32-bit C plugin limits.

Unless gtools could call the C code through Java or Python, it doesn't look like the 2.1B limitation will be overcome.

mcaceresb · 2022-08-10T17:08:36Z

@miserton That's a bit of a strange answer since changing stplugin.h in particular would only involve adding a long long type (and in my system long is 64-bit anyway). I always thought the issue were the internals, and that chancing the internal function to take 64-bit input was difficult for some reason. But I don't really know.

In any case, you are right this is unlikely to be resolved any time soon, if ever.

wbuchanan · 2024-06-21T12:51:18Z

@miserton can you share who provided that answer to you? @mcaceresb do you have any interest in porting things over to Java or any idea how much different the performance would be in Java compared to C? Also, do you know if C allows overloading function signatures (thinking that might make it even easier since multiple dispatch would prevent breaking any existing code while allowing the same function names to be used with updated types).

mcaceresb · 2024-06-21T13:16:15Z

@wbuchanan Porting to Java would take quite a bit of effort (partly since I'm not particularly familiar wit Java; not sure if there would be a simple thin wrapper to execute C code in Java but I expect it would still take some work).

The problem is that I cannot modify observations past 2^31-1. The only solution I can think of given the current limitations would be to create multiple separate datasets and append them at the end from Stata.

miserton · 2024-06-21T13:18:20Z

@miserton can you share who provided that answer to you? @mcaceresb do you have any interest in porting things over to Java or any idea how much different the performance would be in Java compared to C? Also, do you know if C allows overloading function signatures (thinking that might make it even easier since multiple dispatch would prevent breaking any existing code while allowing the same function names to be used with updated types).

I got the answer from Stata technical support ([email protected]).

wbuchanan · 2024-06-21T13:20:04Z

@mcaceresb that is what I tried the most recent time and there were still too many observations. There is something called the Java Native Interface, but I've not used that at all. However, even though it's been a while I'm still pretty familiar with the Java API for Stata. The only other thing that might be feasible would be trying to figure out how to implement things in a Cythonic way and using the Python API for the entry point, but I'm not sure how much performance degradation that would cause.

mcaceresb · 2024-06-21T13:34:31Z

@wbuchanan

that is what I tried the most recent time and there were still too many observations.

Wait, if k is the maximum number of variables that match a given stub, you can reshape long by looping ceil(_N * k / (2^31-1)) chunks of roughly equal size, no?

wbuchanan · 2024-06-21T14:43:46Z

Turns out there was a bug in part of the code that was intended to reduce N by a decent amount. That said, given the size of N and the size of k it would almost definitely be more efficient to port that part of the process to a different language that wouldn't have the restriction on the size of N. (It is a fairly large data set with several hundred variables that need to go from wide to long).

miserton · 2024-06-21T15:07:56Z

Turns out there was a bug in part of the code that was intended to reduce N by a decent amount. That said, given the size of N and the size of k it would almost definitely be more efficient to port that part of the process to a different language that wouldn't have the restriction on the size of N. (It is a fairly large data set with several hundred variables that need to go from wide to long).

I was grappling with a similar issue when I ran into the maximum number of variables, needing to reshape a very large dataset and greshape was the only thing I thought might work. What I ultimately did was save the data as a CSV and then use some Bash code to write the equivalent of a reshape. I was going from long to wide in the example, but you can do the same thing going from wide to long:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1689382-how-to-identify-duplicate-combinations-of-observations-in-long-form-without-reshaping

mcaceresb · 2024-06-21T15:08:03Z

@wbuchanan I had a reshape like this: 8 stubs, 144 variables per stub (monthly data over 12 years), a bit under 100M observations. The main thing in my favor was that a TON of the wide data was missing, so the long version was considerably smaller than the theoretical max of ~14B (having to do this is why I added the dropmiss option).

Anyway, I processed it by chunk without much issue, but you're right it should be much slower than doing it at once from another language. This suggests R's data.table is extremely fast, but it also shows pandas might give enough of a speed boost while having an easier Stata interface.

wbuchanan · 2024-06-26T13:01:35Z

@mcaceresb I think there might be an older version of gtools in the FSRDCs computing environment since there isn't anything mentioning dropmiss in the documentation. That said, by chunking columns and dropping records missing all of the variables to be reshaped manually prior to the reshape I was able to get things to work well. If you wouldn't mind sending the current version of gtools to one of the FSRDC staff I can get you connected; they have an SOP that requires the author of a package to send the source to them.

mcaceresb · 2024-06-27T18:10:27Z

Feel free to email me and I can send them the code in whatever format they like. @wbuchanan

mcaceresb closed this as completed in 0fe7a65 Aug 19, 2018

mcaceresb reopened this Aug 21, 2018

mcaceresb added the wontfix label Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with more than 2^31-1 observations #43

Error with more than 2^31-1 observations #43

mcaceresb commented Aug 19, 2018

mcaceresb commented Aug 21, 2018

miserton commented Aug 9, 2022

mcaceresb commented Aug 9, 2022

miserton commented Aug 10, 2022

mcaceresb commented Aug 10, 2022

wbuchanan commented Jun 21, 2024

mcaceresb commented Jun 21, 2024

miserton commented Jun 21, 2024

wbuchanan commented Jun 21, 2024

mcaceresb commented Jun 21, 2024

wbuchanan commented Jun 21, 2024

miserton commented Jun 21, 2024

mcaceresb commented Jun 21, 2024 •

edited

Loading

wbuchanan commented Jun 26, 2024

mcaceresb commented Jun 27, 2024

Error with more than 2^31-1 observations #43

Error with more than 2^31-1 observations #43

Comments

mcaceresb commented Aug 19, 2018

mcaceresb commented Aug 21, 2018

miserton commented Aug 9, 2022

mcaceresb commented Aug 9, 2022

miserton commented Aug 10, 2022

mcaceresb commented Aug 10, 2022

wbuchanan commented Jun 21, 2024

mcaceresb commented Jun 21, 2024

miserton commented Jun 21, 2024

wbuchanan commented Jun 21, 2024

mcaceresb commented Jun 21, 2024

wbuchanan commented Jun 21, 2024

miserton commented Jun 21, 2024

mcaceresb commented Jun 21, 2024 • edited Loading

wbuchanan commented Jun 26, 2024

mcaceresb commented Jun 27, 2024

mcaceresb commented Jun 21, 2024 •

edited

Loading