-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with more than 2^31-1 observations #43
Comments
…teger bug * Closes #43 * Added option to selectively test via gtools, test[] * Gtools exits with error if `_N > 2^31-1` and points the user to the pertinent bug report. The SPI uses long integers (32-bit) instead of long long integers (64-bit) for its observation numbers (not to mention theyare signed integers, but alas). At any rate, this de facto limits the number of observations to 2^31-1.
Re-open because technically it hasn't been fixed. It just throws the correct error now. |
Any update on this bug - is it correctable now? |
@miserton Not afaik. I would contact StataCorp to ask about updates to this point, since I cannot fix it until they update the SPI. Sorry! |
I reached out to StataCorp:
Unless gtools could call the C code through Java or Python, it doesn't look like the 2.1B limitation will be overcome. |
@miserton That's a bit of a strange answer since changing stplugin.h in particular would only involve adding a long long type (and in my system long is 64-bit anyway). I always thought the issue were the internals, and that chancing the internal function to take 64-bit input was difficult for some reason. But I don't really know. In any case, you are right this is unlikely to be resolved any time soon, if ever. |
@miserton can you share who provided that answer to you? @mcaceresb do you have any interest in porting things over to Java or any idea how much different the performance would be in Java compared to C? Also, do you know if C allows overloading function signatures (thinking that might make it even easier since multiple dispatch would prevent breaking any existing code while allowing the same function names to be used with updated types). |
@wbuchanan Porting to Java would take quite a bit of effort (partly since I'm not particularly familiar wit Java; not sure if there would be a simple thin wrapper to execute C code in Java but I expect it would still take some work). The problem is that I cannot modify observations past |
I got the answer from Stata technical support ([email protected]). |
@mcaceresb that is what I tried the most recent time and there were still too many observations. There is something called the Java Native Interface, but I've not used that at all. However, even though it's been a while I'm still pretty familiar with the Java API for Stata. The only other thing that might be feasible would be trying to figure out how to implement things in a Cythonic way and using the Python API for the entry point, but I'm not sure how much performance degradation that would cause. |
Wait, if |
Turns out there was a bug in part of the code that was intended to reduce N by a decent amount. That said, given the size of N and the size of k it would almost definitely be more efficient to port that part of the process to a different language that wouldn't have the restriction on the size of N. (It is a fairly large data set with several hundred variables that need to go from wide to long). |
I was grappling with a similar issue when I ran into the maximum number of variables, needing to reshape a very large dataset and greshape was the only thing I thought might work. What I ultimately did was save the data as a CSV and then use some Bash code to write the equivalent of a reshape. I was going from long to wide in the example, but you can do the same thing going from wide to long: |
@wbuchanan I had a reshape like this: 8 stubs, 144 variables per stub (monthly data over 12 years), a bit under 100M observations. The main thing in my favor was that a TON of the wide data was missing, so the long version was considerably smaller than the theoretical max of ~14B (having to do this is why I added the Anyway, I processed it by chunk without much issue, but you're right it should be much slower than doing it at once from another language. This suggests R's data.table is extremely fast, but it also shows pandas might give enough of a speed boost while having an easier Stata interface. |
@mcaceresb I think there might be an older version of gtools in the FSRDCs computing environment since there isn't anything mentioning |
Feel free to email me and I can send them the code in whatever format they like. @wbuchanan |
A bug in Stata causes gtools to exit with error if the user has more than 2^31-1 observations in memory. See this bug report.
I contacted StataCorp about it and they replied:
The text was updated successfully, but these errors were encountered: