Replies: 8 comments 10 replies
-
These are all good questions, that many (myself included) have asked and (maybe?) solved differently over the years. You are correct that managing CSV files is not a reasonable solution. A local SSURGO tabular database is well worth the effort. You might consider two options before rolling your own:
Note that you will have to setup / manage queries to these local databases on your own. Also, you will likely need to add some additional indexing to improve query performance. I haven't personally tested a "full SSURGO" sqlite database, but I know that it is possible to make one. I'm sure Andrew will have some more details on what is currently or will soon be possible with local databases. |
Beta Was this translation helpful? Give feedback.
-
Yes, check out this link: https://nrcs.app.box.com/s/w0gtf7ooxqfcd0cgfht5zprcmc8qvu0i (from the SSURGO Portal page) This is a Geopackage of all of SSURGO. It is a special .sqlite database that has spatial data and other metadata in it. I have tested this with a handful of functions in soilDB that support a SQlite source. It works OK, a little slower than I would like when run against the whole DB... because some of these queries were never really optimized for this (or at all for that matter)... and several things are just slower in SQLite in general (like string aggregation) or are slower on your local machine than SDA server. I think running in chunks--sets of mukeys, or areasymbols--is generally still worthwhile even though the dataset is local.
You can use RSQLite/DBI functions directly. For example... library(RSQLite)
con <- dbConnect(SQLite(), "~/path/to/ssurgo.gpkg")
result <- dbGetQuery(con, "SELECT * FROM mapunit")
dbDisconnect(con) I developed a simple wrapper function that does this source-switching internally for soilDB, BUT have not exported or used it widely yet ( Functions in the get/fetch families of soilDB functions apply modifications to the default T-SQL used for SDA and NASIS to make the syntax compatible with SQLite dialect. This works fairly well where it works, but does not work for every combination of function and aggregation method. Some queries that require e.g. intermediate tables are at this time completely unsupported for SQLite sources. Though, in many cases the method="none" (disaggregated) results can be easily queried and then aggregation done on the R side rather than SQL. Please let me know if you bump up against functionality that needs upgrading, there is a related issue that goes into more detail: #250
Building from the downloaded CSVs is possible, you can iterate over them and write tables of the same name as the basename of the file to the SQLite data source. While that is possible, I would recommend using SSURGO Portal tool, and/or downloading the full SQLite DB linked above. This is primarily because it is our new tool for this purpose, and the the relationships and indexes are automatically part of the template.
Either SSURGO Portal tool or createSSURGO() function requires that you download the individual SSAs from web soil survey (via |
Beta Was this translation helpful? Give feedback.
-
@brownag thank you so much for this detailed response. It has been VERY helpful for me. I mean to respond sooner, but I have just been digging right in since reading all this. I think I have landed on using the "Geopackage of all of SSURGO" that you linked to above as my starting point. I also tried building my own SQLite database from the downloaded CSV files, and - you were right - this is MUCH slower (~7x). While the "Geopackage of all of SSURGO" works well and is fast, it is a MASSIVE file. 112GB after unzipping it. Just the process of unzipping it caused my computer to run out of memory the first time. I would like to benefit from the speed of using this database, but my use case cannot afford to host such a massive file. The only SSURGO tables that I need are: chorizon, chtexture, chtexturegrp, comonth, component, corestrictions, cosoilmoist, legend, mapunit, muaggatt, and just the NCCPI-related rows of cointerp. When I used the CSV files to build a SQLite database out of just these tables, the file size came to ~5GB. This is MUCH better than 112GB. Is there a way to start with the "Geopackage of all of SSURGO", remove all of the tables I don't need, and still retain the speedy connectivity of the remaining tables that I care about? My first attempt to do this (shown below) successfully "removed" tables, but the file size did not decrease AT ALL! I'm guess this process of "removal" is just removing the tables from the index but not actually deleting the data.
|
Beta Was this translation helpful? Give feedback.
-
I think that you need to "vacuum" the database or perform some kind of cleanup operation to free unused space, and re-build indexes. Something like this. You may have to run it a couple of times. db <- dbConnect(RSQLite::SQLite(), db.file)
dbExecute(db, 'VACUUM;') Also consider Honestly, it would be more efficient to do this kind of work in a real RDBMS vs. file-based database. |
Beta Was this translation helpful? Give feedback.
-
That worked! Thank you. There are some "tables" returned by I'm wondering if these have to do with the connections between tables, and therefore are providing some of that speed efficiency in this database. Should I not delete these in order to save that efficiency? Are there any other "tables" that I should not remove in order to avoid inefficiencies or errors during this slimming process? |
Beta Was this translation helpful? Give feedback.
-
I'd recommend adding additional indexes to those tables you need to filter on columns other than the primary key. Creating an index in SQLite must be followed by a 'VACUUM' command. Since you are trying to do a lot with the cointerp table, consider indexing |
Beta Was this translation helpful? Give feedback.
-
Thanks, guys. I will need to look up what "addition additional indexes" means and how to do it. I'll do some learning... On another note....am I going crazy or is the column nationalmusym present in SDA's version of the mapunit table but NOT in the mapunit table of either the Geopackage of all of SSURGO or in the downloaded CSV files?? Why is not in the non-SDA data? I was using it! |
Beta Was this translation helpful? Give feedback.
-
Okay so I have successfully taken the Geopackage of all of SSURGO, filtered out the tables I don't use, and run the VACUUM. THis does indeed get me back to a database of ~5GB! The crazy thing is that, when I run some queries using this reduced database, it is ~4 times slower than when running those same queries using the Geopackage of all of SSURGO that it came from! So, it does seem like something about speed/efficiency is being lost when I do this. |
Beta Was this translation helpful? Give feedback.
-
As described in #310 by @dylanbeaudette, "we are now in the strange in-between time where gSSURGO/gNATSGO are out of sync with SSURGO proper (SDA), and it could be a couple of months before gSSURGO and gNATSGO are rebuilt."
This annual "in between" time (as well as other generic instability of some of the web services) makes my use of SSURGO very difficult. To address this, I have downloaded the gSSURGO mukey raster and tabular data CSVs from here, as suggested by @dylanbeaudette . Using the mukey raster data from disk rather than via
soilDB::mukey.wcs()
has great improved stability and speed on my end. However, I am still relient onsoilDB::SDAquery()
to acccess tabular data, as I would have to build out a ton of code to efficiently utilize the downloaded CSV files.@brownag recently mentioned that "larger/local SSURGO tabular data are reasonably well covered by soilDB methods, e.g. by specifing
dsn
argument and pointing at "snapshot" databases stored locally rather than web services."This is new to me! A few questions:
dsn
argument forsoilDB::SDAquery()
. If I were able to acquire a SQL database for SSURGO on disk, how would I usesoilDB
to access it?Beta Was this translation helpful? Give feedback.
All reactions