Utilizing local tabular SSURGO data #312

kevinwolz · 2023-10-15T02:40:08Z

kevinwolz
Oct 15, 2023

As described in #310 by @dylanbeaudette, "we are now in the strange in-between time where gSSURGO/gNATSGO are out of sync with SSURGO proper (SDA), and it could be a couple of months before gSSURGO and gNATSGO are rebuilt."

This annual "in between" time (as well as other generic instability of some of the web services) makes my use of SSURGO very difficult. To address this, I have downloaded the gSSURGO mukey raster and tabular data CSVs from here, as suggested by @dylanbeaudette . Using the mukey raster data from disk rather than via soilDB::mukey.wcs() has great improved stability and speed on my end. However, I am still relient on soilDB::SDAquery() to acccess tabular data, as I would have to build out a ton of code to efficiently utilize the downloaded CSV files.

@brownag recently mentioned that "larger/local SSURGO tabular data are reasonably well covered by soilDB methods, e.g. by specifing dsn argument and pointing at "snapshot" databases stored locally rather than web services."

This is new to me! A few questions:

Does this imply that a SQL database of all SSURGO tabular data is available somewhere for download? I cannot for the life of me find it on the USDA website. This would certainly be much better to have in hand rather than a big folder of CSV files for individual tables.
I don't see a dsn argument for soilDB::SDAquery(). If I were able to acquire a SQL database for SSURGO on disk, how would I use soilDB to access it?
If a SQL database of all SSURGO tabular data is not easily accessible, then can I create a SQL database manually using the downloaded CSV files?

dylanbeaudette · 2023-10-16T18:41:45Z

dylanbeaudette
Oct 16, 2023
Maintainer

These are all good questions, that many (myself included) have asked and (maybe?) solved differently over the years. You are correct that managing CSV files is not a reasonable solution.

A local SSURGO tabular database is well worth the effort. You might consider two options before rolling your own:

https://www.nrcs.usda.gov/resources/data-and-reports/ssurgo-portal
createSSURGO() from the soilDB package

Note that you will have to setup / manage queries to these local databases on your own. Also, you will likely need to add some additional indexing to improve query performance. I haven't personally tested a "full SSURGO" sqlite database, but I know that it is possible to make one.

I'm sure Andrew will have some more details on what is currently or will soon be possible with local databases.

1 reply

kevinwolz Oct 28, 2023
Author

Thanks @dylanbeaudette. @brownag any thoughts?

brownag · 2023-10-28T14:35:27Z

brownag
Oct 28, 2023
Maintainer

@brownag recently mentioned that "larger/local SSURGO tabular data are reasonably well covered by soilDB methods, e.g. by specifing dsn argument and pointing at "snapshot" databases stored locally rather than web services."

Does this imply that a SQL database of all SSURGO tabular data is available somewhere for download? I cannot for the life of me find it on the USDA website. This would certainly be much better to have in hand rather than a big folder of CSV files for individual tables.

Yes, check out this link: https://nrcs.app.box.com/s/w0gtf7ooxqfcd0cgfht5zprcmc8qvu0i (from the SSURGO Portal page)

This is a Geopackage of all of SSURGO. It is a special .sqlite database that has spatial data and other metadata in it.

I have tested this with a handful of functions in soilDB that support a SQlite source. It works OK, a little slower than I would like when run against the whole DB... because some of these queries were never really optimized for this (or at all for that matter)... and several things are just slower in SQLite in general (like string aggregation) or are slower on your local machine than SDA server. I think running in chunks--sets of mukeys, or areasymbols--is generally still worthwhile even though the dataset is local.

I don't see a dsn argument for soilDB::SDAquery(). If I were able to acquire a SQL database for SSURGO on disk, how would I use soilDB to access it?

You can use RSQLite/DBI functions directly. For example...

library(RSQLite)
con <- dbConnect(SQLite(), "~/path/to/ssurgo.gpkg")
result <- dbGetQuery(con, "SELECT * FROM mapunit")
dbDisconnect(con)

I developed a simple wrapper function that does this source-switching internally for soilDB, BUT have not exported or used it widely yet (soilDB:::.SSURGO_query()). I still am thinking on this. If you look at the code in the get_SDA functions you will see the RSQLite conditional usage.

Functions in the get/fetch families of soilDB functions apply modifications to the default T-SQL used for SDA and NASIS to make the syntax compatible with SQLite dialect. This works fairly well where it works, but does not work for every combination of function and aggregation method. Some queries that require e.g. intermediate tables are at this time completely unsupported for SQLite sources. Though, in many cases the method="none" (disaggregated) results can be easily queried and then aggregation done on the R side rather than SQL. Please let me know if you bump up against functionality that needs upgrading, there is a related issue that goes into more detail: #250

If a SQL database of all SSURGO tabular data is not easily accessible, then can I create a SQL database manually using the downloaded CSV files?

Building from the downloaded CSVs is possible, you can iterate over them and write tables of the same name as the basename of the file to the SQLite data source. While that is possible, I would recommend using SSURGO Portal tool, and/or downloading the full SQLite DB linked above. This is primarily because it is our new tool for this purpose, and the the relationships and indexes are automatically part of the template.

createSSURGO() is an option that pre-dates the SSURGOPortal tool. At this time it is essentially equivalent to writing the CSV files to a database--it does not create table relationships/constraints because it does not use a template database at all. EDIT: createSSURGO does create primary table indexes... but note that not all queries are written in such a way that they most efficiently use the pkeys...

Either SSURGO Portal tool or createSSURGO() function requires that you download the individual SSAs from web soil survey (via downloadSSURGO() or similar tools offered on SSURGO portal page). I have bundled up the SSURGO Portal tool in an R package here: https://humus.rocks/SSURGOPortalR/ which you might find convenient.

0 replies

kevinwolz · 2023-11-17T07:31:35Z

kevinwolz
Nov 17, 2023
Author

@brownag thank you so much for this detailed response. It has been VERY helpful for me. I mean to respond sooner, but I have just been digging right in since reading all this. I think I have landed on using the "Geopackage of all of SSURGO" that you linked to above as my starting point. I also tried building my own SQLite database from the downloaded CSV files, and - you were right - this is MUCH slower (~7x).

While the "Geopackage of all of SSURGO" works well and is fast, it is a MASSIVE file. 112GB after unzipping it. Just the process of unzipping it caused my computer to run out of memory the first time. I would like to benefit from the speed of using this database, but my use case cannot afford to host such a massive file.

The only SSURGO tables that I need are: chorizon, chtexture, chtexturegrp, comonth, component, corestrictions, cosoilmoist, legend, mapunit, muaggatt, and just the NCCPI-related rows of cointerp. When I used the CSV files to build a SQLite database out of just these tables, the file size came to ~5GB. This is MUCH better than 112GB.

Is there a way to start with the "Geopackage of all of SSURGO", remove all of the tables I don't need, and still retain the speedy connectivity of the remaining tables that I care about?

My first attempt to do this (shown below) successfully "removed" tables, but the file size did not decrease AT ALL! I'm guess this process of "removal" is just removing the tables from the index but not actually deleting the data.

MYDIR <- ""
gfile <- file.path(MYDIR, "All_SSURGO_10_06_2023-REDUCED.gpkg")

db <- DBI::dbConnect(RSQLite::SQLite(), gfile)
tables <- DBI::dbListTables(db)
tables_to_keep <- c("chorizon", "chtexture", "chtexturegrp", "comonth", "component",
                 "corestrictions", "cosoilmoist", "legend", "mapunit", "muaggatt", "cointerp")
tables_to_remove <- setdiff(tables, tables_to_keep)

fun <- function(x) try(DBI::dbRemoveTable(conn = db, name = x))
purrr::walk(tables_to_remove, fun)
DBI::dbDisconnect(db)

0 replies

dylanbeaudette · 2023-11-17T14:52:22Z

dylanbeaudette
Nov 17, 2023
Maintainer

I think that you need to "vacuum" the database or perform some kind of cleanup operation to free unused space, and re-build indexes.

Something like this. You may have to run it a couple of times.

db <- dbConnect(RSQLite::SQLite(), db.file)
dbExecute(db, 'VACUUM;')

Also consider TRUNCATE TABLE vs. DROP TABLE, likely faster.

Honestly, it would be more efficient to do this kind of work in a real RDBMS vs. file-based database.

0 replies

kevinwolz · 2023-11-17T18:37:11Z

kevinwolz
Nov 17, 2023
Author

That worked! Thank you.

There are some "tables" returned by DBI::dbListTables that I don't think are actually SSURGO tables:
sdvalgorithm, sdvattribute, sdvfolder, sdvfolderattribute, systemtemplateinformation

I'm wondering if these have to do with the connections between tables, and therefore are providing some of that speed efficiency in this database. Should I not delete these in order to save that efficiency? Are there any other "tables" that I should not remove in order to avoid inefficiencies or errors during this slimming process?

2 replies

brownag Nov 17, 2023
Maintainer

You are right that those tables are not standard in the SSURGO schema.

The "sdv" prefix relates to Soil Data Viewer, I believe, which is planned to be rebuilt to be used with SQLite template databases.

I do not think that these specifically are contributing to any observed differences in efficiency. At this point I am not aware of anything that uses them, but the plan is likely for the information stored there to be used by a future incarnation of Soil Data Viewer tool.

brownag Nov 17, 2023
Maintainer

More generally with Soil Data Access and local (SSURGO) SQLite sources you do not need any "metadata" tables to convert values to labels as we do with e.g. NASIS sources. So the required tables should be just those that you can see in whatever query you want to run.

If you have primary and foreign keys set for the tables of interest that is probably the best you can do. If you use the full Geopackage database, SSURGO Portal you will have all keys, and with createSSURGO() you will have primary keys only. As I have mentioned in the other comment some of these queries are considerably slower in SQLite form, and/or may make use of fields and operations where having the keys has little to no benefit.

dylanbeaudette · 2023-11-17T20:54:34Z

dylanbeaudette
Nov 17, 2023
Maintainer

I'd recommend adding additional indexes to those tables you need to filter on columns other than the primary key. Creating an index in SQLite must be followed by a 'VACUUM' command. Since you are trying to do a lot with the cointerp table, consider indexing mrulekey and mrulename. Those indexes can make a several order of magnitude difference when attempting to work with chunks of data.

0 replies

kevinwolz · 2023-11-17T21:52:21Z

kevinwolz
Nov 17, 2023
Author

Thanks, guys. I will need to look up what "addition additional indexes" means and how to do it. I'll do some learning...

On another note....am I going crazy or is the column nationalmusym present in SDA's version of the mapunit table but NOT in the mapunit table of either the Geopackage of all of SSURGO or in the downloaded CSV files?? Why is not in the non-SDA data? I was using it!

6 replies

brownag Nov 17, 2023
Maintainer

To your second question there: I took a look at a fresh db made in SSURGO Portal... it does appear that nationalmusym is missing. I will investigate and get back to you on that.

kevinwolz Nov 17, 2023
Author

Thanks!

Regarding the indexing, I need to process this more. One question to start with: why do you run DROP INDEX at the end before closing the connection?

brownag Nov 17, 2023
Maintainer

That was just because I ran through it a few times. There is some redundancy there in that I also delete/recreate the SQLite file at the top. You would create these indices and then would never have to drop them

brownag Nov 17, 2023
Maintainer

Another note on indexes: If you also add the (non-unique) cokey index on cointerp table, then it seems you get that multiple-order-of-magnitude benefit @dylanbeaudette was alluding to.

library(soilDB)
destdir <- "~/FY24/SQLiteTest"
soil_db <- file.path(destdir, "ca041.sqlite")

downloadSSURGO(areasymbols = "CA041", destdir = destdir)
#> Extracting downloaded ZIP files...
file.remove(soil_db)
#> [1] TRUE

createSSURGO(soil_db, exdir = destdir)
#> Loading required namespace: sf
#> Loading required namespace: RSQLite

system.time(x <- get_SDA_interpretation(
  rulename    = "NCCPI - National Commodity Crop Productivity Index (Ver 3.0)",
  method      = "Dominant Component",
  areasymbols = "CA041",
  wide_reason = TRUE,
  dsn = soil_db
))
#>    user  system elapsed 
#>    6.66   20.89   27.57

soil_db_con <- DBI::dbConnect(RSQLite::SQLite(), soil_db)

RSQLite::dbExecute(soil_db_con, "CREATE INDEX IF NOT EXISTS K_cointerp_cokey ON cointerp (cokey);")
#> [1] 0
RSQLite::dbExecute(soil_db_con, "CREATE INDEX IF NOT EXISTS K_cointerp_mrulename ON cointerp (mrulename);")
#> [1] 0

system.time(x <- get_SDA_interpretation(
  rulename    = "NCCPI - National Commodity Crop Productivity Index (Ver 3.0)",
  method      = "Dominant Component",
  areasymbols = "CA041",
  wide_reason = TRUE,
  dsn = soil_db
))
#>    user  system elapsed 
#>    0.10    0.02    0.11

RSQLite::dbExecute(soil_db_con, "DROP INDEX K_cointerp_cokey;")
#> [1] 0
RSQLite::dbExecute(soil_db_con, "DROP INDEX K_cointerp_mrulename;")
#> [1] 0
RSQLite::dbDisconnect(soil_db_con)

Probably I can add indices on mukey, cokey, nationalmusym, areasymbol, etc. to all tables that they are present--these keys are often the grouping variable of interest in soilDB queries.

kevinwolz Nov 18, 2023
Author

This is great! I have indexed all tables I use across the key columns that are used for either filtering or joining. It does indeed help a ton!

kevinwolz · 2023-11-17T22:14:02Z

kevinwolz
Nov 17, 2023
Author

Okay so I have successfully taken the Geopackage of all of SSURGO, filtered out the tables I don't use, and run the VACUUM. THis does indeed get me back to a database of ~5GB! The crazy thing is that, when I run some queries using this reduced database, it is ~4 times slower than when running those same queries using the Geopackage of all of SSURGO that it came from! So, it does seem like something about speed/efficiency is being lost when I do this.

1 reply

kevinwolz Nov 18, 2023
Author

I think this had to do with some of the indexes contained in the original geopackage being lost when I did the filtering down of some tables. I rebuilt all indexes and now we are back to speed again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilizing local tabular SSURGO data #312

{{title}}

Replies: 8 comments 10 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Utilizing local tabular SSURGO data #312

kevinwolz Oct 15, 2023

Replies: 8 comments · 10 replies

dylanbeaudette Oct 16, 2023 Maintainer

kevinwolz Oct 28, 2023 Author

brownag Oct 28, 2023 Maintainer

kevinwolz Nov 17, 2023 Author

dylanbeaudette Nov 17, 2023 Maintainer

kevinwolz Nov 17, 2023 Author

brownag Nov 17, 2023 Maintainer

brownag Nov 17, 2023 Maintainer

dylanbeaudette Nov 17, 2023 Maintainer

kevinwolz Nov 17, 2023 Author

brownag Nov 17, 2023 Maintainer

kevinwolz Nov 17, 2023 Author

brownag Nov 17, 2023 Maintainer

brownag Nov 17, 2023 Maintainer

kevinwolz Nov 18, 2023 Author

kevinwolz Nov 17, 2023 Author

kevinwolz Nov 18, 2023 Author

kevinwolz
Oct 15, 2023

Replies: 8 comments 10 replies

dylanbeaudette
Oct 16, 2023
Maintainer

kevinwolz Oct 28, 2023
Author

brownag
Oct 28, 2023
Maintainer

kevinwolz
Nov 17, 2023
Author

dylanbeaudette
Nov 17, 2023
Maintainer

kevinwolz
Nov 17, 2023
Author

brownag Nov 17, 2023
Maintainer

brownag Nov 17, 2023
Maintainer

dylanbeaudette
Nov 17, 2023
Maintainer

kevinwolz
Nov 17, 2023
Author

brownag Nov 17, 2023
Maintainer

kevinwolz Nov 17, 2023
Author

brownag Nov 17, 2023
Maintainer

brownag Nov 17, 2023
Maintainer

kevinwolz Nov 18, 2023
Author

kevinwolz
Nov 17, 2023
Author

kevinwolz Nov 18, 2023
Author