Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New package concept: parametric conversions #1

Open
billdenney opened this issue Oct 16, 2023 · 21 comments
Open

New package concept: parametric conversions #1

billdenney opened this issue Oct 16, 2023 · 21 comments

Comments

@billdenney
Copy link

Related to r-quantities/units#134 (and others)

I'm not sure how to best discuss this. In the end, I don't think that it will be part of the units library, but I would like to engage both @Enchufa2, @edzer, and others to get this solution right. I hope that you think it's okay to have (or at least start) the discussion here.

The issue of units that are often not convertible often comes up as evidenced in r-quantities/units#134 and several other issues linking there. It signals a need to have a method of keeping different unit conversions separated. A typical example is mass-to-moles conversions that happen in many fields. For the data I work with (often laboratory measurements of blood tests), other types of conversions can exist like activity to molar conversions (e.g. conversion of 1 mole per hour of X is means that the concentration of Y is Z moles/L).

To accomplish this, I think that the best method would be the creation of a new package that would enable the following:

  • Keep different analytes separate
    • For example, "carbon dioxide" is different than "carbon" (quotes are used here and for the rest of this comment to indicate an analyte)
  • Allow conversion between different measurement types of the same analyte
    • For example, 1 mole of "carbon dioxide" is the same as 44.01 grams of "carbon dioxide"
  • Prevent conversion between analytes, unless there is a specific addition of a conversion
    • For example, "carbon dioxide" cannot be converted to "carbon" unless there is a specific conversion factor added
  • It is possible to convert between analytes, when the conversion is specifically added
    • For example, 1 mole of "carbon dioxide" is equal to 1 mole of "carbon"
  • Multiple different sets of conversions can be used within one R session
    • For example, one conversion system may only allow conversion between units for "carbon dioxide" and another may allow conversion between "carbon" and "carbon dioxide" as well because they may both be used for different purposes during one session.
  • Conversions may be more complex than multiplication or division and may use a function instead of a multiplication factor
    • For example, the laboratory measurement of HbA1c (a measurement used to detect diabetes) recently had a conversion created with an offset and conversion factor between units of "percent of hemoglobin that is glycated" to "mmol glycated hemoglobin/total mol hemoglobin" (editorial note, yes, those seem like they would be converted by multiplication or division, only, but they are not based on standards defined by international bodies-- I think because they are actually measured a little differently in the machine).

Are there other features that should be supported?

@edzer
Copy link
Member

edzer commented Oct 16, 2023

Cc @henningte

@Enchufa2
Copy link
Member

Thanks for spawning this, @billdenney. Currently,

  • Different XMLs could be defined and maintained for different analytes. These could be part of a separate package (let's call it units.chem), and then loaded using units::load_units_xml (which would be automatically called by something like units.chem::load_units("carbon dioxide"). This addresses points 1-2, and partially 3-5 (i.e. by converting to moles and then switching to another system).
  • Complex conversions ($a + bx$) are already possible. That's what Celsius and Fahrenheit do.

So if loading several units systems at the same time is not a requirement, I think that the current set of features of the units package are enough to implement such a set of requirements. Otherwise, a bit more work will be needed.

But regardless of this, a separate package to implement such XMLs would be a nice starting point. We can host it here, of course.

@billdenney
Copy link
Author

Thanks for the quick thoughts.

I think that loading multiple unit systems at the same time is a requirement (at least it is for my use case of laboratory measurements). My initial thought of the implementation would look something like the following:

  • A units.systems package that gives the general support of having multiple systems in place simultaneously and then other packages (e.g. units.chem) that would support specific unit systems.
  • It would be possible to load more than one of the dependent packages simultaneously, so it would not be a problem to load units.chem and units.clinical at the same time. I think that it would be best if they both register their units with units.systems and either
    • units.systems checks for conflicts or
    • units.systems maintains package-specific unit libraries.
  • The units.systems package would look like the units library to users in that they would use the same generic functions, for the most part.

Beneath the surface, I was not thinking of trying to load XML. The way that I understand both UDUNITS and units, that would not allow multiple simultaneous systems to be loaded. So, we would not be able to have both "carbon dioxide" and "carbon" loaded at the same time. For my clinical laboratory example, I have a project now that needs to have 515 different systems (that is the largest example that I've ever had, but about 100 different systems at a time is normal).

My thought was that the unit system table would be a data.frame that looks something like the following:

system unit_common unit_alternate slope intercept
carbon_dioxide mol g 44.01 0
carbon mol g 12.01 0
gamma_glutamyl_transpeptidase U ukat 0.01667 0
sodium mol g 22.989769 0
sodium mol Eq 1 0

There are a couple of notable problems that will immediately show up:

  • For carbon_dioxide and carbon, they both have mass-to-mole conversions. That's the primary reason for the package.
  • For gamma_glutamyl_transpeptidase, both "U" and "ukat" are undefined in the base units package. It would be up to the package owner to make sure that the units were defined (e.g. that would happen in the units.clinical package-- maybe with some helper functions from units.systems).
  • The way that gamma_glutamyl_transpeptidase units are actually given are "U/L" and "ukat/L". units.systems would need to be able to pull out the "U" and "ukat" part of the unit to try conversion.
  • sodium may have three types of units: "mol", "g", and "Eq". To convert from "g" to "Eq" it would be necessary to find that "g" can convert to "mol" and "mol" can convert to "Eq". That would require more complex searching through the units available for conversion. (In practice, this is a directed, acyclic graph. Hopefully, implementation could be simpler than writing full DAG search into the package.)

Thanks for the offer to host it within r-quantities!

@edzer
Copy link
Member

edzer commented Oct 16, 2023

Calling such packages units.xxx suggests they are about units. I think they are not, but rather about conversion constants that depend on the characteristics of the matter the unit refers to. Thinking for instance along the lines of BIPM SI, it seems there are no real "units for domain xxx". Could you think of (a) more descriptive package name(s)?

@billdenney
Copy link
Author

That's a good point that they are not units like the BIPM SI units. They are more accurately described as unit conversions. I suggested the names of units.xxx to clearly link to this package.

When I just did a brief look at the list of R packages, I think that looking to highlight "unit conversion" would likely get people to the right place. So, perhaps unitconv.xxx for the specific packages and unitconv.systems for the support package? (I don't have a strong opinion here; my only goal is to make it easy for people to find the packages.)

@edzer
Copy link
Member

edzer commented Oct 16, 2023

I think that is a good idea; the linking to "units" will be clear from the package dependency and from when it's source repository is in this GH organisation.

@Enchufa2
Copy link
Member

The basic support for several units systems must be provided by the units package. Otherwise, such a "systems" package would basically need to reimplement units. So "systems" may or may not be required.

To try to shape what's needed, I would need a sort but comprehensive example of the set of conversions that would be defined for a couple of analytes as well as the set of operations (within and across analytes).

@Enchufa2
Copy link
Member

Because, @billdenney , when you say "515 units systems", you really mean 515 conversions between some form of unitless parametric quantity, such as moles, to some unit such as grams, right? Because that's not really a units system. ;)

@Enchufa2
Copy link
Member

Also (I'm re-reading the previous discussion in r-quantities/units#134 ): it would be helpful to know your current workflow and what changed our what prompted you to raise this proposal. Because (I'm just thinking out loud here) if your current workflow works but it requires e.g. complicated parsing, that could just be abstracted in a separate package. But if you hit some fundamental limitation, then it would be helpful to know it.

@billdenney
Copy link
Author

@Enchufa2, I was thinking that the "systems" package would do two main things (to accomplish the list of goals above described below), and by prevent rework needed to implement the jumps between analytes.

I have been using the word "systems" because they are multiple disconnected sets of conversions, therein a "system". It seems like the word "system" is causing issues for this discussion, and I'm happy to choose a different word, but I don't know what that better word would be.

My specific, typical workflow is that I receive data with three columns (among many more) where one column is the analyte (e.g. "LDL cholesterol" or "sodium"), one column is the measurement value as a number, and one column is the units. Some of the units may be the same when analytes are different (e.g. "150 mg/dL LDL cholesterol" and "130 mg/dL sodium", both have units of "mg/dL"). I then need to convert both to standard units of "mmol/L", but the conversion for each is different:

  • 1 mmol/L LDL cholesterol = 38.66976 mg/dL
  • 1 mmol/L sodium = 2.2989769 mg/dL

My current workflow is always one-off. Nothing has really changed other than the fact that I had some more thoughts about how to generalize the solution better than I previously had considered. My workflow to standardize the units for a measurement is that I look at the dataset and make individual case_when() calls like the following to set the values and units:

library(tidyverse)
library(assertr)

data %>%
  mutate(
    value =
      case_when(
        analyte == "sodium" & unit == "mg/dL"~value/2.2989769,
        analyte == "LDL cholesterol" & unit == "mg/dL"~value/38.66976,
        TRUE~value
      ),
    unit =
      case_when(
        analyte == "sodium" & unit == "mg/dL"~"mmol/L",
        analyte == "LDL cholesterol" & unit == "mg/dL"~"mmol/L",
        TRUE~value
      )
  ) %>%
  group_by(analyte) %>%
  mutate(
    unit_count = length(unique(unit))
  ) %>%
  ungroup() %>%
  verify(unit_count == 1)

I then will relatively often use one measurement type and combine it with another. Such as, I may have the concentration of sodium (mg/mL) in urine and the total urine volume for the day (mL/24 h) and multiply them together then convert the units to mmol/day. And, I may have many analytes in the urine where I want to do this (sodium, potassium, glucose, albumin [a protein], etc.).

The unitconv.systems (or whatever it is called) would:

  1. Enable creation of subclasses for each "system" (or whatever we want to call it) so that "LDL cholesterol" does not accidentally become "sodium". There would be support functions for creating a class called "unitconv.systems LDL cholesterol" and "unitconv.systems sodium", etc. (The class names would look messy similar to those shown here-- not like typical class names because they would store the analyte name without modification to prevent accidental collisions.) These subclasses would be storable in a single vector, similar to the mixed_units class in the units package.
  2. Find the way to jump across SI unit boundaries (or any defined unit boundaries, e.g. the "ukat" example above) to allow conversion from mass to moles, for example. And also, find the way to jump between analytes (the "carbon dioxide" to "carbon" example above.) This would happen by the (simplified) directed acyclic graph search described above.

I was suggesting that these two features would exist in a separate package because they could be shared across many inherited packages (e.g. unitconv.clinical, unitconv.chem, etc.).

(I think that I covered everything you just asked for, but if I missed something, please let me know.)

@Enchufa2 Enchufa2 changed the title New package concept: Multiple unit systems New package concept: parametric conversions Oct 17, 2023
@Enchufa2
Copy link
Member

Enchufa2 commented Oct 17, 2023

I think that the main obstacle to this discussion is that we are merging low-level concepts and implementation details with requirements, and as a result we are going in circles here. Please forget for now about systems, boundaries, vectors, subclasses, and acyclic graphs, and let's talk about the workflow, the high-level interface. The code above is what you do now, so let's define what a better workflow should look like. Then we can assess what's available and what's missing, and what would be the best implementation.

Also: I've changed the title because, if we want to generalize this, I think we should be talking about parametric conversions instead of systems. Correct me please if I'm wrong, but every single conversion you are dealing with involves some kind of parametric unit, such as the mole, that requires a different parametrization (e.g. mol/g) for different substances.

@billdenney
Copy link
Author

Good point about starting with the requirements. Thanks.

My high-level workflow is:

  • Load a data.frame or tibble (typically from a .csv or .sas7bdat file)
    • The loaded data will have three columns defining an analyte (e.g. "sodium"), measurement value (the number), and the units as text (e.g. "mg/dL"). There will be many other columns unrelated to the unit operations.
    • There will usually be many different analytes in the loaded data.frame.
  • I then need to convert the values/units from the source values/units to preferred values/units. (The preferred units are usually but not always SI units.) The conversion is different for different analytes. The conversion needs to be able to jump between different types of units (e.g. grams to moles or mass per volume to moles per volume concentrations).
    • I need to be able to store these unit conversions for use between different projects.
      • For example, the conversion between sodium mass and moles is always the same. And, the preferred units to use for sodium will usually be "mmol/L" regardless of the input units. (The preferred units to use will not always be the same between different projects, though.)
  • I then need to be able to perform other unit-based operations using normal units methods such as multiplying concentrations (e.g. moles per volume) times volume to get the total amount (e.g. moles).
    • I may also want to convert these final units to a different parametric unit (e.g. convert the moles calculated to grams).

Does that clarify the workflow sufficiently?

And yes, "parametric units" are what I'm talking about throughout this discussion. Thanks for helping clarify the terminology.

@Enchufa2
Copy link
Member

Enchufa2 commented Oct 17, 2023

Thanks for the clear specification of the workflow. Let's say that such columns are analyte, value, unit. What would you expect your code to look like? Maybe something along the following lines? (Don't take the new function names too seriously for now, I'm brainstorming).

library(<new package>)

set_substance_conversions(<data frame of parametric conversions>)

df |>
  mutate(src = set_substance_units(analyte, value, unit)) |>
  mutate(new_unit = <specify the destination unit>) |>
  mutate(dst = set_substance_units(analyte, src, new_unit))

EDIT: A more specific proposal, maybe more in-line with the units workflow:

library(substances)

load_substances_df(<data frame of parametric conversions>)

df |>
  mutate(src = set_substances(mixed_units(value, unit), analyte))
  mutate(new_unit = <specify the destination unit>) |>
  mutate(dst = set_units(src, new_unit, analyte)) # analyte here is optional

@billdenney
Copy link
Author

(Thank you for the edit to use the units workflow, when possible. I was thinking that we should use units generic functions whenever feasible, too.)

I would hope that the code would be a little simpler than what you suggest:

library(<new package>)

# This would not be required if using the specific package (the xxx.clinical package would already
# have these conversions built-in), but it would be required if using the general package
set_substance_conversions(<data frame of parametric conversions>)

df |>
  mutate(src = set_substance_units(analyte, value, unit)) |>
  mutate(new_unit = <specify the destination unit>) |>
  mutate(dst = set_units(src, new_unit))

I dropped the "analyte" from the second call to set_substance_units() because it would already be contained within src.

Another use would be standardize_substance_units() which would choose the unit that is indicated as "typical" during the call to set_substance_conversions(). So, a simpler workflow could be:

library(<new package>)

set_substance_conversions(<data frame of parametric conversions>)

df |>
  mutate(src = set_substance_units(analyte, value, unit)) |>
  mutate(dst = standardize_substance_units(src))

The methods would also need accessors to the attributes:

df |>
  mutate(src = set_substance_units(analyte, value, unit)) |>
  mutate(dst = standardize_substance_units(src)) |>
  mutate(
    dst_value = as.numeric(dst),
    dst_unit = as.character(units(dst))
  )

The above does not cover the conversion between analytes (e.g. "1 mole carbon dioxide" = "1 mole carbon" and "1 mole carbon dioxide" = "2 moles oxygen"), but that is a much bigger lift to get right and maybe it should not be included at this time. This suggestion is not simply interface-bloat; I do use that type of conversion between analytes.

My specific use case for needing to convert between analytes is a medicine and its metabolite. I need to calculate the amount of a medicine that comes out in urine. I receive data like "10 mg simvastatin" was dosed; we measured "10 ng/mL beta-hydroxy simvastatin" in urine 500 mL urine. What fraction of the dose came out in urine as "beta-hydroxy simvastatin"? The process is

  • Convert "10 mg simvastatin" to "X moles simvastatin";
  • Multiply "10 ng/mL beta-hydroxy simvastatin" times "500 mL" = "5000 ng beta-hydroxy simvastatin"
  • Convert "5000 ng beta-hydroxy simvastatin" to "Y moles beta-hydroxy simvastatin"
  • Calculate the molar fraction that comes out in urine as Y/X

I think that would need another method like

set_between_substance_conversions(<data frame of parametric conversions between substances>)

@Enchufa2 Enchufa2 transferred this issue from r-quantities/units Oct 17, 2023
@Enchufa2
Copy link
Member

We have a nice initial specification, so I have transferred the issue to a separate repo (thoughts on the name?). The MVP would be conversions within the same substance, so I'll try to address that first. We can iterate and address cross-substance conversions later.

@billdenney
Copy link
Author

I like the name. Let's keep it! :) (I'm also happy to entertain other names; I can't think of a better one right now.)

I agree that cross-substance conversions can come later. I wanted to make sure that they were considered throughout the process so that we don't end up making an API that can't work with the concept.

The other thing that we should ensure is that we keep the naming as consistent as feasible. In the drafting discussion, we used several terms. I think that the terms we settled on are:

  • "parametric units" describe the link between a substance and the unit conversions that are specific to that substance.
  • "substance" is the term for something that has a different set of unit conversions.

@billdenney
Copy link
Author

I'm tagging several people who had similar questions to the proposal above in case they have additional ideas that may help:

@ilikegitlab
Copy link

so now i'm Cc i feel obliged to say something about my workflow:
I'm mainly working with gas at the moment. Units are mol fractions, concentrations and partial pressures. So not only do i want to go from mols to grams, but maybe also to partial pressure (which luckily is often the same at sealevel) and for water, we can also go from mol fraction to a percentage (relative humidity) and for isotopes we use permil (which surprisingly is not even defined in units). I have no problem in doing conversions manual, but the consequence is I need like to have mol per mol in my data or things become confusing. More annoyingly the mol air is not even well defined because perhaps sometimes I used "different" air, with different gram/mol.

I'm not sure I would really need another package. What would it bring me other than a simpler way of converting (its not that often that I need to do it)? I also had a dataset recently where I had m^2/m^2, but still to very different areas, so the problem is not exclusive to substances.

@Enchufa2
Copy link
Member

Enchufa2 commented Oct 17, 2023

My aim here is to specifically solve workaround the problem of "counts of things" that have a translation to SI units (typically mass, but could be e.g. electric charge). The relationship of those parametric units, as defined by Johansson, to the SI units depend on the things being considered, and this is why I called this substances for now.

Fractions of the same units are a different beast... When you define a parametric conversion (the aim here), you are referring to a single thing (e.g. g/mol or mol/g of atoms of oxygen). But when you define a fraction like g/g, you are referring to two different things. And unfortunately this is much much harder to handle based on a system like UDUNITS (or any units systems out there that I know for that matter, because not even the SI takes these things into consideration). I'll keep that in mind, but in principle, this is not the goal of this package proposal.

BTW, @ilikegitlab, could you please comment further on that example of m^2/m^2?

@henningte
Copy link
Contributor

henningte commented Oct 17, 2023

I think it is a good idea to collect conversions for some often-used parametric units in an own package. I'm not yet sure if I fully understand the scope and the sketched implementation of the planned package and the main reason certainly is that I'm not too familiar with formal definitions of unit systems or parametric units.

In any case, I think that defining a naming scheme for the substances and limiting the scope of considered substances are important tasks because otherwise things may get too complicated if one considers the diversity of chemical substances alone (e.g. the same compound (same chemical formula and bonds) but with different charges, etc.). This is another reason why I'm unsure about the scope.

One problem that came to my mind was how to avoid automatic conversion if one installs conversions like grams of CO2 to grams of O and grams of H2O to grams of O. Wouldn't it in such a case be possible to create ambiguities, e.g. that it is possible to compute 2 grams of CO2 + 8 grams of H2O = 12 grams of O which may be desired behavior in some cases, but not in others?

I noticed that @billdenney said this may be too complicated to consider in a first sketch of the package, but perhaps similar things could happen in other contexts (e.g. conversion of mols of hydrated compounds to mols of water).

I'm not sure at all how likely such things are, but if they are, it may be better to force explicit unit conversion. For example, one could do something like this (i.e., provide a table/list conversion_constants in the planned package from which one can access conversion constants explicitly, but these constants are not installed via the 'units' package):

library(units)
#> udunits database from C:/Users/henni/AppData/Local/R/win-library/4.3/units/share/udunits/udunits2.xml

# units which need to be installed
install_unit("mol_CO2_")
install_unit("mol_water_") # I got an error that the unit is not defined with install_unit("mol_H2O_")
install_unit("mol_O")

# example for a conversion table holding conversion constants
conversion_constants <- 
  list(
    CO2 =
      list(O = units::set_units(2, mol_O/mol_CO2_)),
    H2O =
      list(O = units::set_units(1, mol_O/mol_water_))
  )

# Then nonsense is avoided by default:
units::set_units(1, mol_CO2_) + units::set_units(1, mol_water_)
#> Error: cannot convert mol_water_ into mol_CO2_

# But you can make the conversions explicitly (this can certainly be simplified)
units::set_units(1, mol_CO2_) * conversion_constants$CO2$O + units::set_units(1, mol_water_) * conversion_constants$H2O$O
#> 3 [mol_O]

Created on 2023-10-17 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16 ucrt)
#>  os       Windows 11 x64 (build 22621)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2023-10-17
#>  pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.1)
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.1)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.1)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.1)
#>  htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.1)
#>  knitr         1.43    2023-05-25 [1] CRAN (R 4.3.1)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.1)
#>  Rcpp          1.0.11  2023-07-06 [1] CRAN (R 4.3.1)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.3.1)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.1)
#>  rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)
#>  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
#>  units       * 0.8-3   2023-09-06 [1] Github (billdenney/units@d57f54d)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.1)
#>  xfun          0.40    2023-08-09 [1] CRAN (R 4.3.1)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/Users/henni/AppData/Local/R/win-library/4.3
#>  [2] C:/Program Files/R/R-4.3.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Just a first thought from my side, in case it is useful.

@ilikegitlab
Copy link

@Enchufa2: The aims sound sensible, but I wonder: in the case of grams of oxygen per m^3 of air, or mols of sugar per kg of water (Osmolality) would I not also be essential be referring to two different things?

as for m^2/m^2 (or g/kg) this comes up in allometric relationships describing ratios of body parts of plants and animals. You are right they are two different things. Although it may make sense to simplify it to [1], in practice this breaks current math with udunits because:

(g leaf)/(g plant) * (m^2 leaf/g leaf) = (m2 leaf)/(g plant)

I agree one could go through the trouble of redefining a gleaf and gplant unit, but care should be taken not to make things too rigid or complex because then many people may just drop units at the earliest convenience (I admit I found myself wanting to write a dispense_units(math, reapply="units") method at some point, which I still have managed to avoid!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants