Skip to content

Commit

Permalink
Fixed latitude weighting in losses and metrics.
Browse files Browse the repository at this point in the history
  • Loading branch information
geronimocharlie committed Jan 8, 2024
2 parents a709661 + f6291ff commit 6506153
Show file tree
Hide file tree
Showing 159 changed files with 18,679 additions and 8,099 deletions.
22 changes: 0 additions & 22 deletions .circleci/config.yml

This file was deleted.

10 changes: 10 additions & 0 deletions .github/workflows/diagram.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
<<<<<<< HEAD
name: create diagram
=======
name: Create diagram
>>>>>>> f6291ff9afa023b7888456d700119b5c25816b48
on:
workflow_dispatch: {}
push:
Expand All @@ -13,5 +17,11 @@ jobs:
- name: Update diagram
uses: githubocto/repo-visualizer@main
with:
<<<<<<< HEAD
excluded_paths: "ignore,.github,causal, data, prepare_data.py, requirements.txt, test.txt, tests.py, .circleci"

=======
max_depth: 7
excluded_globs: "frontend/*.spec.js;**/*.{png,jpg};**/!(*.module).ts/**/*.{txt,md}"
excluded_paths: ".esg, .circleci, emulator/logs, ignore,.github, causal, causalpaca2, data_building, deprecated, env39, notebooks, diagram.svg, download_cliateset.sh, requirements.txt, requirements2.txt, requirements37.txt, requirements_data.txt, setup.sh, tests.py"
>>>>>>> f6291ff9afa023b7888456d700119b5c25816b48
20 changes: 20 additions & 0 deletions .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
# All data files
*.nc
RAW_DATA
PROCESSED_DATA
tmp/*
logs/*
*.out
*.nc
internal
data
pretrained_models
Climateset_DATA
# deprecated files
deprecated/
notes/

# privacy files
data_generation/parameters/credentials.txt

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -106,6 +125,7 @@ celerybeat.pid
.env
.venv
env/
env39/
venv/
ENV/
env.bak/
Expand Down
25 changes: 18 additions & 7 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,11 +1,22 @@
# causalpaca
Creating an Ensemble Climate Emulator that can incorporate causality.
# ***ClimateSet*** - : A Large-Scale Climate Model Dataset for Machine Learning

## Best coding practices
Here are few best practices to keep in mind. It will help us to maintain a more consistent code and an overall better code :).
## Official implementation for the data downloader & processor

- __Git__: Try to make small commits with meaningful messages. When adding a new functionnality, make a pull request and add a short description. You can also assign the revision to other contributors.
Abstract: *Climate models have been key for assessing the impact of climate change and simulating future climate scenarios depending on humanity’s socioeconomic choices.

- __Continuous integration__: When pushing your code, CircleCI routines will make sure that you follow the PEP8 guidelines and you can also run unit tests from `tests.py`. It is possible to change CircleCI configurations in `.circleci/config.yml`. To see the results: `https://app.circleci.com/`.
The machine learning (ML) community has taken an increased interest in supporting climate scientists’ efforts on various tasks such as climate emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and the ML communities have communicated that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the CMIP6 and Input4MIPs archives. In addition, we provide a modular dataset pipeline for retrieving and pre-processing additional climate models and scenarios.

- __Comments__: For function's docstrings comments, let's use the google format with python annotation (as "PEP 484 type annotations" in https://www.sphinx-doc.org/en/master/usage/extensions/example_google.html).
We showcase the potential of our dataset by using it as a benchmark for ML-based climate emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing them across different climate models. Furthermore, the dataset is used to train an ML model on all 36 climate models, i.e. not only one specific climate model but the entire CMIP6 archive is emulated. With this, we can quickly project new climate scenarios capturing the inter-model variability of climate models - similar to the “averaged climate scenarios” provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate model related tasks at scale.*

This repositorcy contains 2 independent pathways: Databuilding and emulation.

### Data Building
If you wish to create an individual climate dataset or extend the core dataset provided by ClimateSet please refer to [downloader](README_downloader.md) and [preprocessor](README_preprocessor.md) pages for further information.

### Emulation
if you wish to set up your own experiments, reproduce our benchmarks on the core dateset or your individual dataset, please refer to the [emulator](README_emulator.md) page for further information.

## Development

This repository is currently under active development and you may encounter bugs with some functionality.
Any feedback, extensions & suggestions are welcome!
78 changes: 78 additions & 0 deletions README_downloader.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# ***ClimateSet*** - : A Large-Scale Climate Model Dataset for Machine Learning

## Official implementation for the data downloader

Abstract: *Climate models have been key for assessing the impact of climate change and simulating future climate scenarios depending on humanity’s socioeconomic choices.

The machine learning (ML) community has taken an increased interest in supporting climate scientists’ efforts on various tasks such as climate emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and the ML communities have communicated that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the CMIP6 and Input4MIPs archives. In addition, we provide a modular dataset pipeline for retrieving and pre-processing additional climate models and scenarios.

We showcase the potential of our dataset by using it as a benchmark for ML-based climate emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing them across different climate models. Furthermore, the dataset is used to train an ML model on all 36 climate models, i.e. not only one specific climate model but the entire CMIP6 archive is emulated. With this, we can quickly project new climate scenarios capturing the inter-model variability of climate models - similar to the “averaged climate scenarios” provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate model related tasks at scale.*

## Usage
### Create an enviroment

To setup the environment for the downloader, we use python>=3.10.

Follow the following steps to create the environment:

```bash
python -m venv env_downloader
source env_downloader/bin/activate
pip install -r requirements_downloader.txt
```

### Downloader

To ownload data, you can run the downloader module with a desired config specifying what climate models, experiments and variables you want to download data for.
You can also specify a list of ensemble members or a maximum number of ensemble members per climate model.

The following parameters for the downloader are the default:

```python
experiments: List[str] = [
"historical",
"ssp370",
"hist-GHG",
"piControl",
"ssp434",
"ssp126",
], # sub-selection of ClimateBench default
vars: List[str] = ["tas", "pr", "SO2", "BC"],
data_dir: str = os.path.join(ROOT_DIR, "data"),
max_ensemble_members: int = 10, # if -1 take all available models
ensemble_members: List[str] = None # preferred ensemble members used, if None not considered
overwrite=False, # flag if files should be overwritten
download_biomassburning=True, # get biomassburning data for input4mips
download_metafiles=True, # get input4mips meta files
plain_emission_vars=True, # specifies if plain variabsle for emissions data are given and rest is inferred or if variables are specified

```

To run the downloader, create a config in which you specify what models you want to download data for. If no model is given, we assume you only want to download "input4mips" data.
You can override any of the downloader kwargs in this onfig.
As an example, to download the filese needed to create the core dataset, see this [example downloader config](data_building/configs/downloader/core_dataset.yaml).
To run the downloader with this default example, excecute the following command:

```bash
python -m data_building.builders.downloader --cfg data_building/configs/downloader/core_dataset.yaml
```

Feel free to change create new configs to change the variables and experiments being downloaded.

Per default, will expect plain variable names for the emissions variables for esier usage and will infer other variable names for building the full dataset. For example, passing on ```BC``` will dowlnoad data for ```BC_em_anthro```, ```BC_em_AIR_anthro``` and ```BC_em_openburning```, and if biomassburning (```BC```) and percentage files if desired.
If you wish to change this behavior and be specific about what variables to download, pass on ```plain_emission_vars: False``` in your config.

Per default, the downloader will create two subfolders in your specified directory, one named ```raw``` containing unprocessed ```input4mips``` and ```CMIP6``` files and one named ```meta``` containing files concerning fire emission data and othter files needed to achieve consistent preprocessing emission data.

Per default, the downloader will create a structure that already specifies most of the needed meta information of each file like nominal resolution, temporal resolution, experiment, source etc. Please do not change this structure if you wish to be using the preprocessing module out of the box.

#### Available Variables

To check what CMIP6 variables are available, you can refer to this (table)[data_building/data_glossary/mappings/variableid2tableid.csv] in our data glossary mapping long variable names to ids and units. Please use the ids to prompt the downloader.

We provide more detailed information on all variables available to our example model ```NorESM2-LM``` in our (data glossary)[data_building/data_glossary/] as well with a collection of (helpful links)[data_building/data_glossary/helpful_links.txt] to get you started.

## Development

This repository is currently under active development and you may encounter bugs with some functionality.
Any feedback, extensions & suggestions are welcome!
Loading

0 comments on commit 6506153

Please sign in to comment.