Merge pull request #119 from MetOffice/develop

Merge all changes from develop onto main before merging CORDEX changes
MetOffice · Mar 21, 2022 · 6b4c3d6 · 6b4c3d6
2 parents dd42a86 + c3c66be
commit 6b4c3d6
Show file tree

Hide file tree

Showing 19 changed files with 889 additions and 99 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -20,11 +20,11 @@ conda activate pyprecis-environment
 :exclamation: *Note: As of v1.0 we are unable to provison the model data necessary for reproducing the full PyPRECIS learning environment via github due to it's large file size.  Contact the PRECIS team for more information.*
 
 ## Before you start...
-Read through the current issues to see what you can help with.  If you have your own ideas for improvements, please start a new issues so we can track and discuss your improvement. You must create a new branch for any changes you make.
+Read through the current issues to see what you can help with.  If you have your own ideas for improvements, please start a new issue so we can track and discuss your improvement. You must create a new branch for any changes you make.
 
 **Please take note of the following guidelines when contributing to the PyPRECIS repository.**
 
-* Please do **not** make changes to the `master` branch.  The `master` branch is reserved for files and code that has been fully tested and reviewed.  Only the core PyPRECIS developers can/should push to the `master` branch.
+* Please do **not** make changes to `main` or `develop` branches.  The `main` branch is reserved for files and code that has been fully tested and reviewed.  Only the core PyPRECIS developers can push to the `main` and `develop` branches.
 
 * The `develop` branch contains the latest holistic version of the `PyPRECIS` repository.  Please branch off `develop` to fix a particular issue or add a new feature.
 * Please use the following tokens at the start of a new branch name to help sign-post and group branches:
@@ -66,5 +66,5 @@ have questions.**
 
 <h5 align="center">
 <img src="notebooks/img/MO_MASTER_black_mono_for_light_backg_RBG.png" width="200" alt="Met Office"> <br>
-&copy; British Crown Copyright 2018 - 2019, Met Office
+&copy; British Crown Copyright 2018 - 2022, Met Office
 </h5>
diff --git a/README.md b/README.md
@@ -31,7 +31,7 @@ PyPRECIS is built on [Jupyter Notebooks](https://jupyter.org/), with data proces
 Further information about PRECIS can be found on the [Met Office website](https://www.metoffice.gov.uk/precis).
 
 ## Contents
-The teaching elements of PyPRECIS are contained in the `notebooks` directory. The primary worksheets are:
+The teaching elements of PyPRECIS are contained in the `notebooks` directory. The core primary worksheets are:
 
 Worksheet | Aims
 :----: | -----------
@@ -42,7 +42,7 @@ Worksheet | Aims
 [5](notebooks/worksheet5.ipynb) | <li>Have an appreciation for working with daily model data</li><li>Understand how to calculate some useful climate extremes statistics</li><li>Be aware of some coding stratagies for dealing with large data sets</li></ul>  
 [6](notebooks/worksheet6.ipynb) | An extended coding exercise designed to allow you to put everything you've learned into practise  
 
-Additional tutorials specific to the CSSP 20th Century reanalysis datasets:
+Additional tutorials specific to the CSSP 20th Century reanalysis dataset:
 
 Worksheet | Aims
 :----: | -----------
@@ -55,10 +55,17 @@ Three additional worksheets are available for use by workshop instructors:
 
 * `makedata.ipynb`: Provides scripts for preparing raw model output for use in notebook exercises.
 * `worksheet_solutions.ipyn`: Solutions to worksheet exercices.
-* `worksheet6example.ipynb`: Example code for Worksheet 6. 
+* `worksheet6example.ipynb`: Example code for Worksheet 6.
 
 ## Data
-The data used in the worksheets is currently only available within the Met Office. Data relating to the CSSP_20CRDS_Tutorials is also available in Zarr format in an Azure Blob Storage Service. See the `data/DATA-ACESS.md` for further details.
+Data relating to the PyPRECIS project is currently held internally to the Met Office.
+
+The total data volume for the core worksheets is 36.68 GB, of which ~20 GB is raw pp data. This is too large to be stored on github, or via git lfs.
+As of v2.0, the storage solution for making this data available alongside the notebooks is still under investgation.
+
+Data relating to the **CSSP 20CRDS** tutorials is held online in an Azure Blob Storage Service. To access this data user will need a valid shared access signature (SAS) token.  The data is in [Zarr](https://zarr.readthedocs.io/en/stable/) format and the total volume is ~2TB. The data is in hourly, 3 hourly, 6 hourly, daily and monthly frequencies stored seperatrely under the `metoffice-20cr-ds` container on MS-Azure. Monthly data only is also via [Zenodo](https://zenodo.org/record/2558135).
+
+
 
 ## Contributing
 Information on how to contribute can be found in the [Contributing guide](CONTRIBUTING.md).
@@ -69,5 +76,5 @@ PyPRECIS is licenced under BSD 3-clause licence for use outside of the Met Offic
 
 <h5 align="center">
 <img src="notebooks/img/MO_MASTER_black_mono_for_light_backg_RBG.png" width="200" alt="Met Office"> <br>
-&copy; British Crown Copyright 2018 - 2020, Met Office
+&copy; British Crown Copyright 2018 - 2022, Met Office
 </h5>
diff --git a/data/DATA-ACCESS.md b/data/DATA-ACCESS.md
diff --git a/dockerfile b/dockerfile
@@ -0,0 +1,23 @@
+FROM continuumio/miniconda3
+
+RUN apt-get update
+
+# Set working directory for the project
+WORKDIR  /app
+
+SHELL ["/bin/bash", "--login", "-c"]
+
+RUN apt-get install -y git
+
+# Create Conda environment from the YAML file
+COPY environment.yml .
+RUN pip install --upgrade pip
+
+RUN conda env create -f environment.yml
+
+RUN conda init bash
+RUN conda activate pyprecis-environment
+
+RUN pip install ipykernel && \
+    python -m ipykernel install --name pyprecis-training
+
diff --git a/environment.yml b/environment.yml
@@ -1,11 +1,17 @@
 name: pyprecis-environment
 channels:
      - conda-forge
-     - defaults
- dependencies:
-     - python=3.6.6
-     - numpy
-     - matplotlib
-     - cartopy=0.16.0
-     - dask=0.19.4
-     - iris=2.2.0
+dependencies:
+  - python=3.6.10
+  - iris=2.4.0
+  - numpy=1.17.4
+  - matplotlib=3.1.3
+  - nc-time-axis=1.2.0
+  - jupyter_client=6.1.7
+  - jupyter_core=4.6.3
+  - dask=2.11.0
+  - notebook=5.7.8
+  - mo_pack=0.2.0
+  - boto3
+  - botocore
+  - tqdm
diff --git a/notebooks/awsutils/README-AWS.md b/notebooks/awsutils/README-AWS.md
@@ -0,0 +1,129 @@
+
+## AWS
+
+### Create an EC2 instance
+
+* Select Eu-west2 (London) region from the top right of navigation bar
+* Click on Launch instance
+* Choose Amazon Linux 2 AMI (HVM) kARNEL 5.10 64-bit (- X86) machine, click select
+* Choose t2.2xlarge and click next: configure instance details
+* Choose subnet default eu-west-2c
+* In IAM role choose existing trainings-ec2-dev role and click next: storage
+* 8 gb is fine, click next: add tags
+* Add following tags
+  * Name: [Unique Instance name]
+  * Tenable: FA
+  * ServiceOwner: [firstname.lastname]
+  * ServiceCode: PABCLT
+* add securitygroup, select an existing security group: IAStrainings-ec2-mo
+* Review and Launch and then select launch
+* It will prompt to set a key pair (to allow ssh). create a new key and download it.
+
+It will create the instance. To see the running instance goto instances and instacne state will be "Running"
+
+### SSH instance on VDI
+
+
+* Save the key (.pem)  to .ssh and set the permission: chmod 0400 ~/.ssh/your_key.pem
+* Open ~/.ssh/config and add following:
+
+```
+Host ec2-*.eu-west-2.compute.amazonaws.com
+    IdentityFile ~/.ssh/your_key.pem
+    User ec2-user
+
+```
+
+* Find the public IPv4 DNS and ssh in using it ssh ec2-<ip address>.eu-west-2.compute.amazonaws.com, public IPv4 DNS can be found in instance detail on AWS. Click on your instance and it will open the details.
+
+* Remember to shutdown the instance when not using it. It will save the cost.
+### create s3 bucket
+
+* goto s3 service and press "create bucket"
+* name the bucket
+* set region to EU (London) eu-west-2
+* add tags:
+  * Name: [name of bucket or any unique name]
+  * ServiceOwner: [your-name]
+  * ServiceCode: PABCLT
+  * Tenable: FA
+* click on "create bucket"
+
+### Key configurations
+
+
+The above script run only when config files contains latest keys. In order to update the keys:
+
+* go to AB climate training dev --> Administrator access --> command line or programmatic access
+* Copy keys in "Option 1: Set AWS environment variables"
+* In VDI, paste (/replace existing) these keys in ~/.aws/config
+* add [default] in first line
+* Copy keys in "Option 2: Add a profile to your AWS credentials file"
+* In VDI, Paste the keys in credentials file: ~/.aws/credentials (remove the first copied line, looks somethings like: [198477955030_AdministratorAccess])
+* add [default] in first line
+
+The config and credentials file should look like (with own keys):
+
+```
+[default]
+export AWS_ACCESS_KEY_ID="ASIAS4NRVH7LD2RRGSFB"
+export AWS_SECRET_ACCESS_KEY="rpI/dxzQWhCul8ZHd18n1VW1FWjc0LxoKeGO50oM"
+export AWS_SESSION_TOKEN="IQoJb3JpZ2luX2VjEGkaCWV1LXdlc3QtMiJH"
+```
+
+### Loading data on s3 bucket from VDI (using boto3)
+
+to upload the file(s) on S3 use: /aws-scripts/s3_file_upload.py
+to upload the directory(s) on S3 use: /aws-scripts/s3_bulk_data_upload.py
+
+### AWS Elastic container repository
+
+Following instructions are for creating image repo on ECR and uploading container image
+
+* ssh to the previously created EC2 instance, make an empty Git repo:
+
+```
+sudo yum install -y git
+git init
+```
+* On VDI, run the following command to push the PyPrecis repo containing the docker file to the EC2 instance:
+```
+git push <ec2 host name>:~
+```
+
+* Now checkout the branch on EC2: git checkout [branch-name]
+* Install docker and start docker service
+
+```
+sudo amazon-linux-extras install docker
+sudo service docker start
+```
+
+* build docker image:
+
+```
+sudo docker build .
+```
+
+* goto AWS ECR console and "create repository", make it private and name it
+
+* Once created, press "push commands"
+
+* copy the command and run it on EC2 instance, it will push the container image on record. if get "permission denied" error, please add "sudo" before "docker" in the command.
+
+
+
+### AWS Sagemaker: Run notebook using custom kernel
+The instructions below follow the following tutorial:
+https://aws.amazon.com/blogs/machine-learning/bringing-your-own-custom-container-image-to-amazon-sagemaker-studio-notebooks/
+
+* goto Sagemaker and "open sagemaker domain"
+* add user
+  * Name and and select Amazonsagemaker-executionrole (dafult one)
+
+* Once user is created, goto "attach image"
+* Select "New Image" and add image URI (copy from image repo)
+* Give new image name, display name, sagmaker-executionrole and add tags and attach the image
+* add kernel name and display name (both can be same)
+* Now, launch app -> Studio and it will open the Notebook dashboard.
+* Select python notebook and add your custom named Kernel
diff --git a/notebooks/awsutils/fetch_s3_file.py b/notebooks/awsutils/fetch_s3_file.py
@@ -0,0 +1,111 @@
+
+import io
+import os
+import boto3
+from urllib.parse import urlparse
+from fnmatch import fnmatch
+from shutil import copyfile
+
+
+def _fetch_s3_file(s3_uri, save_to):
+
+    bucket_name, key = _split_s3_uri(s3_uri)
+    print(f"Fetching s3 object {key} from bucket {bucket_name}")
+
+    client = boto3.client("s3")
+    obj = client.get_object(
+        Bucket=bucket_name,
+        Key=key,
+    )
+    with io.FileIO(save_to, "w") as f:
+        for i in obj["Body"]:
+            f.write(i)
+
+
+def _save_s3_file(s3_uri, out_filename, file_to_save="/tmp/tmp"):
+    bucket, folder = _split_s3_uri(s3_uri)
+    out_filepath = os.path.join(folder, out_filename)
+    print(f"Save s3 object {out_filepath} to bucket {bucket}")
+    client = boto3.client("s3")
+    client.upload_file(
+        Filename=file_to_save,
+        Bucket=bucket,
+        Key=out_filepath
+    )
+
+
+def _split_s3_uri(s3_uri):
+    parsed_uri = urlparse(s3_uri)
+    return parsed_uri.netloc, parsed_uri.path[1:]
+
+
+def find_matching_s3_keys(in_fileglob):
+
+    bucket_name, file_and_folder_name = _split_s3_uri(in_fileglob)
+    folder_name = os.path.split(file_and_folder_name)[0]
+    all_key_responses = _get_all_files_in_s3_folder(bucket_name, folder_name)
+    matching_keys = []
+    for key in [k["Key"] for k in all_key_responses]:
+        if fnmatch(key, file_and_folder_name):
+            matching_keys.append(key)
+    return matching_keys
+
+
+def _get_all_files_in_s3_folder(bucket_name, folder_name):
+    client = boto3.client("s3")
+    response = client.list_objects_v2(
+        Bucket=bucket_name,
+        Prefix=folder_name,
+    )
+    all_key_responses = []
+    if "Contents" in response:
+        all_key_responses = response["Contents"]
+    while response["IsTruncated"]:
+        continuation_token = response["NextContinuationToken"]
+        response = client.list_objects_v2(
+            Bucket=bucket_name,
+            Prefix=folder_name,
+            ContinuationToken=continuation_token,
+        )
+        if "Contents" in response:
+            all_key_responses += response["Contents"]
+    return all_key_responses
+
+
+def copy_s3_files(in_fileglob, out_folder):
+    '''
+    This function copy files from s3 bucket to local directory.
+    args
+    ---
+    in_fileglob: s3 uri of flies (wild card can be used)
+    out_folder: local path where data will be stored
+    '''
+    matching_keys = find_matching_s3_keys(in_fileglob)
+    in_bucket_name = _split_s3_uri(in_fileglob)[0]
+    out_scheme = urlparse(out_folder).scheme
+    for key in matching_keys:
+        new_filename = os.path.split(key)[1]
+        temp_filename = os.path.join("/tmp", new_filename)
+        in_s3_uri = os.path.join(f"s3://{in_bucket_name}", key)
+        _fetch_s3_file(in_s3_uri, temp_filename)
+        if out_scheme == "s3":
+            _save_s3_file(
+                out_folder,
+                new_filename,
+                temp_filename,
+            )
+        else:
+            copyfile(
+                temp_filename, os.path.join(out_folder, new_filename)
+            )
+        os.remove(temp_filename)
+
+
+def main():
+    in_fileglob = 's3://ias-pyprecis/data/cmip5/*.nc'
+    out_folder = '/home/h01/zmaalick/myprojs/PyPRECIS/aws-scripts'
+    copy_s3_files(in_fileglob, out_folder)
+
+
+if __name__ == "__main__":
+    main()