Skip to content

Commit

Permalink
Merge branch 'main' into feature/mount-plugin-on-vfolder
Browse files Browse the repository at this point in the history
  • Loading branch information
fregataa committed Oct 23, 2023
2 parents fce6b5b + 429c155 commit befeee2
Show file tree
Hide file tree
Showing 354 changed files with 18,032 additions and 7,994 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ on:
jobs:
test-coverage:
if: ${{ contains(fromJson('["schedule", "workflow_dispatch"]'), github.event_name) || (!contains(github.event.pull_request.labels.*.name, 'skip:ci') && github.event.pull_request.merged == true) }}
runs-on: [ubuntu-latest-8-cores, self-hosted]
runs-on: [ubuntu-latest-8-cores]
steps:
- uses: actions/checkout@v4
with:
Expand Down
7 changes: 6 additions & 1 deletion .github/workflows/default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
- name: Check BUILD files
run: pants tailor --check update-build-files --check '::'
- name: Check forbidden cross imports
run: ./scripts/check-cross-imports.sh
run: pants dependencies '::'
- name: Lint
run: |
if [ "$GITHUB_EVENT_NAME" == "pull_request" -a -n "$GITHUB_HEAD_REF" ]; then
Expand Down Expand Up @@ -394,6 +394,11 @@ jobs:
PYTHON_VERSION=$(grep -m 1 -oP '(?<=CPython==)([^"]+)' pants.toml)
echo "PANTS_CONFIG_FILES=pants.ci.toml" >> $GITHUB_ENV
echo "PROJECT_PYTHON_VERSION=$PYTHON_VERSION" >> $GITHUB_ENV
- name: Download wheels
uses: actions/download-artifact@v3
with:
name: wheels
path: dist
- name: Build and push
uses: docker/build-push-action@v4
with:
Expand Down
29 changes: 29 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,35 @@ Changes

<!-- towncrier release notes start -->

## 23.09.0 (2023-09-28)

### Features
* Add option for roundrobin agent selection strategy ([#1405](https://github.com/lablup/backend.ai/issues/1405))
* Add health check and manual trigger API for the manager scheduler ([#1444](https://github.com/lablup/backend.ai/issues/1444))
* Implement VAST storage backend. ([#1577](https://github.com/lablup/backend.ai/issues/1577))

### Fixes
* Apply the jinja `string` filter to a `yarl.URL()`-typed field in webserver.conf to make it serializable ([#1595](https://github.com/lablup/backend.ai/issues/1595))


## 23.09.0b3 (2023-09-22)

### Features
* Add a GraphQL query to get the information of a virtual folder by ID. ([#432](https://github.com/lablup/backend.ai/issues/432))
* Implement limitation of the number of containers per agent. ([#1338](https://github.com/lablup/backend.ai/issues/1338))
* Introduce the k8s agent backend mode to `install-dev.sh` with `--agent-backend` option ([#1526](https://github.com/lablup/backend.ai/issues/1526))
* Improve the resource metadata API (`/config/resource-slots/details`) to include only explicitly reported resource slots and be able to filter by the agent availability in a resource group ([#1589](https://github.com/lablup/backend.ai/issues/1589))

### Fixes
* Enable `ResourceSlotColumn` to return `None` since we need to distinguish between empty `ResourceSlot` value and `None`.
Alter `kernels.requested_slots` column into not nullable since the value of the column should not be null. ([#1469](https://github.com/lablup/backend.ai/issues/1469))
* Update outdated nfs mount for kubernetes agent backend ([#1527](https://github.com/lablup/backend.ai/issues/1527))
* Collect orphan routings (route which its belonging session is already terminated) ([#1590](https://github.com/lablup/backend.ai/issues/1590))
* Handle external error of storage proxy to return error response with detail message rather than just leaving it. ([#1591](https://github.com/lablup/backend.ai/issues/1591))
* Add `pipeline.endpoint` default value to `configs/webserver/halfstack.conf` to be able to run immediately after install ([#1592](https://github.com/lablup/backend.ai/issues/1592))
* Make `RedisHelperConfig` optional and give default values when it is not specified. ([#1593](https://github.com/lablup/backend.ai/issues/1593))


## 23.09.0b2 (2023-09-20)

### Fixes
Expand Down
1 change: 1 addition & 0 deletions MIGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Backend.AI Migration Guide
# 23.03 to 23.09
* webserver configuration scheme updated
- `webserver`, `logging` and `debug` categories added, with all of those marked as required.
- `session.redis.host` and `session.redis.port` settings are now part of `session.redis.addr`

# 22.09 to 23.03
* All running containers **MUST** be shut down before starting 23.03 version of Backend.AI Agent.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ Python Version Compatibility

| Backend.AI Core Version | Compatible Python Version |
|:-----------------------:|:-------------------------:|
| 23.03.x | 3.11.x |
| 23.03.x / 23.09.x | 3.11.x |
| 22.03.x / 22.09.x | 3.10.x |
| 21.03.x / 21.09.x | 3.8.x |

Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
23.09.0b2
24.03.0dev0
3 changes: 2 additions & 1 deletion backend.ai-client.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
ARG PYTHON_VERSION
FROM python:${PYTHON_VERSION} AS builder
ARG PKGVER
RUN pip wheel --wheel-dir=/wheels --no-cache-dir backend.ai-client==${PKGVER}
COPY dist /dist
RUN pip wheel --wheel-dir=/wheels --no-cache-dir backend.ai-client==${PKGVER} --find-links=/dist

FROM python:${PYTHON_VERSION}
COPY --from=builder /wheels /wheels
Expand Down
1 change: 1 addition & 0 deletions changes/1401.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Wrong exception handling logics of `SchedulerDispatcher`
1 change: 1 addition & 0 deletions changes/1417.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add `max_vfolder_count` to `ProjectResourcePolicy` and migrate the same option to `UserResourcePolicy`
1 change: 0 additions & 1 deletion changes/1589.feature.md

This file was deleted.

1 change: 1 addition & 0 deletions changes/1599.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Refactor initiating logic of model session DB models so that errors while creating the session can be also stored and expressed to user
1 change: 1 addition & 0 deletions changes/1603.misc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Bump base Python version from 3.11.4 to 3.11.6 to resolve potential bugs.
1 change: 1 addition & 0 deletions changes/1606.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Check health status of model service actively
1 change: 1 addition & 0 deletions changes/1610.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Update the default API endpoint of the client SDK (`api.cloud.backend.ai`)
1 change: 1 addition & 0 deletions changes/1612.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix the mock accelerator plugin to properly set the environment variables without removing existing ones such as `LOCAL_USER_ID`. Also add explicit logging and warning about such situations.
1 change: 1 addition & 0 deletions changes/1613.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Clean up `entrypoint.sh` (our custom container entrypoint), including fixes to avoid non-mandatory recursive file operations on `/home/work`
1 change: 1 addition & 0 deletions changes/1616.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Update GPFS storage client for compatibility.
1 change: 1 addition & 0 deletions changes/1617.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Enable exhaustive search for recursive session termination irrelevant to each session's status
1 change: 1 addition & 0 deletions changes/1619.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Remove legacy name-based container image exclusion filter to prevent unexpected exclusion of user-built images with names containing "base-" or "common"
1 change: 1 addition & 0 deletions changes/1620.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Improve logging when retrying redis connections during failover and use explicit names for all redis connection pools
1 change: 1 addition & 0 deletions changes/1624.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Allow sessions to have dependencies on stale sessions during the `_post_enqueue()` process.
1 change: 1 addition & 0 deletions changes/835.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add vfolder purge API for permanent vfolder removal and change original vfolder delete API to update vfolder status only.
1 change: 1 addition & 0 deletions changes/962.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Support session-based usage stats for period.
2 changes: 2 additions & 0 deletions configs/agent/halfstack.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ sandbox-type = "docker"
scratch-type = "hostdir"
scratch-root = "./scratches"
scratch-size = "1G"
scratch-nfs-address = ""
scratch-nfs-options = ""


[watcher]
Expand Down
13 changes: 13 additions & 0 deletions configs/storage-proxy/sample.toml
Original file line number Diff line number Diff line change
Expand Up @@ -196,3 +196,16 @@ gpfs_password = "admin" # Spectrum Scale GUI Password
gpfs_fs_name = "example_fs" # Target filesystem to use as vFolder, must match with the one mounted under `volume.gpfs.path`
gpfs_verify_ssl = false # Skips GPFS API's SSL Certificate validation if set to true. Defaults to false.
gpfs_owner = "1000:1000" # Default ownership to created fileset

[volume.vast]
backend = "vast"
path = "/vfroot/vast"

[volume.vast.options]
vast_endpoint = "https://vast.example.com" # Endpoint to Vast mgmt system API
vast_username = "" # Vast mgmt system Username
vast_password = "" # Vast mgmt system Password
vast_verify_ssl = false # If set to true, allow to communicate to server with insecure ssl context. Defaults to false.
vast_api_version = "v2" # Vast mgmt system API version. Defaults to `v2`
vast_cluster_id = 1 # Vast mgmt system Cluster ID
vast_storage_base_dir = "/" # Vast storage base directory that is mounted to our volume base path
8 changes: 5 additions & 3 deletions configs/webserver/halfstack.conf
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ max_file_upload_size = 4294967296
[plugin]

[pipeline]
endpoint = "http://127.0.0.1:9500"
jwt.secret = "7<:~[X,^Z1XM!*,Pe:PHR!bv,H~Q#l177<7gf_XHD6.<*<.t<[o|V5W(=0x:jTh-"

[ui]
Expand All @@ -50,9 +51,10 @@ redis.addr = "localhost:6379"
# redis.service_name = "mymaster"
# redis.sentinel = "127.0.0.1:9503,127.0.0.1:9504,127.0.0.1:9505"

redis.redis_helper_config.socket_timeout = 5
redis.redis_helper_config.socket_connect_timeout = 2
redis.redis_helper_config.reconnect_poll_timeout = 0.3
# redis.redis_helper_config.socket_timeout = 5
# redis.redis_helper_config.socket_connect_timeout = 2
# redis.redis_helper_config.reconnect_poll_timeout = 0.3

max_age = 604800 # 1 week
flush_on_startup = false
login_block_time = 1200 # 20 min (in sec)
Expand Down
13 changes: 8 additions & 5 deletions configs/webserver/sample.conf
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,10 @@ max_file_upload_size = 4294967296
#page = ""

[pipeline]
#endpoint = "http://mlops.com:9500"
jwt.secret = "7<:~[X,^Z1XM!*,Pe:PHR!bv,H~Q#l177<7gf_XHD6.<*<.t<[o|V5W(=0x:jTh-"
# Endpoint to the pipeline service
#endpoint = "http://127.0.0.1:9500"
# A secret to sign JWTs used to authenticate users from the pipeline service
#jwt.secret = "7<:~[X,^Z1XM!*,Pe:PHR!bv,H~Q#l177<7gf_XHD6.<*<.t<[o|V5W(=0x:jTh-"

[ui]
brand = "Lablup Cloud"
Expand All @@ -121,9 +123,10 @@ redis.addr = "localhost:6379"
# redis.db = 0
# redis.password = "mysecret"

redis.redis_helper_config.socket_timeout = 5
redis.redis_helper_config.socket_connect_timeout = 2
redis.redis_helper_config.reconnect_poll_timeout = 0.3
# Customizes the settings of the Redis connection object used in the web server.
# redis.redis_helper_config.socket_timeout = 5
# redis.redis_helper_config.socket_connect_timeout = 2
# redis.redis_helper_config.reconnect_poll_timeout = 0.3

max_age = 604800 # 1 week
flush_on_startup = false
Expand Down
8 changes: 4 additions & 4 deletions fixtures/manager/example-keypairs.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,15 @@
"user_resource_policies": [
{
"name": "default",
"max_vfolder_size": -1
"max_vfolder_count": 0,
"max_quota_scope_size": -1
}
],
"project_resource_policies": [
{
"name": "default",
"max_vfolder_size": -1
"max_vfolder_count": 0,
"max_quota_scope_size": -1
}
],
"groups": [
Expand Down Expand Up @@ -154,8 +156,6 @@
"max_session_lifetime": 0,
"max_concurrent_sessions": 5,
"max_containers_per_session": 1,
"max_vfolder_count": 10,
"max_vfolder_size": 0,
"idle_timeout": 3600,
"allowed_vfolder_hosts": {
"local:volume1": [
Expand Down
127 changes: 63 additions & 64 deletions fixtures/manager/example-session-templates.json
Original file line number Diff line number Diff line change
@@ -1,71 +1,70 @@
{
"session_templates": [
{
"id": "c1b8441a-ba46-4a83-8727-de6645f521b4",
"is_active": true,
"domain_name": "default",
"group_id": "2de2b969-1d04-48a6-af16-0bc8adb3c831",
"user_uuid": "f38dea23-50fa-42a0-b5ae-338f5f4693f4",
"type": "TASK",
"name": "jupyter",
"template": {
"api_version": "6",
"kind": "task_template",
"metadata": {
"name": "cr.backend.ai/testing/ngc-pytorch",
"tag": "20.11-py3"
[
{
"id": "c1b8441a-ba46-4a83-8727-de6645f521b4",
"is_active": true,
"domain_name": "default",
"group_id": "2de2b969-1d04-48a6-af16-0bc8adb3c831",
"user_uuid": "f38dea23-50fa-42a0-b5ae-338f5f4693f4",
"type": "TASK",
"name": "python_x86",
"template": {
"api_version": "6",
"kind": "task_template",
"metadata": {
"name": "cr.backend.ai/multiarch/python",
"tag": "3.10-ubuntu20.04"
},
"spec": {
"session_type": "interactive",
"kernel": {
"image": "cr.backend.ai/multiarch/python:3.10-ubuntu20.04",
"environ": {},
"architecture": "x86_64",
"run": null,
"git": null
},
"spec": {
"session_type": "interactive",
"kernel": {
"image": "cr.backend.ai/testing/ngc-pytorch:20.11-py3",
"environ": {},
"run": null,
"git": null
},
"scaling_group": "default",
"mounts": {
},
"resources": {
"cpu": "2",
"mem": "4g",
"cuda.shares": "0.2"
}
"scaling_group": "default",
"mounts": {
},
"resources": {
"cpu": "2",
"mem": "4g"
}
}
},
{
"id": "59062449-4f57-4434-975d-add2a593438c",
"is_active": true,
"domain_name": "default",
"group_id": "2de2b969-1d04-48a6-af16-0bc8adb3c831",
"user_uuid": "f38dea23-50fa-42a0-b5ae-338f5f4693f4",
"type": "TASK",
"name": "rstudio",
"template": {
"api_version": "6",
"kind": "task_template",
"metadata": {
"name": "cr.backend.ai/cloud/r-base",
"tag": "4.0"
}
},
{
"id": "59062449-4f57-4434-975d-add2a593438c",
"is_active": true,
"domain_name": "default",
"group_id": "2de2b969-1d04-48a6-af16-0bc8adb3c831",
"user_uuid": "f38dea23-50fa-42a0-b5ae-338f5f4693f4",
"type": "TASK",
"name": "python_aarch64",
"template": {
"api_version": "6",
"kind": "task_template",
"metadata": {
"name": "cr.backend.ai/multiarch/python",
"tag": "3.10-ubuntu20.04"
},
"spec": {
"session_type": "interactive",
"kernel": {
"image": "cr.backend.ai/multiarch/python:3.10-ubuntu20.04",
"environ": {},
"architecture": "aarch64",
"run": null,
"git": null
},
"scaling_group": "default",
"mounts": {
},
"spec": {
"session_type": "interactive",
"kernel": {
"image": "cr.backend.ai/cloud/r-base:4.0",
"environ": {},
"run": null,
"git": null
},
"scaling_group": "default",
"mounts": {
},
"resources": {
"cpu": "1",
"mem": "2g"
}
"resources": {
"cpu": "2",
"mem": "4g"
}
}
}
]
}
}
]
Loading

0 comments on commit befeee2

Please sign in to comment.