|
1 | | -# AI on GKE Assets |
2 | | - |
3 | | -This repository contains assets related to AI/ML workloads on |
4 | | -[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/integrations/ai-infra). |
5 | | - |
6 | | -## Overview |
7 | | - |
8 | | -Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: |
9 | | - |
10 | | -- Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale |
11 | | -- Flexible integration with distributed computing and data processing frameworks |
12 | | -- Support for multiple teams on the same infrastructure to maximize utilization of resources |
13 | | - |
14 | | -## Infrastructure |
15 | | - |
16 | | -The AI-on-GKE application modules assumes you already have a functional GKE cluster. If not, follow the instructions under [infrastructure/README.md](./infrastructure/README.md) to install a Standard or Autopilot GKE cluster. |
17 | | - |
18 | | -```bash |
19 | | -. |
20 | | -├── LICENSE |
21 | | -├── README.md |
22 | | -├── infrastructure |
23 | | -│ ├── README.md |
24 | | -│ ├── backend.tf |
25 | | -│ ├── main.tf |
26 | | -│ ├── outputs.tf |
27 | | -│ ├── platform.tfvars |
28 | | -│ ├── variables.tf |
29 | | -│ └── versions.tf |
30 | | -├── modules |
31 | | -│ ├── gke-autopilot-private-cluster |
32 | | -│ ├── gke-autopilot-public-cluster |
33 | | -│ ├── gke-standard-private-cluster |
34 | | -│ ├── gke-standard-public-cluster |
35 | | -│ ├── jupyter |
36 | | -│ ├── jupyter_iap |
37 | | -│ ├── jupyter_service_accounts |
38 | | -│ ├── kuberay-cluster |
39 | | -│ ├── kuberay-logging |
40 | | -│ ├── kuberay-monitoring |
41 | | -│ ├── kuberay-operator |
42 | | -│ └── kuberay-serviceaccounts |
43 | | -└── tutorial.md |
44 | | -``` |
45 | | - |
46 | | -To deploy new GKE cluster update the `platform.tfvars` file with the appropriate values and then execute below terraform commands: |
47 | | -``` |
48 | | -terraform init |
49 | | -terraform apply -var-file platform.tfvars |
50 | | -``` |
51 | | - |
52 | | - |
53 | | -## Applications |
54 | | - |
55 | | -The repo structure looks like this: |
56 | | - |
57 | | -```bash |
58 | | -. |
59 | | -├── LICENSE |
60 | | -├── Makefile |
61 | | -├── README.md |
62 | | -├── applications |
63 | | -│ ├── jupyter |
64 | | -│ └── ray |
65 | | -├── contributing.md |
66 | | -├── dcgm-on-gke |
67 | | -│ ├── grafana |
68 | | -│ └── quickstart |
69 | | -├── gke-a100-jax |
70 | | -│ ├── Dockerfile |
71 | | -│ ├── README.md |
72 | | -│ ├── build_push_container.sh |
73 | | -│ ├── kubernetes |
74 | | -│ └── train.py |
75 | | -├── gke-batch-refarch |
76 | | -│ ├── 01_gke |
77 | | -│ ├── 02_platform |
78 | | -│ ├── 03_low_priority |
79 | | -│ ├── 04_high_priority |
80 | | -│ ├── 05_compact_placement |
81 | | -│ ├── 06_jobset |
82 | | -│ ├── Dockerfile |
83 | | -│ ├── README.md |
84 | | -│ ├── cloudbuild-create.yaml |
85 | | -│ ├── cloudbuild-destroy.yaml |
86 | | -│ ├── create-platform.sh |
87 | | -│ ├── destroy-platform.sh |
88 | | -│ └── images |
89 | | -├── gke-disk-image-builder |
90 | | -│ ├── README.md |
91 | | -│ ├── cli |
92 | | -│ ├── go.mod |
93 | | -│ ├── go.sum |
94 | | -│ ├── imager.go |
95 | | -│ └── script |
96 | | -├── gke-dws-examples |
97 | | -│ ├── README.md |
98 | | -│ ├── dws-queues.yaml |
99 | | -│ ├── job.yaml |
100 | | -│ └── kueue-manifests.yaml |
101 | | -├── gke-online-serving-single-gpu |
102 | | -│ ├── README.md |
103 | | -│ └── src |
104 | | -├── gke-tpu-examples |
105 | | -│ ├── single-host-inference |
106 | | -│ └── training |
107 | | -├── indexed-job |
108 | | -│ ├── Dockerfile |
109 | | -│ ├── README.md |
110 | | -│ └── mnist.py |
111 | | -├── jobset |
112 | | -│ └── pytorch |
113 | | -├── modules |
114 | | -│ ├── gke-autopilot-private-cluster |
115 | | -│ ├── gke-autopilot-public-cluster |
116 | | -│ ├── gke-standard-private-cluster |
117 | | -│ ├── gke-standard-public-cluster |
118 | | -│ ├── jupyter |
119 | | -│ ├── jupyter_iap |
120 | | -│ ├── jupyter_service_accounts |
121 | | -│ ├── kuberay-cluster |
122 | | -│ ├── kuberay-logging |
123 | | -│ ├── kuberay-monitoring |
124 | | -│ ├── kuberay-operator |
125 | | -│ └── kuberay-serviceaccounts |
126 | | -├── saxml-on-gke |
127 | | -│ ├── httpserver |
128 | | -│ └── single-host-inference |
129 | | -├── training-single-gpu |
130 | | -│ ├── README.md |
131 | | -│ ├── data |
132 | | -│ └── src |
133 | | -├── tutorial.md |
134 | | -└── tutorials |
135 | | - ├── e2e-genai-langchain-app |
136 | | - ├── finetuning-llama-7b-on-l4 |
137 | | - └── serving-llama2-70b-on-l4-gpus |
138 | | -``` |
139 | | - |
140 | | - |
141 | | -### Jupyter Hub |
142 | | - |
143 | | -This repository contains a Terraform template for running JupyterHub on Google Kubernetes Engine. We've also included some example notebooks ( under `applications/ray/example_notebooks`), including one that serves a GPT-J-6B model with Ray AIR (see here for the original notebook). To run these, follow the instructions at [applications/ray/README.md](./applications/ray/README.md) to install a Ray cluster. |
144 | | - |
145 | | -This jupyter module deploys the following resources, once per user: |
146 | | -- JupyterHub deployment |
147 | | -- User namespace |
148 | | -- Kubernetes service accounts |
149 | | - |
150 | | -Learn more [about JupyterHub on GKE here](./applications/jupyter/README.md) |
151 | | - |
152 | | -### Ray |
153 | | - |
154 | | -This repository contains a Terraform template for running Ray on Google Kubernetes Engine. |
155 | | - |
156 | | -This module deploys the following, once per user: |
157 | | -- User namespace |
158 | | -- Kubernetes service accounts |
159 | | -- Kuberay cluster |
160 | | -- Prometheus monitoring |
161 | | -- Logging container |
162 | | - |
163 | | -Learn more [about Ray on GKE here](./applications/ray/README.md) |
164 | | - |
165 | | -## Important Considerations |
166 | | -- Make sure to configure terraform backend to use GCS bucket, in order to persist terraform state across different environments. |
167 | | - |
168 | | - |
169 | | -## Licensing |
170 | | - |
171 | | -* The use of the assets contained in this repository is subject to compliance with [Google's AI Principles](https://ai.google/responsibility/principles/) |
172 | | -* See [LICENSE](/LICENSE) |
| 1 | +# AI on GKE (Archived) |
| 2 | + |
| 3 | +>[!WARNING] |
| 4 | +>This repository has been archived to preserve its contents and is no **longer actively maintained**. It is now **read-only**, meaning no further changes or contributions can be made. |
| 5 | +> |
| 6 | +> You can still freely browse all files, commit history, and issues. Please note that most of this repository's content has been migrated to new repositories under the [AI on GKE GitHub Organization](https://github.com/ai-on-gke). |
| 7 | +
|
| 8 | +## Content Migration Update |
| 9 | + |
| 10 | +All content, including open PRs and Issues has been successfully migrated and updated! You can now find everything in the new repositories within the [AI on GKE GitHub Organization](https://github.com/ai-on-gke) and on the [GKE AI Labs website](https://gke-ai-labs.dev). |
| 11 | + |
| 12 | +#### Looking for Older Content? |
| 13 | +If you're searching for a previous folder or guide, start by checking the main README.md file of the specific folder you're looking for. This file should include a direct link to where the code was migrated. If you can't find it there, please refer to the table below for overall guidance. |
| 14 | + |
| 15 | +#### Repository Migration Table |
| 16 | +Below is a breakdown of how the content from older folders has been migrated to new repositories within the [AI on GKE GitHub Organization](https://github.com/ai-on-gke). This table will help you locate content that has been moved or updated. |
| 17 | + |
| 18 | +| Original ai-on-gke folder | New Repository | |
| 19 | +| :----------| :--------------------- | |
| 20 | +| `benchmarks` | [scalability-benchmarks](https://github.com/ai-on-gke/scalability-benchmarks) | |
| 21 | +| `gke-batch-refarch` | [batch-reference-architecture](https://github.com/ai-on-gke/batch-reference-architecture) | |
| 22 | +| `ml-platform` | [GoogleCloudPlatform/accelerated-platforms](https://github.com/GoogleCloudPlatform/accelerated-platforms) | |
| 23 | +| `tutorials-and-examples/nvidia*` | [nvidia-ai-solutions](https://github.com/ai-on-gke/nvidia-ai-solutions) | |
| 24 | +| `applications` | [quick-start-guides](https://github.com/ai-on-gke/quick-start-guides) | |
| 25 | +| `ray-on-gke` | [quick-start-guides](https://github.com/ai-on-gke/quick-start-guides) | |
| 26 | +| `slurm-on-gke` | [slurm-on-gke](https://github.com/ai-on-gke/slurm-on-gke) | |
| 27 | +| `tools` | [tools](https://github.com/ai-on-gke/tools) | |
| 28 | +| `tpu-provisioner` | [tpu-provisioner](https://github.com/ai-on-gke/tpu-provisioner) | |
| 29 | +| `tutorials-and-examples` | [tutorials-and-examples](https://github.com/ai-on-gke/tutorials-and-examples) | |
| 30 | +| `ray-on-gke\tpu\kuberay-tpu-webhook` | [kuberay-tpu-webhook](https://github.com/ai-on-gke/kuberay-tpu-webhook) | |
| 31 | +| `modules`, `scripts`, `charts` | [common-infra](https://github.com/ai-on-gke/common-infra) | |
| 32 | +| `website` | [website](https://github.com/ai-on-gke/website) | |
0 commit comments