It provisions all necessary Azure resources. Main resources are:
az login
as an Azure account with sufficient privileges to administer necessary resources:
The templates are organized into two modules, infra
and services
.
Before you do anything, create a TF vars file FILE.tfvars
(FILE
could be something else), with this content.
org_prefix = "yourorg" # use something short and distinctive
This is used to help generate unique resource names for:
Next, apply the infra
module (creates Azure cloud resources only).
terraform apply -target="module.infra" -var-file=FILE.tfvars
If you do not create Azure PostgreSQL Flexible Server instances often, Azure API may be flaky initially:
| Error: waiting for creation of the Postgresql Flexible Server "metaflow-database-server-xyz" (Resource Group "rg-db-metaflow-xyz"):
| Code="InternalServerError" Message="An unexpected error occured while processing the request. Tracking ID: 'xyz'"
|
| with module.infra.azurerm_postgresql_flexible_server.metaflow_database_server,
| on infra/database.tf line 20, in resource "azurerm_postgresql_flexible_server" "metaflow_database_server":
| 20: resource "azurerm_postgresql_flexible_server" "metaflow_database_server" { In our experience, waiting 20 mins and trying again resolves this issue. This appears to be a one time phenomenon - future stack spin ups do not encounter such `InternalServerError`'s.
We have hardcoded some default instance type to be used for k8s nodes as well as worker pools (“taskworkers”). Depending on real-time availability of such instances in your region or availability zone, you may consider choosing alternate instance types.
VM Availability issues might look something like this:
| Error: waiting for creation of Node Pool: (Agent Pool Name "taskworkers" / Managed Cluster Name "metaflow-kubernetes-xyz" /
| Resource Group "rg-k8s-metaflow-xyz"): Code="ReconcileVMSSAgentPoolFailed" Message="Code=\"AllocationFailed\" Message=\"Allocation failed.
| We do not have sufficient capacity for the requested VM size in this region. Read more about improving likelihood of allocation success
| at http://aka.ms/allocation-guidance\""
VM quotas may also cause provisioning to fail:
| Error: creating Node Pool: (Agent Pool Name "taskworkers" / Managed Cluster Name "metaflow-kubernetes-default" / Resource Group "rg-k8s-metaflow-default"):
| containerservice.AgentPoolsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="PreconditionFailed"
| Message="Provisioning of resource(s) for Agent Pool taskworkers failed. Error: {\n \"code\": \"InvalidTemplateDeployment\",\n
| \"message\": \"The template deployment '8b1a99f1-e35e-44be-a8ac-0f82009b7149' is not valid according to the validation procedure.
| The tracking id is 'xyz'. See inner errors for details.\",\n \"details\":
| [\n {\n \"code\": \"QuotaExceeded\",\n \"message\": \"Operation could not be completed as it results in exceeding approved standardDv5Family Cores quota.
| Additional details - Deployment Model: Resource Manager, Location: westeurope, Current Limit: 0, Current Usage: 0,
| Additional Required: 4, (Minimum) New Limit Required: 4.
| Submit a request for Quota increase at https://<AZURE_LINK> by specifying parameters listed in the ‘Details’ section for deployment to succeed.
| Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n }\n ]\n }"
Then, apply the services
module (deploys Metaflow services to AKS)
terraform apply -target="module.services" -var-file=FILE.tfvars
The step above will output next steps for Metaflow end users.
The recommended way to orchestrate Metaflow workloads on Kubernetes is via Argo Workflows. However, Airflow is also supported as an alternative.
The template also provides the deploy_airflow
and deploy_argo
flags as variables. These are booleans that specify if Airflow or Argo Workflows will be deployed in the Kubernetes cluster along with Metaflow related services. By default deploy_argo
is set to true and deploy_airflow
is set to false.
To change these, set them in your FILE.tfvars
file (or else, via other terraform variable passing mechanisms)
Argo Workflows is installed by default on the AKS cluster as part of the services
submodule. Setting the deploy_argo
variable will deploy Argo in the AKS cluster. Not additional configuration is done in the infra
module to support argo
.
After you have changed the value of deploy_argo
, reapply terraform for both infra and services.
This is quickstart template only, not recommended for real production deployments
If deploy_airflow
is set to true, then the infra
module will create one more storage blob-container named airflow-logs
and provide blob-container R/W permissions to the service principal. We create this extra blob-container because Airflow expects the blob-container where it ships logs on Azure to be named airflow-logs
.
The services
module will deploy Airflow via a helm chart into the kubernetes cluster (the one deployed by the infra
module). The Airflow installation will store all the logs in the airflow-logs
blob-container. The terraform template deploys Airflow configured with a LocalExecutor
. Metaflow can work with any Airflow executor. This template deploys the LocalExecutor
for simplicity.
After you have changed the value of deploy_airflow
, reapply terraform for both infra and services.
Airflow expects Python files with Airflow dags present in the dags_folder. By default this terraform template uses the defaults set in the Airflow helm chart which is {AIRFLOW_HOME}/dags
(/opt/airflow/dags
).
The metaflow-tools repository also ships a airflow_dag_upload.py file that can help sync Airflow dag files generated by Metaflow to the Airflow scheduler deployed by this template. Under the hood airflow_dag_upload.py uses the kubectl cp
command to copy files from local to the Airflow scheduler’s container. Example of how to use the file:
python airflow_dag_upload.py my-dag.py /opt/airflow/dags/my-dag.py
Terraform manages the state of Azure resources in tfstate files locally by default.
If you plan to maintain the minimal stack for any significant period of time, it is highly recommended that these state files be stored in cloud storage (e.g. Azure Blob Storage) instead.
Some reasons include:
For more details, see Terraform docs.