External Storage & Multi-Attach Errors

This page will contain all the detailed guidance, examples, and migration steps for handling storage issues when running multiple MMS pods across nodes.

External Storage & Multi-Attach Troubleshooting

When running Model Management Server (artifact-management) in a Kubernetes cluster with autoscaling enabled, pods may be rescheduled or rebalanced across different nodes (for example, after a crash, node drain, or scaling event). If the underlying storage class does not support multi-node access (e.g., ReadWriteMany), you may encounter Multi-Attach errors when pods are scheduled on different nodes. This is common with default disk-based storage classes in AKS and EKS.

When to use external storage:
If you see Multi-Attach errors during autoscaling, you must use a network file system that supports multi-node access (such as Azure Files or AWS EFS).

General Steps:

Provision External Storage
- For AKS: Create an Azure File Share.
- For EKS: Create an EFS file system.

Create Kubernetes Secret (if required)

Store storage account credentials as a Kubernetes Secret.

For example, to create a secret for Azure Files:

kubectl create secret generic azure-files-secret \
    --from-literal=azurestorageaccountname=<STORAGE_ACCOUNT_NAME> \
    --from-literal=azurestorageaccountkey=<STORAGE_KEY> \
    --namespace <NAMESPACE>

Create PersistentVolume (PV) and PersistentVolumeClaim (PVC)

Define a static PV that points to your external storage.
Create a PVC that uses the PV or a StorageClass that supports ReadWriteMany.
Note: You need to create PV and PVC for both artifact-management and deploy-storage volumes to ensure all MMS data is accessible across nodes.

For example, to define a static PersistentVolume and PersistentVolumeClaim for Azure Files:

# artifact-management-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: artifact-management-pv
spec:
capacity:
    storage: 20Gi
accessModes:
    - ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: azurefile-csi-static
mountOptions:
    - dir_mode=0777
    - file_mode=0777
    - uid=1000
    - gid=1000
    - mfsymlinks
    - cache=strict
    - nosharesock
    - nobrl
csi:
    driver: file.csi.azure.com
    volumeHandle: <VOLUME_HANDLE> # <-- make sure this volumeid is unique for every identical share in the cluster
    readOnly: false
    volumeAttributes:
    secretName: azure-files-secret
    secretNamespace: <NAMESPACE>
    shareName: <SHARE_NAME>
---
# artifact-management-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: artifact-management-rwx
namespace: <NAMESPACE>
spec:
accessModes:
    - ReadWriteMany
storageClassName: azurefile-csi-static
resources:
    requests:
    storage: 20Gi
---
# deploy-storage-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: deploy-storage-pv
spec:
capacity:
    storage: 20Gi
accessModes:
    - ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: azurefile-csi-static
mountOptions:
    - dir_mode=0777
    - file_mode=0777
    - uid=1000
    - gid=1000
    - mfsymlinks
    - cache=strict
    - nosharesock
    - nobrl
csi:
    driver: file.csi.azure.com
    volumeHandle: <VOLUME_HANDLE> # <-- make sure this volumeid is unique for every identical share in the cluster
    readOnly: false
    volumeAttributes:
    secretName: azure-files-secret
    secretNamespace: <NAMESPACE>
    shareName: <SHARE_NAME>
---
# deploy-storage-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: deploy-storage-rwx
namespace: <NAMESPACE>
spec:
accessModes:
    - ReadWriteMany
storageClassName: azurefile-csi-static
resources:
    requests:
    storage: 20Gi

Update MMS Deployment

Ensure the MMS (artifact-management) deployment uses the new PVCs for its data volumes.
In your deployment spec, update the volumes section to reference the new PVCs:

# Example snippet from your deployment spec:
spec:
  template:
    spec:
      volumes:
        - name: artifact-management
          persistentVolumeClaim:
            claimName: artifact-management-rwx   # <-- new PVC name
        - name: deploy-storage
          persistentVolumeClaim:
            claimName: deploy-storage-rwx        # <-- new PVC name

After making these changes, restart the deployment to pick up the new storage.
Alternatively, you can patch the deployment using kubectl:

kubectl -n streaming-web patch deployment artifact-management \
  --type='strategic' \
  -p='
    spec:
    template:
        spec:
        volumes:
            - name: artifact-management
            persistentVolumeClaim:
                claimName: artifact-management-rwx
            - name: deploy-storage
            persistentVolumeClaim:
                claimName: deploy-storage
    '

This ensures both artifact-management and deploy-storage volumes use the new external PVCs.

Migrate Existing Data

To migrate data from the old PVC to the new external storage, follow these steps:

Migration of data:

Scale down the deployment to 0:

kubectl -n <NAMESPACE> scale deployment artifact-management --replicas=0

Create a temporary Pod for data copy:

# artifact-migration.yaml
apiVersion: v1
kind: Pod
metadata:
name: artifact-move-data
namespace: <NAMESPACE>
spec:
restartPolicy: Never
containers:
    - name: mover
    image: busybox
    command: ["/bin/sh", "-c", "cp -av /old-data/. /var/opt/spotfire/streaming-web/data/"]
    volumeMounts:
        - name: new
        mountPath: /var/opt/spotfire/streaming-web/data
        - name: old
        mountPath: /old-data
volumes:
    - name: new
    persistentVolumeClaim:
        claimName: artifact-management-rwx      # new PVC
    - name: old
    persistentVolumeClaim:
        claimName: artifact-management         # old PVC

Apply and monitor the pod:

kubectl apply -f artifact-migration.yaml
kubectl -n <NAMESPACE> logs artifact-move-data
# Wait until the pod status is 'Succeeded' before deleting it.
kubectl -n <NAMESPACE> delete pod artifact-move-data

# deploy-storage-migration.yaml
apiVersion: v1
kind: Pod
metadata:
name: artifact-move-data
namespace: streaming-web
spec:
restartPolicy: Never
containers:
    - name: mover
    image: busybox
    command: ["/bin/sh", "-c", "cp -av /old-data/. /var/opt/spotfire/streaming-web/deploy-storage/"]
    volumeMounts:
        - name: new
        mountPath: /var/opt/spotfire/streaming-web/deploy-storage
        - name: old
        mountPath: /old-data
volumes:
    - name: new
    persistentVolumeClaim:
        claimName: deploy-storage-rwx
    - name: old
    persistentVolumeClaim:
        claimName: deploy-storage

Apply and monitor the pod:

kubectl apply -f deploy-storage-migration.yaml
kubectl -n <NAMESPACE> logs artifact-move-data
# Wait until the pod status is 'Succeeded' before deleting it.
kubectl -n <NAMESPACE> delete pod artifact-move-data

Note:
Ensure that the migration/copy pods for both artifact-management and deploy-storage complete successfully (pod status is Succeeded) before deleting them. Deleting the pod before completion may result in incomplete data transfer.

Scale up the deployment:

kubectl -n <NAMESPACE> scale deployment artifact-management --replicas=2

This ensures your data is safely migrated to the new external storage before restarting the application.

Note:
This approach is only required if you encounter multi-attach errors due to autoscaling and pod distribution across nodes. For single-node or non-autoscaled clusters, the default storage class may be sufficient.