External Storage & Multi-Attach Errors

This page will contain all the detailed guidance, examples, and migration steps for handling storage issues when running multiple MMS pods across nodes.

External Storage & Multi-Attach Troubleshooting

When running Model Management Server (artifact-management) in a Kubernetes cluster with autoscaling enabled, pods may be rescheduled or rebalanced across different nodes (for example, after a crash, node drain, or scaling event). If the underlying storage class does not support multi-node access (e.g., ReadWriteMany), you may encounter Multi-Attach errors when pods are scheduled on different nodes. This is common with default disk-based storage classes in AKS and EKS.

When to use external storage:
If you see Multi-Attach errors during autoscaling, you must use a network file system that supports multi-node access (such as Azure Files or AWS EFS).

General Steps:

  1. Provision External Storage

  2. Create Kubernetes Secret (if required)

    • Store storage account credentials as a Kubernetes Secret.

    For example, to create a secret for Azure Files:

    kubectl create secret generic azure-files-secret \
        --from-literal=azurestorageaccountname=<STORAGE_ACCOUNT_NAME> \
        --from-literal=azurestorageaccountkey=<STORAGE_KEY> \
        --namespace <NAMESPACE>
    
  3. Create PersistentVolume (PV) and PersistentVolumeClaim (PVC)

    • Define a static PV that points to your external storage.
    • Create a PVC that uses the PV or a StorageClass that supports ReadWriteMany.
    • Note: You need to create PV and PVC for both artifact-management and deploy-storage volumes to ensure all MMS data is accessible across nodes.

    For example, to define a static PersistentVolume and PersistentVolumeClaim for Azure Files:

    # artifact-management-pv.yaml
    apiVersion: v1
    kind: PersistentVolume
    metadata:
    name: artifact-management-pv
    spec:
    capacity:
        storage: 20Gi
    accessModes:
        - ReadWriteMany
    persistentVolumeReclaimPolicy: Retain
    storageClassName: azurefile-csi-static
    mountOptions:
        - dir_mode=0777
        - file_mode=0777
        - uid=1000
        - gid=1000
        - mfsymlinks
        - cache=strict
        - nosharesock
        - nobrl
    csi:
        driver: file.csi.azure.com
        volumeHandle: <VOLUME_HANDLE> # <-- make sure this volumeid is unique for every identical share in the cluster
        readOnly: false
        volumeAttributes:
        secretName: azure-files-secret
        secretNamespace: <NAMESPACE>
        shareName: <SHARE_NAME>
    ---
    # artifact-management-pvc.yaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: artifact-management-rwx
    namespace: <NAMESPACE>
    spec:
    accessModes:
        - ReadWriteMany
    storageClassName: azurefile-csi-static
    resources:
        requests:
        storage: 20Gi
    ---
    # deploy-storage-pv.yaml
    apiVersion: v1
    kind: PersistentVolume
    metadata:
    name: deploy-storage-pv
    spec:
    capacity:
        storage: 20Gi
    accessModes:
        - ReadWriteMany
    persistentVolumeReclaimPolicy: Retain
    storageClassName: azurefile-csi-static
    mountOptions:
        - dir_mode=0777
        - file_mode=0777
        - uid=1000
        - gid=1000
        - mfsymlinks
        - cache=strict
        - nosharesock
        - nobrl
    csi:
        driver: file.csi.azure.com
        volumeHandle: <VOLUME_HANDLE> # <-- make sure this volumeid is unique for every identical share in the cluster
        readOnly: false
        volumeAttributes:
        secretName: azure-files-secret
        secretNamespace: <NAMESPACE>
        shareName: <SHARE_NAME>
    ---
    # deploy-storage-pvc.yaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: deploy-storage-rwx
    namespace: <NAMESPACE>
    spec:
    accessModes:
        - ReadWriteMany
    storageClassName: azurefile-csi-static
    resources:
        requests:
        storage: 20Gi
    
  4. Update MMS Deployment

    • Ensure the MMS (artifact-management) deployment uses the new PVCs for its data volumes.
    • In your deployment spec, update the volumes section to reference the new PVCs:
    # Example snippet from your deployment spec:
    spec:
      template:
        spec:
          volumes:
            - name: artifact-management
              persistentVolumeClaim:
                claimName: artifact-management-rwx   # <-- new PVC name
            - name: deploy-storage
              persistentVolumeClaim:
                claimName: deploy-storage-rwx        # <-- new PVC name
    
    • After making these changes, restart the deployment to pick up the new storage.

    • Alternatively, you can patch the deployment using kubectl:

    kubectl -n streaming-web patch deployment artifact-management \
      --type='strategic' \
      -p='
        spec:
        template:
            spec:
            volumes:
                - name: artifact-management
                persistentVolumeClaim:
                    claimName: artifact-management-rwx
                - name: deploy-storage
                persistentVolumeClaim:
                    claimName: deploy-storage
        '
    
    • This ensures both artifact-management and deploy-storage volumes use the new external PVCs.
  5. Migrate Existing Data

    • To migrate data from the old PVC to the new external storage, follow these steps:

    Migration of data:

    1. Scale down the deployment to 0:

      kubectl -n <NAMESPACE> scale deployment artifact-management --replicas=0
      
    2. Create a temporary Pod for data copy:

      # artifact-migration.yaml
      apiVersion: v1
      kind: Pod
      metadata:
      name: artifact-move-data
      namespace: <NAMESPACE>
      spec:
      restartPolicy: Never
      containers:
          - name: mover
          image: busybox
          command: ["/bin/sh", "-c", "cp -av /old-data/. /var/opt/spotfire/streaming-web/data/"]
          volumeMounts:
              - name: new
              mountPath: /var/opt/spotfire/streaming-web/data
              - name: old
              mountPath: /old-data
      volumes:
          - name: new
          persistentVolumeClaim:
              claimName: artifact-management-rwx      # new PVC
          - name: old
          persistentVolumeClaim:
              claimName: artifact-management         # old PVC
      

      Apply and monitor the pod:

      kubectl apply -f artifact-migration.yaml
      kubectl -n <NAMESPACE> logs artifact-move-data
      # Wait until the pod status is 'Succeeded' before deleting it.
      kubectl -n <NAMESPACE> delete pod artifact-move-data
      
      # deploy-storage-migration.yaml
      apiVersion: v1
      kind: Pod
      metadata:
      name: artifact-move-data
      namespace: streaming-web
      spec:
      restartPolicy: Never
      containers:
          - name: mover
          image: busybox
          command: ["/bin/sh", "-c", "cp -av /old-data/. /var/opt/spotfire/streaming-web/deploy-storage/"]
          volumeMounts:
              - name: new
              mountPath: /var/opt/spotfire/streaming-web/deploy-storage
              - name: old
              mountPath: /old-data
      volumes:
          - name: new
          persistentVolumeClaim:
              claimName: deploy-storage-rwx
          - name: old
          persistentVolumeClaim:
              claimName: deploy-storage
      

      Apply and monitor the pod:

      kubectl apply -f deploy-storage-migration.yaml
      kubectl -n <NAMESPACE> logs artifact-move-data
      # Wait until the pod status is 'Succeeded' before deleting it.
      kubectl -n <NAMESPACE> delete pod artifact-move-data
      

      Note:
      Ensure that the migration/copy pods for both artifact-management and deploy-storage complete successfully (pod status is Succeeded) before deleting them. Deleting the pod before completion may result in incomplete data transfer.

    3. Scale up the deployment:

      kubectl -n <NAMESPACE> scale deployment artifact-management --replicas=2
      

    This ensures your data is safely migrated to the new external storage before restarting the application.

Note:
This approach is only required if you encounter multi-attach errors due to autoscaling and pod distribution across nodes. For single-node or non-autoscaled clusters, the default storage class may be sufficient.