From ec42fdf4b2671760eff1d067a140f5d47b54eaa4 Mon Sep 17 00:00:00 2001 From: Jinpei Su Date: Fri, 6 Feb 2026 17:09:45 +0800 Subject: [PATCH 1/5] support to deploy Milvus in ACP --- docs/en/solutions/How_to_Use_Milvus.md | 719 +++++++++++++++++++++++++ 1 file changed, 719 insertions(+) create mode 100644 docs/en/solutions/How_to_Use_Milvus.md diff --git a/docs/en/solutions/How_to_Use_Milvus.md b/docs/en/solutions/How_to_Use_Milvus.md new file mode 100644 index 0000000..6f9edbd --- /dev/null +++ b/docs/en/solutions/How_to_Use_Milvus.md @@ -0,0 +1,719 @@ +--- +kind: + - Solution +products: + - Alauda Application Services +ProductsVersion: + - 4.x +--- + +# Milvus Vector Database Solution Guide + +## Background + +### The Challenge + +Modern AI/ML applications require efficient similarity search and vector operations at scale. Traditional databases struggle with: + +- **Vector Search Performance**: Inability to efficiently search through millions of high-dimensional vectors +- **Scalability Limitations**: Difficulty scaling vector operations across multiple nodes +- **Complex Deployment**: Challenges in deploying and managing distributed vector databases +- **Integration Complexity**: Hard to integrate with existing ML pipelines and AI frameworks + +### The Solution + +Milvus is an open-source vector database built for scalable similarity search and AI applications, providing: + +- **High-Performance Vector Search**: Billion-scale vector search with millisecond latency +- **Multiple Index Types**: Support for various indexing algorithms (IVF, HNSW, ANNOY, DiskANN) +- **Cloud-Native Architecture**: Kubernetes-native design with automatic scaling and fault tolerance +- **Rich Ecosystem**: Integrations with popular ML frameworks (PyTorch, TensorFlow, LangChain, LlamaIndex) + +## Environment Information + +Applicable Versions: >=ACP 4.2.0, Milvus: >=v2.4.0 + +## Quick Reference + +### Key Concepts +- **Collection**: A container for a set of vectors and their associated schema +- **Vector Embedding**: Numerical representation of data (text, images, audio) for similarity search +- **Index**: Data structure that accelerates vector similarity search +- **Partition**: Logical division of a collection for improved search performance and data management +- **Message Queue**: Required for cluster mode. Options include: + - **Woodpecker**: Embedded WAL in Milvus 2.6+ (simpler deployment) + - **Kafka**: External distributed event streaming platform (battle-tested, production-proven) + +### Common Use Cases + +| Scenario | Recommended Approach | Section Reference | +|----------|---------------------|------------------| +| **Semantic Search** | Create collection with text embeddings | [Basic Operations](https://milvus.io/docs/) | +| **Image Retrieval** | Use vision model embeddings | [Image Search](https://milvus.io/docs/) | +| **RAG Applications** | Integrate with LangChain/LlamaIndex | [RAG Pipeline](https://milvus.io/docs/) | +| **Production Deployment** | Use cluster mode with appropriate message queue | [Production Workflows](https://milvus.io/docs/) | + +### Message Queue Selection Guide + +| Factor | Woodpecker | Kafka | +|--------|-----------|-------| +| **Operational Overhead** | Low (embedded) | High (external service) | +| **Production Maturity** | New (Milvus 2.6+) | Battle-tested | +| **Scalability** | Good with object storage | Excellent horizontal scaling | +| **Deployment Complexity** | Simple | Complex | +| **Best For** | Simplicity, lower cost, new deployments | Mission-critical workloads, existing Kafka users | + +## Prerequisites + +Before implementing Milvus, ensure you have: + +- ACP v4.2.0 or later +- Basic understanding of vector embeddings and similarity search concepts + +> **Note**: ACP v4.2.0 and later supports in-cluster MinIO and etcd deployment through the Milvus Operator. External storage (S3-compatible) and external message queue (Kafka) are optional. + +### Storage Requirements + +- **etcd**: Minimum 10GB storage per replica for metadata (in-cluster deployment) +- **MinIO**: Sufficient capacity for your vector data and index files (in-cluster deployment) +- **Memory**: RAM should be 2-4x the vector dataset size for optimal performance + +### Resource Recommendations + +| Deployment Mode | Minimum CPU | Minimum Memory | Recommended Use | +|-----------------|-------------|----------------|-----------------| +| **Standalone** | 4 cores | 8GB | Development, testing | +| **Cluster** | 16+ cores | 32GB+ | Production, large-scale | + +## Installation Guide + +### Chart Upload + +Download the Milvus Operator chart from the Marketplace in the Alauda Customer Portal and upload the chart to your ACP catalog. To download the `violet` tool and find usage information, refer to [Violet CLI Tool Documentation](https://docs.alauda.io/container_platform/4.2/ui/cli_tools/violet.html): + +```bash +CHART=chart-milvus-operator.ALL.1.3.5.tgz +ADDR="https://your-acp-domain.com" +USER="admin@cpaas.io" +PASS="your-password" + +violet push $CHART \ +--platform-address "$ADDR" \ +--platform-username "$USER" \ +--platform-password "$PASS" +``` + +### Backend Storage Configuration + +#### External S3-Compatible Storage (Optional) + +For production deployments requiring external storage, you can use existing S3-compatible storage services. This requires: + +1. Create a Kubernetes secret with storage credentials: + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: milvus-storage-secret + namespace: milvus +type: Opaque +stringData: + accesskey: "" + secretkey: "" +``` + +2. Configure Milvus to use external storage in the custom resource (see Option 2B below) + +#### Ceph RGW (Not Verified) + +Ceph RGW should work with Milvus but is not currently verified. If you choose to use Ceph RGW: + +1. Deploy Ceph storage system following the [Ceph installation guide](https://docs.alauda.io/container_platform/4.2/storage/storagesystem_ceph/installation/create_service_stand.html) + +2. [Create Ceph Object Store User](https://docs.alauda.io/container_platform/4.2/storage/storagesystem_ceph/how_to/create_object_user): + +```yaml +apiVersion: ceph.rook.io/v1 +kind: CephObjectStoreUser +metadata: + name: milvus-user + namespace: rook-ceph +spec: + store: my-store + displayName: milvus-storage-pool + quotas: + maxBuckets: 100 + maxSize: -1 + maxObjects: -1 + capabilities: + user: "*" + bucket: "*" +``` + +3. Retrieve access credentials: + +```bash +user_secret=$(kubectl -n rook-ceph get cephobjectstoreuser milvus-user -o jsonpath='{.status.info.secretName}') +ACCESS_KEY=$(kubectl -n rook-ceph get secret $user_secret -o jsonpath='{.data.AccessKey}' | base64 -d) +SECRET_KEY=$(kubectl -n rook-ceph get secret $user_secret -o jsonpath='{.data.SecretKey}' | base64 -d) +``` + +4. Create a Kubernetes secret with the Ceph RGW credentials: + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: milvus-storage-secret + namespace: milvus +type: Opaque +stringData: + accesskey: "" + secretkey: "" +``` + +5. Configure Milvus to use Ceph RGW in the custom resource by setting the storage endpoint to your Ceph RGW service (e.g., `rook-ceph-rgw-my-store.rook-ceph.svc:7480`) + +### Message Queue Options + +Milvus requires a message queue for cluster mode deployments. You can choose between: + +#### Option 1: Woodpecker + +Woodpecker is an embedded Write-Ahead Log (WAL) in Milvus 2.6+. It's a cloud-native WAL designed for object storage. + +**Characteristics:** +- **Simplified Deployment**: No external message queue service required +- **Cost-Efficient**: Lower operational overhead +- **High Throughput**: Optimized for batch operations with object storage +- **Storage Options**: Supports MinIO/S3-compatible storage or local file system +- **Availability**: Introduced in Milvus 2.6 as an optional WAL + +Woodpecker is enabled by default in Milvus 2.6+ and uses the same object storage (MinIO) configured for your Milvus deployment. For more details, see the [Milvus Woodpecker documentation](https://milvus.io/docs/use-woodpecker.md). + +**Considerations:** +- Newer technology with less production history compared to Kafka +- May require evaluation for your specific production requirements +- Best suited for deployments prioritizing simplicity and lower operational overhead + +#### Option 2: Kafka + +Kafka is a distributed event streaming platform that can be used as the message queue for Milvus. Kafka is a mature, battle-tested solution widely used in production environments. + +**Characteristics:** +- **Production-Proven**: Battle-tested in enterprise environments for years +- **Scalability**: Horizontal scaling with multiple brokers +- **Ecosystem**: Extensive tooling, monitoring, and operational experience +- **ACP Integration**: Supported as a service on ACP + +**Setup:** +1. Deploy Kafka following the [Kafka installation guide](https://docs.alauda.io/kafka/4.1/) + +2. Retrieve the Kafka broker service endpoint: + +```bash +# Get Kafka broker service endpoint +kubectl get svc -n kafka-namespace +``` + +3. Use the Kafka broker endpoint in your Milvus custom resource (e.g., `kafka://kafka-broker.kafka.svc.cluster.local:9092`) + +> **Important**: Although the Milvus CRD field is named `pulsar`, it supports both Pulsar and Kafka. The endpoint scheme determines which message queue type is used: +> - `kafka://` for Kafka brokers +> - `pulsar://` for Pulsar brokers + +**Considerations:** +- Requires additional operational overhead for Kafka cluster management +- Best suited for organizations with existing Kafka infrastructure and expertise +- Recommended for mission-critical production workloads requiring proven reliability + +### Milvus Deployment + +#### Option 1: Standalone Mode (Development/Testing) + +1. Access ACP web console and navigate to "Applications" → "Create" → "Create from Catalog" + +2. Select the Milvus Operator chart and deploy the operator first + +3. Create a Milvus custom resource with standalone mode: + +```yaml +apiVersion: milvus.io/v1beta1 +kind: Milvus +metadata: + name: milvus-standalone + namespace: milvus +spec: + mode: standalone + components: + image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 + + dependencies: + etcd: + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/etcd + tag: 3.5.25-r1 + replicaCount: 1 + persistence: + size: 5Gi + + storage: + type: MinIO + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/minio + tag: RELEASE.2024-12-18T13-15-44Z + mode: standalone + persistence: + size: 20Gi + resources: + requests: + cpu: 100m + memory: 128Mi + + config: + milvus: + log: + level: info +``` + +#### Option 2: Cluster Mode (Production) + +For production deployments, use cluster mode. Below are common production configurations: + +**Option 2A: Production with In-Cluster Dependencies (Recommended)** + +This configuration uses in-cluster etcd and MinIO, with Woodpecker as the embedded message queue: + +```yaml +apiVersion: milvus.io/v1beta1 +kind: Milvus +metadata: + name: milvus-cluster + namespace: milvus +spec: + mode: cluster + components: + image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 + + dependencies: + # Use in-cluster etcd + etcd: + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/etcd + tag: 3.5.25-r1 + replicaCount: 3 + persistence: + size: 10Gi + resources: + requests: + cpu: 250m + memory: 256Mi + limits: + cpu: 1000m + memory: 1Gi + + # Use in-cluster MinIO for production + storage: + type: MinIO + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/minio + tag: RELEASE.2024-12-18T13-15-44Z + mode: standalone + persistence: + size: 100Gi + resources: + requests: + cpu: 250m + memory: 256Mi + limits: + cpu: 1000m + memory: 1Gi + + config: + milvus: + log: + level: info + + # Resource allocation for production + resources: + requests: + cpu: "4" + memory: "8Gi" + limits: + cpu: "8" + memory: "16Gi" +``` + +**Option 2B: Production with External S3-Compatible Storage** + +This configuration uses in-cluster etcd with external S3-compatible storage: + +```yaml +apiVersion: milvus.io/v1beta1 +kind: Milvus +metadata: + name: milvus-cluster + namespace: milvus +spec: + mode: cluster + components: + image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 + + dependencies: + # Use in-cluster etcd + etcd: + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/etcd + tag: 3.5.25-r1 + replicaCount: 3 + persistence: + size: 10Gi + resources: + requests: + cpu: 250m + memory: 256Mi + limits: + cpu: 1000m + memory: 1Gi + + # Use external S3-compatible storage + storage: + type: S3 + external: true + endpoint: minio-service.minio.svc:9000 + secretRef: milvus-storage-secret + + config: + milvus: + log: + level: info + + # Resource allocation for production + resources: + requests: + cpu: "4" + memory: "8Gi" + limits: + cpu: "8" + memory: "16Gi" +``` + +4. For external storage, create the storage secret (skip for in-cluster MinIO): + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: milvus-storage-secret + namespace: milvus +type: Opaque +stringData: + accesskey: "" + secretkey: "" +``` + +> **Note**: Skip this step if using in-cluster MinIO (Option 2A). The secret is only required for external storage (Option 2B). + +**Option 2C: Production with Kafka Message Queue** + +If you prefer to use Kafka instead of Woodpecker (recommended for mission-critical production workloads): + +```yaml +apiVersion: milvus.io/v1beta1 +kind: Milvus +metadata: + name: milvus-cluster + namespace: milvus +spec: + mode: cluster + components: + image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 + + dependencies: + # Use in-cluster etcd + etcd: + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/etcd + tag: 3.5.25-r1 + replicaCount: 3 + persistence: + size: 10Gi + resources: + requests: + cpu: 250m + memory: 256Mi + limits: + cpu: 1000m + memory: 1Gi + + # Use in-cluster MinIO for production + storage: + type: MinIO + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/minio + tag: RELEASE.2024-12-18T13-15-44Z + mode: standalone + persistence: + size: 100Gi + resources: + requests: + cpu: 250m + memory: 256Mi + limits: + cpu: 1000m + memory: 1Gi + + # Use external Kafka for message queue + # Note: The field is named 'pulsar' for historical reasons, but supports both Pulsar and Kafka + # Use 'kafka://' scheme for Kafka, 'pulsar://' scheme for Pulsar + pulsar: + external: true + endpoint: kafka://kafka-broker.kafka.svc.cluster.local:9092 + + config: + milvus: + log: + level: info + + # Resource allocation for production + resources: + requests: + cpu: "4" + memory: "8Gi" + limits: + cpu: "8" + memory: "16Gi" +``` + +5. Deploy and verify the Milvus cluster reaches "Ready" status: + +```bash +# Check Milvus custom resource status +kubectl get milvus -n milvus + +# Check all pods are running +kubectl get pods -n milvus + +# View Milvus components +kubectl get milvus -n milvus -o yaml +``` + +## Configuration Guide + +### Accessing Milvus + +1. Retrieve the Milvus service endpoint: + +```bash +# For standalone mode +kubectl get svc milvus-standalone-milvus -n milvus + +# For cluster mode +kubectl get svc milvus-cluster-milvus -n milvus +``` + +2. The default Milvus port is **19530** for gRPC API + +3. Use port-forwarding for local access: + +```bash +kubectl port-forward svc/milvus-standalone-milvus 19530:19530 -n milvus +``` + +### Getting Started with Milvus + +For detailed usage instructions, API reference, and advanced features, please refer to the official [Milvus documentation](https://milvus.io/docs/). + +The official documentation covers: +- Basic operations (create collections, insert vectors, search) +- Advanced features (index types, partitioning, replication) +- Client SDKs (Python, Java, Go, Node.js, C#) +- Integration with AI frameworks (LangChain, LlamaIndex, Haystack) +- Performance tuning and best practices + +#### Quick Start Example (Python) + +```python +from pymilvus import MilvusClient + +# Connect to Milvus +client = MilvusClient( + uri="http://milvus-standalone-milvus.milvus.svc.cluster.local:19530" +) + +# Create a collection +client.create_collection( + collection_name="demo_collection", + dimension=384 # Match your embedding model +) + +# Insert vectors +vectors = [[0.1, 0.2, ...], [0.3, 0.4, ...]] # Your embeddings +data = [{"id": 1, "vector": v, "text": "sample"} for v in vectors] +client.insert("demo_collection", data) + +# Search similar vectors +query_vector = [[0.1, 0.2, ...]] +results = client.search( + collection_name="demo_collection", + data=query_vector, + limit=5 +) +``` + +## Troubleshooting + +### Common Issues + +#### Pod Not Starting + +**Symptoms**: Milvus pods stuck in Pending or CrashLoopBackOff state + +**Solutions**: +- Check resource allocation (memory and CPU limits) +- Verify storage classes are available +- Ensure image pull secrets are configured correctly +- Review pod logs: `kubectl logs -n milvus ` + +#### Connection Refused + +**Symptoms**: Unable to connect to Milvus service + +**Solutions**: +- Verify Milvus service is running: `kubectl get svc -n milvus` +- Check network policies allow traffic +- Ensure port-forwarding is active if using local access +- Verify no firewall rules blocking port 19530 + +#### Poor Search Performance + +**Symptoms**: Slow vector search queries + +**Solutions**: +- Create appropriate indexes for your collections +- Increase query node resources +- Use partitioning to limit search scope +- Optimize search parameters (nprobe, ef) +- Consider using GPU-enabled indices for large-scale deployments + +### Diagnostic Commands + +Check Milvus health: + +```bash +# Check all Milvus components +kubectl get milvus -n milvus -o wide + +# Check pod status +kubectl get pods -n milvus + +# Check component logs +kubectl logs -n milvus -c milvus + +# Describe Milvus resource +kubectl describe milvus -n milvus +``` + +Verify dependencies: + +```bash +# Check in-cluster etcd pods +kubectl get pods -n milvus -l app=etcd + +# Check in-cluster MinIO pods +kubectl get pods -n milvus -l app=minio + +# Check Kafka connectivity (if using Kafka) +kubectl exec -it -n milvus -- nc -zv 9092 +``` + +## Best Practices + +### Collection Design + +- **Schema Planning**: Define appropriate vector dimensions and field types before creating collections +- **Index Selection**: Choose index types based on your use case (HNSW for high recall, IVF for balance) +- **Partitioning**: Use partitions to logically separate data and improve search performance +- **Consistency Level**: Set appropriate consistency levels (Strong, Bounded, Eventually, Session) + +### Resource Optimization + +- **Memory Sizing**: Allocate memory 2-4x your vector dataset size for optimal performance +- **Query Nodes**: Scale query nodes based on search QPS requirements +- **Index Building**: Use dedicated index nodes for large collections +- **Monitoring**: Implement monitoring for resource utilization and query latency + +### Security Considerations + +- **Network Policies**: Restrict network access to Milvus services +- **Authentication**: Enable TLS and authentication for production deployments +- **Secrets Management**: Use Kubernetes secrets for sensitive credentials +- **RBAC**: Implement role-based access control for Milvus operator + +### Backup Strategy + +- **etcd Backups**: Regular backups of in-cluster etcd persistent volumes +- **MinIO Replication**: Enable replication on MinIO or use redundant storage backend +- **Collection Export**: Periodically export collection data for disaster recovery +- **Testing**: Regularly test restoration procedures + +## Reference + +### Configuration Parameters + +**Milvus Deployment:** +- `mode`: Deployment mode (standalone, cluster) +- `components.image`: Milvus container image +- `dependencies.etcd`: etcd configuration for metadata +- `dependencies.storage`: Object storage configuration +- `dependencies.pulsar`: Message queue configuration (field named `pulsar` for historical reasons, supports both Pulsar and Kafka) +- `config.milvus`: Milvus-specific configuration + +**Message Queue Options:** +- **Woodpecker**: Embedded WAL enabled by default in Milvus 2.6+, uses object storage +- **Kafka**: External Kafka service, set `pulsar.external.endpoint` to Kafka broker with `kafka://` scheme (e.g., `kafka://kafka-broker.kafka.svc.cluster.local:9092`) +- **Pulsar**: External Pulsar service, set `pulsar.external.endpoint` to Pulsar broker with `pulsar://` scheme (e.g., `pulsar://pulsar-broker.pulsar.svc.cluster.local:6650`) + +> **Important**: The CRD field is named `pulsar` for backward compatibility, but you can configure either Pulsar or Kafka by using the appropriate endpoint scheme (`pulsar://` or `kafka://`). + +**Index Types:** +- **FLAT**: Exact search, 100% recall, slow for large datasets +- **IVF_FLAT**: Balanced performance and accuracy +- **IVF_SQ8**: Compressed vectors, lower memory usage +- **HNSW**: High performance, high recall, higher memory usage +- **DISKANN**: Disk-based index for very large datasets + +### Useful Links + +- [Milvus Documentation](https://milvus.io/docs/) - Comprehensive usage guide and API reference +- [Milvus Woodpecker Guide](https://milvus.io/docs/use-woodpecker.md) - Woodpecker WAL documentation +- [Milvus Bootcamp](https://github.com/milvus-io/bootcamp) - Tutorial notebooks and examples +- [PyMilvus SDK](https://milvus.io/api-reference/pymilvus/v2.4.x/About.md) - Python client documentation +- [Milvus Operator GitHub](https://github.com/zilliztech/milvus-operator) - Operator source code +- [ACP Kafka Documentation](https://docs.alauda.io/kafka/4.2/) - Kafka installation on ACP + +## Summary + +This guide provides comprehensive instructions for implementing Milvus on Alauda Container Platform. The solution delivers a production-ready vector database for AI/ML applications, enabling: + +- **Scalable Vector Search**: Billion-scale similarity search with millisecond latency +- **Flexible Deployment**: Support for both development (standalone) and production (cluster) modes +- **Cloud-Native Architecture**: Kubernetes-native design with automatic scaling and fault tolerance +- **Rich AI Integration**: Seamless integration with popular ML frameworks and LLM platforms + +By following these practices, organizations can build robust AI applications including semantic search, recommendation systems, RAG applications, and image retrieval while maintaining the scalability and reliability required for production deployments. From 6ee7b0d9ad48b6f95f51ee19014dbad2189d83ca Mon Sep 17 00:00:00 2001 From: Jinpei Su Date: Fri, 6 Feb 2026 18:39:14 +0800 Subject: [PATCH 2/5] add troubleshooting --- docs/en/solutions/How_to_Use_Milvus.md | 402 +++++++++++++++++++++++++ 1 file changed, 402 insertions(+) diff --git a/docs/en/solutions/How_to_Use_Milvus.md b/docs/en/solutions/How_to_Use_Milvus.md index 6f9edbd..e4d70e1 100644 --- a/docs/en/solutions/How_to_Use_Milvus.md +++ b/docs/en/solutions/How_to_Use_Milvus.md @@ -69,9 +69,12 @@ Before implementing Milvus, ensure you have: - ACP v4.2.0 or later - Basic understanding of vector embeddings and similarity search concepts +- Access to your cluster's container image registry (registry addresses vary by cluster) > **Note**: ACP v4.2.0 and later supports in-cluster MinIO and etcd deployment through the Milvus Operator. External storage (S3-compatible) and external message queue (Kafka) are optional. +> **Important**: Different ACP clusters may use different container registry addresses. The documentation uses `build-harbor.alauda.cn` as an example, but you may need to replace this with your cluster's registry (e.g., `registry.alauda.cn:60070`). See the [Troubleshooting section](#image-pull-authentication-errors) for details. + ### Storage Requirements - **etcd**: Minimum 10GB storage per replica for metadata (in-cluster deployment) @@ -85,6 +88,47 @@ Before implementing Milvus, ensure you have: | **Standalone** | 4 cores | 8GB | Development, testing | | **Cluster** | 16+ cores | 32GB+ | Production, large-scale | +### Pre-Deployment Checklist + +Before deploying Milvus, complete this checklist to ensure a smooth deployment: + +- [ ] **Cluster Registry Address**: Verify your cluster's container registry address + ```bash + # Check existing deployments for registry address + kubectl get deployment -n -o jsonpath='{.items[0].spec.template.spec.containers[0].image}' + ``` + +- [ ] **Storage Class**: Verify storage classes are available and check binding mode + ```bash + kubectl get storageclasses + kubectl get storageclass -o jsonpath='{.volumeBindingMode}' + ``` + Prefer storage classes with `Immediate` binding mode. + +- [ ] **Namespace**: Create a dedicated namespace for Milvus + ```bash + kubectl create namespace milvus + ``` + +- [ ] **PodSecurity Policy**: Verify if your cluster enforces PodSecurity policies + ```bash + kubectl get namespace -o jsonpath='{.metadata.labels}' + ``` + If `pod-security.kubernetes.io/enforce=restricted`, be ready to apply security patches. + +- [ ] **Message Queue Decision**: Decide which message queue to use for cluster mode: + - Woodpecker (embedded, simpler) - No additional setup required + - Kafka (external, production-proven) - Deploy Kafka service first + +- [ ] **Storage Decision**: Decide storage configuration: + - In-cluster MinIO (simpler, recommended for most cases) + - External S3-compatible storage (for production with existing storage infrastructure) + +- [ ] **Resource Availability**: Ensure sufficient resources in the cluster + ```bash + kubectl top nodes + ``` + ## Installation Guide ### Chart Upload @@ -103,6 +147,8 @@ violet push $CHART \ --platform-password "$PASS" ``` +> **Important**: Before deploying, verify the image registry address in the chart matches your cluster's registry. If your cluster uses a different registry (e.g., `registry.alauda.cn:60070` instead of `build-harbor.alauda.cn`), you'll need to update the image references. See [Image Pull Authentication Errors](#image-pull-authentication-errors) in the Troubleshooting section. + ### Backend Storage Configuration #### External S3-Compatible Storage (Optional) @@ -578,6 +624,21 @@ results = client.search( ## Troubleshooting +### Quick Troubleshooting Checklist + +Use this checklist to quickly identify and resolve common deployment issues: + +| Symptom | Likely Cause | Solution Section | +|---------|--------------|------------------| +| Pods stuck in Pending with PodSecurity violations | PodSecurity policy | [PodSecurity Admission Violations](#podsecurity-admission-violations) | +| Pods fail with ErrImagePull or ImagePullBackOff | Wrong registry or authentication | [Image Pull Authentication Errors](#image-pull-authentication-errors) | +| PVCs stuck in Pending with "waiting for consumer" | Storage class binding mode | [PVC Pending - Storage Class Binding Mode](#pvc-pending---storage-class-binding-mode) | +| etcd pods fail with "invalid reference format" | Image prefix bug | [etcd Invalid Image Name Error](#etcd-invalid-image-name-error) | +| Multi-Attach volume errors | Storage class access mode | [Multi-Attach Volume Errors](#multi-attach-volume-errors) | +| Milvus pod crashes with exit code 134 | Permission issues | See notes in [PodSecurity Admission Violations](#podsecurity-admission-violations) | +| Cannot connect to Milvus service | Network or service issues | [Connection Refused](#connection-refused) | +| Slow vector search performance | Index or resource issues | [Poor Search Performance](#poor-search-performance) | + ### Common Issues #### Pod Not Starting @@ -611,6 +672,347 @@ results = client.search( - Optimize search parameters (nprobe, ef) - Consider using GPU-enabled indices for large-scale deployments +#### PodSecurity Admission Violations + +**Symptoms**: All Milvus components (operator, etcd, MinIO, Milvus) fail to create with PodSecurity errors: + +``` +Error creating: pods is forbidden: violates PodSecurity "restricted:latest": +- unrestricted capabilities (container must set securityContext.capabilities.drop=["ALL"]) +- seccompProfile (pod or container must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") +- allowPrivilegeEscalation != false (container must set securityContext.allowPrivilegeEscalation=false) +- runAsNonRoot != true (pod or container must set securityContext.runAsNonRoot=true) +``` + +**Cause**: ACP clusters enforce PodSecurity "restricted:latest" policy. All Helm charts deployed by Milvus Operator require security context patches to comply with this policy. + +**Comprehensive Solution** - Patch all components: + +```bash +NAMESPACE="" + +# 1. Patch Milvus Operator +kubectl patch deployment milvus-operator -n $NAMESPACE --type='json' -p=' +[ + { + "op": "add", + "path": "/spec/template/spec/securityContext/seccompProfile", + "value": {"type": "RuntimeDefault"} + }, + { + "op": "add", + "path": "/spec/template/spec/containers/0/securityContext/capabilities", + "value": {"drop": ["ALL"]} + } +]' + +# 2. Patch etcd StatefulSet +kubectl patch statefulset milvus--etcd -n $NAMESPACE --type='json' -p=' +[ + { + "op": "add", + "path": "/spec/template/spec/securityContext", + "value": {"runAsNonRoot": true, "seccompProfile": {"type": "RuntimeDefault"}} + }, + { + "op": "add", + "path": "/spec/template/spec/containers/0/securityContext", + "value": {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"]}, "runAsNonRoot": true, "runAsUser": 1000} + }, + { + "op": "add", + "path": "/spec/template/spec/securityContext/fsGroup", + "value": 1000 + } +]' + +# 3. Patch MinIO Deployment +kubectl patch deployment milvus--minio -n $NAMESPACE --type='json' -p=' +[ + { + "op": "add", + "path": "/spec/template/spec/securityContext", + "value": {"fsGroup": 1000, "runAsNonRoot": true, "seccompProfile": {"type": "RuntimeDefault"}} + }, + { + "op": "add", + "path": "/spec/template/spec/containers/0/securityContext", + "value": {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"]}, "runAsNonRoot": true, "runAsUser": 1000} + } +]' + +# 4. Patch Milvus Standalone/Cluster Deployment +kubectl patch deployment milvus--milvus-standalone -n $NAMESPACE --type='json' -p=' +[ + { + "op": "add", + "path": "/spec/template/spec/securityContext", + "value": {"runAsNonRoot": true, "fsGroup": 1000, "seccompProfile": {"type": "RuntimeDefault"}} + } +]' + +kubectl patch deployment milvus--milvus-standalone -n $NAMESPACE --type='json' -p=' +[ + { + "op": "add", + "path": "/spec/template/spec/containers/0/securityContext", + "value": {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"]}, "runAsNonRoot": true, "runAsUser": 1000} + } +]' + +kubectl patch deployment milvus--milvus-standalone -n $NAMESPACE --type='json' -p=' +[ + { + "op": "add", + "path": "/spec/template/spec/containers/1/securityContext", + "value": {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"]}, "runAsNonRoot": true, "runAsUser": 1000} + } +]' + +# 5. Patch initContainer (config) for Milvus Standalone +kubectl patch deployment milvus--milvus-standalone -n $NAMESPACE --type='json' -p=' +[ + { + "op": "add", + "path": "/spec/template/spec/initContainers[0]/securityContext/allowPrivilegeEscalation", + "value": false + }, + { + "op": "add", + "path": "/spec/template/spec/initContainers[0]/securityContext/capabilities", + "value": {"drop": ["ALL"]} + } +]' +``` + +**Verification**: +```bash +# Check if all pods are running +kubectl get pods -n $NAMESPACE + +# Verify security contexts are applied +kubectl get pod -n $NAMESPACE -o jsonpath='{.spec.securityContext}' +kubectl get pod -n $NAMESPACE -o jsonpath='{.spec.containers[*].securityContext}' +``` + +> **Important**: These patches must be applied after each deployment. For a permanent solution, the Helm charts need to be updated to include these security contexts by default. + +> **Known Issue**: Some containers may fail with permission errors when running as non-root (user 1000). This is a known limitation that requires chart updates to resolve properly. + +#### Image Pull Authentication Errors + +**Symptoms**: Pods fail with `ErrImagePull` or `ImagePullBackOff` status. + +**Error Message**: +``` +Failed to pull image "build-harbor.alauda.cn/middleware/milvus-operator:v1.3.5": +authorization failed: no basic auth credentials +``` + +**Cause**: The image registry requires authentication or the registry address is incorrect for your cluster. + +**Solutions**: + +1. **Check if the registry address is correct for your cluster**: + ```bash + # Describe the pod to see which registry it's trying to pull from + kubectl describe pod -n | grep "Image:" + ``` + +2. **If the registry address is wrong**, patch the deployment with the correct registry: + ```bash + # Replace registry.alauda.cn:60070 with your cluster's registry + kubectl patch deployment milvus-operator -n --type='json' -p=' + [ + { + "op": "replace", + "path": "/spec/template/spec/containers/0/image", + "value": "registry.alauda.cn:60070/middleware/milvus-operator:v1.3.5" + } + ]' + ``` + +3. **If authentication is required**, create an image pull secret: + ```bash + # Create a docker-registry secret + kubectl create secret docker-registry harbor-pull-secret \ + --docker-server= \ + --docker-username= \ + --docker-password= \ + -n + + # Patch the deployment to use the secret + kubectl patch deployment milvus-operator -n --type='json' -p=' + [ + { + "op": "add", + "path": "/spec/template/spec/imagePullSecrets", + "value": [{"name": "harbor-pull-secret"}] + } + ]' + ``` + +4. **For existing Milvus deployments**, update image references in the Milvus CR: + ```yaml + spec: + components: + image: registry.alauda.cn:60070/middleware/milvus:v2.6.7 + dependencies: + etcd: + inCluster: + values: + image: + repository: registry.alauda.cn:60070/middleware/etcd + storage: + inCluster: + values: + image: + repository: registry.alauda.cn:60070/middleware/minio + ``` + +> **Note**: Different ACP clusters may use different registry addresses. Common registries include: +> - `build-harbor.alauda.cn` - Default in documentation +> - `registry.alauda.cn:60070` - Private registry +> - Other custom registries per cluster configuration + +#### etcd Invalid Image Name Error + +**Symptoms**: etcd pods fail to start with invalid image reference format error: + +``` +Error: failed to pull and unpack image "docker.io/registry.alauda.cn:60070/middleware/etcd:3.5.25-r1": +failed to resolve reference "docker.io/registry.alauda.cn:60070/middleware/etcd:3.5.25-r1": +"docker.io/registry.alauda.cn:60070/middleware/etcd:3.5.25-r1": invalid reference format +``` + +**Cause**: The etcd Helm chart automatically prepends "docker.io/" to the image repository, creating an invalid image reference when using a custom registry. + +**Solution**: Patch the etcd StatefulSet to use the correct image reference without the "docker.io/" prefix: + +```bash +# Patch the etcd StatefulSet image +kubectl patch statefulset milvus--etcd -n --type='json' -p=' +[ + { + "op": "replace", + "path": "/spec/template/spec/containers/0/image", + "value": "registry.alauda.cn:60070/middleware/etcd:3.5.25-r1" + } +]' +``` + +Replace `registry.alauda.cn:60070` with your cluster's registry address. + +#### PVC Pending - Storage Class Binding Mode + +**Symptoms**: PersistentVolumeClaims remain in Pending state with events like: + +``` +Warning ProvisioningFailed persistentvolumeclaim +storageclass.storage.k8s.io "" is waiting for a consumer to be found +``` + +**Cause**: Some storage classes (e.g., Topolvm) use `volumeBindingMode: WaitForFirstConsumer`, which delays PVC binding until a Pod using the PVC is scheduled. However, some controllers and operators may have issues with this delayed binding mode. + +**Solution**: Use a storage class with `volumeBindingMode: Immediate` for Milvus deployments: + +1. **List available storage classes**: + +```bash +kubectl get storageclasses +``` + +2. **Check storage class binding mode**: + +```bash +kubectl get storageclass -o jsonpath='{.volumeBindingMode}' +``` + +3. **Use Immediate binding storage class** in your Milvus CR: + +```yaml +dependencies: + etcd: + inCluster: + values: + persistence: + storageClass: # e.g., jpsu2-rook-cephfs-sc + storage: + inCluster: + values: + persistence: + storageClass: +``` + +Common storage classes with Immediate binding include CephFS-based storage classes (e.g., `jpsu2-rook-cephfs-sc`). + +#### Multi-Attach Volume Errors + +**Symptoms**: Pods fail with multi-attach error: + +``` +Warning FailedMount Unable to attach or mount volumes: +unmounted volumes=[], unattached volumes=[]: +timed out waiting for the condition +Multi-Attach error: Volume is already used by pod(s) +``` + +**Cause**: This occurs when multiple Pods attempt to use the same PersistentVolume simultaneously with a storage class that doesn't support read-write-many (RWX) access mode. + +**Solution**: Verify your storage class supports the required access mode: + +1. **Check storage class access modes**: + +```bash +kubectl get storageclass -o jsonpath='{.allowedTopologies}' +``` + +2. **Use appropriate storage class** for your deployment: + - **Standalone mode**: ReadWriteOnce (RWO) is sufficient + - **Cluster mode**: Use ReadWriteMany (RWX) if multiple pods need shared access, or ensure each pod has its own PVC + +3. **For CephFS storage classes**, RWX is typically supported and recommended for Milvus cluster deployments. + +### Deployment Verification + +After deploying Milvus, verify the deployment is successful: + +```bash +# 1. Check Milvus custom resource status +# Should show "Ready" or "Running" +kubectl get milvus -n + +# 2. Check all pods are running +# All pods should be in "Running" state with no restarts +kubectl get pods -n + +# 3. Check services are created +kubectl get svc -n + +# 4. Verify PVCs are bound +kubectl get pvc -n + +# 5. Check Milvus logs for errors +kubectl logs -n deployment/milvus--milvus-standalone -c milvus --tail=50 + +# 6. Port-forward and test connectivity +kubectl port-forward svc/milvus--milvus 19530:19530 -n +``` + +Expected output for a healthy standalone deployment: + +``` +# kubectl get pods -n milvus +NAME READY STATUS RESTARTS AGE +milvus-standalone-etcd-0 1/1 Running 0 5m +milvus-standalone-minio-7f6f9d8b4c-x2k9q 1/1 Running 0 5m +milvus-standalone-milvus-standalone-6b8c9d 1/1 Running 0 3m + +# kubectl get milvus -n milvus +NAME MODE STATUS Updated +milvus-standalone standalone Ready True +``` + ### Diagnostic Commands Check Milvus health: From 83fdefa0648a6a44730e65c0467cf33df6dc62aa Mon Sep 17 00:00:00 2001 From: Jinpei Su Date: Fri, 6 Feb 2026 18:53:27 +0800 Subject: [PATCH 3/5] refine doc --- docs/en/solutions/How_to_Use_Milvus.md | 55 +++++++++++++++++++++++++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/docs/en/solutions/How_to_Use_Milvus.md b/docs/en/solutions/How_to_Use_Milvus.md index e4d70e1..fab4fe7 100644 --- a/docs/en/solutions/How_to_Use_Milvus.md +++ b/docs/en/solutions/How_to_Use_Milvus.md @@ -635,7 +635,7 @@ Use this checklist to quickly identify and resolve common deployment issues: | PVCs stuck in Pending with "waiting for consumer" | Storage class binding mode | [PVC Pending - Storage Class Binding Mode](#pvc-pending---storage-class-binding-mode) | | etcd pods fail with "invalid reference format" | Image prefix bug | [etcd Invalid Image Name Error](#etcd-invalid-image-name-error) | | Multi-Attach volume errors | Storage class access mode | [Multi-Attach Volume Errors](#multi-attach-volume-errors) | -| Milvus pod crashes with exit code 134 | Permission issues | See notes in [PodSecurity Admission Violations](#podsecurity-admission-violations) | +| Milvus standalone pod crashes (exit code 134) | Health check & non-root compatibility | [Milvus Standalone Pod Crashes (Exit Code 134)](#milvus-standalone-pod-crashes-exit-code-134) | | Cannot connect to Milvus service | Network or service issues | [Connection Refused](#connection-refused) | | Slow vector search performance | Index or resource issues | [Poor Search Performance](#poor-search-performance) | @@ -903,6 +903,59 @@ kubectl patch statefulset milvus--etcd -n --type='json' -p=' Replace `registry.alauda.cn:60070` with your cluster's registry address. +#### Milvus Standalone Pod Crashes (Exit Code 134) + +**Symptoms**: Milvus standalone pod repeatedly crashes with exit code 134 (SIGABRT) and logs show: + +``` +check health failed] [UnhealthyComponent="[querynode,streamingnode]"] +Set runtime dir at /run/milvus failed, set it to /tmp/milvus directory +``` + +**Cause**: This is a known compatibility issue with Milvus v2.6.7 when running under PodSecurity "restricted" policies: + +1. **Health Check Issue**: The `/healthz` endpoint returns HTTP 500 because it checks for cluster-mode components (querynode, streamingnode) that don't exist in standalone mode +2. **Non-Root Compatibility**: The container has permission issues running as non-root user (UID 1000) as required by PodSecurity policies +3. **Operator Management**: Manual patches to the deployment are reverted by the Milvus operator + +**Workarounds**: + +**Option 1**: Use cluster mode instead of standalone mode (Recommended) +- Cluster mode has better compatibility with ACP's PodSecurity policies +- The health check properly detects cluster components + +**Option 2**: Request namespace exemption from PodSecurity policies (for testing only): +```bash +# Remove PodSecurity enforcement (NOT recommended for production) +kubectl label namespace pod-security.kubernetes.io/enforce- +kubectl label namespace pod-security.kubernetes.io/enforce=privileged +``` + +Then delete and recreate the Milvus CR. + +**Option 3**: Disable readiness/startup probes (temporary workaround): +```bash +# Patch deployment to use TCP probe instead of HTTP probe +kubectl patch deployment milvus--milvus-standalone -n --type=json -p=' +[ + { + "op": "replace", + "path": "/spec/template/spec/containers/0/readinessProbe", + "value": { + "tcpSocket": {"port": 9091}, + "initialDelaySeconds": 30, + "periodSeconds": 15, + "timeoutSeconds": 3, + "failureThreshold": 10 + } + } +]' +``` + +> **Note**: The operator may revert this change. Monitor and re-apply if needed. + +> **Known Issue**: This is a documented limitation in Milvus v2.6.7. Future versions should address the non-root compatibility and health check issues. For production deployments, use cluster mode or wait for updated charts. + #### PVC Pending - Storage Class Binding Mode **Symptoms**: PersistentVolumeClaims remain in Pending state with events like: From 87f36f8b9d12ca46dc5960d98ba1deefeff2c771 Mon Sep 17 00:00:00 2001 From: Jinpei Su Date: Fri, 6 Feb 2026 22:25:59 +0800 Subject: [PATCH 4/5] refine doc --- docs/en/solutions/How_to_Use_Milvus.md | 113 +++++++++++++++++++++++-- 1 file changed, 107 insertions(+), 6 deletions(-) diff --git a/docs/en/solutions/How_to_Use_Milvus.md b/docs/en/solutions/How_to_Use_Milvus.md index fab4fe7..31874fe 100644 --- a/docs/en/solutions/How_to_Use_Milvus.md +++ b/docs/en/solutions/How_to_Use_Milvus.md @@ -63,6 +63,18 @@ Applicable Versions: >=ACP 4.2.0, Milvus: >=v2.4.0 | **Deployment Complexity** | Simple | Complex | | **Best For** | Simplicity, lower cost, new deployments | Mission-critical workloads, existing Kafka users | +### Important Deployment Considerations + +| Aspect | Standalone Mode | Cluster Mode | +|--------|----------------|--------------| +| **PodSecurity Compatibility** | ⚠️ Requires namespace exemption for testing | ✓ Better compatibility | +| **Production Readiness** | Development/testing only | Production-ready | +| **Resource Requirements** | Lower (4 cores, 8GB RAM) | Higher (16+ cores, 32GB+ RAM) | +| **Scalability** | Limited | Horizontal scaling | +| **Complexity** | Simple to deploy | More components to manage | + +> **⚠️ Warning**: Milvus v2.6.7 standalone mode has known compatibility issues with ACP's PodSecurity "restricted" policy. For production deployments, use cluster mode or apply namespace exemptions (see [Troubleshooting](#milvus-standalone-pod-crashes-exit-code-134)). + ## Prerequisites Before implementing Milvus, ensure you have: @@ -71,6 +83,16 @@ Before implementing Milvus, ensure you have: - Basic understanding of vector embeddings and similarity search concepts - Access to your cluster's container image registry (registry addresses vary by cluster) +> **⚠️ PodSecurity Policy Requirement**: ACP clusters enforce PodSecurity policies by default. Milvus v2.6.7 standalone mode has known compatibility issues with PodSecurity "restricted" policy. For testing/development, you may need to exempt your namespace: +> ```bash +> kubectl label namespace pod-security.kubernetes.io/enforce=privileged +> ``` +> See [Milvus Standalone Pod Crashes](#milvus-standalone-pod-crashes-exit-code-134) for details and workarounds. + +- ACP v4.2.0 or later +- Basic understanding of vector embeddings and similarity search concepts +- Access to your cluster's container image registry (registry addresses vary by cluster) + > **Note**: ACP v4.2.0 and later supports in-cluster MinIO and etcd deployment through the Milvus Operator. External storage (S3-compatible) and external message queue (Kafka) are optional. > **Important**: Different ACP clusters may use different container registry addresses. The documentation uses `build-harbor.alauda.cn` as an example, but you may need to replace this with your cluster's registry (e.g., `registry.alauda.cn:60070`). See the [Troubleshooting section](#image-pull-authentication-errors) for details. @@ -635,6 +657,7 @@ Use this checklist to quickly identify and resolve common deployment issues: | PVCs stuck in Pending with "waiting for consumer" | Storage class binding mode | [PVC Pending - Storage Class Binding Mode](#pvc-pending---storage-class-binding-mode) | | etcd pods fail with "invalid reference format" | Image prefix bug | [etcd Invalid Image Name Error](#etcd-invalid-image-name-error) | | Multi-Attach volume errors | Storage class access mode | [Multi-Attach Volume Errors](#multi-attach-volume-errors) | +| Milvus panic: MinIO PutObjectIfNoneMatch failed | MinIO PVC corruption | [MinIO Storage Corruption Issues](#minio-storage-corruption-issues) | | Milvus standalone pod crashes (exit code 134) | Health check & non-root compatibility | [Milvus Standalone Pod Crashes (Exit Code 134)](#milvus-standalone-pod-crashes-exit-code-134) | | Cannot connect to Milvus service | Network or service issues | [Connection Refused](#connection-refused) | | Slow vector search performance | Index or resource issues | [Poor Search Performance](#poor-search-performance) | @@ -1026,30 +1049,108 @@ kubectl get storageclass -o jsonpath='{.allowedTopologies}' 3. **For CephFS storage classes**, RWX is typically supported and recommended for Milvus cluster deployments. +#### MinIO Storage Corruption Issues + +**Symptoms**: Milvus standalone pod crashes with panic related to MinIO: + +``` +panic: CheckIfConditionWriteSupport failed: PutObjectIfNoneMatch not supported or failed. +BucketName: milvus-test, ObjectKey: files/wp/conditional_write_test_object, +Error: Resource requested is unreadable, please reduce your request rate +``` + +Or MinIO logs show: + +``` +Error: Following error has been printed 3 times.. UUID on positions 0:0 do not match with +expected... inconsistent drive found +Error: Storage resources are insufficient for the write operation +``` + +**Cause**: The MinIO persistent volume claim (PVC) has corrupted data from previous deployments. This can happen when: +- The MinIO deployment was deleted but the PVC was retained +- Multiple MinIO deployments used the same PVC +- The MinIO data became inconsistent due to incomplete writes or crashes + +**Solution**: Completely recreate MinIO by uninstalling the Helm release and deleting the PVC: + +```bash +# 1. Check MinIO Helm release +helm list -n + +# 2. Uninstall the MinIO Helm release (keeps PVC by default) +helm uninstall milvus--minio -n + +# 3. List PVCs to find the MinIO PVC +kubectl get pvc -n | grep minio + +# 4. Delete the corrupted MinIO PVC +kubectl delete pvc -n milvus--minio + +# 5. Delete the Milvus CR to trigger full recreation +kubectl delete milvus -n + +# 6. Recreate the Milvus instance +kubectl apply -f .yaml +``` + +The Milvus operator will automatically: +- Deploy a fresh MinIO instance using Helm +- Create a new PVC with clean data +- Initialize the MinIO bucket properly + +**Verification**: +```bash +# Check new MinIO pod is running +kubectl get pods -n -l app.kubernetes.io/instance= + +# Verify MinIO Helm release is deployed +helm list -n | grep minio + +# Check Milvus can connect to MinIO +kubectl logs -n deployment/milvus--milvus-standalone | grep -i minio +``` + +> **Note**: Always delete both the Helm release AND the PVC when encountering MinIO corruption. Deleting only the deployment or pod will not fix the underlying data corruption. + ### Deployment Verification After deploying Milvus, verify the deployment is successful: ```bash # 1. Check Milvus custom resource status -# Should show "Ready" or "Running" +# Should show "Healthy" status kubectl get milvus -n # 2. Check all pods are running # All pods should be in "Running" state with no restarts kubectl get pods -n -# 3. Check services are created +# 3. Verify all dependencies are healthy +# etcd should be 1/1 Ready +kubectl get pod -n -l app.kubernetes.io/component=etcd + +# MinIO should be Running +kubectl get pod -n | grep minio + +# 4. Check services are created kubectl get svc -n -# 4. Verify PVCs are bound +# 5. Verify PVCs are bound kubectl get pvc -n -# 5. Check Milvus logs for errors -kubectl logs -n deployment/milvus--milvus-standalone -c milvus --tail=50 +# 6. Check MinIO health for corruption +kubectl logs -n deployment/milvus--minio | grep -i "error\|inconsistent\|corrupt" +# Should return no errors -# 6. Port-forward and test connectivity +# 7. Check Milvus logs for errors +kubectl logs -n deployment/milvus--milvus-standalone -c milvus --tail=50 | grep -i "panic\|fatal\|error" + +# 8. Port-forward and test connectivity kubectl port-forward svc/milvus--milvus 19530:19530 -n + +# In another terminal, test the connection +nc -zv localhost 19530 ``` Expected output for a healthy standalone deployment: From b4431be51c1c0e178406a04b1902b0097f5ab521 Mon Sep 17 00:00:00 2001 From: Jinpei Su Date: Wed, 11 Feb 2026 09:57:00 +0800 Subject: [PATCH 5/5] refine doc --- docs/en/solutions/How_to_Use_Milvus.md | 377 ++++++++----------------- 1 file changed, 124 insertions(+), 253 deletions(-) diff --git a/docs/en/solutions/How_to_Use_Milvus.md b/docs/en/solutions/How_to_Use_Milvus.md index 31874fe..ac9c8bc 100644 --- a/docs/en/solutions/How_to_Use_Milvus.md +++ b/docs/en/solutions/How_to_Use_Milvus.md @@ -67,13 +67,13 @@ Applicable Versions: >=ACP 4.2.0, Milvus: >=v2.4.0 | Aspect | Standalone Mode | Cluster Mode | |--------|----------------|--------------| -| **PodSecurity Compatibility** | ⚠️ Requires namespace exemption for testing | ✓ Better compatibility | +| **PodSecurity Compatibility** | ✓ Supported (set `runAsNonRoot: true`) | ✓ Supported | | **Production Readiness** | Development/testing only | Production-ready | | **Resource Requirements** | Lower (4 cores, 8GB RAM) | Higher (16+ cores, 32GB+ RAM) | | **Scalability** | Limited | Horizontal scaling | | **Complexity** | Simple to deploy | More components to manage | -> **⚠️ Warning**: Milvus v2.6.7 standalone mode has known compatibility issues with ACP's PodSecurity "restricted" policy. For production deployments, use cluster mode or apply namespace exemptions (see [Troubleshooting](#milvus-standalone-pod-crashes-exit-code-134)). +> **✓ PodSecurity Compliance**: Both standalone and cluster modes are fully compatible with ACP's PodSecurity "restricted" policy. Simply add `components.runAsNonRoot: true` to your Milvus custom resource (see deployment examples below). ## Prerequisites @@ -83,12 +83,6 @@ Before implementing Milvus, ensure you have: - Basic understanding of vector embeddings and similarity search concepts - Access to your cluster's container image registry (registry addresses vary by cluster) -> **⚠️ PodSecurity Policy Requirement**: ACP clusters enforce PodSecurity policies by default. Milvus v2.6.7 standalone mode has known compatibility issues with PodSecurity "restricted" policy. For testing/development, you may need to exempt your namespace: -> ```bash -> kubectl label namespace pod-security.kubernetes.io/enforce=privileged -> ``` -> See [Milvus Standalone Pod Crashes](#milvus-standalone-pod-crashes-exit-code-134) for details and workarounds. - - ACP v4.2.0 or later - Basic understanding of vector embeddings and similarity search concepts - Access to your cluster's container image registry (registry addresses vary by cluster) @@ -132,11 +126,11 @@ Before deploying Milvus, complete this checklist to ensure a smooth deployment: kubectl create namespace milvus ``` -- [ ] **PodSecurity Policy**: Verify if your cluster enforces PodSecurity policies +- [ ] **PodSecurity Policy**: Verify if your cluster enforces PodSecurity policies (most ACP clusters do by default) ```bash kubectl get namespace -o jsonpath='{.metadata.labels}' ``` - If `pod-security.kubernetes.io/enforce=restricted`, be ready to apply security patches. + If `pod-security.kubernetes.io/enforce=restricted`, the Milvus operator will automatically handle PodSecurity compliance when you set `components.runAsNonRoot: true` in your Milvus CR. No manual patching required. - [ ] **Message Queue Decision**: Decide which message queue to use for cluster mode: - Woodpecker (embedded, simpler) - No additional setup required @@ -247,6 +241,8 @@ stringData: Milvus requires a message queue for cluster mode deployments. You can choose between: +> **Important**: For cluster mode, set `dependencies.msgStreamType: woodpecker` to use Woodpecker as the message queue. Do **not** use `msgStreamType: rocksmq` for cluster mode - rocksmq is only for standalone mode. + #### Option 1: Woodpecker Woodpecker is an embedded Write-Ahead Log (WAL) in Milvus 2.6+. It's a cloud-native WAL designed for object storage. @@ -316,6 +312,7 @@ spec: mode: standalone components: image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 + runAsNonRoot: true # Enable PodSecurity compliance dependencies: etcd: @@ -349,11 +346,13 @@ spec: level: info ``` +> **Important**: The `components.runAsNonRoot: true` setting enables PodSecurity compliance. The operator will automatically apply all required security contexts to the Milvus containers and their dependencies (etcd, MinIO). + #### Option 2: Cluster Mode (Production) For production deployments, use cluster mode. Below are common production configurations: -**Option 2A: Production with In-Cluster Dependencies (Recommended)** +**Option 2A: Production with Woodpecker (Recommended)** This configuration uses in-cluster etcd and MinIO, with Woodpecker as the embedded message queue: @@ -367,8 +366,12 @@ spec: mode: cluster components: image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 + runAsNonRoot: true # Enable PodSecurity compliance dependencies: + # Enable Woodpecker as message queue (recommended for cluster mode) + msgStreamType: woodpecker + # Use in-cluster etcd etcd: inCluster: @@ -421,6 +424,8 @@ spec: memory: "16Gi" ``` +> **Note**: Woodpecker is set with `msgStreamType: woodpecker`. Woodpecker uses the same MinIO storage for its WAL, providing a simpler deployment without external message queue services. + **Option 2B: Production with External S3-Compatible Storage** This configuration uses in-cluster etcd with external S3-compatible storage: @@ -659,6 +664,7 @@ Use this checklist to quickly identify and resolve common deployment issues: | Multi-Attach volume errors | Storage class access mode | [Multi-Attach Volume Errors](#multi-attach-volume-errors) | | Milvus panic: MinIO PutObjectIfNoneMatch failed | MinIO PVC corruption | [MinIO Storage Corruption Issues](#minio-storage-corruption-issues) | | Milvus standalone pod crashes (exit code 134) | Health check & non-root compatibility | [Milvus Standalone Pod Crashes (Exit Code 134)](#milvus-standalone-pod-crashes-exit-code-134) | +| Milvus cluster pod panic with "mq rocksmq is only valid in standalone mode" | Incorrect message queue type | [Cluster Mode Message Queue Configuration](#cluster-mode-message-queue-configuration) | | Cannot connect to Milvus service | Network or service issues | [Connection Refused](#connection-refused) | | Slow vector search performance | Index or resource issues | [Poor Search Performance](#poor-search-performance) | @@ -697,287 +703,140 @@ Use this checklist to quickly identify and resolve common deployment issues: #### PodSecurity Admission Violations -**Symptoms**: All Milvus components (operator, etcd, MinIO, Milvus) fail to create with PodSecurity errors: +**Symptoms**: Milvus pods fail to create with PodSecurity errors: ``` Error creating: pods is forbidden: violates PodSecurity "restricted:latest": -- unrestricted capabilities (container must set securityContext.capabilities.drop=["ALL"]) -- seccompProfile (pod or container must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") -- allowPrivilegeEscalation != false (container must set securityContext.allowPrivilegeEscalation=false) - runAsNonRoot != true (pod or container must set securityContext.runAsNonRoot=true) ``` -**Cause**: ACP clusters enforce PodSecurity "restricted:latest" policy. All Helm charts deployed by Milvus Operator require security context patches to comply with this policy. +**Cause**: The Milvus custom resource is missing the `components.runAsNonRoot: true` setting. -**Comprehensive Solution** - Patch all components: +**Solution**: Add `components.runAsNonRoot: true` to your Milvus custom resource: -```bash -NAMESPACE="" - -# 1. Patch Milvus Operator -kubectl patch deployment milvus-operator -n $NAMESPACE --type='json' -p=' -[ - { - "op": "add", - "path": "/spec/template/spec/securityContext/seccompProfile", - "value": {"type": "RuntimeDefault"} - }, - { - "op": "add", - "path": "/spec/template/spec/containers/0/securityContext/capabilities", - "value": {"drop": ["ALL"]} - } -]' - -# 2. Patch etcd StatefulSet -kubectl patch statefulset milvus--etcd -n $NAMESPACE --type='json' -p=' -[ - { - "op": "add", - "path": "/spec/template/spec/securityContext", - "value": {"runAsNonRoot": true, "seccompProfile": {"type": "RuntimeDefault"}} - }, - { - "op": "add", - "path": "/spec/template/spec/containers/0/securityContext", - "value": {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"]}, "runAsNonRoot": true, "runAsUser": 1000} - }, - { - "op": "add", - "path": "/spec/template/spec/securityContext/fsGroup", - "value": 1000 - } -]' - -# 3. Patch MinIO Deployment -kubectl patch deployment milvus--minio -n $NAMESPACE --type='json' -p=' -[ - { - "op": "add", - "path": "/spec/template/spec/securityContext", - "value": {"fsGroup": 1000, "runAsNonRoot": true, "seccompProfile": {"type": "RuntimeDefault"}} - }, - { - "op": "add", - "path": "/spec/template/spec/containers/0/securityContext", - "value": {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"]}, "runAsNonRoot": true, "runAsUser": 1000} - } -]' - -# 4. Patch Milvus Standalone/Cluster Deployment -kubectl patch deployment milvus--milvus-standalone -n $NAMESPACE --type='json' -p=' -[ - { - "op": "add", - "path": "/spec/template/spec/securityContext", - "value": {"runAsNonRoot": true, "fsGroup": 1000, "seccompProfile": {"type": "RuntimeDefault"}} - } -]' - -kubectl patch deployment milvus--milvus-standalone -n $NAMESPACE --type='json' -p=' -[ - { - "op": "add", - "path": "/spec/template/spec/containers/0/securityContext", - "value": {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"]}, "runAsNonRoot": true, "runAsUser": 1000} - } -]' - -kubectl patch deployment milvus--milvus-standalone -n $NAMESPACE --type='json' -p=' -[ - { - "op": "add", - "path": "/spec/template/spec/containers/1/securityContext", - "value": {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"]}, "runAsNonRoot": true, "runAsUser": 1000} - } -]' - -# 5. Patch initContainer (config) for Milvus Standalone -kubectl patch deployment milvus--milvus-standalone -n $NAMESPACE --type='json' -p=' -[ - { - "op": "add", - "path": "/spec/template/spec/initContainers[0]/securityContext/allowPrivilegeEscalation", - "value": false - }, - { - "op": "add", - "path": "/spec/template/spec/initContainers[0]/securityContext/capabilities", - "value": {"drop": ["ALL"]} - } -]' +```yaml +spec: + components: + runAsNonRoot: true # Required for PodSecurity compliance + image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 ``` +The Milvus operator will automatically apply all required security contexts: +- `runAsNonRoot: true` (pod and container level) +- `runAsUser: 1000` (matching upstream) +- `allowPrivilegeEscalation: false` +- `capabilities.drop: [ALL]` +- `seccompProfile.type: RuntimeDefault` + +This applies to: +- Milvus standalone/cluster deployments +- Init containers (config) +- etcd StatefulSets +- MinIO deployments + **Verification**: ```bash # Check if all pods are running -kubectl get pods -n $NAMESPACE +kubectl get pods -n # Verify security contexts are applied -kubectl get pod -n $NAMESPACE -o jsonpath='{.spec.securityContext}' -kubectl get pod -n $NAMESPACE -o jsonpath='{.spec.containers[*].securityContext}' +kubectl get pod -n -o jsonpath='{.spec.securityContext}' +kubectl get pod -n -o jsonpath='{.spec.containers[*].securityContext}' ``` -> **Important**: These patches must be applied after each deployment. For a permanent solution, the Helm charts need to be updated to include these security contexts by default. - -> **Known Issue**: Some containers may fail with permission errors when running as non-root (user 1000). This is a known limitation that requires chart updates to resolve properly. +#### Milvus Standalone Pod Crashes (Exit Code 134) -#### Image Pull Authentication Errors +**Symptoms**: Milvus standalone pod repeatedly crashes with exit code 134 (SIGABRT). -**Symptoms**: Pods fail with `ErrImagePull` or `ImagePullBackOff` status. +**Cause**: This was a known compatibility issue with Milvus v2.6.7 when running under PodSecurity "restricted" policies. The issue has been fixed in updated Milvus operator images. -**Error Message**: -``` -Failed to pull image "build-harbor.alauda.cn/middleware/milvus-operator:v1.3.5": -authorization failed: no basic auth credentials -``` +**Solution**: -**Cause**: The image registry requires authentication or the registry address is incorrect for your cluster. +1. Ensure you're using the updated Milvus operator image (v1.3.5-6e82465e or later) +2. Add `components.runAsNonRoot: true` to your Milvus custom resource: -**Solutions**: - -1. **Check if the registry address is correct for your cluster**: - ```bash - # Describe the pod to see which registry it's trying to pull from - kubectl describe pod -n | grep "Image:" - ``` - -2. **If the registry address is wrong**, patch the deployment with the correct registry: - ```bash - # Replace registry.alauda.cn:60070 with your cluster's registry - kubectl patch deployment milvus-operator -n --type='json' -p=' - [ - { - "op": "replace", - "path": "/spec/template/spec/containers/0/image", - "value": "registry.alauda.cn:60070/middleware/milvus-operator:v1.3.5" - } - ]' - ``` - -3. **If authentication is required**, create an image pull secret: - ```bash - # Create a docker-registry secret - kubectl create secret docker-registry harbor-pull-secret \ - --docker-server= \ - --docker-username= \ - --docker-password= \ - -n - - # Patch the deployment to use the secret - kubectl patch deployment milvus-operator -n --type='json' -p=' - [ - { - "op": "add", - "path": "/spec/template/spec/imagePullSecrets", - "value": [{"name": "harbor-pull-secret"}] - } - ]' - ``` - -4. **For existing Milvus deployments**, update image references in the Milvus CR: - ```yaml - spec: - components: - image: registry.alauda.cn:60070/middleware/milvus:v2.6.7 - dependencies: - etcd: - inCluster: - values: - image: - repository: registry.alauda.cn:60070/middleware/etcd - storage: - inCluster: - values: - image: - repository: registry.alauda.cn:60070/middleware/minio - ``` - -> **Note**: Different ACP clusters may use different registry addresses. Common registries include: -> - `build-harbor.alauda.cn` - Default in documentation -> - `registry.alauda.cn:60070` - Private registry -> - Other custom registries per cluster configuration - -#### etcd Invalid Image Name Error - -**Symptoms**: etcd pods fail to start with invalid image reference format error: - -``` -Error: failed to pull and unpack image "docker.io/registry.alauda.cn:60070/middleware/etcd:3.5.25-r1": -failed to resolve reference "docker.io/registry.alauda.cn:60070/middleware/etcd:3.5.25-r1": -"docker.io/registry.alauda.cn:60070/middleware/etcd:3.5.25-r1": invalid reference format +```yaml +spec: + components: + runAsNonRoot: true # Required for PodSecurity compliance + image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 ``` -**Cause**: The etcd Helm chart automatically prepends "docker.io/" to the image repository, creating an invalid image reference when using a custom registry. - -**Solution**: Patch the etcd StatefulSet to use the correct image reference without the "docker.io/" prefix: +3. Delete and recreate the Milvus CR if you previously deployed without this setting: ```bash -# Patch the etcd StatefulSet image -kubectl patch statefulset milvus--etcd -n --type='json' -p=' -[ - { - "op": "replace", - "path": "/spec/template/spec/containers/0/image", - "value": "registry.alauda.cn:60070/middleware/etcd:3.5.25-r1" - } -]' +kubectl delete milvus -n +kubectl apply -f .yaml ``` -Replace `registry.alauda.cn:60070` with your cluster's registry address. +The operator will automatically handle all PodSecurity requirements when `runAsNonRoot: true` is set. -#### Milvus Standalone Pod Crashes (Exit Code 134) +#### Cluster Mode Message Queue Configuration -**Symptoms**: Milvus standalone pod repeatedly crashes with exit code 134 (SIGABRT) and logs show: +**Symptoms**: Milvus cluster component pods (mixcoord, datanode, proxy, querynode) panic with the following error: ``` -check health failed] [UnhealthyComponent="[querynode,streamingnode]"] -Set runtime dir at /run/milvus failed, set it to /tmp/milvus directory +panic: mq rocksmq is only valid in standalone mode ``` -**Cause**: This is a known compatibility issue with Milvus v2.6.7 when running under PodSecurity "restricted" policies: +**Cause**: The Milvus custom resource is configured with `msgStreamType: rocksmq` for cluster mode. The `rocksmq` message stream type is only valid for standalone mode. For cluster mode, you must use `woodpecker` instead. -1. **Health Check Issue**: The `/healthz` endpoint returns HTTP 500 because it checks for cluster-mode components (querynode, streamingnode) that don't exist in standalone mode -2. **Non-Root Compatibility**: The container has permission issues running as non-root user (UID 1000) as required by PodSecurity policies -3. **Operator Management**: Manual patches to the deployment are reverted by the Milvus operator +**Solution**: Change `dependencies.msgStreamType` from `rocksmq` to `woodpecker`: -**Workarounds**: +**Incorrect (for cluster mode)**: +```yaml +spec: + dependencies: + msgStreamType: rocksmq # WRONG - only for standalone mode +``` -**Option 1**: Use cluster mode instead of standalone mode (Recommended) -- Cluster mode has better compatibility with ACP's PodSecurity policies -- The health check properly detects cluster components +**Correct (for cluster mode)**: +```yaml +spec: + dependencies: + msgStreamType: woodpecker # Use woodpecker for cluster mode +``` -**Option 2**: Request namespace exemption from PodSecurity policies (for testing only): -```bash -# Remove PodSecurity enforcement (NOT recommended for production) -kubectl label namespace pod-security.kubernetes.io/enforce- -kubectl label namespace pod-security.kubernetes.io/enforce=privileged +**Complete cluster mode example with Woodpecker**: +```yaml +apiVersion: milvus.io/v1beta1 +kind: Milvus +metadata: + name: milvus-cluster + namespace: milvus +spec: + mode: cluster + components: + image: build-harbor.alauda.cn/middleware/milvus:v2.6.7 + runAsNonRoot: true + dependencies: + msgStreamType: woodpecker # Woodpecker for cluster mode + etcd: + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/etcd + tag: 3.5.25-r1 + replicaCount: 3 + storage: + inCluster: + values: + image: + repository: build-harbor.alauda.cn/middleware/minio + tag: RELEASE.2024-12-18T13-15-44Z ``` -Then delete and recreate the Milvus CR. +After correcting the configuration, delete and recreate the Milvus instance: -**Option 3**: Disable readiness/startup probes (temporary workaround): ```bash -# Patch deployment to use TCP probe instead of HTTP probe -kubectl patch deployment milvus--milvus-standalone -n --type=json -p=' -[ - { - "op": "replace", - "path": "/spec/template/spec/containers/0/readinessProbe", - "value": { - "tcpSocket": {"port": 9091}, - "initialDelaySeconds": 30, - "periodSeconds": 15, - "timeoutSeconds": 3, - "failureThreshold": 10 - } - } -]' +kubectl delete milvus -n +kubectl apply -f .yaml ``` -> **Note**: The operator may revert this change. Monitor and re-apply if needed. - -> **Known Issue**: This is a documented limitation in Milvus v2.6.7. Future versions should address the non-root compatibility and health check issues. For production deployments, use cluster mode or wait for updated charts. +**Message Queue Type Reference**: +- **Standalone mode**: Use `msgStreamType: rocksmq` (or omit, defaults to rocksmq) +- **Cluster mode**: Use `msgStreamType: woodpecker` +- **External Kafka/Pulsar**: Use `dependencies.pulsar.external.endpoint` with appropriate scheme #### PVC Pending - Storage Class Binding Mode @@ -1164,7 +1023,18 @@ milvus-standalone-milvus-standalone-6b8c9d 1/1 Running 0 3m # kubectl get milvus -n milvus NAME MODE STATUS Updated -milvus-standalone standalone Ready True +milvus-standalone standalone Healthy True +``` + +**Verify PodSecurity Compliance**: + +```bash +# Check pod security contexts (all should show PodSecurity-compliant settings) +kubectl get pod milvus-standalone-milvus-standalone- -n milvus -o jsonpath='{.spec.securityContext}' +# Output should include: {"runAsNonRoot":true,"runAsUser":1000} + +kubectl get pod milvus-standalone-milvus-standalone- -n milvus -o jsonpath='{.spec.containers[0].securityContext}' +# Output should include: {"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"seccompProfile":{"type":"RuntimeDefault"}} ``` ### Diagnostic Commands @@ -1235,15 +1105,16 @@ kubectl exec -it -n milvus -- nc -zv 9092 **Milvus Deployment:** - `mode`: Deployment mode (standalone, cluster) - `components.image`: Milvus container image +- `dependencies.msgStreamType`: Message queue type - `woodpecker` (recommended, embedded), `pulsar` (external), or `kafka` (external) - `dependencies.etcd`: etcd configuration for metadata - `dependencies.storage`: Object storage configuration -- `dependencies.pulsar`: Message queue configuration (field named `pulsar` for historical reasons, supports both Pulsar and Kafka) +- `dependencies.pulsar`: External message queue configuration (field named `pulsar` for historical reasons, supports both Pulsar and Kafka) - `config.milvus`: Milvus-specific configuration **Message Queue Options:** -- **Woodpecker**: Embedded WAL enabled by default in Milvus 2.6+, uses object storage -- **Kafka**: External Kafka service, set `pulsar.external.endpoint` to Kafka broker with `kafka://` scheme (e.g., `kafka://kafka-broker.kafka.svc.cluster.local:9092`) -- **Pulsar**: External Pulsar service, set `pulsar.external.endpoint` to Pulsar broker with `pulsar://` scheme (e.g., `pulsar://pulsar-broker.pulsar.svc.cluster.local:6650`) +- **Woodpecker** (`msgStreamType: woodpecker`): Embedded WAL in Milvus 2.6+, uses object storage, supports both standalone and cluster modes +- **Kafka** (via `pulsar.external.endpoint`): External Kafka service, set endpoint to `kafka://kafka-broker.kafka.svc.cluster.local:9092` +- **Pulsar** (via `pulsar.external.endpoint`): External Pulsar service, set endpoint to `pulsar://pulsar-broker.pulsar.svc.cluster.local:6650` > **Important**: The CRD field is named `pulsar` for backward compatibility, but you can configure either Pulsar or Kafka by using the appropriate endpoint scheme (`pulsar://` or `kafka://`).