How do you securely deploy large number of Kubernetes components in isolation?



  • Precursor: I am not experienced in the design of large-scale infrastructure deployments for infra applications.

    The assumptions for the questions:

    • I have read that it is a good practice to host Kubernetes components in isolation from each other on the network as the network provides a layer of security control.

    • In a large K8S deployment environment, you may have multiple instances of Kubernetes deployment. Each Kubernetes deployment has components including etcd, kube API server, scheduler, controller manager, etc.

    If we consider both points above, then the question are:

    Q1a) How do you scale the Kubernetes administration/control plane? How do you scale from 1 etcd server to 10 etcd servers for example?

    Q1b) In a large organization where there are different business units, do you deploy one K8S instance (active/passive) for each business unit, or multiple K8S instances serving the entire organization?

    Q2) For each deployment method described in part (1b), how do you reconcile multiple instances of Kubernetes to get a master view in order to monitor all the instances of containers running on Kubernetes?



  • Q1a) How do you scale the Kubernetes administration/control plane? How do you scale from 1 etcd server to 10 etcd servers for example?

    You won't. etcd needs to be fast, especially with a growing cluster. 3 members is what you'll want to have. Usually, as a result, you would also deploy 3x kube apiserver, controller-manager & scheduler pods.

    Q1b) In a large organization where there are different business units, do you deploy one K8S instance (active/passive) for each business unit, or multiple K8S instances serving the entire organization?

    Large organizations would usually go with several clusters. Don't put all your eggs in the same basket. Make sure you can upgrade some of your clusters, without impacting the others. Allowing end-users to implement their own disaster recovery, managing resources in "sibling" clusters -- kind of active/passive, without designating one or the other as passive, rather planning for each of your clusters to double in size over night, should you need to.

    Which doesn't mean they would be small clusters. You could easily have 100s of workers (my current customer's largest cluster has between 350 and 400 nodes, cluster-autoscaler adjusting size base on requested resources). But as much as possible, you want to avoid beasts like those: monitoring or logging stack would consume a lot, requiring larger nodes hosting infra components, operations on those components would become more slow and painful, ... better have two small than one large.

    Although here: automation would be critical. your team probalby can't afford to micro-manage 40, 80, 200 clusters. You would have to figure out a way to rollout changes to your clusters with little effort. And consistently. Might involve tools such as Ansible, Terraform, Tekton or ArgoCD.

    Q2) How do you reconcile multiple instances of Kubernetes to get a master view in order to monitor all the instances of containers running on Kubernetes?

    I wouldn't. The more you grow, the more metrics/data you would collect, the more complicated it would be to monitor everything from one point, the more likely rules evaluation would be late eventually, maybe misses alerts, ...

    Better deploy one Prometheus (or more) per cluster, self-monitoring your cluster. Then, pick two or three "ops" clusters, where you would deploy another Prometheus, making sure the Prometheuses in your other clusters work as expected (can server query, alertmanager is present, alertmanager can relay alerts, ... could be pretty much all based on blackbox exporter).

    Configure alertmanagers centralizing alerts into a single place (rocketchat, slack channel, stuff like opsgenie).

    Going further, you could look at solutions such as Thanos. You may aggregate metrics from several prometheuses .... although you'll need lots of RAM running thanos components. and some reliable s3 bucket aggregating those metrics into one location. It's definitely not something I would recommend for monitoring purposes, but it can be nice for metrology, making cool grafana dashboards, ...




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2