Cluster Autoscaler: как он работает и решение частых проблем

Что такое Cluster Autosc

Kubernetes представляет несколько механизмов для масштабирования нагрузки. Три главные механизмы это : VPA, HPA, CA.

CA автоматически подбирает количество нод в кластере под требования. Когда число подов, которые находятся в очереди назначения или при остутствии возможности назначить, показывает что ресурсов не хватает в кластере, CA добавляет новые ноды в кластер. Он так же может уменьшить количество нод если они не до конца используются долгое время.

~~The~~Обычно Cluster Autoscaler isустанавливается ~~typically~~как ~~installed as a~~объект Deployment ~~object~~в inкластере. aОн ~~cluster.~~работает Itтолько ~~scales~~одной ~~one~~репликой ~~replica~~и atиспользует aвыборный ~~time,~~механизм ~~and~~для ~~uses~~того. ~~leader~~чтобы ~~election~~быть toуверенным, ~~ensure~~что ~~high~~он ~~availability.~~

полностью

HowКак работает Cluster Autoscaler

Для простоты, мы объясним процесс Cluster Autoscaler ~~Works~~в

~~For~~режиме ~~simplicity,~~масштабирования. ~~we’ll~~Когда ~~explain~~число ~~the~~назначенных ~~Cluster~~подов ~~Autoscaler~~в ~~process~~кластере inувеличивается, aуказывая ~~scale~~на ~~out~~недостаток ~~scenario. When the number of pending (unschedulable) pods in the cluster increases, indicating a lack of resources,~~ресурсов, CA ~~automatically~~автоматически ~~starts~~запускает ~~new~~новые ~~nodes.~~ноды.

~~This~~Это ~~occurs~~проявляется inв ~~four~~четырех ~~steps:~~шагах:

CA ~~checks~~проверяет ~~for~~назначенные ~~pending~~поды, ~~pods,~~время ~~scanning at an interval of~~проверки 10 ~~seconds~~секунд(для ~~(configurable~~настройки ~~using~~можно ~~the~~указать флаг --scan-interval flag).)
IfЕсли ~~there~~есть ~~are~~назначенные ~~pending pods,~~поды, CA ~~spins~~запускаем upновые ~~new~~ноды ~~nodes~~для toмасштабирования ~~scale~~кластера, ~~out~~в ~~the~~рамках ~~cluster,~~конфигурации ~~within the constraints configured by the administrator.~~кластера. CA ~~integrates~~встраивается ~~with~~в ~~public~~облачную ~~cloud~~платформу, ~~platforms such as~~например AWS ~~and~~или Azure, ~~using~~используя ~~their~~их ~~autoscaling~~возможности ~~capabilities~~масштабирования toдля ~~add~~того, ~~more~~чтобы ~~virtual~~можно ~~machines.~~было управлять vm.
~~Kubernetes~~K8s ~~registers~~регистрирует ~~the~~новые ~~new~~vm ~~virtual~~в ~~machines~~качестве asнод, ~~nodes~~позволяя inK8s ~~the~~запускать ~~control~~поды ~~plane,~~на ~~allowing~~свежих ~~the Kubernetes scheduler to run pods on them.~~ресурсах.
~~The~~K8s ~~Kubernetes~~планировщик ~~scheduler~~запускает ~~assigns~~назначенные ~~the~~поды ~~pending~~на ~~pods~~новые toноды. ~~the new~~ew nodes.

DiagnosingОбнаружение Issuesпроблем withс Cluster Autoscaler

~~Cluster~~CA ~~Autoscaler~~полезный isмеханизм, aно ~~useful~~он ~~mechanism,~~может ~~but~~работать itне ~~can~~так, ~~sometimes~~как ~~work~~ожидает ~~differently~~администратор. ~~than~~Вот ~~expected.~~первшые ~~Here~~шаги, ~~are~~чтобы ~~the~~найти ~~primary~~проблему ~~ways~~с ~~to diagnose an issue with CA:~~CA.

LogsЛогирование onна control plane nodesнодах

~~Kubernetes~~План ~~control~~управления ~~plane~~K8s ~~nodes~~создает ~~create~~логи ~~logs~~активности ofCA ~~Cluster~~по ~~Autoscaler~~следующему ~~activity in the following path:~~пути: /var/log/cluster-autoscaler.log

Events on control plane nodesСобытия

~~The~~ kube-system/cluster-autoscaler-status ConfigMap ~~emits~~производят ~~the~~следующие ~~following events:~~события:

~~ScaledUpGroup—this~~ScaledUpGroup ~~event~~- ~~means~~это событие говорит, CA ~~increased~~увеличивает ~~the~~размер ~~size~~группы ofнод(предоставляется ~~the~~прошлый ~~node~~и ~~group~~текущий ~~(provides previous size and current size)~~размеры)
~~ScaleDownEmpty—this~~ScaleDownEmpty ~~event~~- ~~means~~это событие означение, что CA ~~removed~~убирает aноду, ~~node~~которая ~~that~~не ~~did~~имеет ~~not~~подов(системные ~~have~~поды ~~any~~при ~~user~~этом ~~pods~~не ~~running on it (only system pods)~~рассматриваюца)
~~ScaleDown—this~~ScaleDown ~~event~~- ~~means~~это событие создается, когда CA ~~removed~~убирает aноду, ~~node~~которая ~~that~~имеет ~~had~~запущенные ~~user~~поды. ~~pods~~Событие ~~running~~содержит onимена ~~it.~~всех ~~The~~подов, ~~event~~которые ~~will~~будет ~~include~~перезазначены ~~the~~на ~~names~~другие ofноды ~~all~~в ~~pods~~результате ~~that are rescheduled as a result.~~действия.

EventsСобытия on nodesнод

~~ScaleDown—this~~TriggeredScaleUp ~~event~~- ~~means~~это cобытие говорит, что CA isувеличивает ~~scaling~~кластер, ~~down~~так ~~the~~как ~~node.~~появились ~~There~~поды ~~can~~в ~~be multiple events, indicating different stages of the scale-down operation.~~очереди.
~~ScaleDownFailed—this~~NotTriggerScaleUp ~~event~~- ~~means~~событие говорит, что CA ~~tried~~не toможет ~~remove~~увеличить ~~the~~количество ~~node~~нод ~~but~~в ~~did not succeed. It provides the resulting error message.~~группе.

ScaleDown

Events- onэто pods

событие

значит,

~~TriggeredScaleUp—this event means~~что CA ~~scaled~~пробует upперенести ~~the~~поды ~~cluster~~с toноды, ~~enable~~чтобы ~~this~~затем ~~pod~~освободить toноду ~~schedule.~~

~~NotTriggerScaleUp—this~~удалить ~~event~~из ~~means CA was not able to scale up a node group to allow this pod to schedule.~~

~~ScaleDown—this event means CA tried to evict this pod from a node, in order to drain it and then scale it down.~~кластера.

Cluster Autoscaler: Troubleshootingработа forс Specificопределенными Error Scenariosошибками

~~Here~~Предлагаем ~~are~~несколько ~~specific~~определенных ~~error~~ситуаций, ~~scenarios~~которые ~~that~~могут ~~can~~повяится ~~occur~~при ~~with~~работе ~~the~~CA ~~Cluster~~и ~~Autoscaler~~возможные ~~and~~решения ~~how~~этих ~~to perform initial troubleshooting.~~проблем.

~~These~~Эта ~~instructions~~инструкция ~~will~~позволит ~~allow~~выяснить ~~you~~простые toошибки ~~debug~~работы ~~simple~~CA, ~~error~~но ~~scenarios,~~для ~~but~~более ~~for~~сложных ~~more~~проблем, ~~complex~~включающие ~~errors~~множество ~~involving~~двигающихся ~~multiple~~частей ~~moving~~в ~~parts~~кластере, inвозможно ~~the~~придется ~~cluster,~~автоматизировать ~~you~~инструментарий ~~might~~решения ~~need automated troubleshooting tools.~~проблем.

NodesНоды withс Lowнедостаточной Utilizationнагрузой areне Notудалются Scaledиз Downкластера.

~~Here~~Вот ~~are~~причины ~~reasons~~по ~~why~~которы CA ~~might~~не ~~fail~~может toуменьшить ~~scale~~количество ~~down~~нод, aи ~~node,~~что ~~and~~можно ~~what~~с ~~you~~этим ~~can do about them.~~сделать.

~~REASON~~Причина ~~CLUSTER DOESN’T SCALE DOWN~~проблемы	~~WHAT~~Что ~~YOU~~можно ~~CAN DO~~сделать
~~Pod~~В ~~specs~~описании ~~indicate~~пода itесть ~~should~~указание, ~~not~~что beего ~~evicted~~нельзя ~~from~~перенести ~~the~~на ~~node.~~другую ноду.	~~Identify~~Проверьте ~~the missing~~отсутсвующий ConfigMap ~~and~~и ~~create~~создайте itего, inили ~~the~~используйте ~~namespace, or mount another, existing ConfigMap.~~другой.
~~Node~~Группа ~~group~~нод ~~already~~уже ~~has~~имеет ~~the~~минимальное ~~minimum size.~~значение.	~~Reduce~~Сократите ~~minimum~~минимальное ~~size~~значение inв CAнастройках ~~configuration.~~CA.
~~The~~Нода ~~node~~имеет ~~has~~директиву “scale-down disabled” ~~annotation.~~.	~~Remove~~Уберите ~~the~~директиву ~~annotation~~с ~~from the node.~~ноды.
CA isожидает ~~waiting~~времени ~~for~~согласно ~~the~~одному ~~duration~~из ~~specified~~указанных inследующих ~~one of these flags:~~флагов: `--scale-down-unneeded-time` `--scale-down-delay-after-add` ~~flag~~, `--scale-down-delay-after-failure`, `--scale-down-delay-after-delete`, `--scan-interval`	~~Reduce~~Сократите ~~the~~время ~~time~~указанное ~~specified~~во inсоответсвующем ~~the~~флаге, ~~relevant~~или ~~flag,~~дождись orуказанного ~~wait the specified time after the relevant event.~~времени.
~~Failed~~Неудачна ~~attempt~~япопытка toудаления ~~remove the node (~~ноды(CA ~~will~~будет ~~wait another~~ждать 5 ~~minutes~~минут ~~before~~пееред ~~trying~~повторной ~~again).~~попыткой)	~~Wait~~Подождите 5 ~~minutes~~минут ~~and~~и ~~see~~проверьте ifрешилась ~~the~~ли ~~issue~~проблема. ~~is resolved.~~.

PendingПоды Nodesв Existсостоянии Butpenind, Clusterно Doesновые Notноды Scaleне Upсоздаются.

~~Here~~Ниже ~~are~~приведены ~~reasons~~причины ~~why~~почему CA ~~might~~может ~~fail~~не toувеличивать ~~scale~~количество upнод ~~the~~в ~~cluster,~~кластере, ~~and~~и ~~what~~что ~~you~~с ~~can~~этим doможно ~~about them.~~сделать.

~~REASON CLUSTER DOESN’T SCALE UP~~Причина	~~WHAT~~Что ~~YOU~~можно ~~CAN DO~~сделать
~~Existing~~Создаваемый ~~pods~~под ~~have~~имеет ~~high~~запросы ~~resource~~превыщающие ~~requests,~~характеристики ~~which won’t be satisfied by new nodes.~~ноды.	~~Enable~~Дать возможность CA toдобавлять ~~add~~большие ~~large~~ноды, ~~nodes,~~или orсократить ~~reduce~~требования ~~resource~~ресурсов ~~requests~~для ~~by pods.~~пода.
~~All~~Все ~~suitable~~подходящие ~~node~~группы ~~groups~~нод ~~are~~имеют atмаксимально ~~maximum~~разрешенное ~~size.~~значение.	~~Increase~~Увеличьте ~~the~~максимальное ~~maximum~~значение ~~size~~необходимой ~~of the relevant node group.~~группы.
~~Existing~~Новый ~~pods~~под ~~are~~не ~~not~~назначается ~~able~~но toновые ~~schedule on new nodes due to selectors or other settings.~~ноды.	~~Modify~~Изменити ~~pod~~описание ~~manifests~~пода, toчтобы ~~enable~~предоставить ~~some~~возможность ~~pods~~поду toназначаться ~~schedule~~на onопределенной ~~the~~группе ~~new nodes. Learn more in our guide to node affinity.~~нод.

NoVolumeZoneConflict error—~~this~~ ~~indicates~~показывает, ~~that a~~что StatefulSet ~~needs~~требует toзапуск ~~run~~в inтой ~~the~~же ~~same~~зоне ~~zone~~что ~~with~~и ~~a PersistentVolume (~~PersistentVolume(PV), ~~but~~но ~~that~~эта ~~zone~~зона ~~has~~уже ~~already~~имеет ~~reached~~доступный ~~its~~лимит ~~scaling~~.| ~~limit.~~начиная ~~From~~с Kubernetes 1.1313, ~~onwards,~~вы ~~you~~можете ~~can~~разделить ~~run~~группу ~~separate~~нод ~~node~~на ~~groups~~зоны ~~per~~и ~~zone~~использовать ~~and use the~~флаг --balance-similar-node-groups ~~flag~~для ~~to keep them balanced across zones.~~ балансировки.|

Cluster Autoscaler Stopsпрекратил Workingработать

IfЕсли CA ~~appears~~не toработает, ~~have~~пройдитесь ~~stopped~~по ~~working,~~следующим ~~follow~~шагам, ~~these~~чтобы ~~steps~~понять ~~to debug the problem:~~проблему.

~~Check~~Проверьте ifчто CA isзапущен. ~~running—you~~Это ~~can~~можно ~~check~~проверить ~~the~~по ~~latest~~последнему ~~events~~событию, ~~emitted~~которое byгенерируется ~~the~~в kube-system/cluster-autoscaler-status ConfigMap. ~~This~~Оно ~~should~~не beдолжно ~~no more than~~превышать 3 ~~minutes.~~минуты.
~~Check~~Проверьте ifесли ~~cluster~~кластер ~~and~~и ~~node~~группы ~~groups~~нод ~~are~~находятся inв ~~healthy~~здоровом ~~state—this~~состоянии, ~~should~~это beтак ~~reported~~же byможно ~~the~~найти ~~ConfigMap.~~в configMap
~~Check~~Проверьте ifналичие ~~there~~неготовых ~~are~~нод - если какие-то ноды оказываются unready ~~nodes~~проверьте ~~(CA~~число ~~version~~resoureceUnready. ~~1.24~~Если ~~and~~какие-то ~~later)—if~~ноды ~~some~~помечены, ~~nodes~~проблема, ~~appear~~скорей ~~unready,~~всего, ~~check~~в ~~the~~том, ~~resourceUnready~~что ~~count.~~не Ifбыло ~~any~~установленно ~~nodes~~необходимое ~~are marked as resourceUnready, the problem is likely with a device driver failing to install a required hardware resource.~~ПО.
IfЕсли ~~both cluster and~~состояние CA ~~are~~и ~~healthy,~~кластера ~~check:~~здоровое,
- Nodes with low utilization—if these nodes are not being scheduled, see the Nodes with Low Utilization section above. ju* Pending pods that do not trigger a scale up—see the Pending Nodes Exist section above.

проверьте:

Control plane CA ~~logs—could~~logs ~~indicate~~- ~~what~~могут isуказать ~~the~~на ~~problem~~проблему, ~~preventing~~которая CAможет ~~from~~не ~~scaling~~давать upмасштабировать ~~or down, why it cannot remove a pod, or what was the scale-up plan.~~кластер.
CA ~~events~~события ~~on the~~для pod ~~object—could~~объекта ~~provide~~— ~~clues~~может ~~why~~дать понимание почему CA ~~could~~не ~~not~~переназначает ~~reschedule the pod.~~поды.
Cloud provider resources quota—if ~~there~~если ~~are~~есть ~~failed~~неудачные ~~attempts~~попытки toдобавить ~~add~~ноду, ~~nodes,~~проблема ~~the~~может ~~problem~~быть ~~could~~в beквотах ~~resource~~ресурсов ~~quota~~у ~~with the public cloud provider.~~провайдера.
Networking issues—if ~~the~~если ~~cloud~~провайдер ~~provider~~пытается isсоздать ~~managing~~ноду, toно ~~create~~она ~~nodes~~не ~~but~~подключается ~~they~~к ~~are~~кластеру, ~~not~~это ~~connecting~~может toговорить ~~the~~о ~~cluster,~~проблеме ~~this~~с ~~could indicate a networking issue.~~сетью.