Senior Site Reliability Engineer - Amman, الأردن - Quadcode

    Quadcode
    Quadcode Amman, الأردن

    منذ 3 أسابيع

    Default job background
    وصف

    Senior Site Reliability EngineerTech stack

    • OS: Linux Ubuntu;
    • Web server: Nginx;
    • Monitoring: Grafana, Prometheus, Graylog, Jaeger;
    • CI/CD: Jenkins, Git, Gitlab, Docker;
    • Automation: Python, Bash;
    • SCM: Ansible, Chef;
    • IaC: Terraform. Pulumi;
    • DB: PostgreSQL, Redis, Keydb, MySQL;
    • Cloud: Openstack, AWS, GCP, DO.

    Examples of first tasks in the role:

    • Review processes, platform and infrastructure;
    • Implementation of Grafana OnCall;
    • Review and rework ITSM processes if needed.

    Responsibilities in the role:

    • Identification of bottlenecks and preparation of recommendations to improve the reliability of services;
    • Responding to platform emergencies, localizing and resolving the causes of failures, compiling postmortem reports;
    • Development of monitoring and alerting tools ensuring high availability and quick detection of potential issues: (Grafana, Grafana OnCall, Prometheus Alert manager, etc.);
    • Active participation in change management processes, including assessment and coordination of changes to the infrastructure within Change Advisory Board (CAB) sessions;
    • Implementation and support of ITSM processes to optimize team workflow and enhance service quality.
    • Development and maintenance of documentation in an up-to-date state.

    Requirements:

    • 3+ years of experience in SRE/DevOps;
    • Understanding of SRE principles, practical experience in implementing SRE practices;
    • Understanding of principles and practical experience in building resilient systems;
    • Experience with monitoring and logging systems (Prometheus, Graylog, Grafana).
    • Experience with automation tools for software build and deployment (CI/CD): GitLab, Jenkins;
    • Understanding of virtualization and containerization principles;
    • Understanding of Infrastructure as Code (IaC) approaches and experience;
    • Proficiency in a programming language for automation script development (Python, Nodejs, Golang, etc.), ability to understand service code;
    • Understanding of network protocols, topologies, and network models;
    • Experience with configuration management tools: Ansible, Chef;
    • Basic experience with relational databases, such as PostgreSQL;
    • Experience in administering Linux operating systems;
    • Fluency in English and Russian (B2 minimum).

    As an advantage:

    • Experience in implementing monitoring and logging systems from scratch;
    • Experience with k8s, Openstack;
    • Advanced programming skills in any language.