
Workinvirtual
We are seeking a highly proficient Site Reliability Engineer (SRE) to join our client’s distributed team. The SRE will be instrumental in maintaining and improving the reliability, performance, and scalability of our client’s production systems. This role demands a proactive approach to automation, incident response, and performance optimization, coupled with seamless collaboration with development and operations teams.
Key Responsibilities:
- System Reliability and Availability:
- Architect and maintain highly available and scalable infrastructure solutions.
- Implement comprehensive monitoring and alerting systems to ensure proactive issue identification.
- Develop and enforce strategies to maximize system uptime and minimize service disruptions.
- Participate in an on-call rotation to address and resolve production incidents.
- Automation and Infrastructure as Code:
- Automate infrastructure provisioning, configuration, and deployment using tools such as Terraform, Ansible, or CloudFormation.
- Design and implement robust CI/CD pipelines utilizing tools like Jenkins, GitLab CI, or similar platforms.
- Develop and maintain scripts to streamline repetitive operational tasks.
- Performance Engineering:
- Conduct in-depth performance analysis to identify and address system bottlenecks.
- Implement performance optimization strategies to enhance efficiency and reduce latency.
- Perform capacity planning and ensure systems can accommodate future growth.
- Incident Management and Response:
- Lead incident response efforts, ensuring timely resolution and effective communication.
- Conduct thorough post-incident reviews to identify root causes and implement corrective measures.
- Develop and maintain comprehensive incident response plans and procedures.
- Collaborative Teamwork and Communication:
- Foster close collaboration with development and operations teams to ensure seamless deployments and issue resolution.
- Communicate effectively with stakeholders regarding system performance and incident status.
- Maintain comprehensive documentation for processes and procedures.
- Security Best Practices:
- Implement and enforce security best practices across infrastructure and applications.
- Collaborate with security teams to address vulnerabilities and ensure compliance.
Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
- Minimum of 5 years of experience in a Site Reliability Engineering or DevOps role.
- Proficient understanding of cloud platforms (AWS, Azure, GCP).
- Strong scripting skills in languages such as Python or Bash.
- Proven experience with containerization and orchestration technologies (Docker, Kubernetes).
- Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack).
- Solid understanding of networking and security principles.
- Exceptional problem-solving and troubleshooting abilities.
- Excellent communication and collaboration skills.
Preferred Qualifications:
- Experience with infrastructure-as-code tools (Terraform, CloudFormation).
- Experience with configuration management tools (Ansible, Chef, Puppet).
- Experience with database administration (MySQL, PostgreSQL).
- Experience with serverless architecture.
Apply Now: https://workinvirtual.com/application-tracking-system/