Do you dream in code and fix problems for fun? Are you passionate about Artificial Intelligence and pushing the boundaries of what’s possible?
If so, then NVIDIA’s Senior Site Reliability Engineer (SRE) – DGX Cloud position could be your perfect match! This exciting role offers the chance to join a world-class team and play a pivotal role in building and maintaining the infrastructure that powers cutting-edge AI research and development.
The Lowdown on SRE at NVIDIA
Site Reliability Engineering (SRE) is a unique blend of software engineering and systems administration that focuses on keeping large-scale systems running smoothly and efficiently. Here at NVIDIA, our SRE team is the backbone of our DGX Cloud, a powerful platform that allows developers to access the computing power they need to tackle the most challenging AI projects.
What You’ll Be Doing:
- You’ll be a master architect, designing, implementing, and supporting massive Kubernetes clusters, ensuring peak performance and real-time monitoring.
- From concept to launch and beyond, you’ll be involved in the entire lifecycle of our services, constantly refining and improving their functionality.
- Before services go live, you’ll be a trusted advisor, providing invaluable expertise on system design, tool development, and capacity management.
- Once services are operational, you’ll become their guardian angel, monitoring availability, latency, and overall health with a keen eye.
- Sustainability is key! You’ll automate processes and champion changes that enhance system reliability and development speed.
- When the unexpected strikes, you’ll be part of our on-call rotation, ensuring a smooth and efficient incident response.
The Skills You Bring to the Table:
- A BS degree in Computer Science or a related technical field (think physics or math) is a must, or equivalent experience that proves your coding prowess.
- 5+ years of experience are crucial, with a proven track record in infrastructure automation, distributed system design, and building tools for large-scale production environments (cloud or private).
- Python, Go, Perl, or Ruby – you speak at least one of these programming languages fluently.
- When it comes to Linux, networking, and containers, you’re a walking encyclopedia.
What Makes You Shine Brighter?
- You have a burning desire to tinker with, analyze, and fix large-scale distributed systems.
- You’re a problem-solving whiz with excellent communication skills and a strong sense of ownership.
- Debugging and optimizing code? Automating routine tasks? These are your superpowers!
- Experience with Kubernetes, OpenStack, and Docker? Bonus points!
Why NVIDIA?
NVIDIA is a dream destination for many in the tech world, attracting some of the most innovative and dedicated minds around. Here, you’ll be surrounded by a culture that encourages creativity, autonomy, and a love for a good challenge.
Ready to Join the Revolution?
The base salary for this role is highly competitive, ranging from $148,000 to $276,000 USD depending on your location, experience, and market value. Plus, you’ll enjoy a comprehensive benefits package that includes equity. The best part? Remote positions are available!
Don’t wait! NVIDIA is always looking for top talent, so apply today.
To learn more about open positions Click here