Senior Cloud Architect, SRE - DGX Cloud: Shaping the Future of Cloud Computing and AI Infrastructure

Remote Full-time
Join the Ranks of the World's Most Innovative Technology Company NVIDIA is at the forefront of technological advancements, driving innovations in AI, computing, and beyond. We're seeking a highly skilled and experienced Senior Cloud Architect to join our DGX Cloud Site Reliability Engineering (SRE) team. As a Senior Cloud Architect, SRE - DGX Cloud, you will play a pivotal role in designing, building, and maintaining large-scale production systems that power NVIDIA's GPU cloud services. This is an exceptional opportunity to leverage your technical expertise, creativity, and passion for cloud computing to shape the future of AI infrastructure. About the Role The Senior Cloud Architect, SRE - DGX Cloud role is a key position within NVIDIA's SRE team, responsible for ensuring the reliability, efficiency, and scalability of our DGX Cloud solutions. As a Senior Cloud Architect, you will lead the technical architecture for DGX cloud solutions on top of cloud service providers like AWS, GCP, Azure, and OCI. You will work closely with cross-functional teams to design, implement, and support operational and reliability aspects of large-scale GPU training clusters. Key Responsibilities Lead technical architecture for DGX cloud solutions on top of cloud service providers like AWS, GCP, Azure, and OCI. Provide fast and creative solutions for complex problems and write effective, clear, and reliable architecture specifications. Design, implement, and support operational and reliability aspects of large-scale GPU training clusters with a focus on performance at scale, real-time monitoring, logging, and alerting. Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement. Support services before they go live through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems. Requirements and Qualifications To be successful in this role, you should possess a strong technical background with a focus on cloud computing, distributed systems, and site reliability engineering. The ideal candidate will have: Essential Qualifications B.Sc./M.Sc./Ph.D. degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience. 8+ years of proven experience in cloud computing, distributed systems, or a related field. Experience with infrastructure automation, distributed systems design, and experience with designing, developing tools for running large-scale private or public cloud systems in production. Experience in one or more of the following: Python, Go. In-depth knowledge of Linux, Networking, and Cloud Native Technologies. Preferred Qualifications Interest in crafting, analyzing, and fixing large-scale distributed systems. Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Ability to debug and optimize code and automate routine tasks. Experience in using or running large private and public cloud systems based on Kubernetes or Slurm. What We Offer NVIDIA is committed to providing a comprehensive compensation and benefits package that reflects our employees' skills, experience, and contributions. The base salary range for this role is $220,000 - $419,750 USD. You will also be eligible for equity and benefits. We accept applications on an ongoing basis, so we encourage you to apply as soon as possible. Our Culture and Work Environment At NVIDIA, we pride ourselves on fostering a diverse and inclusive work environment that encourages creativity, innovation, and collaboration. Our SRE team is no exception, with a culture that values intellectual curiosity, problem-solving, and openness. We promote self-direction, allowing our engineers to work on meaningful projects while providing the support and mentorship needed to learn and grow. As a remote team, we offer the flexibility to work from anywhere, at any time, as long as you're committed to delivering exceptional results. We're committed to building a community that is diverse, inclusive, and respectful, where everyone can thrive and grow. Career Growth and Development At NVIDIA, we're committed to helping our employees grow and develop their careers. As a Senior Cloud Architect, SRE - DGX Cloud, you will have the opportunity to work on complex, challenging projects that will help you develop your technical skills and expertise. You will also have access to our comprehensive training and development programs, designed to help you stay up-to-date with the latest technologies and trends. Join Our Team! If you're a motivated, talented, and experienced Senior Cloud Architect looking to shape the future of cloud computing and AI infrastructure, we want to hear from you! Apply today to join our team and be part of a community that is driving innovation and excellence in the tech industry. NVIDIA is an equal opportunity employer and welcomes applications from diverse candidates. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law. Apply for this job
Apply Now

Similar Opportunities

Senior Community Liaison for Home Health and Hospice - Remote Opportunity with HarmonyCares

Remote

Senior Compliance Accountant - Remote Opportunity with Piedmont Airlines, Ensuring Financial Integrity and Regulatory Adherence in the Aviation Industry

Remote

Senior Concept Artist for Disney Digital Entertainment - Remote Game Development Opportunity

Remote

Senior Concept Artist for Netflix's Groundbreaking AAA PC Game Development - Remote

Remote

Senior Content Designer for Acquisition - Remote Opportunity at Netflix

Remote

Senior Content Designer for Merchandising Experience Design (XD) - Remote Opportunity at Netflix

Remote

Experienced Senior Copywriter for Disney's In-House Creative Agency - Crafting Innovative Content for Entertainment Marketing

Remote

Experienced Senior Corporate Lawyer for Remote IT Recruitment Industry - Full-Time

Remote

Senior Content Designer for Member Experience Personalization - Shaping the Future of Entertainment at Netflix

Remote

Experienced Senior Contract Recruiter - Talent Acquisition & Client Relationship Management (Remote)

Remote

Senior Product Manager, AIR Expansion, Amazon Air Transportation Network Development

Remote

Experienced Full Stack Data Product Manager – Data Modeling Focus for arenaflex

Remote

Sr. Cyber Security GRC Analyst_REMOTE_On W2

Remote

Finance Operational Analyst - Remote

Remote

Product Manager – Shopping – Full Remote or Hybrid F/H

Remote

R&D Imagineer Principal - Electrical Engineer

Remote

Staff Accountant (For-Profit Audit)

Remote

Contract Director, National Accounts – CommonSpirit and HCA

Remote

2025 Summer Intern: Assoc Product Designer UI/UX

Remote

Value Based Care Cost and Utilization Analyst

Remote
← Back to Home