Senior Systems Reliability Engineer
Mountain View, CA, USA
Systems Reliability Engineer (Technical Support):
About Us:
ThoughtSpot is an AI-powered analytics platform that enables users to explore and analyze data through natural language queries, making insights accessible to all. Our mission is to deliver reliable, high-performing applications that empower our customers.
The Role:
As part of the ThoughtSpot SRE team, you will be on the cutting edge of operational intelligence. You will not only ensure service reliability but also act as a trusted partner for our customers — proactively leveraging AI/ML to deliver timely updates, meaningful solutions, and predictive improvements. You are the bridge between our customers and engineering, combining deep systems expertise with a genuine passion for customer success. If you thrive in dynamic environments and are committed to building resilient, self-optimizing systems, this role is for you.
What You'll Do:
Technical & Customer Support:
- Act as the primary point of contact for customer-facing technical issues related to our SaaS platform, including data connectivity, report errors, performance concerns, access problems, data inconsistencies, software bugs, and integration challenges.
- Understand and empathize with the challenges ThoughtSpot users face, offering tailored solutions to improve their experience.
- Provide timely, accurate, and clear updates to customers, consistently meeting SLAs and driving issues through to full resolution via tickets and calls.
- Translate complex technical issues into clear, concise updates for both technical and non-technical stakeholders.
- Create and maintain knowledge-base articles to empower customer self-service and improve support efficiency.
System Reliability & Monitoring:
- Maintain, monitor, and troubleshoot ThoughtSpot cloud infrastructure using tools like Grafana, Prometheus, Datadog, and Splunk.
- Monitor system health and performance through metrics, logs, and dashboards to detect and prevent issues proactively.
- Implement and leverage AI/ML-driven solutions for proactive observability, predictive anomaly detection, and intelligent alerting to enhance service reliability and reduce Mean Time to Resolution (MTTR).
- Understand and apply NetOps and SecOps principles for cloud and on-premise deployments.
- Develop and implement automation and best practices to streamline operations and strengthen system reliability.
- Optimize SRE workflows with AI tools to boost operational effectiveness.
Incident Management & Continuous Improvement:
- Participate in on-call rotations, lead incident reviews, and conduct thorough root cause analyses to drive continuous improvement.
- Work cross-functionally with Engineering to define and implement tools that enhance debuggability, supportability, availability, scalability, and performance.
- Be an expert in both cloud and on-premise infrastructure by developing automation and best practices.
What You'll Bring
- B.S. in Computer Science or equivalent relevant experience.
- Proven experience troubleshooting complex Linux systems and managing virtualization and cloud platforms (VMware, AWS, Azure, GCP).
- Hands-on experience with monitoring tools such as Grafana, Prometheus, Datadog, or Splunk.
- Demonstrated experience and a keen interest in leveraging AI/ML principles to address SRE challenges — including AIOps, predictive maintenance, and intelligent automation.
- Prior experience in enterprise customer support, including on-call rotations and incident management, with the ability to lead root cause analyses.
- Strong problem-solving and algorithmic thinking with a solid understanding of system internals.
- Excellent verbal and written communication skills with the ability to work independently and cross-functionally in fast-paced environments.
- Familiarity with scripting and programming languages such as Python, Go, Bash, or Java.
- Exposure to infrastructure and service monitoring frameworks with the ability to analyze data to ensure high availability.
Good to Have
- Experience partnering with Engineering to design and implement mission-critical tooling and automation that advances system debuggability, high availability, elastic scalability, and performance.
- Experience with alerting strategies and monitoring system tuning to minimize alert fatigue and optimize Mean Time to Acknowledge (MTTA).
- Familiarity with C/C++ or other low-level systems languages.
Ideal Candidate Profile
You have a balanced mix of technical expertise in cloud operations and a proven record of handling support incidents and end-user queries. This sets you apart from candidates with purely systems or cloud engineering backgrounds. You move fluidly between deep technical investigation and customer-facing communication — equally at home diagnosing a complex infrastructure issue and presenting findings clearly to an enterprise stakeholder.
What We Offer
- Competitive salary and benefits package.
- Opportunities for professional growth and career advancement.
- A collaborative work environment where your input and expertise directly impact customer experience and platform reliability.
If you're ready to leverage your technical skills in a role that directly influences customer success and BI user satisfaction, we'd love to hear from you.