Senior Systems Reliability Engineer

ThoughtSpot
ThoughtSpot

Mountain View, CA, USA

Posted on Jun 30, 2026

Systems Reliability Engineer (Technical Support):

About Us:

ThoughtSpot is an AI-powered analytics platform that enables users to explore and analyze data through natural language queries, making insights accessible to all. Our mission is to deliver reliable, high-performing applications that empower our customers.

The Role:

As part of the ThoughtSpot SRE team, you will be on the cutting edge of operational intelligence. You will not only ensure service reliability but also act as a trusted partner for our customers — proactively leveraging AI/ML to deliver timely updates, meaningful solutions, and predictive improvements. You are the bridge between our customers and engineering, combining deep systems expertise with a genuine passion for customer success. If you thrive in dynamic environments and are committed to building resilient, self-optimizing systems, this role is for you.

What You'll Do:

Technical & Customer Support:

  • Act as the primary point of contact for customer-facing technical issues related to our SaaS platform, including data connectivity, report errors, performance concerns, access problems, data inconsistencies, software bugs, and integration challenges.
  • Understand and empathize with the challenges ThoughtSpot users face, offering tailored solutions to improve their experience.
  • Provide timely, accurate, and clear updates to customers, consistently meeting SLAs and driving issues through to full resolution via tickets and calls.
  • Translate complex technical issues into clear, concise updates for both technical and non-technical stakeholders.
  • Create and maintain knowledge-base articles to empower customer self-service and improve support efficiency.

System Reliability & Monitoring:

  • Maintain, monitor, and troubleshoot ThoughtSpot cloud infrastructure using tools like Grafana, Prometheus, Datadog, and Splunk.
  • Monitor system health and performance through metrics, logs, and dashboards to detect and prevent issues proactively.
  • Implement and leverage AI/ML-driven solutions for proactive observability, predictive anomaly detection, and intelligent alerting to enhance service reliability and reduce Mean Time to Resolution (MTTR).
  • Understand and apply NetOps and SecOps principles for cloud and on-premise deployments.
  • Develop and implement automation and best practices to streamline operations and strengthen system reliability.
  • Optimize SRE workflows with AI tools to boost operational effectiveness.

Incident Management & Continuous Improvement:

  • Participate in on-call rotations, lead incident reviews, and conduct thorough root cause analyses to drive continuous improvement.
  • Work cross-functionally with Engineering to define and implement tools that enhance debuggability, supportability, availability, scalability, and performance.
  • Be an expert in both cloud and on-premise infrastructure by developing automation and best practices.

What You'll Bring

  • B.S. in Computer Science or equivalent relevant experience.
  • Proven experience troubleshooting complex Linux systems and managing virtualization and cloud platforms (VMware, AWS, Azure, GCP).
  • Hands-on experience with monitoring tools such as Grafana, Prometheus, Datadog, or Splunk.
  • Demonstrated experience and a keen interest in leveraging AI/ML principles to address SRE challenges — including AIOps, predictive maintenance, and intelligent automation.
  • Prior experience in enterprise customer support, including on-call rotations and incident management, with the ability to lead root cause analyses.
  • Strong problem-solving and algorithmic thinking with a solid understanding of system internals.
  • Excellent verbal and written communication skills with the ability to work independently and cross-functionally in fast-paced environments.
  • Familiarity with scripting and programming languages such as Python, Go, Bash, or Java.
  • Exposure to infrastructure and service monitoring frameworks with the ability to analyze data to ensure high availability.

Good to Have

  • Experience partnering with Engineering to design and implement mission-critical tooling and automation that advances system debuggability, high availability, elastic scalability, and performance.
  • Experience with alerting strategies and monitoring system tuning to minimize alert fatigue and optimize Mean Time to Acknowledge (MTTA).
  • Familiarity with C/C++ or other low-level systems languages.

Ideal Candidate Profile

You have a balanced mix of technical expertise in cloud operations and a proven record of handling support incidents and end-user queries. This sets you apart from candidates with purely systems or cloud engineering backgrounds. You move fluidly between deep technical investigation and customer-facing communication — equally at home diagnosing a complex infrastructure issue and presenting findings clearly to an enterprise stakeholder.

What We Offer

  • Competitive salary and benefits package.
  • Opportunities for professional growth and career advancement.
  • A collaborative work environment where your input and expertise directly impact customer experience and platform reliability.

If you're ready to leverage your technical skills in a role that directly influences customer success and BI user satisfaction, we'd love to hear from you.