Phil Liu: Proactive Building for New Infrastructure at SignalFx

Phil Liu: Proactive Building for New Infrastructure at SignalFx

September 2022

By: Terah Lyons

image

Technology has traveled a long way for cloud computing to become the industry norm–a fact easy to forget sitting here in 2022, when most every application on the internet runs on cloud architecture. For that fact, we have several impactful companies, products, and founders to thank. These are the folks who help build the musculoskeletal system of the internet: the largely unsung heroes of backend infrastructure development.

Phillip Liu is one such hero. It is said that necessity is the mother of invention, and few founders prove the adage out more than Phil. When he co-founded SignalFx in 2012, the infrastructure landscape was rapidly evolving, and Software as a Service (SaaS) was a newly emerging category of business in Silicon Valley.

SignalFx was a classic entrepreneurship success formula: Phil saw a technology and market shift underway — the move from datacenter to cloud, combined with the move from monolithic backends to microservices architectures. Application performance monitoring tools (or APM) were already a big business with many successful leaders - but Phil saw that the platform shift would create an opening to introduce new monitoring tools relevant to new applications built on a new architecture. They also had a technology-based insight about how that new need could be addressed, based on Phil’s work as the leader of Facebook’s Infrastructure-as-a-Service platform and as Chief Architect of Opsware, a pioneer in datacenter optimization. Based on a key market insight, a key technology insight, and an unbelievable amount of hard work, Phil and Karthik grew SignalFx to a successful SaaS-based monitoring and analytics platform, and in 2019, Splunk acquired the company for $1 billion.

Phil is now back in the entrepreneurship ring, as founder and CEO of Trustero, a startup working to revolutionize SOC 2 compliance, using ML to automate part of the process (in which, full disclosure, Zetta is an investor).

Over a career spanning several decades in distributed systems–and several startups–Phil has learned to recognize important patterns in what it takes to successfully build and sell a SaaS product. He has also developed a great sense of humor, which is, apparently, a necessity when you are a deeply technical founder trying to patch the proverbial pot holes of the internet.

Phil connected with us to talk about the earliest days of SignalFx: what observations led to the founding of the company back when he was an architect at Facebook; what lessons he has gleaned selling to engineers over a span of rapid technology development; and why stoplights are relevant in the product design questions that they faced.

Terah Lyons: Phil, welcome. Let’s start with you telling us a little about yourself and your career journey through the founding of SignalFx.

Philip Liu: My background has been in cloud computing in distributed systems over the past two decades or so here in the Valley. I've been through many startups, three IPOs, and two north-of-a-billion-dollar exits, so my background has been at startups, period. I most recently started Trustero, and prior to Trustero, I spent some time at one of my friends' companies, Datrium. And then, before that, I founded SignalFx.

When we started SignalFx back in late 2012 and early 2013, I think there were maybe three very successful SaaS businesses: ServiceNow, Salesforce, and the version of AWS that existed then. At that time, people weren't comfortable running their full business on AWS outside of basically marketing websites–that might have been the primary use case of AWS initially. Coming out of Facebook and my previous experience at Opsware and Loudcloud, I saw that there was an acute change in how people were building back-end infrastructure in applications, mostly going to a model of microservices.

At Facebook, I started to understand that if AWS is going to be successful, there will be a mushrooming of services which are running out there in the ecosystem. So, if this is a new architecture — what are the effective monitoring tools and configuration tools that will work in these vastly distributed applications in large numbers? One of the things that worked well for us at Facebook was monitoring a large number of servers using statistical analysis.

If the industry was moving toward this microservices environment, with a proliferation of thousands of instances that make up an application, then the traditional way of monitoring will no longer work, and we will require a more data-centric way of monitoring ecosystems. Those were really the motivating observations for the start of SignalFx.

“If the industry was moving toward this microservices environment, with a proliferation of thousands of instances that make up an application, then the traditional way of monitoring will no longer work, and we will require a more data-centric way of monitoring ecosystems. Those were really the motivating observations for the start of SignalFx.”

That was the thesis, how we got started. One of the things that really didn't work well for us at Facebook was the delayed nature of how we computed the statistics to find anomalies. We figured we should do a lot better. So the challenge we took on was, trying to use en-masse Bayesian statistics, and predictive Bayesian statistics in real time, to be able to predict what is happening with a massively complicated and distributed application through many, many microservices.

Throughout the years of SignalFx, we grew and grew more around statistics. We built a lot of very complicated predictive models by looking at historical behavior, and then charting predictions on what will happen if this trend continues. We did this with fine granularity, so we were able to basically receive and process data points on a per-second granularity. That standard probably continues to be the best in the industry today. That's SignalFx in a nutshell.

TL: I want to dig deeper into the core observations you had behind the founding of the company. Can you say more about the insights you had about infrastructure while working in prior roles? What factors made you decide that there was really a company-worthy business idea here worth pursuing?

PL: It was really a combination of things. One was a new technique that proved to work well for large scale applications that I observed at Facebook. Another influence was my background dealing with inter-application communications at Opsware and Loudcloud throughout the years, and also, even previous to that, at Marimba. And I saw a distinct change in how people were building applications. I think this was really the key takeaway: If you look at the industry, whenever there is a shift in infrastructure, and how people are building infrastructure and applications, it pulls along all the supporting tools.

“I think this was really the key takeaway: If you look at the industry, whenever there is a shift in infrastructure, and how people are building infrastructure and applications, it pulls along all the supporting tools.”

TL: Have you extended any of those observations to the founding of your most recent company, Trustero?

PL: We started Trustero because the way that companies think about compliance has changed. In the old days, it was something that you didn't have to worry about until you were relatively large and selling into regulated environments. Now, there's a business platform change happening — all the businesses now are SaaS. SaaS are all interconnected forming a new platform and a key requirement of this new platform is trust. So compliance becomes a concern practically from day one. Why should you trust others with your data? And that's why it’s the right time to start a company like Trustero.

At the same time, we have the opportunity to use modern technologies to solve these problems. For example, natural language processing (NLP) is something that we are investing quite heavily in, taking what used to be a non-normalized suite of information, and then normalizing that using NLP, and then being able to perform matches. For example, you have a policy document about what your employee should do. How does that translate into a set of objectives your business should have, and how do you measure that set of objectives using, again, monitoring techniques that we've learned from the past?

TL: When you were thinking about starting SignalFx, what were your biggest questions as a founder at the time? Who was the first person you talked to, and what finally made you decide to go all in?

PL: It was a personal goal of mine to always start a company, so that was motivating me. I was looking for the right opportunity, and thought, "Hey, this is a pattern that fits." At the time, I approached Ben Horowitz, who was a mentor of mine–he was my CEO at Opsware/Loudcloud. I didn't quite specifically tell him the idea, but it was more about, "What are some of the things that I should be thinking about? Is this whole startup thing going to be really difficult?"

Some very good conversations came out of that. I didn't come from a business background, I'm more of an engineer by trade, so I was advised to go find a partner who has more of a business background, so they can help me along. That was huge.

Financing is also a big, big deal in the beginning. You want to build momentum in the beginning– you need to have accredited, trusted financial people who are backing you. It helps you attract the proper talent.

I have learned some lessons, though. At the time [we raised the SignalFx Series A], we thought, naively, it was validation for us. We thought, "People are willing to give us this much money, so therefore, we've done it." But we still had such a long way to go, and we gave up a big chunk of the company… Still, Ben was a great board member, and I think the outcome of SignalFx speaks for itself in the end. So, it did work out for everybody. I’ve just approached fundraising slightly differently since then.

TL: Let’s go back to the tech for a few minutes. You spoke about the statistical models that you built for SignalFx: that became more and more predictive over time. Those models start from pure math, but at some point, were they getting better because you were seeing more and more traffic across more and more customers?

PL: Yes. The very classical, simple thing to understand is the percentiles. You look at percentiles on some specific metric that reached a certain point. We asked: If you look at it for one server instance, or one application instance, that works well, but what happens if you have a doze small server or application instances? Then, what is an anomaly amongst the percentiles? Then, the model becomes a little more complicated. Do you build an average of the system, and then compare against the average? Or do you build a median within that population, and compare against the median? And then, what is the median window? Is it a minute? Is it 10 minutes? Is it an hour?

The parameterization of the statistical model becomes more and more difficult, and it's very specific to the type of applications that you're monitoring. It was configurable by our customers, for one. We were also trying to build what was basically a learning model, to determine the right window to select for each application to properly be able to find the anomaly in time for the application.

Those were some of the ideas that evolved over time as we looked at different types of applications.

Another thing that evolved during our time at SignalFx was the liveliness of an entire infrastructure. As AWS evolved, there were less static virtual machines hanging around to run an application, and it became more of a just-in-time provisioning of containers and of lambda applications.

To further explain: We see more and more containers designed to solve a problem in production, and as a result, that further complicates how you calculate the median, and the mean. If you spike from 10 instances to 100 instances, what is the medium, what is the mean? And do you take it from the 10 instances, or do you take it from the 100 instances? Now, you have 10 instances giving you metadata about a time series. And then you have 100 different instances that give you 10 times more of the metadata about the same time series. So, those are the mechanical things we had to deal with…it’s actually quite a difficult technical problem. And that is one of the reasons Splunk paid so much money [to acquire] SignalFx. We have a lot of patents on the systems that we built, and it's a pretty impressive system, which is why it’s continually in use today. It's a multi-generational time series service.

TL: Would you say the techniques that you're describing were obviously going to work from your perspective back when you were starting the company, circa 2013? What were some of the surprises you encountered along the way, either related to technical development or the building of the business?

PL: There were a lot of lessons to learn. When I started the company, it was obvious to me that the technology was going to win. That's the reason I made the bet, and I started the business.

However, executing is another matter. A big group of us came from a highly technical group of engineers at Facebook — the engineering population in general at Facebook was a highly skilled and experienced group of people. And most of us understood the importance of statistics, and how to think about statistics, and read meanings, and be able to infer. We understood the meaning from the time series, and groups of time series, and the percentiles, and outliers, and specific statistical algorithms. But at SignalFx, we found that most of our customers weren’t like that. That was one key finding right away.

We built a super cool system where you could create and stitch together a very complex statistical algorithm. But most people were going, "Oh. What does that mean?" So, that's what we ran into: how do we make it more consumable for people?

“We built a super cool system where you could create and stitch together a very complex statistical algorithm. But most people were going, ‘Oh. What does that mean?’ So, that's what we ran into: how do we make it more consumable for people?”

The second lesson, related to the first one, came from feedback we heard from customers. They said "Well, that's great that I can do this with your tool. Why don't you do it for me?" Basically: Why do I need to build up these statistical algorithms specifically for my application? Why isn't there some out-of-the-box thing for the application I'm using?

We heard a lot of feedback along the lines of, "A lot of people are using open-source tools like Cassandra, like Kafka. If I'm running that, why don't you just give me a dashboard, tell me what is going wrong?" And, “When I should worry, and then, what should I do when something goes wrong? Just tell me." That was the other lesson: We thought that most of the apps were custom, but as it turned out, there were a lot of open source frameworks being run out there, and people wanted to know what was wrong with them. This meant we could provide a complete solution without having to do a lot of custom dashboard-building for each customer.

Those were two key things that we learned early on which led us to pivot some of the application side of development–less toward the specific building of the charts, and the statistical functions, to more of a prepackaged set of dashboards which let our customers spin up what they needed based on what they were running. Then they immediately had the charts that mattered, and all the monitor statistics: red, green, yellow–and helped people deal with the issue if something was red for a period of time.

TL: Did any of those lessons carry over when you were starting your current company, Trustero?

PL: Absolutely, there were lessons learned from SignalFx that we're applying to Trustero. One of the key things that we did is that we spent a lot of time in the UX at the beginning to make sure that we're targeting the right person, and then making sure that we support the fewest clicks needed to complete their job, rather than having to think about what’s under the hood. We like to think about all of the awesome things that us engineers are building to make the sale–but our customers want their problems solved, and so we learned to emphasize different things.

TL: Let’s talk more about customer needs, since you’re heading there anyways. Can you talk us through how you approached the customer discovery and acquisition process in the early days? What did you have to change your mind about as you started trying to sell?

PL: Actually, we were a bit naive starting out at SignalFx. We thought that all engineers are like us, and therefore, if we build a tool that we like, then everybody else will like it. It started out that way, and then we had to make some adjustments in packaging the product and in user experience.

“We thought that all engineers are like us, and therefore, if we build a tool that we like, then everybody else will like it.”

As SignalFx was growing, the software industry was also re-shaping. Really for the first time, there were people going into DevOps, which was a word that didn't really exist when SignalFx started. We had to figure out what DevOps was, who the people were in this new category. Sometimes they were engineers, sometimes they were system administrators — some of their traditional infrastructure became DevOps. They have different skill sets, and their skill sets were not the core infrastructure engineers that we were. All of these discoveries helped shape the early product.

We also learned a lot about the way our customers viewed the world. They told us, "I have a lot of things to do. When I buy a tool, I want to see red, green, or yellow, and then, let me take action. That's where I find value.” That became the primary audience versus the developer who was building a tool and trying to find out what happened with their app. It became more of a DevOps-type of requirement, and they ended up controlling the budgets.

“We also learned a lot about the way our customers viewed the world. They told us, ‘I have a lot of things to do. When I buy a tool, I want to see red, green, or yellow, and then, let me take action. That's where I find value.’”

TL: The whole red, yellow, green thing takes a lot of trust, right? Customers have to really believe that your categorization is correct. Do you feel like you had to earn that, and that they needed to be able to see under the hood initially in order to trust the classifications? Or do you think that from the outset, the desire for ease of use trumped the need to trust the product?

PL: I actually think that people naturally just want to trust you. They look for ease, and they look at the credibility of the people building the system. They naturally just want to trust you.

That was then, at least–SaaS was a relatively new segment and everyone was trying a lot of new things, and given that it was a new ecosystem, people just were inclined to trust each other. Now, more and more as things evolve, we see a lot of security problems…for example, Capital One leaking information about their clients indirectly through a member of their organization. Those types of incidents disrupt the natural inclination toward trust. I think it would have been hard to convince our buyers to dive in with us if we had started later, because the ecosystem just has more experience now and there have been more breaches and more reasons to be skeptical and wary.

TL: Your point is really interesting: that customers are prone to want the ease of use and trust, and to trust first until you break it. Applied AI startups especially run into: To what extent do I need to prove the prediction of my model? Will customers take a prediction at face value, or need it proven first? And this dynamic, by the way — when there is the inclination towards trust — it can be beneficial for growing businesses, but produce some really adverse societal consequences as AI applications scale.

PL: Yes, exactly. During the evaluation process, we found that customers effectively want to run through a proof of concept. Not on their data, but just a general proof that it works. If they're convinced in that eval process that it should work for them, then they’ll trust the product... But in our particular product area, that only lasts until it breaks. For a monitoring example, if you start to get storms of alerts from a monitoring system, then as a customer you naturally say, "Well, what's going on? Isn't this tool supposed to smooth out all the noise, and only show me what the true anomalies are?"

There are ways that you can get ahead of those challenges with the company’s tools and with well-engineered assistance. We had to have the stance that we knew it was going to be a problem. We didn’t want our customers to distrust us as a result, and we wanted to build a great product, so we were motivated to really ensure customer trust and to really build those principles into the product from a user experience standpoint.

TL: What type of customer feedback did you get along the way that you all found most useful? Were there any significant product pivot points that are worth sharing as lessons learned?

PL: If you’ll let me geek out a little bit: SignalFx itself is a pretty large distributed microservices infrastructure. Basically, you have layers of different types of services interconnected to one another. We ran into a classical distributed systems problem, where there was a fault. Someone had introduced a bug in one of the time series calculation systems, which then, basically, had a cyclic effect. And then, the cyclic effect caused the customers to retry. And then, the retry, basically, caused even a bigger storm of input. So, basically, the bug introduces self DDoS. That was interesting.

We used another instance of SignalFx to monitor a production instance of SignalFx, and we found that things were spiking quite a bit, but we didn't find the source until later on. But it made us realize that, well, we weren't quite doing our job, because we should have found the source pretty early on before it got out of hand. That was an interesting technical lesson learned, that we'd actually catch that effect.

On the business side, we already talked about how we had a pivot of sorts, from a more statistical-focused tool to a dashboard-focused tool. We ended up having to build dashboard support, with per-application knowledge. It used the same underlying technology, but from a product perspective, it was a pivot in how we presented the solution.

That wasn't very clear in the beginning. We thought we were building an insights tool. But customers would say things like, "Oh, that's great that you could show a trend, but I want to know what's happening right now.” They were really asking for a pure monitoring system.

So at that point we went from an analysis tool to a monitoring tool, which was a big change. That was our a-ha moment. Customers got accurate alerts, and then could also look at a trend.

“So at that point we went from an analysis tool to a monitoring tool, which was a big change. That was our a-ha moment. Customers got accurate alerts, and then could also look at a trend.”

TL: You’re right — that might sound subtle, but that’s actually a huge paradigm shift for the product and the company, which must have been challenging at the time.

One last question, about today, and your latest venture. You saw the trade winds shifting around infrastructure when you founded SignalFx. What are some of the patterns that led to your founding Trustero this time around?

PL: A few things. One is that I ran a SOC 2 compliance process a couple of times before Trustero. One at Datrium, and one at SignalFx. Both times we needed a SOC 2, but they showed up at different phases of the selling process. At SignalFx, we followed a typical buying process, and only after customers had selected our product to buy, and they wanted due diligence on the business, that's when they would ask. At Datrium just a few years later, it came much earlier, during the evaluation phase itself. Basically, they wouldn’t even bother evaluating you unless you had a SOC 2 report, or at least a plan for a SOC 2 report. That’s because it's reached a point where that trust is so important between businesses. It's just a non-starter for a lot of products unless you have that in place.

So it became clear to me that a great solution needs to be there, that’s the market insight. And then, number two was the technology insight about how to do it better, how to make this process less manual. The status quo to get a SOC 2 assessment was a professional services engagement where someone does a human examination of your infrastructure. That used to work when you had large IT teams, but modern SaaS ecosystems consume services differently, right? Professional service is completely foreign to me. I see it as an archaic way of doing business, and saw an opportunity to modernize the process.

Which brings us back to the green, yellow, red analogy. In selling to prior customers, I found that they just want to get the job done, and they don't want to talk to a lot of humans to accomplish it. If I’m a customer, I don't want to see reports, either in PDF form or in paper form. I want to consume it like SaaS. The way you think about GRC, it can't be the traditional approach. It has to be delivered with a new platform. In this case, the new platform is inter-connected SaaS businesses.

TL: There’s a takeaway uniting many of the examples that you’ve given in describing the businesses that you’ve built: When a generational technology platform shift occurs, there’s a market opening to reinvent the categories for all of the supporting tools and infrastructure. This seems like it has been pervasive across your career, and you’ve really moved to act on these shifts when you recognize them, in SignalFx and most recently in Trustero.

This seems like a great note to end on. Thanks for your time today, Phil — we really appreciate it, and can’t wait to track your journey with Trustero.