Episode 5: Cloudera – Machine Learning & Big Data with Mike Olson

Mike Olson Co-Founded Cloudera in 2008 and served as CEO until 2013, when he took on his current role of Chief Strategy Officer (CSO). Cloudera delivers enterprise tools that leverage the open source Apache Hadoop platform for big data analytics. In this episode, Mike describes how Cloudera contributes to the open source community while also holding back enough proprietary IP to build one of the most successful open source software businesses of all time.

Transcript

Introduction

Michael Schwartz: Welcome to Open Source Underdogs, the podcast where we dig into the business models of the best open source software companies in the world.

Today I’m excited to be with Mike Olson, Founder and Chief Strategy Officer at Cloudera.

Cloudera provides Enterprise tools for data engineering, data warehousing, machine learning, and analytics that leverage the open source Apache Hadoop platform.

Mike Olson, thank you so much for joining the podcast today.

Mike Olson: Michael thanks for coming by, I’m excited to talk.

How Did Cloudera Get Started?

Michael Schwartz: How did Cloudera get started?

Mike Olson: The company’s ten years and a few months old at this point. We started in 2008, early summer.

The conviction that I had, along with my three co-founders Amr Awadallah, Christophe Bisciglia, and Jeff Hammerbacher. The conviction we all had was that big data was going to be a big deal.

So if you looked at Yahoo, and Google, and Facebook and others, they were collecting and analyzing data at enormous scale – much bigger than banks or hospitals were doing at the time. And they had invented a collection of new tools to do that.

So Apache Hadoop was sort of the foundational project of this ecosystem. We all believed that banks, hospitals, insurance companies would want to get lots of data and then analyze it in powerful new ways. And this wonderful open source Hadoop project was ideally designed for that.

So our idea was let’s bring those capabilities to traditional Enterprises.

Now I had had a long career in database technology, I had worked for Informix, and Oracle, and a bunch of other database startups you never heard of. I had a long career in open source, so I had worked at Berkeley Unix, on Postgres, Berkeley DB.

So I understood the big Enterprise data consumption behavior, and I kind of got open source. I didn’t know anything about Hadoop.

I had no code running in there, I had not developed it. But I understood what it did, how it worked. And my co-founders had all been actively involved in its development and its use at Facebook, Yahoo, Google.

So together I think we were uniquely positioned at that time to start the business and to try to bring this web technology to traditional Enterprises.

Cloudera Customers

Michael Schwartz: Who’s using Cloudera’s product today?

Mike Olson: Let me talk about what we’ve built for the market, because it’s been heavily influenced by our target customer; and that’ll give me a chance to explain then, who we chose, why and so on.

When we started the company, all we had was Apache Hadoop. That is two open source components, a file system DTFS, and distributed processing engine, MapReduce.

Those two things together were what Google invented for big data processing and what took over the consumer internet. Fast forward ten years, we have 26 different open source projects in the bundle now.

In general all of the new stuff has swiped the fundamental design ideas of Hadoop. Scale out, so deploy and run on a whole bunch of servers, and bring the processing to the data. So MapReduce let you plow through huge amounts of data by every server basically running out tiny little plow-through job on its own, and then you combine the results at the end.

Well, there are other processing engines available now, there is Apache Impala for high-performance distributed SQL analysis. There’s Apache Spark for stream processing, and for marshalling like model training in machine learning. There’s the SlrCloud Lucene search engine that does document search and so on.

So, a whole bunch of scale-out projects now can collaborate on the data on that cluster, on those servers. So you don’t need to make multiple copies of the day, you can do lots of different processing at massive scale by taking advantage of all of the CPUs that you have attached to the storage.

So, platform’s gotten a lot more interesting and handles way more analytic work less than ever before. We’ve built a collection of services that compliment that core open source offering.

Data governance, compliance, living up to regulatory regimes in these big industries like healthcare and financial services, we built tools on the platform that let administrators, and chief information officers, and the privacy team be sure that the rules are getting followed and policies were enforced.

We’ve got high performance security and encryption services, deploy, and manage this infrastructure at massive scale – if you’ve got a thousand computers running a mission-critical job how do you ensure that the jobs going to finish on time, and so on.

So we’ve built a collection of large Enterprise-focused software compliments to that open source platform. The data storage, the processing, and so on, all of that is in the open source and were substantial contributors to that ecosystem, and we benefit from the great workout that ecosystem.

And then we build a product that adds some proprietary IP. That is not, that is differentiated and not nearly as easy for a competitor to pick up and use against us, as the open source would otherwise be. So we maintain some differentiated IP and some reasons for customers to buy from us.

We chose those capabilities: Regulatory support, operations at massive scale, compliance, and so on, security, because we aim to sell to large Enterprises.

Banks and hospitals and insurance companies have lots of data, complex businesses with lots of opportunities to do analysis, so they’re good customers, and crassly they got a lot of money.

So we designed our product strategy, and we did a lot of really innovative stuff expressly to go after, kind of that same large Enterprise buyer that I had been servicing for my entire career as a relational database industry player.

Competition

Michael Schwartz: There must be a lot of competition in this market. How do you differentiate Cloudera from some of the other offerings that are out there?

Mike Olson: In 2008 when we started the company nobody had heard of Hadoop.

It was just, you know, it was some nerd technology, if you were an engineer at Yahoo you were up to date. But the industry at large hadn’t heard of it.

The meme of Big Data didn’t even exist yet, so those words weren’t being used to apply to any kind of business problem.

That changed in a hurry. So many competitors have entered the market, lots of folks have recognized the value of Big Data. A number of companies have even stepped in and joined us in delivering solutions that build on the Apache Hadoop ecosystem.

Look, I think there’s a bunch of reasons we’re good, we’re deeply involved in the open source community in order to take advantage of and help drive that benefit, we’ve got to do that. But our differentiation, how we set ourselves apart, is really based on how we been thinking about the customer.

So large Enterprises need the security governance compliance and regulatory support we’ve got. We genuinely believe that, given the decade that we’ve been building the product and the expertise we got, we’re differentiated there.

In addition however, our large Enterprise clients have data centers, and we think they’re going to have data center for a long time, so they want to be able to run stuff on premises.

They need to be able to take advantage of the public cloud, so spin jobs up on Google or on Amazon or on Microsoft Azure. And then, they’d like to be able to move those workloads around among those places – on prem, the various public cloud providers, as business requirements demand.

We support support this sort of, hybrid on-prem and cloud, and also multi-cloud, you can move among the cloud vendors capability, so that our customers get all that security, governance, compliance, regulatory support, and aren’t locked into a single cloud provider.

There are outstanding offerings that are single cloud only, so you know Amazon’s got a rich collection of great data services, and we integrate with and complement those.

But the promise we make our CIO customers is, you can deploy with confidence on any of those venues because if you decide later to move, your applications don’t change, you don’t, you haven’t coded to any sort of built-in proprietary, native cloud API’s. You’re building on the open source ecosystem of Hadoop, and your app will move easily.

Customer Interaction

Michael Schwartz: Are there any grades of interaction, maybe for smaller customers or the community?

Mike Olson: So of the 26 open source projects that we distribute, some were ingested from the open source community, created outside of Cloudera entirely – so Apache Spark’s a great example. Now we’re major contributors in the Spark ecosystem as well, but that was created at Berkeley and then released through the Apache Software Foundation.

One of the two original founders of the Hadoop project, Doug Cutting, actually works at Cloudera so you know, that predated the company, but obviously we were engaged in that from the very earliest days.

And we’ve created open source projects. So a great example is Apache Impala, our distributed query processing engine. Or Apache Kudu, an IoT-based storage engine that were conceived and originally developed here, and released through the Apache Software Foundation to the open source community, and we’ve built communities around that.

We benefit from the great work of the global open source development community. We earn our credibility by contributing to that community as well, so both of those are important.

We don’t really engage commercially with say, small companies with small data problems. Remember, we’re aimed at these large organizations and regulated industries with complex business.

So we really sell to very, very large Enterprises and we focus on those problems, and our interactions with those customers is aimed at basically that kind of buyer.

So we don’t really do much with open source consumption of the platform, although a great deal of it goes on.

We’re concentrating on monetizing and easing the adoption of the open source tech by those big guys.

What Goes In Open Source?

Michael Schwartz: How do you decide how much to invest in the open source projects versus your commercial project?

Mike Olson: This is going to sound glib, actually a whole bunch of thought has gone into it. But very briefly if it’s about data storage, data analysis, data processing – CIO’s don’t want to be locked in.

They don’t want a single vendor proprietary solution.

I mean, I spent a bunch of my career working for great big database vendors and we taught CIO’s that single vendor proprietary lock-in was a mistake. So everybody’s learned, they don’t want me to be able to turn off their data access, they don’t want me to be able to shut down their analytic workload.

So if it’s data storage, data process, data analysis really it needs to be in the open source.

If it’s addressing the unique requirements of a large Enterprise, like if you need to be able to answer the demands of regulators, or if law enforcement shows up and you’ve got to do a data lineage work.

Well, look man, that’s fair game. Because first of all the open source community isn’t going to spend a lot of time on that kind of problem. And 2nd it’s expensive and difficult to develop those solutions, and so I should legitimately be able to be paid for those.

And so in general we’ve got a philosophy on needs to be open source and what can be closed source, proprietary IP.

Channels

Michael Schwartz: Have you developed channel partners, or do you rely on the open source to be a distribution Channel?

Mike Olson: We’ve got good channel relationships with some of the big systems vendors.

So you know Dell is a fantastic partner, HP’s a fantastic partner, even Oracle and Teradata offer appliances that bundle the Cloudera platform and they’ll sell through their sales force to their customer base our IP on those appliances.

The global systems integrators are likewise a really good channel for us they turn up a bunch of opportunity.

The bulk of our revenue is from direct sales.

So we’ve got a substantial global sales force with concentrations in AMEA, in Asia-Pacific, and then the Americas and leaders identified there. And we have a bunch of direct sellers and technical folks in the field that engage with our large Enterprise customers.

Initial Sales Strategy

Michael Schwartz: When you were first getting started was it really hard to figure out how to manage the long sales cycle of a large Enterprise? How did it go in the early days?

Mike Olson: Bear in mind that had been the market that I sold two for my entire career, so I knew what to expect in terms of sale cycle. And I was an engineer to begin with right, but in middle-late 90s I stopped being an engineer and became or more and more field-focused.

Anyway, I was used to the buying cycle of that community. I’ll say that we got lucky in a weird way.

So you remember 2008 was the Global Financial Market meltdown. It had a really important knock-on effect, it was everybody was really afraid of what was going to happen further in the economy.

So large Enterprises were looking for ways to shed expense, and if they had to do innovative stuff to do it in novel and much less expensive ways.

So you know, it used to be that you could legitimately go charge somebody $40,000 to manage a terabyte of data. You know there’s a lot of terabytes of data in the world these days, and at 40,000 bucks a terabyte, there’s a huge penalty for having data.

Well this platform was designed to make lots of data cheap, to a cheap to accumulate, and then operate on. So the market crashed, a bunch of CIOs got very cost-conscious, and here we were with this open source foundation and a new way of building scale-out, distributed systems that was vastly cheaper than what came before.

Look I mean, the market crash was a disaster for a bunch of reasons but it was a little bit lucky for us. We tried to stay rigorously focused on what we thought customers needed and that varies by time.

You know the needs of customers today are very different from what they were a decade ago, because everybody gets open source now, right. Everybody gets big data.

We’ve evolved our product strategy along with the maturing market. And I’d, maybe it’s vain, I’d claim we helped to mature the market as we went.

Red Hat IoT Partnership

Michael Schwartz: One of the new trends is IoT. To expand a little bit on the partnership angle, we were reading about a partnership with Red Hat.

Do you see that as the new direction, that better solutions are needed?

Mike Olson: I actually do and unsurprisingly that’s why we got involved in, that’s why I’m so excited about the work we’re doing with Red Hat.

I’d take a step back, IoT as a broad secular trend, is a huge boon to those of us in the big data industry. Right, I mean that data got to come from somewhere.

It’s nice if we just put sensors all over the place, if we were able to ingest you know, stock trader market data at very fine grain, and then we can store huge amounts of it. So the more things there are on the internet making data, the better off that the big data platform is going to be. So, very excited about it on that basis.

If you know much about the Hadoop architecture, it’s kind of this big back-end, where data lands and then you analyze it. What we hadn’t done, what we don’t build, is the end-to-end architecture.

So how do you rely on to get data off of the sensor, or off of the device? What happens to it as it runs the network? Is their analysis and aggregation of data in flight?

Well it turns out the folks at Red Hat and a number of other industry partners, were thinking about those capabilities and taking advantage of their infrastructure software, to deliver some interesting services there.

If we can make our analytic platform a good destination for that data, and then train models using machine learning techniques to spot anomalous activity in those sensors in real-time – well that’s awesome, right?

So the combination of these technologies, really the partnership around IoT data, enables applications that we couldn’t have enabled on our own.

And then obviously, you know Red Hat’s been a fantastic partner in Enterprise software for for open source, forever and ever.

Other Partnerships

Michael Schwartz: Are there other partners that you think have been helpful?

Mike Olson: So in the IoT space in particular, Eurotech and the Eurotech Everyware family, is an important part of that partnership. Looking more broadly than just at that IoT activity though, like any platform software company we rely on a rich ecosystem of other companies in order to succeed.

So you squint your eyes, you’re allowed to think of a Cloudera Enterprise as a database. It’s just not a 1980s database it’s a 2018 database; lots more data with lots more powerful analytic capabilities. But you need the application that runs on that database.

I want to predict which customers will churn out of my mobile service. I want to spot fraudulent transactions in data flows. I want to guess which patient at my hospital is likeliest to be readmitted based on past behavior.

We don’t build that app but we have a collection of systems integrator partners, and independent software vendors that do build those applications. And they’re absolutely essential to our success, right – nobody buys the platform, everybody buys the application, but the platform has to be there to build the application on.

License Strategy

Michael Schwartz: One of the popular licensing strategies in open source is open core. Do you have some thoughts on that strategy, or maybe licensing in general?

Mike Olson: Couple things. So first of all – this is just a super fraught emotional area. People have very strongly held views. I’m pretty pragmatic. I deeply believe in the strategic value of open source.

You get this global community of really smart folks, anyone who cares enough about a problem can join the community and concentrate on solving that problem; and because they care a lot about it, they’re probably going to be better at it then someone you choose randomly.

So open source innovates in ways that proprietary software can’t because you can harness the whole planet, and you get to take advantage of distributed expertise. I love it.

The challenges of purely open source products is able to be picked up, and given away by any vendor who wants to do that.

So you can think of a mega vendor grabbing a collection of open source IP, pricing it at zero and then monetizing that by selling proprietary database that connects to it, or a whole bunch of expensive services or… if you invest a lot an open source development you run the risk of being commoditized when someone else takes that IP. And drives its cost way down, so you wind up competing with them.

Our decision to complement our open source platform with proprietary IP is not intended to lock customers in, its intended to lock competitors out.

I want to have some reason that customers will come to talk to me and not go talk to one of those big bad commoditizing vendors. And that’s why we build our product, that’s why we have the IP strategy that we do.

We’ll always have proprietary IP that complements the open core, the open source.

What that is will evolve over time as our business involves. And the open source ecosystem will also evolve over time.

People are going to innovate a new ways, maybe some stuff that used to only be able to get one way you’ll be able to get four or five different ways. But I believe it’s a good sustainable long-term model. Certainly it’s served as well in the decade since we started the business.

Tools?

Michael Schwartz: Would it be oversimplifying to say that the strategy is to build proprietary tools around the open source?

Mike Olson: It’s a fair view. I mean if, if I use those terms in front of my marketing guys, they’d say but we build way more than tools, we build a platform. Fair enough, right.

But complementary code aimed at the specific requirements of large Enterprises, that compliments all the power, flexibility, scale of the open source ecosystem.

Closing Advice

Michael Schwartz: So if you are an entrepreneur who wanted to use open source as part of your business?

Mike Olson: I’d say first of all, you know, 20 years ago… Oh heck, man. Yeah 20 years ago, 30 years ago, when I was working on Postgres, when I was working on Berkeley DB, there was this question you know, what is the open source business model?

I think that question is poorly formed. A business model is actually complex construct. Open source is a really important component of strategic thinking.

It’s a great distributed development model, it’s a genius low-cost distribution model – anybody can download your software at very low cost on very low friction.

And those have a bunch of advantages, right. You need to think about how you’re going to get paid. So what is it that people will give you money for, and it can’t just be because you’re good at what you do, because sooner or later somebody else is going to get good at that too, and competition is going to be tough to maintain attractive margins.

So you need to be thoughtful about the unique value that you’re adding. You need to think about who the customer is, and what they require, and why they will uniquely buy from you.

The particulars of the license that you choose matter a lot.

The GPL is coercive in a way that a lot of people like but that freaks out, for example, a lot of cloud vendors. They won’t pick up GPL’d code because they’re concerned about getting infected with the GPL requirements on code that they developed. The Apache license I think is actually a great license and has some good IP protection around patents, but is much more permissive.

So you want to think about what the license requires relative to what you’re going to deliver to your customers, and why they’re going to buy, and how that’s going to stay defensible. I think there’s a lot of really good thinking on this.

Clearly Red Hat is world-class at building an open source business, but it’s much more than – what should I do with open source? You want to think about who you’re going to sell to, why they’re going to buy, and why you’ll be able to preserve differentiation, and your advantage long-term in that market.

Michael Schwartz: Wise words from Mike Olson. Thank you so much for sharing your thoughts with us.

Mike Olson: Thank you Michael, I really enjoyed it. Thank you for coming by.

Michael Schwartz: That’s it for episode five. Transcription and episode audio can be found on opensourceunderdogs.com.

Special thanks to the Linux Journal for co-sponsoring this podcast to the All Things Open conference for helping us publicize the launch.

Music from Broke for Free, by Chris Zabriskie and Lee Rosevere.

Production assistance from Natalie Lowe. Operational support from William Lowe. Thanks for the staff of Cloudera for schedule juggling.

Next week we’re leaving the Bay Area and we’re off to the Big Apple, where we’ll visit with Eliot Horowitz, one of the founders of MongoDB.

Thanks for tuning in.

Subscribe to our newsletter
for news and updates