As you may know, we’ve worked closely with Microsoft to bring the ORTEC Big Data Portal to market. The Portal runs on Microsoft’s Azure Cloud and combines data storage, processing, analysis and advanced visualization in a user-friendly interface. Naturally, we are keen on sharing what makes Microsoft Azure such a powerful platform, so we were delighted to have the opportunity to discuss its capabilities with Raghu Ramakrishnan. Raghu leads the team behind Microsoft’s Data offering. In this interview, he elaborates on his life’s work, his product vision and his outlook on how big data is disrupting the way we do business.
Raghu, you graduated from one of the most prestigious universities in the world: The Indian Institute of Technology Madras. Can you explain what makes your Alma Mater so special and what other universities can learn from it?
Part of the reason Indian Institutes of Technology are so well-known is because the people they have admitted have gone on to do pretty remarkable things. But if an institution wants to be elite, they need to focus on getting the very best students in from the start. Indian Institutes historically have had a very a rigorous entry process. Getting into the Institute is based on a nationwide exam and it’s very competitive.
So I would say the biggest take-away is that initially, you have to bootstrap yourself by attracting those very good students. Once you get past that, you have some momentum. You can use your alumni network to help new graduates get going. Your reputation will help attract the next generation of strong students and you can put them in place to really succeed.
You have been an entrepreneur, a professor, have worked at Yahoo and now work at Microsoft. In addition, you have written one of the most influential academic books on databases: “Database Management Systems.” For most people, a single one of these accomplishments would fulfill a lifetime of ambitions. Do you still have big plans?
Absolutely! Let me put it this way. When I graduated, I wanted to be doing research and I thought about both academia and industry. I considered IBM, Microsoft and a handful of universities and I ended up teaching, which made me very happy. Along the way, I did a startup and that “whet my appetite” and made me consider going into industry for a while. So when Yahoo had an opportunity to rebuild their Research Lab from scratch, I decided to go there. I enjoyed it and over time I got more and more interested in the Engineering and Product side of things.
When I first came to Microsoft, I started an Applied Research team, but now I also have significant Product and Product Delivery responsibilities. The one common thread has been: I like to work on interesting things with interesting people, and I’ve also been able to change my environment. I think this variety has kept me fresh. So, in a nutshell, I’ll continue to work hard as long as I enjoy it.
Let”s talk a bit more about Microsoft. For some time, Microsoft was viewed as a company that was lagging somewhat behind compared to Apple, Google and other tech giants. How did it turn itself around to become a hot company again?
That’s an interesting question. I came here 4 years ago and I would say the turnaround had actually begun around that time. I think the company recognized that the world had changed. It recognized that no one company at this point should be able to dictate what customers prefer in terms of their product choices. The change at Microsoft developed with this awareness and there are a lot of examples. Office runs on iOS and Android, for instance. We offer Azure services on Linux. We are also a leading contributor to Open Source. We have a number of people who are committers for Hadoop projects. We work closely with Jupyter Notebooks and we recently acquired a company called Revolution Analytics that supports Open Source R – the world’s most popular programming language for statistical computing and predictive analytics.
So, fundamentally, we want to be where the customer wants to be. We’ve gone through some pivotal changes when it comes to the cloud and mobile devices. That has resulted in really concrete outcomes. We walk the walk and people are giving us credit for that. We have been willing to take leaps into emerging directions even while knowing that these could very well cannibalize existing areas. In short, the company has been willing to disrupt itself. I give Satya [Microsoft’s CEO] a lot of credit for that. We haven’t hesitated to make trade-offs of short-term gain for what we believe would be the long-term path going forward.
Sketching Microsoft’s Big Data Landscape
You are the CTO of Data at Microsoft. Most people know Microsoft, not many know what a CTO of Data does. Can you explain what you do and how your department fits with Microsoft’s strategy?
I am the CTO for The Data Group, which is part of Cloud and Enterprise. Now, the Data Group includes pretty much all our assets in the broad data space. People might not know this, but internally Microsoft operates one of the largest – if not the largest – cloud infrastructures. We have our internal Big Data Analysis Platform called Cosmos, which is used for storing and processing data for applications like Bing, Xbox and Skype; and we also have SCOPE which is like Hive. All of these assets and our Big Data offerings are run by The Data Group and we are working now to converge these offerings into what we call Azure Data Lake.
We also have a offerings around R, DocDB, Azure Search, Data Orchestration, and so on…So whether it’s the cloud or on-prem software, whether it’s transactional or analytics software, whether it’s for our very extensive internal or external customer base…The Data Group is responsible for this large suite of data products and services.
In my role as CTO, I run the Engineering side of all the big data products and make sure that what we’re doing comes together. As you can imagine this diverse portfolio is wonderful but it can be confusing from a customer’s point of view. So my job is that this offering is as seamless and as scalable as we can make it. Besides this, I manage an Applied Research team for The Data Group.
Can you tell us a bit more about Cosmos, are there features from this internal analytics platform that have already found their way to the market?
Yes, Azure Data Lake is the evolution of Cosmos. Many components of Cosmos are shared in Azure Data Lake Store. It’s not quite the same because we are also re-architecting it to be completely compliant with HDFS and we’re infusing it with some new capabilities – streaming right into the file system for one, much greater scale for another. Our internal Cosmos users will soon be able to use Azure Data Lake as well. Ultimately, we want our external customers to count on the fact that they ride on the same platform that our internal business rides on.
Microsoft’s Big Data Landscape looks impressive, providing tools and services ranging from capture and storage to interpretation and visualization. Could you briefly sketch it for us?
So, we’ve talked about the storage component. There is Cosmos and Azure Data Lake. From Azure, of course, you have the full array of virtual machines and the like. The Data Group provides a number of higher-level services that complement all of this. We offer:
- Tools for query processing, every flavor of SQL you can imagine in the Open Source world including Hive, Impala, Spark, and our own SQL
- Extremely sophisticated Machine Learning tools
- Streaming solutions, including Open Source Storm as well as our proprietary Stream Analytics
- Document processing capabilities in DocDB
- A built-in enterprise search capability
- Orchestration capabilities where different pieces of transformation or data loading can be strung together to create workflows
- A comprehensive suite of transactional services
- Visualization through Power BI and more
All of this describes our Azure Suite, and there are a number of third party vendors that complement this. What is unique about us is that we are both an established on-prem data management vendor and an enterprise-grade cloud vendor. This allows us to offer unique hybrid opportunities. If you look at other vendors, they have strength in one or the other, but not both.
In one of your blogs, you praise the open source culture at Yahoo and stated that you “wanted to bring the power of this ecosystem to Microsoft’s big data efforts.” As you mentioned, Microsoft’s Big Data offering is open to tools and services from many vendors. Is that what you had in mind when you joined Microsoft? And how do you keep this ecosystem open?
Well, let’s look at a couple of different things. SQL has gone through a standardization process. It would be great if some of the SQL dialects in Open Source also started snapping to those standards. We are seeing such a plethora of different dialects…Hive, Impala’s version of SQL, Spark’s version of SQL, and so on. Frankly, there is no fundamentally sound reason why everyone has to have a different dialect. So if we could see some emergence of standardization, I think this would be to everyone’s benefit. And more than just the SQL syntax, the underlying data standards like Parquet and ORC are coming up but all of the SQL vendors have their own formats as well.
From a customer’s point of view, if they want to move across different tools for analyzing and querying their data, having all those tools operate on a common representation would save the super extensive task of transforming data formats every time they want to use a different tool. So there are opportunities, both in terms of the language and the data representation, to make the ecosystem much more open and synergistic.
As for being open to many vendors, when Satya recruited me he convinced me that the company was really going the open route. We offer Open-Source Apache Hadoop distributed by Hortonworks. We also include offerings from Cloudera, Spark and others. We have made contributions to ORC and HDFS. All of these things I mentioned just now are supported as product services on Azure. Side by side, we offer some proprietary services. Customers have the choice to use our tools, Open Source tools, or mixing and matching them with as little friction as we can possibly manage. Ultimately, it’s about the customer and how they want to leverage their data.
Staying one step ahead by fostering a Data Culture
You mentioned your CEO Satya Nadella earlier in our conversation. In 2014, he wrote a blog about the importance of creating a data culture. In your view, what exactly is a data culture?
Data is emanating from the tools we use, our means of transportation – whether it’s airplanes or cars -, our homes, even our own bodies. Every facet of the world is becoming observable. According to Gartner, there are more than 20 billion connected devices expected by the year 2020. The amount of data exchanged by sensors is going to be way more than the amount of data exchanged by human beings. So, our world is permeated with data. Most activity is observable and can be learned from to act more intelligently and efficiently. Having a data culture is recognizing the power of observations and getting into the habit of looking at them to ground your thinking.
Many experts say 3 key sectors will be revolutionized by Big Data: Healthcare, Finance and Retail. How will Big Data change these industries and what role will Microsoft play in this disruption?
Let’s start with retail. When you think about what Amazon has done to disrupt traditional retailers, a lot of it comes from the use of their own cloud. Everything from processing retail transactions at scale cheaply to looking at what’s selling where, managing their supply chain, making product recommendations to customers, planning programs like PRIME…all of this is based on the use of data. Other retailers clearly want the same advantage. Players like Walmart are all moving to compete very aggressively and investing in their own cloud strategy and that’s an important market for us.
It’s a similar story in finance. Financial companies are among the pioneers in recognizing that you can learn from what you observe. For years, they have been building predictive models for stock markets. It’s a very data-driven industry and being able to process this data at scale, elastically is crucial. The financial industry is therefore another big market for big data and the cloud. Again, we think we have some of the very best offerings in the space.
In healthcare, when you think about personalized health and preventive measures, there’s so much that can be done. But here you have to be super careful about security and privacy. Microsoft has had a traditional emphasis on enterprise security and privacy. I hope that some of what we have learned will be attractive to players in the healthcare industry who want to leverage big data and the cloud as well.
What do you think are the top 3 enablers that allow a company to gain value from using Big Data and Analytics?
First, recognize that you can’t leverage big data if you don’t have it. Think carefully about the data your business would benefit most from having. Then, take steps to instrument, capture and maintain that data with high fidelity and keep it current.
Second, you need to understand the importance of infusing data into your everyday operations and your thinking. You might be a package delivery company but now, you have the opportunity to look at what kind of packages are delivered when, how, and to whom. That information can be used to optimize your businesses and create new data products. Thinking about the possibilities offered by data can be transformative.
For instance, we recently acquired LinkedIn. LinkedIn’s value is all around their people, the people who have profiles on LinkedIn and their relationships. They can create so many valuable things around this data and without this data culture, they wouldn’t be the same. So ask yourself – what are the most important decisions that affect my company? Are people gathering the right data and validating their next steps based on that data or acting solely on instinct? Instincts are good, but instincts grounded in data are usually more reliable.
Third, you need to have the ability to ride technology curves and understand when something has become cost-effective to leverage. Things that were too expensive to be worth the ROI 5 years ago may now be incredibly cheap. The cost of hardware and the cost of computing is going down while the value of data is increasing. Ultimately, you should strive to become a company that is constantly keeping an eye on the horizon and knowing where that shifting point of profitability is, where the ROI justifies using your data in new ways.
In your opinion, what can traditional companies do to avoid being limited by the shortage in analytical talent?
That’s a great question. I think there are two basic things organizations should do. First, do more training. Luckily, this is one of the hottest industries right now. Universities across the world are introducing new curricula to produce more people who are trained in understanding and interpreting data in order to meet this need.
Second, we need to make the whole ecosystem of tools easier to use. That’s on us – the technology companies. Right now, the tools require a high degree of sophistication and we simply don’t have enough people for all the companies that need these capabilities. We need to reduce the level of abstraction in the tools we have today. We need to make it easier for people to do things end-to-end. There’s no doubt that people need to be trained on their underlying business and how data can affect it. And there’s no getting away from the fact that they need to be trained in the basics of statistics and interpreting data. But we can certainly do more to bring out tools that are straightforward to use, so people can focus on their domain knowledge and solving critical business problems without extensive technical know-how.
Interview written and edited by: Gloria Quintanilla, Chirppoint.
- Large picture above: Courtesy of the German Center for Research and Innovation New York, © Nathalie Schueller
- Profile picture: Courtesy of Microsoft