Imagine a word cloud to represent discussions about how to make data useful in business right now. Based on what I saw at Strata in late February in Santa Clara, you would see “big” in letters about 4 inches high and the word “data” in regular 12 point type. As my pals at Gawker say, “Thatz Not Okay.”
The analytical nerds over at Cheezburger recommend solving the “data” problem before tackling the challenge of “big.” To me this seems like good advice, but carrying it out may be a bit tricky because of a disconnect that occurs in the way that data is analyzed in different sizes. Ask yourself this: if you took all the money you were going to spend on “big” and instead spent it on “data” would you be better off?
First, some basic assumptions:
What does all of this mean?
I’m going to leave assumption 8 out of this discussion for the most part and deal with it in detail in another article.
The problem with the current focus on “big” is that it addresses only two of these assumptions, 5 and 7. (Remember that the way “big” is used most often means both voluminous and unstructured.)
The second problem is that most of the time the methods used to handle “big” are specialized and can only be done by high priests of programming, which creates a bottleneck.
My goal for this series of articles is to present a few ways that various types of technology could be used together to create an infrastructure that provides the most possible value for a group of workers. The collections of technology I’m going to propose seek to meet the following criteria:
The Pentaho Way
Pentaho is an open source business intelligence and big data analytics company that was founded by five deeply nerdy people. Three of the original founders, Richard Daley, Doug Moran, and James Dixon are focused on Pentaho’s big data technology and strategy.
Unlike many other companies in the realm of data science, Pentaho is focused on what users do to get value from data and how to make it easier, and it shows in their product.
“Pentaho is taking the responsible, adult approach to tackling big data head-on,” says Dave Henry, Senior Vice President, Enterprise Solutions at Pentaho. “We offer great connectivity, easy-to-use data development apps and, by putting data integration and visualization so close together, deliver a productive experience that lets more people work together.”
Pentaho’s secret sauce is a system known as Pentaho Data Integration, which is essentially a visually-oriented toolkit for massaging data that has the following characteristics:
So, how well does Pentaho meet my criteria?
One powerful aspect of Pentaho Data Integration is that because of its simple, drag-and-drop visual interface, it can be used by analysts in all areas of the organization. Pentaho can serve the business analyst who needs to quickly grab some information, add some context and then analyze and share it using Pentaho Instaview. It can also be used by a developer who is using the full Pentaho Business Analytics platform to build a formal data warehouse.
Data integration processes (often called ETL when building a data warehouse or data prep when an analyst is massaging data) can take place in Pentaho so that the massive work of massaging and cleaning data does not have to be tightly bound to a specific data warehouse technology or to complicated programming methods. This will be increasingly important for solution building in the cloud where developers cannot pre-integrate all of the data. Instead, data integration must occur on demand (i.e., combining data fromSalesforce.com with data from Amazon Redshift), and Pentaho makes it easy for this to happen as part of an orchestrated process.
Someone exploring external data sets to see what they offer can do the same thing. Pentaho offers native connectivity to popular structured and semi-structured data sources including:
In addition, Pentaho Data Integration (PDI) can access unstructured, raw data such as tweets, do pattern matching, find the structure, and perform sentiment analysis. The Pentaho Instaview template for Twitter provides an environment to play around with the data. Bringing more people in direct contact with data is vital to solving the data problem first.
The most distinctive thing about PDI’s power for big data processing is its integration with MapReduce programming in Hadoop. Both Map and Reduce programs can be created using PDI; these work within the MapReduce 1.0 framework (MRv1) with plans to support MRv2 later this year. This opens the power of Hadoop to a much wider audience than if traditional MapReduce programming in Java or other such languages is required.
PDI can be an embedded data transformation engine that is part of a real time application. As a result, PDI can be inserted into a business process. For instance, PDI could be embedded in a storage appliance used on demand for device data analysis, such as capacity forecasting or failure prediction.
PDI reduces the time necessary to create a new analytic tool from two days to an hour or two. It makes it quick and easy to access big data sources and enrich them with Pentaho’s analytic and visualization tools. As a result, PDI encourages more experimentation and increases the likelihood that something useful will be discovered.
Essentially, PDI allows one technique to be used over and over in a large number of contexts. A company can build expertise in many different departments and allow people to help each other out and have users train other users.
The ideal result is that more and more people can meet their own needs and perform the all-important function of playing with the data, but when it comes time to get serious or to scale, use the same techniques.
The challenge that many companies have when attempting to create a data driven culture is that the glamorous part, the chart and graph at the end of the process, is really about 10 percent or less of the work. The analysis may be only 20 percent of the work. The other 70 percent of the work is massaging data from wherever it comes from and getting it into shape to go on stage as it were. That’s what Pentaho helps with. It may not be glamorous, but it is the sort of work that is essential to solving the data problem first and getting as many people as possible involved.
Dan Woods is CTO and editor of CITO Research, a publication that seeks to advance the craft of technology leadership. For more stories like this one visitwww.CITOResearch.com.
Post a Comment