Ideas for Solving the 'Data' Problem First, the 'Big' Problem Second: The Pentaho Way
Imagine a word cloud to represent discussions about how to make data useful in business right now. Based on what I saw at Strata in late February in Santa Clara, you would see “big” in letters about 4 inches high and the word “data” in regular 12 point type. As my pals at Gawker say, “Thatz Not Okay.”
The analytical nerds over at Cheezburgerrecommend solving the “data” problem before tackling the challenge of “big.”To me this seems like good advice, but carrying it out may be a bit tricky because of a disconnect that occurs in the way that data is analyzed in different sizes. Ask yourself this: if you took all the money you were going to spend on “big” and instead spent it on “data” would you be better off?
First, some basic assumptions:
No dataset comes ready to provide answers, no matter how small or big it is.
Most of the work in analysis involves massaging data and getting it ready so that you can ask questions.
The quality of analysis improves with better data.
The quality of analysis can improve with more datasets, but not always.
The quality of analysis can improve with higher volumes of data, but not always.
The more people involved – from business, IT, and development – at all stages the better.
Unstructured data is becoming more important.
Real-time analysis is becoming more important.
What does all of this mean?
Assumptions 1 and 2 mean that we will gain productivity if we can focus on compressing the early stages of the analysis pipeline, meaning the data preparation, manipulation and transformation needed to get data ready before analyzing it.
Assumptions 3, 4, and 6 suggest that we should arm as many people as possible to evaluate new datasets to see if they can help.
Assumptions 5 and 7 suggest that sometimes this new data will be big data.
I’m going to leave assumption 8 out of this discussion for the most part and deal with it in detail in another article.
The problem with the current focus on “big” is that it addresses only two of these assumptions, 5 and 7. (Remember that the way “big” is used most often means both voluminous and unstructured.)
The second problem is that most of the time the methods used to handle “big” are specialized and can only be done by high priests of programming, which creates a bottleneck.
My goal for this series of articles is to present a few ways that various types of technology could be used together to create an infrastructure that provides the most possible value for a group of workers. The collections of technology I’m going to propose seek to meet the following criteria:
Increase the number of people involved at all stages.
As much as possible, use the same techniques for massaging and analyzing data at all stages.
As much as possible, use the same techniques for processing both small and big data sets.
The Pentaho Way
Pentaho is an open source business intelligence and big data analytics company that was founded by five deeply nerdy people. Three of the original founders, Richard Daley, Doug Moran, and James Dixon are focused on Pentaho’s big data technology and strategy.
Unlike many other companies in the realm of data science, Pentaho is focused on what users do to get value from data and how to make it easier, and it shows in their product.
“Pentaho is taking the responsible, adult approach to tackling big data head-on,” says Dave Henry, Senior Vice President,EnterpriseSolutions at Pentaho. “We offer great connectivity, easy-to-use data development apps and, by putting data integration and visualization so close together, deliver a productive experience that lets more people work together.”
Pentaho’s secret sauce is a system known as Pentaho Data Integration, which is essentially a visually-oriented toolkit for massaging data that has the following characteristics:
A multitude of connectors bring data in from a huge variety of sources, and you can build your own if a connector is missing.
Transformations can be applied to the data by dragging and dropping functions of various sorts to perform the transformation in a visual interface.
The functions range from simple transformations to those for more complex techniques like regular expressions and machine-learning algorithms.
Pentaho Data Integration can be applied to a single file, to data from any number of sources, from spreadsheets to MPP databases, and also to MapReduce programming.
So, how well does Pentaho meet my criteria?
One powerful aspect of Pentaho Data Integration is that because of its simple, drag-and-drop visual interface, it can be used by analysts in all areas of the organization. Pentaho can serve the business analyst who needs to quickly grab some information, add some context and then analyze and share it using Pentaho Instaview. It can also be used by a developer who is using the full PentahoBusinessAnalytics platform to build a formal data warehouse.
Data integration processes (often called ETL when building a data warehouse or data prep when an analyst is massaging data) can take place in Pentaho so that the massive work of massaging and cleaning data does not have to be tightly bound to a specific data warehouse technology or to complicated programming methods. This will be increasingly important for solution building in the cloud where developers cannot pre-integrate all of the data. Instead, data integration must occur on demand (i.e., combining data fromSalesforce.comwith data fromAmazonRedshift), and Pentaho makes it easy for this to happen as part of an orchestrated process.
Someone exploring external data sets to see what they offer can do the same thing. Pentaho offers native connectivity to popular structured and semi-structured data sources including:
Native connectivity to Hadoop (e.g.,ApacheHadoop, Cloudera, Hortonworks, MapR)
Native connectivity to NoSQL databases (e.g., MongoDB, Cassandra, HBase)
Native connectivity to analytic databases (e.g., Vertica, Greenplum,Teradata)
Connectivity to enterprise applications (e.g.,SAP)
Connectivity to cloud-based and SaaS applications (e.g.,Salesforce,AmazonWeb Services)
In addition, Pentaho Data Integration (PDI) can access unstructured, raw data such as tweets, do pattern matching, find the structure, and perform sentiment analysis. The Pentaho Instaview template for Twitter provides an environment to play around with the data. Bringing more people in direct contact with data is vital to solving the data problem first.
The most distinctive thing about PDI’s power for big data processing is its integration with MapReduce programming in Hadoop. Both Map and Reduce programs can be created using PDI; these work within the MapReduce 1.0 framework (MRv1) with plans to support MRv2 later this year. This opens the power of Hadoop to a much wider audience than if traditional MapReduce programming in Java or other such languages is required.
PDI can be an embedded data transformation engine that is part of a real time application. As a result, PDI can be inserted into a business process. For instance, PDI could be embedded in a storage appliance used on demand for device data analysis, such as capacity forecasting or failure prediction.
PDI reduces the time necessary to create a new analytic tool from two days to an hour or two. It makes it quick and easy to access big data sources and enrich them with Pentaho’s analytic and visualization tools. As a result, PDI encourages more experimentation and increases the likelihood that something useful will be discovered.
Essentially, PDI allows one technique to be used over and over in a large number of contexts. A company can build expertise in many different departments and allow people to help each other out and have users train other users.
The ideal result is that more and more people can meet their own needs and perform the all-important function of playing with the data, but when it comes time to get serious or to scale, use the same techniques.
The challenge that many companies have when attempting to create a data driven culture is that the glamorous part, the chart and graph at the end of the process, is really about 10 percent or less of the work. The analysis may be only 20 percent of the work. The other 70 percent of the work is massaging data from wherever it comes from and getting it into shape to go on stage as it were. That’s what Pentaho helps with. It may not be glamorous, but it is the sort of work that is essential to solving the data problem first and getting as many people as possible involved.