Professor Coruja - Business and Open Source Technology: March 2014

Monday, March 24, 2014

Vagas na IT4biz para trabalhar com BI, Pentaho, ETL, DW, Web designer, Front end, Vendas, etc

Pessoal,

We are hiring!!! Estamos contratando!!!

Estamos contratando, em 2014, na IT4biz, pessoas com interesse em trabalhar com Open Source, BI, Pentaho, ETL, DW, Web designer, Front end, Vendas, Desenvolvimento web, Desenvolvimento Mobile, etc.

Se você tem interesse em ajudar a IT4biz a crescer envie o seu currículo em PDF para rh@it4biz.com.br.

Precisamos de ajuda com Marketing, Vendas, Desenvolvimento, Projetos de BI, Consultoria, Infra Estrutura, Banco de Dados e muitas outras coisas.

Buscamos jovens profissionais que sejam pró-ativos, íntegros, comprometidos, desejosos de aprender, sejam apaixonados por software livre e que desejam fazer a diferença no mundo.

Mais informações através do site:

http://www.it4biz.com.br/novosite/company/trabalhe-conosco/

Friday, March 21, 2014

Escritório Geek é uma empresa especializada em escritório compartilhado (Co-Working) para geeks e startups de tecnologia da informação e inovação.

Amigo leitor,

Estou ajudando na divulgação do Escritório Geek.

O que é o Escritório Geek?

Escritório Geek é uma empresa especializada em escritório compartilhado (Co-Working) para geeks e startups de tecnologia da informação e inovação. O Escritório Geek é a primeira empresa brasileira 100% focada em empreendedores e empresas geeks. Fornecemos todos os serviços necessários para uma empresa geek começar, de um escritório compartilhado (Co-Working) até muitos outros serviços.

Existem quatro unidades Centro de São Paulo, Berrini, Paulista e Butantã.

A idéia é juntar geeks.

Você gosta de software livre? Gosta de programação? Web design? Desenvolvimento de aplicativos para celulares? Dados Abertos? Linux? Java?

Você tem que conhecer o Escritório Geek.

O endereço é www.escritoriogeek.com ou facebook.com/escritoriogeek

Para fazer parte é necessário ser geek e ser aceito como membro do clube Escritório Geek.

Thursday, March 20, 2014

Saiku Chart Plus is available at Pentaho Marketplace

Saiku Chart Plus is an open source project that helps Pentaho BI users to create other types of charts and maps based on Saiku Project, Highcharts and Google Maps.

Developed and maintained by IT4biz team since 03/12/2013, it is been released in its RC4 version with good news for the Pentaho BA server users: Saiku Chart Plus is available at Pentaho Marktplace.

Thanks for the help @webdetails team and @SaikuAnalytics team in this process.

Requirements and Installation

Uninstall your Saiku version and install again the last trunk snapshot, this is really necessary!
Then install the Saiku Chart Plus (all in Pentaho Marketplace).

Read More

Access the project’s website http://it4biz.github.io/SaikuChartPlus/

Enjoy it!

Source:
http://blog.it4biz.com.br/2014/03/20/saiku-chart-plus-available-at-pentaho-marketplace/

Wednesday, March 19, 2014

Big Data, futuro ou presente? Por que eu devo me preocupar com esse tema?

Amigos da comunidade Pentaho,

Você já ouviu falar de Big Data?

Big Data, futuro ou presente? Por que eu devo me preocupar com esse tema?

Separei alguns links sobre o assunto:

1) Profissão do futuro: Lixeiro Digital. Porquê a capa da Revista Veja é sobre Big Data.

http://blog.professorcoruja.com/2013/06/profissao-do-futuro-lixeiro-digital-por.html

2) Pentaho Big Data

http://blog.professorcoruja.com/2012/01/pentaho-big-data-h.html

3) Video - Conheça as aplicações reais do Big Data
http://olhardigital.uol.com.br/pro/video/39376/39376

4) Big Data Explained What is Big Data?
http://www.mongodb.com/learn/big-data

5) O QUE É BIG DATA?
http://www.sas.com/offices/latinamerica/brazil/solucoes/bigdata/

6) Como funciona o Big Data
http://oglobo.globo.com/infograficos/bigdata/

7) Curso ensina análise de big data no Twitter
http://info.abril.com.br/noticias/extras/curso-ensina-analise-de-big-data-no-twitter-10072013-4.shl

8) Info 290. Analyzing Big Data with Twitter
http://www.ischool.berkeley.edu/newsandevents/audiovideo/webcast/21963

9) UC Berkeley Course Lectures: Analyzing Big Data With Twitter
http://blogs.ischool.berkeley.edu/i290-abdt-s12/

10) Os cases e números do big data no Brasil em 2013
http://info.abril.com.br/noticias/ti/2013/09/os-cases-e-numeros-do-big-data-no-brasil-em-2013.shtml

11) Escolas melhoram ensino e criam novos negócios com big data
http://info.abril.com.br/noticias/it-solutions/2013/08/escolas-melhoram-ensino-e-criam-novos-negocios-com-big-data.shtml

12) Como a segurança do big data pode proteger seu negócio
http://info.abril.com.br/noticias/ti/2013/09/como-a-seguranca-do-big-data-pode-proteger-seu-negocio.shtml

13) Exame Abril - Big Data

http://exame.abril.com.br/topicos/big-data

14) Big Data na AWS
http://aws.amazon.com/pt/big-data/

15) Oracle e Big Data
Grandes Dados para a Empresa
http://www.oracle.com/br/technologies/big-data/index.html

16) Big data: The next frontier for innovation, competition, and productivity
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

17) CRIE NOVOS VALORES PARA OS NEGÓCIOS USANDO O BIG DATA
http://brazil.emc.com/campaign/bigdata/index.htm

18) Curso de Big Data + Sparkl + Pentaho

http://www.it4biz.com.br/novosite/treinamentos/treinamentos/curso-de-big-data-sparkl-por-pedro-alves/

19) Pentaho Big Data

http://pentahobigdata.com/

Confirmado o Treinamento Big Data + Sparkl com Pedro Alves - 19 a 21 de Maio de 2014 na IT4biz em São Paulo

Comunidade Pentaho,

Confirmado o Treinamento Big Data + Sparkl com Pedro Alves - 19 a 21 de Maio de 2014 na IT4biz em São Paulo.

Saiba mais através do site:

http://www.webdetails.pt/training/

Ou envie um e-mail para treinamentos@it4biz.com.br

Local do curso: IT4biz São Paulo

Friday, March 14, 2014

Cláusula leonina

Uma cláusula leonina é um item inserido unilateralmente num contrato que lesa os direitos da outra parte, aproveitando-se normalmente de uma situação desigual entre os pactuantes. Tais cláusulas abusivas lesam a boa fé, causando um grave desequilíbrio nos direitos eobrigações das partes em prejuízo do elo mais fraco. A legislação as considera nulas, não implicando, todavia, na nulidade do contrato como um todo.

A expressão cláusula leonina tem sua origem numa fábula de Esopo: uma vaca, uma cabra e uma ovelha haviam feito um acordo com um leão e caçaram um cervo. Partindo-o em quatro partes, e querendo cada um levar a sua, disse o leão: a primeira parte é minha, pois é meu direito como leão; a segunda me pertence porque sou mais forte que vós; a terceira também levo porque trabalhei mais que todos; e quem tocar a quarta me terá como inimigo, de modo que tomou todo o cervo para si.

Fonte:

http://pt.wikipedia.org/wiki/Cl%C3%A1usula_leonina

Tuesday, March 11, 2014

Piwik Open Source Web Analytics

)

What is Piwik?

Piwik is an open analytics platform currently used by individuals, companies and governments all over the world. With Piwik, your data will always be yours. Learn why Piwik is the right web analytics tool for you below.

https://piwik.org/what-is-piwik/

Some Reasons To Choose Piwik Analytics over Google Analytics
http://www.statstory.com/some-reasons-to-choose-piwik-analytics-over-google-analytics/

Optimizing Mondrian Performance

By Sherman Wood and Julian Hyde; last updated November, 2007.

Introduction

As with any data warehouse project, dealing with volumes is always the make or break issue. Mondrian has its own issues, based on its architecture and goals of being cross platform. Here are some experiences and comments.

From the Mondrian developer's mailing list in February, 2005 - an example of unoptimized performance:

When Mondrian initializes and starts to process the first queries, it makes SQL calls to get member lists and determine cardinality, and then to load segments into the cache. When Mondrian is closed and restarted, it has to do that work again. This can be a significant chunk of time depending on the cube size. For example in one test an 8GB cube (55M row fact table) took 15 minutes (mostly doing a group by) before it returned results from its first query, and absent any caching on the database server would take another 15 minutes if you closed it and reopened the application. Now, this cube was just one month of data; imagine the time if there was 5 years worth.

Since this time, Mondrian has been extended to use aggregate tables and materialized views, which have a lot of performance benefits that address the above issue.

From Julian:

I'm surprised that people can run 10m+ row fact tables on Mondrian at all, without using aggregate tables or materialized views.

From Sherman:

Our largest site has a cube with currently ~6M facts on a single low end Linux box running our application with Mondrian and Postgres (not an ideal configuration), without aggregate tables, and gets sub second response times for the user interface (JPivot). This was achieved by tuning the database to support the queries being executed, modifying the OS configuration to best support Postgres execution (thanks Josh!) and adding as much RAM as possible.

A generalized tuning process for Mondrian

The process for addressing performance of Mondrian is a combination of design, hardware, database and other configuration tuning. For really large cubes, the performance issues are driven more by the hardware, operating system and database tuning than anything Mondrian can do.

Have a reasonable physical design for requirements, such as a data warehouse and specific data marts
Architect the application effectively
- Separate the environment where Mondrian is executing from the DBMS
- If possible: separate UI processing from the environment where Mondrian is caching
Have adequate hardware for the DBMS
Tune the operating system for the DBMS
Add materialized views or aggregate tables to support specific MDX queries (see Aggregate Tables and AggGen below)
Tune the DBMS for the specific SQL queries being executed: that is, indexes on both the dimensions and fact table
Tune the Mondrian cache: the larger the better

Recommendations for database tuning

As part of database tuning process, enable SQL tracing and tail the log file. Run some representative MDX queries and watch which SQL statements take a long time. Tune the database to fix those statements and rerun.

Indexes on primary and foreign keys
Consider enabling foreign keys
Ensure that columns are marked NOT NULL where possible
If a table has a compound primary key, experiment with indexing subsets of the columns with different leading edges. For example, for columns (a, b, c) create a unique index on (a, b, c) and non-unique indexes on (b, c) and (c, a). Oracle can use such indexes to speed up counts.
On Oracle, consider using bitmap indexes for low-cardinality columns. (Julian implemented the Oracle's bitmap index feature, and he's rather proud of them!)
On Oracle, Postgres and other DBMSs, analyze tables, otherwise the cost-based optimizers will not be used

Mondrian currently uses 'count(distinct ...)' queries to determine the cardinality of dimensions and levels as it starts, and for your measures that are counts, that is,aggregator="count". Indexes might speed up those queries -- although performance is likely to vary between databases, because optimizing count-distinct queries is a tricky problem.

Aggregate Tables, Materialized Views and Mondrian

The best way to increase the performance of Mondrian is to build a set of aggregate (summary) tables that coexist with the base fact table. These aggregate tables contain pre-aggregated measures build from the fact table.

Some databases, particularly Oracle, can automatically create these aggregations through materialized views, which are tables created and synchronized from views. Otherwise, you will have to maintain the aggregation tables through your data warehouse load processes, usually by clearing them and rerunning aggregating INSERTs.

Aggregate tables are introduced in the Schema Guide, and described in more detail in their own document, Aggregate Tables.

Choosing aggregate tables

It isn't easy to choose the right aggregate tables. For one thing, there are so many to choose from: even a modest cube with six dimensions each with three levels has 6⁴ = 1296 possible aggregate tables! And aggregate tables interfere with each other. If you add a new aggregate table, Mondrian may use an existing aggregate table less frequently.

Missing aggregate tables may not even be the problem. Choosing aggregate tables is part of a wider performance tuning process, where finding the problem is more than half of the battle. The real cause may be a missing index on your fact table, your cache isn't large enough, or (if you're running Oracle) the fact that you forgot to compute statistics. (See recommendations, above.)

Performance tuning is an iterative process. The steps are something like this:

Choose a few queries which are typical for those the end-users will be executing.
Run your set of sample queries, and note how long they take. Now the cache has been primed, run the queries again: has performance improved?
Is the performance good enough? If it is, stop tuning now! If your data set isn't very large, you probably don't need any aggregate tables.
Decide which aggregate tables to create. If you turn on SQL tracing, looking at the GROUP BY clauses of the long-running SQL statements will be a big clue here.
Register the aggregate tables in your catalog, create the tables in the database, populate the tables, and add indexes.
Restart Mondrian, to flush the cache and re-read the schema, then go to step 2 to see if things have improved.

AggGen

AggGen is a tool that generates SQL to support the creation and maintenance of aggregate tables, and would give a template for the creation of materialized views for databases that support those. Given an MDX query, the generated create/insert SQL is optimal for the given query. The generated SQL covers both the "lost" and "collapsed" dimensions. For usage, see the documentation for CmdRunner.

Optimizing Calculations with the Expression Cache

Mondrian may have performance issues if your schema makes intensive use of calculations. Mondrian executes calculations very efficiently, so usually the time spent calculating expressions is insignificant compared to the time spent executing SQL, but if you have many layers of calculated members and sets, in particular set-oriented constructs like the Aggregate function, it is possible that many thousands of calculations will be required for each cell.

To see whether calculations are causing your performance problem, turn on SQL tracing and measure what proportion of the time is spent executing SQL. If SQL is less than 50% of the time, it is possible that excessive calculations are responsible for the rest. (If the result set is very large, and if you are using JPivot or XML/A, the cost of generating HTML or XML is also worth investigating.)

It caches cell values retrieved from the database, but it does not generally cache the results of calculations. (The sole case where mondrian caches expression results automatically is for the second argument of the Rank(, [, ]) function, since this function is typically evaluated many times for different members over the same set.)

Since calculations are very efficient, this is generally the best policy: it is better for mondrian to use the available memory to cache values retrieved from the database, which are much slower to re-create.

The expression cache only caches expression results for the duration of a single statement. The results are not available for other statements. The expression cache also takes into account the evaluation context, and the known dependencies of particular functions and operators. For example, the expression

Filter([Store].[City].Members, ([Store].CurrentMember.Parent, [Time].[1997].[Q1])) > 100)

depends on all dimensions besides [Store] and [Time], because the expression overrides the value of the [Store] and [Time] dimensions inherited from the context, but the implicit evaluation of a cell pulls in all other dimensions. If the expression result has been cached for the contexts ([Store].[USA], [Time].[1997].[Q2], [Gender].[M]), the cache knows that it will return the same value for ([Store].[USA].[CA], [Time].[1997].[Q3], [Gender].[M]); however, ([Store].[USA], [Time].[1997].[Q2], [Gender].[F]) will require a new cache value, because the dependent dimension [Gender] has a different value.

However, if your application is very calculation intensive, you can use the Cache() function to tell mondrian to store the results of the expression in the expression cache. The first time this function is called, it evaluates its argument and stores it in the expression cache; subsequent calls within the an equivalent context will retrieve the value from the cache. We recommend that you use this function sparingly. If you have cached a frequently evaluated expression, then it will not be necessary to cache sub-expressions or super-expressions; the sub-expressions will be evaluated less frequently, and the super-expressions will evaluate more quickly because their expensive argument has been cached.

Source:

http://mondrian.pentaho.com/documentation/performance.php

Mondrian Architecture - Storage and aggregation strategies

Storage and aggregation strategies

OLAP Servers are generally categorized according to how they store their data:

A MOLAP (multidimensional OLAP) server stores all of its data on disk in structures optimized for multidimensional access. Typically, data is stored in dense arrays, requiring only 4 or 8 bytes per cell value.
A ROLAP (relational OLAP) server stores its data in a relational database. Each row in a fact table has a column for each dimension and measure.

Three kinds of data need to be stored: fact table data (the transactional records), aggregates, and dimensions.

MOLAP databases store fact data in multidimensional format, but if there are more than a few dimensions, this data will be sparse, and the multidimensional format does not perform well. A HOLAP (hybrid OLAP) system solves this problem by leaving the most granular data in the relational database, but stores aggregates in multidimensional format.

Pre-computed aggregates are necessary for large data sets, otherwise certain queries could not be answered without reading the entire contents of the fact table. MOLAP aggregates are often an image of the in-memory data structure, broken up into pages and stored on disk. ROLAP aggregates are stored in tables. In some ROLAP systems these are explicitly managed by the OLAP server; in other systems, the tables are declared as materialized views, and they are implicitly used when the OLAP server issues a query with the right combination of columns in the group by clause.

The final component of the aggregation strategy is the cache. The cache holds pre-computed aggregations in memory so subsequent queries can access cell values without going to disk. If the cache holds the required data set at a lower level of aggregation, it can compute the required data set by rolling up.

The cache is arguably the most important part of the aggregation strategy because it is adaptive. It is difficult to choose a set of aggregations to pre-compute which speed up the system without using huge amounts of disk, particularly those with a high dimensionality or if the users are submitting unpredictable queries. And in a system where data is changing in real-time, it is impractical to maintain pre-computed aggregates. A reasonably sized cache can allow a system to perform adequately in the face of unpredictable queries, with few or no pre-computed aggregates.

Mondrian's aggregation strategy is as follows:

Fact data is stored in the RDBMS. Why develop a storage manager when the RDBMS already has one?
Read aggregate data into the cache by submitting group by queries. Again, why develop an aggregator when the RDBMS has one?
If the RDBMS supports materialized views, and the database administrator chooses to create materialized views for particular aggregations, then Mondrian will use them implicitly. Ideally, Mondrian's aggregation manager should be aware that these materialized views exist and that those particular aggregations are cheap to compute. It should even offer tuning suggestings to the database administrator.

The general idea is to delegate unto the database what is the database's. This places additional burden on the database, but once those features are added to the database, all clients of the database will benefit from them. Multidimensional storage would reduce I/O and result in faster operation in some circumstances, but I don't think it warrants the complexity at this stage.

A wonderful side-effect is that because Mondrian requires no storage of its own, it can be installed by adding a JAR file to the class path and be up and running immediately. Because there are no redundant data sets to manage, the data-loading process is easier, and Mondrian is ideally suited to do OLAP on data sets which change in real time

Source:

http://mondrian.pentaho.com/documentation/architecture.php

Mondrian Architecture - Layers of a Mondrian system

Layers of a Mondrian system

A Mondrian OLAP System consists of four layers; working from the eyes of the end-user to the bowels of the data center, these are as follows: the presentation layer, the dimensional layer, the star layer, and the storage layer. (See figure 1.)

The presentation layer determines what the end-user sees on his or her monitor, and how he or she can interact to ask new questions. There are many ways to present multidimensional datasets, including pivot tables (an interactive version of the table shown above), pie, line and bar charts, and advanced visualization tools such as clickable maps and dynamic graphics. These might be written in Swing or JSP, charts rendered in JPEG or GIF format, or transmitted to a remote application via XML. What all of these forms of presentation have in common is the multidimensional 'grammar' of dimensions, measures and cells in which the presentation layer asks the question, and the OLAP server returns the answer.

The second layer is the dimensional layer. The dimensional layer parses, validates and executes MDX queries. A query is evaluted in multiple phases. The axes are computed first, then the values of the cells within the axes. For efficiency, the dimensional layer sends cell-requests to the aggregation layer in batches. A query transformer allows the application to manipulate existing queries, rather than building an MDX statement from scratch for each request. And metadata describes the the dimensional model, and how it maps onto the relational model.

The third layer is the star layer, and is responsible for maintaining an aggregate cache. An aggregation is a set of measure values ('cells') in memory, qualified by a set of dimension column values. The dimensional layer sends requests for sets of cells. If the requested cells are not in the cache, or derivable by rolling up an aggregation in the cache, the aggregation manager sends a request to the storage layer.

The storage layer is an RDBMS. It is responsible for providing aggregated cell data, and members from dimension tables. I describe below why I decided to use the features of the RDBMS rather than developing a storage system optimized for multidimensional data.

These components can all exist on the same machine, or can be distributed between machines. Layers 2 and 3, which comprise the Mondrian server, must be on the same machine. The storage layer could be on another machine, accessed via remote JDBC connection. In a multi-user system, the presentation layer would exist on each end-user's machine (except in the case of JSP pages generated on the server).

Source:
http://mondrian.pentaho.com/documentation/architecture.php

OpenNMS is the world’s first enterprise grade network management application platform (Open Source)

OpenNMS is an award winning network management application platform with a long track

record of providing solutions for enterprises and carriers.

OpenNMS is the world’s first enterprise grade network management application platform developed under the open source model.

Well, what does that mean?

World’s First: The OpenNMS Project was started in July of 1999 and registered on SourceForge in March of 2000. It has years of experience on the alternatives.

Enterprise Grade: It was designed from “day one” to monitor tens of thousands to ultimately unlimited devices with a single instance. It brings the power, scalability and flexibility that enterprises and carriers demand.

Application Platform: While OpenNMS is useful “out of the box,”it is designed to be highly customizable to create an unique and integrated management solution.

Open Source: OpenNMS is 100% Free and Open Source software, with no license fees, software subscriptions or special “enterprise” versions.

Link:

http://www.opennms.org/about/

http://pt.wikipedia.org/wiki/Simple_Network_Management_Protocol

Como gerar uma senha usando o encr.sh do PDI

Pessoal,

Para gerar uma senha usando o encr.sh do PDI você precisa rodar o caminho abaixo:

./encr.sh -kettle sua_senha

Resultado:

MacBook-Air-de-Caio:data-integration caiomsouza$ ./encr.sh -kettle sua_senha
/Applications/Pentaho/pdi-ce-5.0.1-stable/data-integration
Encrypted 2be98afc86aa7f297be189163db9ca7db

PDI 5.0.1 Error: Oracle / PostgreSQL / MySQL date / timestamp problem using "Dimension lookup/update"

Solution:

DB Connection properties->Advanced->Uncheck "Supports the timestamp datatype" option.

Log error:
Timestamp : There was a data type error: the data type of java.util.Date

Problem:

This is a reproducible issue tested in the following conditions:

OS: Mac OS X Mavericks
MySQL: 5.5.29
PDI: 5.0.1

The Dimension Lookup/Update Step fails when trying to update a row. It will work fine on initial load and if there are no changes found in the target dimension table. As soon as there is any update, an error is thrown with the following text:

Caused by: java.lang.RuntimeException: date_from Timestamp : There was a data type error: the data type of java.util.Date object [Mon Jan 01 00:00:00 EST 1900] does not correspond to value meta [Timestamp]

Sample transformation attached

Solution:

Found the fix for this problem (thanks mainly to kgdeck on the forums):

I had to go under the DB Connection properties->Advanced->Uncheck "Supports the timestamp datatype" option. All fixed.

That being said, I'm leaving this issue open to essentially ask the question of why that was enabled by default. I don't know much about other MySQL versions or if this is a global option for all DB connection types but in my case it should have been switched off by default. At the very least, I think it might be worth noting in the Dimension Lookup/Update step docs so that others understand why they get this error. Just my two cents..

Links:

http://forums.pentaho.com/showthread.php?156107-Oracle-date-timestamp-problem-using-quot-Dimension-lookup-update-quot

Dimension Lookup/Update Step Fails with MySQL
http://jira.pentaho.com/browse/PDI-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Professor Coruja - Business and Open Source Technology

Pages

Google Ads