Professor Coruja - Business and Open Source Technology: April 2015

Thursday, April 30, 2015

Rserve

What is Rserve?

Rserve is a TCP/IP server which allows other programs to use facilities of R (see www.r-project.org) from various languages without the need to initialize R or link against R library. Every connection has a separate workspace and working directory. Client-side implementations are available for popular languages such as C/C++, PHP and Java. Rserve supports remote connection, authentication and file transfer. Typical use is to integrate R backend for computation of statstical models, plots etc. in other applications.

https://rforge.net/Rserve/index.html

https://github.com/s-u/Rserve

2015 Gartner Magic Quadrant for Business Intelligence report

Source:
http://www.pentaho.com/resources/analyst-research/2015-gartner-magic-quadrant-business-intelligence-and-analytics

http://www.forbes.com/sites/louiscolumbus/2015/02/25/key-take-aways-from-gartners-2015-magic-quadrant-for-business-intelligence-and-analytics-platforms/

http://insidebigdata.com/2015/03/16/gartners-new-2015-magic-quadrant-reports/

http://www.microstrategy.com/us/about-us/analyst-reviews/gartner-magic-quadrant

Saturday, April 25, 2015

Kitematic - The easiest way to use Docker on your Mac.

The easiest way to use Docker on your Mac.

Brief introduction to Kitematic - Get from zero to having a Docker container running in minutes!

Kitematic is free and open source. If you'd like to contribute please visit our GitHub page at https://github.com/kitematic/kitematic

Curso de R com Pentaho (Data Science)

R é uma linguagem de programação/software gratuíto para análises estatísticas e gráficos. A linguagem R é amplamente utilizada por estatísticos e analistas de dados para análise de dados e desenvolvimento de softwares estatísticos. A popularidade do R tem aumentando significativamente todo ano.

Como o R foi adotado pela comunidade científica, praticamente todos os algoritmos/técnicas novas são lançadas/desenvolvidas no R, tornando-o um software extremamente dinâmico e atual. Além de ser gratuíto, o R é facilmente integravel com SPSS, SAS e Microsoft Excel (entre outros), permitindo que os usuários continuem utilizando ferramentas mais familiares porém adicionando todas as funcionalidades do R.

OBJETIVO

Capacitar o aluno no uso software R e nas suas principais funcionalidades, com foco no uso do software no ambiente corporativo. Também serão exploradas técnicas necessárias a análise de Big Data e formas de integrar o R a outros softwares estatísticos populares como SPSS, SAS e Microsoft Excel, permitindo que os alunos ampliem muito a sua capacidade analítica.

PRE-REQUISITOS

Não há pré-requisitos para este curso. Porém, para que o aluno possa aproveitar este treinamento ao máximo, recomenda-se que tenha noções de lógica de programação.

DURACAO DO CURSO

Três dias – 24 horas

PROGRAMA

•    Apresentação do software R e da GUI RStudio;
•    Instalação do R/Rstudio;
•    Bibliotecas (Libraries);
•    Importação/Exportação de dados locais;
•    Extração e limpeza de dados da internet:
RCurl e WebScrapping (XML e HTML)
•    Gráficos avançados:
o    ggplot2
o    Lattice
•    Manipulação/Análise de dados com exemplos reais:
o    Previsão de Faturamento e de Vendas (Modelos de Regressão)
o    Seleção de amostras da população (Amostragem)
o    Segmentação de Clientes (Análise de Cluster)
o    Otimização de Preço (Análise de Escolhas)
•    Integração do R com outros softwares
o    Excel
o    SPSS
o    SAS
o    Pentaho Data Integration

CRONOGRAMA

Dia 1
Treinamento básico R/Rstudio + Manipulação de Dados + Exercícios
Dia 2
Programação otimizada e expandindo funcionalidades do R + Exercícios
Dia 3
Gráficos e Integração do R com outros softwares + Exercícios

CERTIFICACAO

Será emitido um certificado de participação do curso a todos os alunos que comparecerem aos 3 dias de curso.

MATERIAIS DO CURSO

•    Slides utilizados no treinamento (em formato pdf);
•    Materiais e documentações complementares produzidos pela comunidade R (CRAN);

PUBLICO

No mínimo 10 alunos por turma.

CONFIGURACOES MINIMAS

Cada aluno deve levar para as aulas seu próprio computador/ laptop. Os computadores deverão ter as seguintes características e configurações:
•    Programas Microsoft Office 2007 ou mais atual instalado;
•    Mínimo de 4 GB de Memória RAM;
•    Mínimo de 20 GB de espaço em disco (HD);
•    Sistema Operacional (Windows XP, Vista, 7, 8);
•    Possuir um usuário com privilégios de instalação de programas no sistema operacional.
•    Apresentação de conteúdo programático de acordo com o programa apresentado.
•    Providenciar o material necessário para o curso contratado.

IDIOMAS

O treinamento poderá ser ministrado em Português ou Inglês, mediante negociação entre as partes.

Para maiores informações entre em contato com a IT4biz através do e-mail treinamentos@it4biz.com.br

Fonte:
http://www.it4biz.com.br/novosite/treinamentos/treinamentos/curso-de-r-com-pentaho-data-science/

Tails - Privacy for anyone anywhere

Tails is a live operating system, that you can start on almost any computer from a DVD, USB stick, or SD card. It aims at preserving your privacy and anonymity, and helps you to:

use the Internet anonymously and circumvent censorship;
all connections to the Internet are forced to go through the Tor network;
leave no trace on the computer you are using unless you ask it explicitly;
use state-of-the-art cryptographic tools to encrypt your files, emails and instant messaging.

https://tails.boum.org

Mendeley is a free reference manager and academic social network.

Mendeley is a free reference manager and academic social network. Make your own fully-searchable library in seconds, cite as you write, and read and annotate your PDFs on any device.

https://www.mendeley.com

PythonAnywhere: Host, run, and code Python in the cloud!

PythonAnywhere: the perfect place to code, host and run your Python code in the cloud

Cascalog and Cascading (Big Data Tools)

Cascading
Cascading is the proven application development platform for building data applications on Hadoop.

http://www.cascading.org/

Cascalog
The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading which operates at a significantly higher level of abstraction than those tools.

http://cascalog.org/

Friday, April 24, 2015

Hadoop Tutorial: Analyzing Sentiment Data

Hadoop Tutorial: Analyzing Server Logs

Hadoop Tutorial: Analyzing Geolocation Data

How to become a data scientist

A lot of people ask me: how do I become a data scientist? I think the short answer is: as with any technical role, it isn’t necessarily easy or quick, but if you’re smart, committed and willing to invest in learning and experimentation, then of course you can do it.
In a previous post, I described my view on “What is a data scientist?”: it’s a hybrid role that combines the “applied scientist” with the “data engineer”. Many developers, statisticians, analysts and IT professionals have some partial background and are looking to make the transition into data science.
And so, how does one go about that? Your approach will likely depend on your previous experience. Here are some perspectives below from developers to business analysts.

Java Developers

If you’re a Java developer, you are familiar with software engineering principles and thrive on crafting software systems that perform complex tasks. Data science is all about building “data products”, essentially software systems that are based on data and algorithms.
A good first step is to understand the various algorithms in machine learning: which algorithms exist, which problems they solve and how they are implemented. It is also useful to learn how to use a modeling tool like R or Matlab. Libraries like WEKA, Vowpal Wabbit, and OpenNLP provide well-tested implementations of many common algorithms. If you’re not already familiar with Hadoop — learning map-reduce, Pig and Hive and Mahout will be valuable.

Python Developers

If you’re a Python developer, you are familiar with software development and scripting, and may have already used some Python libraries that are often used in data science such as NumPy and SciPy.
Python has great support for data science applications, especially with libraries such as NumPy/Scipy, Pandas, Scikit-learn, IPython for exploratory analysis, and Matplotlib for visualizations.
To deal with large datasets, learn more about Hadoop and its integration with Python via streaming.

Statisticians and applied scientists

If you’re coming from a statistics or machine-learning background, its likely you’ve already been using tools like R, Matlab or SAS for years to perform regression analysis, clustering analysis, classification or similar machine learning tasks.
R, Matlab and SAS are amazing tools for statistical analysis and visualization, with mature implementations for many machine learning algorithms.
However, these tools are typically used for data exploration and model development, and rarely used in isolation to build production-grade data products. In most cases, you need to mix-in various other software components in like Java or Python and integrate with data platforms like Hadoop, when building end-to-end data products.
Naturally, becoming familiar with one or more modern programming languages such as Python or Java is your first step. I found it very helpful to work closely with experienced data engineers to better understand the mindset and tools they use to build production-quality data products.

Business analysts

If your background is SQL, you have been using data for many years already and understand full well how to use data to gain business insights. Using Hive, which gives you access to large datasets on Hadoop with familiar SQL primitives, is likely to be an easy first step for you into the world of big data.
Data science often entails developing data products that utilize machine learning and statistics at a level that SQL cannot describe well or implement efficiently. Therefore, the next important step towards data science is to understand these types of algorithms (such as recommendation engines, decision trees, NLP) at a deeper theoretical level, and become familiar with current implementations by tools such as Mahout, WEKA, or Python’s Scikit-learn.

Hadoop developers

If you’re a Hadoop developer, you already know the complexities of large datasets and cluster computing. You are probably also familiar with Pig, Hive, and HBase and experienced in Java.
A good first step is to gain deep understanding of machine learning and statistics, and how these algorithms can be implemented efficiently for large datasets. A good first place to look is Mahout which implements many of these algorithms over Hadoop.
Another area to look into is “data cleanup”. Many algorithms assume a certain basic structure to the data before modeling begins. Unfortunately, in real life data is quite “dirty” and making it ready for modeling tends to take a large bulk of the work in data science. Hadoop is often a tool of choice for large-scale data cleanup and pre-processing, prior to modeling.

Final thoughts

The road to data science is not a walk in the park. You have to learn a lot of new disciplines, programming languages, and most important – gain real-world experience. This takes time, effort and a personal investment. But what you find at the end of the road is quite rewarding.
There are many resources you might find useful: books, training, and presentations.
And one more thing: a great way to get started on real world problems is to participate in a data science competition hosted on Kaggle.com. If you do it with a friend, it’s twice the fun.

Source:
http://hortonworks.com/blog/how-to-get-started-in-data-science/

"Data Science: Where are We Going?" - Dr. DJ Patil (Strata + Hadoop 2015)

Saturday, April 11, 2015

Run notebook with ipython

1) Create a folder called notebook on your computer (Mac OS):

mkdir /Users/caiomsouza/ipyton/notebook

2) Run notebook on your terminal (Mac OS):

ipython notebook

Real log:

Caios-MacBook-Pro:notebook caiomsouza$ ipython notebook
[I 13:36:04.314 NotebookApp] Using existing profile dir: u'/Users/caiomsouza/.ipython/profile_default'
[I 13:36:04.317 NotebookApp] Writing notebook server cookie secret to /Users/caiomsouza/.ipython/profile_default/security/notebook_cookie_secret
[I 13:36:04.318 NotebookApp] Using MathJax from CDN: https://cdn.mathjax.org/mathjax/latest/MathJax.js
[I 13:36:04.357 NotebookApp] Serving notebooks from local directory: /Users/caiomsouza/python/notebook
[I 13:36:04.357 NotebookApp] 0 active kernels
[I 13:36:04.357 NotebookApp] The IPython Notebook is running at: http://localhost:8888/
[I 13:36:04.357 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

As you can see in the image below you can now run your codes with notebook.

Link:
http://ipython.org/notebook.html

Friday, April 10, 2015

How to install Anaconda (Python)

What is Anaconda?

Anaconda is a completely free Python distribution (including for commercial use and redistribution). It includes over 195 of the most popular Python packages for science, math, engineering, data analysis.

How to install Anaconda (Python) step by step

1) Download Anaconda-2.2.0-MacOSX-x86_64.pkg from http://continuum.io/downloads

2) Double click on the file Anaconda-2.2.0-MacOSX-x86_64.pkg and then next, next and next.

3) On your terminal type spyder and that's it.

Thursday, April 09, 2015

So live a life you will remember

"When I was 16, my father said, 'You can do anything you want with your life. You just have to be willing to work hard to get it.' That's when I decided, when I die, I want to be remembered for the life I lived—not the money I made." (Song The Nights - Avicci)

Professor Coruja - Business and Open Source Technology

Pages

Google Ads