Purpose of this Page

This page aims to document my endeavour in Data Science / Data Analytics specialization, and provide some relevant and consolidated one-stop information which may be useful to someone who wishes to do the same. Also, this Kaggle page provides a very comprehensive information for aspiring Data Scientists. Below are some skillsets which are important for a Data Scientist / Data Analyst.

Recently, as Deep Learning and Artificial Intelligence (AI) are gaining their importance, I have added another section on it. Also, as Apache Spark is faster than other big data processing frameworks (10x to 100x faster than Hadoop MapReduce), I have some information on that as well. Additionally, as Apache Airflow is attracting more attention worldwide as a standard ETL platform, a section is specially catered for that.

As Docker and Kubernetes are becoming the de facto technologies in the automation journey of containerization and orchestration, I have also added a section to it.

Statistical Tool Skills
Using R, Python, SAS, Tableau, Qlik, RapidMiner, SPSS, Excel (VBA), database querying language like SQL, or other tools to piece together your propositions and discover potential patterns and correlations through statistics are the heart of working data to discover and apply your creativity.

R and Python are the two most popular programming languages used by data analysts and data scientists. Both are free and and open source.

Statistics and Math skills
Understanding correlation, multivariate regression and all aspects of massaging data together to look at it from different angles for use in predictive and prescriptive modeling is the backbone knowledge that’s really step one of revealing intelligence.

Visualization Tool Skills and Storytelling Skills
Visualizing and communicating data is incredibly important, especially at young companies who are making data-driven decisions for the first time or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings or the way techniques work to audiences, both technical and non-technical. Visualization wise, it can be immensely helpful to be familiar with data visualization tools like ggplot2 in R. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information. Another useful visualization tool worth looking into is D3.js – Data-Driven Documents, which is a JavaScript library for manipulating documents based on data. D3 helps bring data to life using HTML, SVG, and CSS.

With the flood of data available to businesses regarding their supply chain these days, companies are turning to analytics solutions to extract meaning from the huge volumes of data to help improve decision making, thus, transforming data into valuable information or insights.

Looking at all the analytic options can be a daunting task. However, luckily these analytic options can be categorized at a high level into three distinct types. No one type of analytic is better than another, and in fact, they co-exist with, and complement each other. In order for a business to have a holistic view of the market and how a company competes efficiently within that market requires a robust analytic environment which includes:

Descriptive Analytics, which use data aggregation and data mining techniques to provide insight into the past and answer: “What has happened?” Descriptive analytics are useful because they allow us to learn from past behaviors, and understand how they might influence future outcomes.

Predictive Analytics, which use statistical models and forecasts techniques to understand the future and answer: “What could happen?” Predictive analytics can be used throughout the organization, from forecasting customer behavior and purchasing patterns to identifying trends in sales activities. They also help forecast demand for inputs from the supply chain, operations and inventory. This document provides a good overview of forecasting methodology. A highly recommended book on Predictive Analytics is available here.

Prescriptive Analytics, which use optimization and simulation algorithms to advice on possible outcomes and answer: “What should we do? Prescriptive analytics helps an organization evaluate different scenarios and seeks to determine the best course of action to achieve optimal outcomes – given known and estimating unknown variables.

Prescriptive analytics is built on top of predictive analytics. In turn, predictive analytics is built upon descriptive analytics. Finally, descriptive analytics is built upon a foundation of data. If that data is incomplete, tainted, unstructured, or otherwise suspect in quality; we launch a domino effect. Bad data quality begets poor descriptive analytics begets poorer predictive analytics begets poorest prescriptive analytics.

To steal a famous quote from Thomas Edison “Genius is 1% inspiration and 99% perspiration”, we accurately describe big data analytics; a whole lot of messing with data, interspersed with a bit of data science. In fact we could honestly say “Big data analytics is 1% data science and 99% data wrangling”.

Read also on the Data Science Pyramid, which mentions that both Insight and Strategy  (for business users) are equally as important as those of tools, algorithms and products used. We can also see the Data Science Hierarchy of Needs that its foundations of Data Pipeline and Extract, Load and Transform (ETL) are equally as important.