Components of Data Science


What is Data

logo of data science course Data is everywhere. Without data, we cannot survive. When you open your mouth, you are giving data. When you listen to others or watch something out there, you are receiving data. People talking on the mobile phones, doing shopping online, making transactions in the banks etc. lead to a lot of data. On an average, users of Internet generate 2.5 quintillion bytes of data each day and that is very huge data!.

Types of Data

Data is the foundation of Data science. It is the material that is collected, cleaned and analyzed. In the context of Data science, there are two types of data: traditional data and big data.

Traditional data is data that is structured and stored in databases which analysts can manage from one computer; it is in table format, containing numeric or text values.

Big data, on the other hand, is bigger than traditional data, and not in the trivial sense. In most of the cases, big data is stored on more than one computer or on several hard disks. To work with such data, we may have to use big data related technologies like Hadoop or Teradata, etc…

Data Science

Simply speaking ‘Data Science’ is nothing but science of data. When we said ‘science’, it represents a methodology to work with data to gain useful insights from it.

Any organization contains data, both structured and unstructured. After collecting the data, it should be cleaned and made ready for analysis.

Data scientist is a person whose work starts at analysing the data. Data scientist primarily uses predictive analytics and prescriptive analytics along with machine learning to derive useful information from the data.

In predictive analytics, the possibility of a particular event can be predicted. For example, when issuing credit card, a bank should know the percentage of customers paying on time. This can be known by adopting predictive analytics.

In prescriptive analytics, the possibility of an event can be predicted along with possible actions and the results of those actions. For example, In a game like Chess, if the player makes a move, then the possible moves that can be taken by the other player can be predicted.

In machine learning, the data scientist can provide some algorithms which act on previous data and based on which the machine (or robot) can learn on its own to make predictions or prescriptions.

After taking the data, analyzing it and showing it in the form of graphs, the Data scientist will come up with useful conclusions for the organization.

Hence, we can say that Data science is a combination of various tools, algorithms and machine learning principles with a goal to discover meaningful insights from data.

The two programming languages mainly used by Data scientists are R and Python.

Data Science life cycle

Detailed life cycle of a data science project

R Language

R is a programming language and environment commonly used in statistical computing, data analytics and scientific research. It is one of the most popular languages used by statisticians, data analysts, researchers and marketers to retrieve, clean, analyze, visualize and present data.

Due to its expressive syntax and easy-to-use interface, it has grown in popularity in recent years.

Companies like Twitter, Ford, Microsoft, New York Times and Google are already using R in their various projects. Nowadays, most of the programmers are preferring to use Python in addition to R and also in some cases in the place of R.

Statistical Computing

R is the most popular programming language among statisticians. In fact, it was initially built by statisticians for statisticians. It has a rich package repository with more than 9100 packages with every statistical function we can imagine. R’s expressive syntax allows researchers - even those from non computer science backgrounds to quickly import, clean and analyze data from various data sources. R also has charting capabilities, which means we can plot our data and create interesting visualizations from any dataset.

Machine Learning

One of another paths that is trending all around the world is Machine Learning. In simple words, machine learning is giving machine the ability to learn on its own. It is done through various algorithms that parses data; the machine then learns from the data and take informed decisions on its own.

Companies like Amazon, Google, Facebook’s inner systems is driven by machine learning algorithms. The languages R and Python have found a lot of use in predictive analytics and machine learning. They have good library for common Machine Learning tasks like linear and non-linear regression, decision trees, linear and non-linear classification and many more.

Artificial Intelligence (AI)

Human robot solving problems on whiteboard The goal of AI is to create systems which can function intelligently and independently like human beings. AI is the advanced part of Data Science.

Humans can speak and listen. This helps them to express their ideas through language. This is done through speech recognition in AI. Much of the speech recognition is done using statistical methods.

Humans can read and write in a language. This is NLP (Natural Language Processing) in AI.

Humans see with their eyes and process what they see. This is coming under Computer vision that is given to computers through Symbolic learning.

Humans understand their environment and move around fluently. This is coming under Robotics.

Humans can recognize patterns like groups of objects or individual objects and their shapes. This is a field of Pattern recognition. Machines can learn this easily since they analyze more data and dimensions of data. This is called Machine learning.

Human brain is a network of neurons. These networks are replicated in machines to get cognitive (recognizing and storing) abilities. If these networks are complex and deeper, it becomes Deep learning.

Humans can recognize past events. This is done using Recurrent Neural network (RNN).

The two main parts in AI are Symbolic learning and Machine learning. Machine learning is based on data. Hence we need to feed lots of data into the computers. Based on this data, they determine patterns and provide predictions.

The first AI robot is Sophia developed in 2016 by Hong Kong-based company Hanson Robotics.

Data analytics vs Data Science

Data analytics is also closer to Data Science but it is used in the beginning stages. For example, Data analytics focuses on analysing the data for finding answers for the questions posed by the people in the organization. It draws conclusions based on past data and does not make any future predictions.

But in case of Data science, the data is analysed for knowing the future trends and fate of the organization. The Data scientist himself poses certain questions and investigates answers for them in order to provide correct guidance to the management.

Skills Required

Skills needed to become a Data Scientist

A Data scientist is supposed to possess in-depth knowledge of statistics, computer programming in Python, SAS, R, Machine learning and AI. Knowledge of Big data is not required but it may be useful in some cases. You can learn more about these skills in our Skills required for Data Scientist here.

Skills needed to become a Data Analyst

A Data analyst should have knowledge in statistics, computer programming in Python, SAS, R and data wrangling ( the ability to map raw data and convert it into another format to make it useful).