Data Science 

Data Science — including machine learning, data analysis, and data visualization

First of all, let’s review what machine learning is.

I think the best way to explain what machine learning is would be to give you a simple example.
Let’s say you want to develop a program that automatically detects what’s in a picture.
So, given this picture below (Picture 1), you want your program to recognize that it’s a dog.




Picture 1

Given this other one below (Picture 2), you want your program to recognize that it’s a table.




Picture 2

You might say, well, I can just write some code to do that. For example, maybe if there are a lot of light brown pixels in the picture, then we can say that it’s a dog.
Or maybe, you can figure out how to detect edges in a picture. Then, you might say, if there are many straight edges, then it’s a table.
However, this kind of approach gets tricky pretty quickly. What if there’s a white dog in the picture with no brown hair? What if the picture shows only the round parts of the table?
This is where machine learning comes in.
Machine learning typically implements an algorithm that automatically detects a pattern in the given input.
You can give, say, 1,000 pictures of a dog and 1,000 pictures of a table to a machine learning algorithm. Then, it will learn the difference between a dog and a table. When you give it a new picture of either a dog or a table, it will be able to recognize which one it is.
I think this is somewhat similar to how a baby learns new things. How does a baby learn that one thing looks like a dog and another a table? Probably from a bunch of examples.
You probably don’t explicitly tell a baby, “If something is furry and has light brown hair, then it’s probably a dog.”
You would probably just say, “That’s a dog. This is also a dog. And this one is a table. That one is also a table.”
Machine learning algorithms work much the same way.
You can apply the same idea to:
  • recommendation systems (think YouTube, Amazon, and Netflix)
  • face recognition
  • voice recognition
among other applications.
Popular machine learning algorithms you might have heard about include:
  • Neural networks
  • Deep learning
  • Support vector machines
  • Random forest
You can use any of the above algorithms to solve the picture-labeling problem I explained earlier.

Python for machine learning

There are popular machine learning libraries and frameworks for Python.
Two of the most popular ones are scikit-learn and TensorFlow.
  • scikit-learn comes with some of the more popular machine learning algorithms built-in. I mentioned some of them above.
  • TensorFlow is more of a low-level library that allows you to build custom machine learning algorithms. It is mostly used with Keras(another library).
If you’re just getting started with a machine learning project, I would recommend that you first start with scikit-learn. If you start running into efficiency issues, then you can start looking into TensorFlow.

How should I learn machine learning?

To learn machine learning fundamentals, I would recommend either Stanford’s or Caltech’s machine learning course.
  • Please note that you need basic knowledge of calculus and linear algebra to understand some of the materials in those courses.
Then, you can practice what you’ve learned from one of those courses with Kaggle. It’s a website where people compete to build the best machine learning algorithm for a given problem. They have nice tutorials for beginners, too.

What about data analysis and data visualization?

To help you understand what these might look like, let me give you a simple example here.
Let’s say you’re working for a company that sells some products online.
Then, as a data analyst, you might draw a bar graph like this.




Bar Chart 1 — generated with Python

From this graph, we can tell that men bought over 400 units of this product and women bought about 350 units of this product this particular Sunday.
As a data analyst, you might come up with a few possible explanations for this difference.
One obvious possible explanation is that this product is more popular with men than with women. Another possible explanation might be that the sample size is too small and this difference was caused just by chance. And yet another possible explanation might be that men tend to buy this product more only on Sunday for some reason.
To understand which of these explanations is correct, you might draw another graph like this one.




Line Chart 1 — generated with Python

Instead of showing the data for Sunday only, we’re looking at the data for a full week. As you can see, from this graph, we can see that this difference is pretty consistent over different days.
From this little analysis, you might conclude that the most convincing explanation for this difference is that this product is simply more popular with men than with women.
On the other hand, what if you see a graph like this one instead?




Line Chart 2 — also generated with Python

Then, what explains the difference on Sunday?
You might say, perhaps men tend to buy more of this product only on Sunday for some reason. Or, perhaps it was just a coincidence that men bought more of it on Sunday.
So, this is a simplified example of what data analysis might look like in the real world.
You need to use SQL to pull data from databases. Then, you can use either (Python + Matplotlib) or (JavaScript + D3.js) to visualize and analyze this data.

Data analysis / visualization with Python

One of the most popular libraries for data visualization is Matplotlib.
It’s a good library to get started with because some other libraries such as seaborn are based on it. So, learning Matplotlib will help you learn these other libraries later on.

Comments

Post a Comment

Popular posts from this blog

Learning Path for Deep Learning in 2019

Day 6 - Daily Dev Diaries