• hello@databyheart.com
  • Mon - Fri: 8:00 - 18:00

Starting Data Science / Machine Learning from 0

Share on facebook
Share on google
Share on twitter
Share on linkedin

My intention with this blog are to improve my skills and knowledge in Data Science and further showcase my projects to inspire / help others. Currently I’m finishing my Master of Science in Computer Science and recently started working part time in a project, which tries to improve the digitalisation of East Germany. I always wondered how to start this blog and implement my intentions. Last week I red an article about starting from scratch, which inspired me to write this series. Then I dug a little deeper and found more articles like the one from mrdbourke. I want to start all over again with the fundamentals and work my way up. In the beginning I’ll probably will be a little bored as most of the stuff should be familiar, but having a great fundament is important for complex tasks. Moreover it will help me in my part time job as I get a better understanding on how to learn things from scratch, which should improve my comprehension on how to teach low level knowledge. 

Learning python and coding environment

This tutorial uses python as a programming language. Usually you should start here and learn the fundamentals, but I will skip this step. I learned enough python in the past and there are more than enough tutorial on learning python, so I won’t show this step. In addition I won’t show you how to set up the environment, because there are enough tutorials on this task as well. Personally I will use a jupyter notebook, you should too.

Resources for python

Resources for coding environment

Libraries numpy and pandas

These library are essential to work with the data and get them into the shape you need. numpy is basically a math library to work with vectors, matrices or multidimensional arrays. Similiar to pandas it is very time efficient and can handle big chunks of data – 1 million rows are no problem. You can easily create some data from different distributions or insert a matrix with placeholder values.

Pandas is basically excel within the python space. It is the place to store the data with lots of functionality. You have Series and DataFrames as the two options, but actually a DataFrame is made of several Series (columns). As written above, pandas works on top of numpy, which means it is very fast.

In the past I used both libraries for every project. Even though that is the case I’m still far from being good or an expert. I know the fundamentals, but I need more knowledge to conquer more complex tasks.

To learn more about these two libraries I want to do the kaggle pandas course and learn from the free code camp video about numpy. In the past I watched a udemy course about Data Science, which was really good, but the section about theses two libraries was very small. Maybe I find another video with more specific information aimed at numpy on udemy

Resources

Explore and visualise data with seaborn / matplotlib

Understanding the data is crucial to find the patterns. Humans have lots of problems visualising an excel sheet in their head, which is why you need to know how to use these libraries. There are different charts from bar to pie and everyone has its purpose. In the past I often used matplotlib or the integrated plot functions from pandas to visualise and understand my data, but I mostly just googled what I needed and didn’t take the time to fully understand everything. 

I heard lots of good things about seaborn and want to compare it to other libraries (matplotlib, pandas). I will do the kaggle data visualisation tutorial (free) for seaborn and watch sentdex video series. The part about the pandas libraries will probably be shorter, but as it is very useful to create a simple chart in seconds, it has its place in the visualisation world.

Resources

Introduction to machine learning

After learning the boring but important fundamentals I want to dip my toes into the world of machine learning. There are a lot of complex algorithm to find outliers, create abstract paintings or predict the stock market, but I want to start with small steps, so I don’t trip and fall. I mostly will use the scikit-learn library as it includes loads of different algorithms to predict and understand pattern. The library has a simple syntax and is very easy to understand. It includes algorithms like k-means, decision trees, random forrets, support vector machine and so on.
I’ll start again by doing the two free courses on kaggle and then watch the series by Data school on youtube

Resources

More will come

Keep in mind, that this is an introduction for starters. There will probably be a an update with a series about neural networks, heavy and complex algorithms, natural language processing, deep learning and more. As some might have noticed, this guide misses the math behind the algorithms. In the beginning I want to focus on practice and learning on the job rather than the theory. For now I will stick to the fundamentals a put one foot in front of another and document my footsteps.

Table of Contents

Related Posts