The absolute fastest way to learn R for Data Science using the 80/20 Principle.
This post is going to take an 80/20 approach to learning Data Science. The Pareto princible states that roughly 80% of your results will come from 20% of your effort.
After finishing this post you will actually be able to do 80% of data science, and it will take you a mere fraction of the time. There will be things in the 20% that you don’t know, but if you get stuck you can turn to Google and StackExchange. We will cover the main tools Data Scientists actually use. This post will be updated as times goes on. But for now, the most important things to learn R as fast as possible are:
1) Use the tools pros actually use (dplyr, ggplot, tidyverse.)
2) Create muscle memory for the commands you use. Never ever ever copy and paste commands you’re trying to learn.
3) Use Scientifically Proven memorization techniques
Data Science is a huge field. The best things to learn are the tools that you’ll actually use on a daily basis. Many Data Science 101 courses will teach you the most basic things, that you will probably never use in practice. dataset$column is the base r way of selecting columns in base R, but since dplyr is more intuitive, it’s more commonly used among actual data scientists. Learning things you’ll rarely use is mostly a waste of time, even if they are ‘building blocks’’. I suggest jumping into the tools the pros actually use. Which is RStudio and The tidyvverse. All of which will be covered in this post. The philosophy of this post is to cover the most used tools in DS, then move on to the lesser ones. As of now, the tools covered will be dplyr and ggplot2 using R Studio as an IDE.
To download R Studio, click here.
As you progress you can move on to more advanced IDEs like Atom and extremely advanced software like vim with Nvim-R, which may take months to learn. Or stay with RStudio. Many professionals use RStudio, and it’s great software.
Once you have RStudio installed, install the tidyverse.
in the R command prompt type:
into your R console. It’s also a good idea to find your .RProfile file in your home directory (how to find: Mac, Linux, Windows) and add the same library command, which will make it so the tidyverse automatically loads whenever you open R. Add the following to your .RProfile file by editing it in a text editor:
You can add any commands you want into this file, as long they’re actualy commands you can use in R, one on each line. When R opens it will look at each line and run each command you’ve entered. Think of as a file that says “Every time I open R, run these set of commands.” Since you’ll use the tidyverse almost all the time, I suggest adding that line.
But what is the tidyverse?
The tidyverse is a collection of packages created by R superstar, Hadley Wickham including ggplot (plotting software), dplyr (data manipulation), readr (software to read various files), tibble (improved data tables), and more.
Just like the tidyverse, RStudio is also developed by Hadley Wickham. Most of these packages are used daily by those who use R. They are hands down the most important tools in R. The best way to learn them? Right from the horses mouth. The creator; Hadley Wickham.
It’s very important with these tutorial to import your data ,follow along in RStudio, and TYPE each command. The scripts and data are included, but you need to type out each command manually in order to create muscle memory. This is crucially important. Which brings me to my next point. When you are learning R, do not copy and paste commands. Ever.
Important commands need to be manually typed ad nauseum. When you learn something: put it into practice immediately. Our cognitive memory is terrible, but out experiential memory is great.
Things need to be ingrained in your mind. You create different neural pathways when you do something. You don’t even necessarily have to intellectually know something to have the neural pathways. Have you ever needed to give someone a phone number, but without the number pad in front of you, you couldn’t remember it? When there’s a phone in front of you. you can dial the phone number just fine, but when someone asks you you’re like ‘uhhhh, let me go look at a number pad.’ Or the same with directions sometimes. You cant give someone directions but you ‘know the way if I just drove there.’ This is why doing is so important.
You have a muscle memory where you know exactly what you’re doing…. but you can’t intellectualize it. And the opposite can happen; you can intellectualize something, but have no idea how to actually do it. Don’t make that mistake!
Now that’s also not to poo-pooing note-taking, but that’s not how I meant it. It’s just that doing is very important since it commits it to muscle memory. Note taking is very important too. Both methods have their place.
A few years back, when I was learning programming at first I just took notes, did the quizzes, etc. But then when I had to do it on my own, I got lost a lot. So I took a new approach. Rather than just taking notes and memorizing, whenever I learned a new technique, I would put it into practice immediately. I would do it over and over until I could practically do it with my eyes closed. Seriously, you have to do things over and over until you’re sick with boredom. Once you can do the command correctly every time, and you become mind-numbingly bored, you can stop. Boredness is nature’s signal that you’re not benefiting anymore.
How to succeed at data science (or anything): Read, Do, Read, Do, Read, Do, Read, Do
How to stay stuck in data science (or anything): Read, Read, Read, Read, Do, Do, Do, Do.
How to REALLY stay stuck in data science (or anything): Read, Copy and Paste code, Read, Copy and Paste Code
My technique? I touched on this earlier when I’m learning a new command, i go right into the console and implement this command over and over and over again
Do it until you can practically do it with your eyes closed.
Why? That way when you’re typing, the words “data”, “dataset” “rowname” and “column name” get programmed into your brain.
Which do you think is a better way to remember a function:
By doing it the second way you engrain into your brain exactly what goes where. You will have an engrainedmental file that says “filter(dataset, columnnamehere>180)” rather than “filter(mtcars, hp>180)”, the former being more useful. It’s also very important to do this in the R console and not in a notepad.
You need to run the command immediately after entering to see if there are any errors. You absolutely need feedback. If you type it into the script, and don’t enter the command, you can easily engrain an incorrect syntax into your brain over and over. Do it in the console. Or at the very least hit cmd+Enter (ctrl+Enter on Windows) to run the command into R.
Now that we have a good way to learn data manipulation with dplyr, the next most common skill used in field is Data Visualization. For that we’ll use ggplot2.
Now there’s 2 approaches you can take. Take a handful of classes on ggplot, and know everything. Or you can take a way shorter amount of time and just grab the essentials. We’re of course, going to take the latter approach here.
There’s a library called ggplot GUI. Type the following into your R prompt:
wait for it to install, then enter:
You can also enter your data between the (parentheses) if you want to load a specific dataset.
This is a quick and easy way to understand ggplot. This of it as ggplot training wheels. Play with your data and understand the ggplot format:
Then after you find a plot you like select “R-Code” up top and manually type it into r.
Again, manually type. You need the commands completely committed to memory. The format of ggplot is as follows:
ggplot(yourdataset here, aes(x = column1, y =column2)) +
geom_point() + #type of plot
theme_bw() +ggplot thweme to use
You’ll get ued to this as you play with it. Note: aes stands for aestetics. Geom is the type of plot to use (bar, line, box plot, etc)
You can also find all of the ggplot chart styles in this list from the main tidyverse site. Pick out the ones you’ll use most, and practice them.
Use flashcards. Flashcards (practice testing) are scientifically proven to be one of the most effective memorization tools available.
On one side write the task, on the other write the function/syntax.
Side 1: “Selecting only specific columns by name”
Side 2: dataset %>%
Side 1: “Dropping certain columns, but keeping the rest”
Side 2: dataset %>%
Or another self-testing idea I like to use is:
Take a piece of paper and divide it in half. On one side of the paper is a question, on the other side is the answer. That way you can cover up the “answer” side and your notes double as Flashcards, which again, are proven to be one of the single most effective ways of learning.
Combine one of these with Spaced repetition (Distributed Practice) and you have memorization magic.
That’s all for now. Oh yea, and don’t cheat.
Get your Tidyverse Brownbelt:
In summary, this is the way to learn about 80% of everything you need to know about Data Science:
1: Follow this article and the dplyr tutorial from Hadley Wickham.
2: Check out “R for Data Science Hadley Wickham.”
3: Check out RStudio’s list of Cheetsheats By Hadley Wickham (noticing a trend here?) Most importanly these cheatsheeat: Data Visualization, Data Wrangling, Data Import, and Data Transformation. Print them if you need to. If you can master all the techniques on those cheatsheets, you’re likely able to take your first Data Science, where you can learn further.
More Tutorials from Hadley Wickham.