12 - A brief introduction to Random Forests
Today an amazing post introducing the Random Forests method is waiting for you to be read! Yep, I know, I promised this post a long time ago but seriously crew: I had no time for it!
Random Forest is a machine learning method mainly used for classification and regression purposes. These kind of statistics basically learns from a training set of data and gives answers associated to new data based on what the algorithm learnt. Yep crew it sounds very similar to what Artificial Neural Networks do. In fact, also Artificial Neural Networks is a machine learning method used for regression analysis. However the two methods are based on slightly different concepts.
From the historical point of view, the first concepts behind the theory of "Random Forests" are introduced for the first time by Ho in 1995 , but it is only in 2001 that they are defined as we currently know them by Breimann .
The algorithm itself is based on the theory of Decision Trees … Oh, you never heard about "Decision Trees" either? Well, maybe you don’t in detail, but, trust me, you probably use the technique without even knowing it. A decision tree is basically a flow chart (see picture below) composed of multiple choices. The final result of a decision tree depends on all the choices done along the chart. In machine learning, every time the algorithm makes a choice based on a particular statistics of a certain feature, for example:
|Example of a decision tree flow chart.|
In the flow chart shown above we make a choice (or the algorithm makes a choice for us) based on the quantity (numerosity) of the pears that Bob bought.
In fact, Decision Trees is a popular method commonly used for a number of machine learning tasks. However trees that are grown very deep (e.g. with many features to take into account and/or many discriminants insight the feature itself) tend to learn highly irregular patterns, usually ending in overfitting the training dataset . Based on similar concepts, Random Forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance of the final result . Basically the initial set of data is randomly subset in smaller ensemble of data and a decision tree is built on each subset. The algorithm finally chooses the solution as an average among all the solution obtained from all the decision trees. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model . In fact, as in the case of the Artificial Neural Networks the Random Forests works as a black box. However it has the main advantage of being more accurate, being able to deal with non-numerical variables, being able to “measure” variable importance, and avoiding overfitting to the training data set  (this is why I’m investigating the performance of this technique when applied to my research!!!).
Do you want to know more about the methods that I mentioned? Here it is a list of good websites that you can visit which gives you also some good examples of how to apply them using R:
https://en.wikipedia.org/wiki/Decision_tree (for a brief overview)
https://en.wikipedia.org/wiki/Random_forest (for a brief overview)
http://www.statmethods.net/advstats/cart.html (for some examples using R)
http://machinelearningmastery.com/how-to-get-started-with-machine-learning-algorithms-in-r/ (very good reference for all the machine learning methods)
or read: James G. et al. (2013) “An Introduction to Statistical Learning: With Applications in R” , Springer, DOI 10.1007/978-1-4614-7138-7 (as I did ;))
Of course you can find more, but that’s all for me today crew! (please, have a look the list of references below if you want to understand more about the theory behind the Random Forests).
Ah, crew! Today I started my secondment in TRL! It is the opportunity for me to increase my knowledge about road conditions monitoring and life-cycle analysis of road pavements. TRL (the Transportation Research Laboratory Ltd) is one of the companies leader in research in the transportation field. Obviously I'll let you know soon how it is going here in Wokingham!
So, have a great day crew, and as usual keep stay tuned if you want to receive updates about my research project!
 Breiman, Leo (2001). "Random Forests". Machine Learning. doi:10.1023/A:1010933404324, 45 (1): 5–32;
 Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282;
 Ho, Tin Kam (1998). "The Random Subspace Method for Constructing Decision Forests" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (8): 832–844. doi:10.1109/34.709601;
 James G. et al. (2013) “An Introduction to Statistical Learning: With Applications in R” , Springer, doi: 10.1007/978-1-4614-7138-7.