Before my course on « big data and economics » at the university of Barcelona in July, I wanted to upload a series of posts on classification techniques, to get an insight on machine learning tools.

According to some common idea, machine learning algorithms are black boxes. I wanted to get back on that saying. First of all, isn’t it the case also for regression models, like generalized additive models (with splines) ? Do you really know what the algorithm is doing ? Even the logistic regression. In textbooks, we can easily find math formulas. But what is really done when I run it, in R ?

When I started working on academia, someone told me something like « **if you really want to understand a theory, teach it** ». And that has been my moto for more than 15 years. I wanted to add a second part to that statement: « **if you really want to understand an algorithm, recode it **». So let’s try this… My ambition is to recode (more or less) most of the standard algorithms used in predictive modeling, from scratch, in R. What I plan to mention, within the next two weeks, will be

- the logistic regression
- the logistic regression with splines
- the logistic regression with kernels (and knn)
- the penalized logistic regression, ridge
- the penalized logistic regression, lasso
- the heuristics of neural nets
- an introduction to SVM
- classification trees
- bagging and random forests
- gradient boosting (and adaboost)

I will use two datasets to illustrate. The first one is inspired by the cover of « Foundations of Machine Learning » by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar. At least, with this dataset, it will be possible to plot predictions (since there are only two – continuous – features)

x = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) z = c(1,1,1,1,1,0,0,1,0,0) df = data.frame(x1=x,x2=y,y=as.factor(z)) plot(x,y,pch=c(1,19)[1+z]) |

Here is some code to get a visualization of the prediction (here the probability to be a black point)

rmatrix_model = function(model){ u = seq(0,1,length=101) p = function(x,y) predict(model,newdata=data.frame(x1=x,x2=y),type="response") v = outer(u,u,p) return(v)} nice_graph=function(v){ u = seq(0,1,length=101) image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10[c(1,10)],breaks=c(0,5,10)/10) points(x,y,pch=19,cex=1.5,col="white") points(x,y,pch=c(1,19)[1+z],cex=1.5) contour(u,u,v,levels = .5,add=TRUE) } reg = glm(y~x1+x2,data=df,family=binomial) nice_graph(rmatrix_model(reg)) |

Note that colors are defined here as

clr10= c("#ffffff","#f7fcfd","#e5f5f9","#ccece6","#99d8c9","#66c2a4","#41ae76","#238b45","#006d2c","#00441b") |

or with some nonlinear model

The second one is a dataset I got from Gilbert Saporta, about heart attacks and decease (our binary variable).

myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";") myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1 y = myocarde$PRONO X = as.matrix(cbind(1,myocarde[,1:7])) |

So far, I do not plan to talk (too much) on the choice of tunning parameters (and cross-validation), on comparing models, etc. The goal here is simply to understand what’s going on when we call either glm, glmnet, gam, random forest, svm, xgboost, or any function to get a predict model.

So what was your feeling as you recoded a simple log reg into a glm()….have you got more insight in the way the information was revealed to you? … daniele

actually, the goal was more to understand penalized algorithm… but even on the simple logistic regression, coding it helps to understand when it can go wrong…

Well Arthur, as we epistemologically accept that a map (e.g r code) is not the territory (the reference map for the code) I get your insight as good as the translation preserves the map. : )…

Your interpretation and visualization are very clear and attractive . Thank you !

What a great idea and what a service to your readers!!

Concerning gradient boosting, what do you think about xgboost – would it fit here too? It is my (and of many others) favourite boosting algorithm as it has so many ways of regularization – a feature that is so much needed when boosting.

Thank you for all this!

Ultra cool !