3. Regression

Lesson content locked

Enroll in Course to Unlock

If you're already enrolled, you'll need to login.

Transcript

- Hello! In this opportunity we will cover Regression. As in the case of Classification this is also a supervised learning method in which we are interested in mapping a set of inputs into outputs but assuming that the outputs are a set of continuous real values. Let's dive in. In regression, we would like to write the numeric output, called the dependent variable y, as a function of the input, called the independent variable x. We assume that the numeric output is the sum of a deterministic function of the input and random noise. Additionally, this function f can be linear or nonlinear, discontinuous or continuous and both x and y can be multivariate. In order to find the closest model to a given set of points, we use least squares. The linear model is just a multivariate polynomial of degree one. The best fit in the least-squares sense minimizes the sum of squared residuals, a residual being: the difference between an observed value, and the fitted value provided by a model. In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The plane is chosen to minimize the sum of the squared vertical distances between each observation, shown in red, and the plane. When the model overfits the data, or when there are issues with collinearity, the linear regression parameter estimates may become too inflated or large. As such, we may want to control the magnitude of these estimates to reduce the sum of squared errors. Controlling, or regularizing, the parameter estimates can be accomplished by adding a penalty term to the sum of squared errors if the estimates become large. Ridge regression adds a penalty on the sum of the squared regression parameters. While ridge regression shrinks the parameter estimates towards 0, the model does not set the values to absolute 0 for any value of the penalty. A popular alternative to ridge regression is the least absolute shrinkage and selection operator model, frequently called the lasso. This model uses a similar penalty to ridge regression based on the L1 norm. By adding the penalty, we are making a trade-off between the model variance and bias. By sacrificing some bias, we can often reduce the variance enough to make the overall square errors lower than unbiased models. Linear models are the simplest regression models. There is a large body of theory, algorithms and codes available in this setting. Geoscientists and engineers love to work with linear models to explain data relations and perform the predictions since they are easy to calculate and communicate to others. When models are not linear, they devote an important time to transform the data to make it linear. However, linearity assumptions may play nasty tricks on existing nonlinear relations: variables may appear to be collinear, lack of independence, inducing instabilities so inducing the need to introduce penalized models. It is interesting to recall Anscombe's quartet before going further. Anscombe's quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when plotted. This is important to keep in mind whenever we analyze data that shows similar relationships. Decision trees are perhaps the simplest nonlinear regression models. The main features of this family of methods are: Non-parametric so they don't require much tweaking Applicable on classification and regression problems, making them quite flexible. Highly interpretable which is a desirable feature for engineers and geoscientists. The figure below shows an illustration of a recursive binary partitioning of the input space, along with the corresponding tree structure. The recursive subdivision can be described by the traversal of the binary tree shown in the right figure. For any new input x, we determine which region it falls into by starting at the top of the tree at the root node and following a path down to a specific leaf node according to the decision criteria established at each node. Note that such decision trees are not probabilistic graphical models. Within each region, there is a separate model to predict the target variable. For instance, in regression we might simply predict a constant over each region, or in classification we might assign each region to a specific class. We now present how decision tree could be used to fit gas production rates. Clearly as we allow the tree to have more splits and therefore depth, the accuracy increases. Nevertheless, note that noise and outliers are also captured by the decision tree regression. When we talk about ensemble methods, we will see that the combination of decision trees can lead to very powerful regression approaches. Support Vector Machines is a class of powerful, highly flexible modeling techniques. The theory behind Support Vector Machines was originally developed in the context of classification models. We can list the following features associated to this approach: The regression function is expressed as a sum of kernel functions. The key idea is that that the approximated solution should not produce errors greater than a predefined threshold. Points that go beyond the specified threshold are defined as support vectors. The weights associated to the function should be maintained small to ensure flatness. The Support Vector Machines formulation leads to the solution of a convex optimization problem; that is, a problem with a unique global solution. The aforementioned points are illustrated in the bottom plot. The Support Vector Machines regression lives within the boundaries of the threshold value between the blue lines. The red circles define the support vectors. So, let us illustrate how Support Vector Machines could be used in a couple of practical problems. At the right, the first example seeks to predict NPV values. These values are given by the height of the formation, porosity, initial saturation, ISOR, and well length for 1000 well measurements. We train the SVM model with approximately 600 measurements and evaluate the predictions on 400 measurements. The plot compares predicted against an actual NPV values. The results are around 85% accurate up to NPV values of the order 1.5x10^7. We can see that the SVM tend to notably under-predict after this value. The second application at the right, deals with a prediction of permeability from log measurements, namely, Gamma Ray, Neutron porosity, Sonic porosity, Bulk density and Formation resistivity. The training was based on information on one well to predict responses on another well. The Support Vector Machines results, indicated in red, was compared against other regression methods. We can observe that Support Vector Machines produces results closer to core measurements indicated by the black dots. Neural networks may not require introduction given its popularity for many years. Neural networks are powerful nonlinear regression techniques inspired by theories about how the brain works. Like partial least squares, the outcome is modeled by an intermediary set of unobserved variables, called hidden variables or hidden nodes here. Each hidden unit is a linear combination of some or all of the predictor variables. However, this linear combination is typically transformed by a nonlinear function, such as the logistic, or the sigmoidal, function to capture the intrinsic nonlinearity of the problem. Given the late great advances in deep learning for network representations such as recurrent and convolutional neural network, neural networks represent the best regression choice for many complex problems. In this example, the goal is to generate a data-driven simulation model that can aid at predicting future torque and acceleration values given measurements of RPM, weight on bit and PID control parameters in the drill string. From 13550 samples we have selected 9000 samples for training and 4550 samples for validation. At the left, we can see how the residual errors build-up as we predict torque values further in the future. However, we can increase the predictability by retraining the network every 2000 samples. At the right, we can see how the previous residuals errors have been mitigated with the retraining process. Artificial neural networks are frequently used in time series analysis to predict non-stationary and non-linear behavior. In real time, it is important to be able to retrain as more data comes along. A kind of natural and appealing idea is to be able to combine multiple models into one single model that can improve the overall predictability. Ensemble methods use multiple learning algorithms, hyper-parameters, inputs or training sets to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Ensembles combine multiple models to form, hopefully, a better model. The term ensemble is usually reserved for methods that generate multiple models using the same base learner. Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model, so ensembles may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. Fast algorithms such as decision trees are commonly used with ensembles, for example Random Forest, relying these type of trees, although slower algorithms can benefit from ensemble techniques as well. Bootstrap aggregating, often abbreviated as bagging, involves having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set. As an example, the random forest algorithm combines random decision trees with bagging to achieve very high classification or aggression accuracy. Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models misclassified. In some cases, boosting has been shown to yield better accuracy than bagging, but it also tends to be more likely to over-fit the training data. By far, the most common implementation of Boosting is Adaboost, although some newer algorithms are reported to achieve better results. A example of Boosting is the Gradient Boosting Regression Trees, also named GBRT, that we will discuss in the next few slides. The philosophy between bagging and boosting is quite different. The former partitions the data making the assessment of each model totally independent. In contrast, boosting seeks to iterate to reduce the prediction error on the original dataset. Bagging provides room for using a complex learner, whereas boosting may be constrained to simple learners to maintain computational efficiency. Also, bagging focuses in decreasing variances and the final learner is basically an average learner. Boosting tries to maximize the margin or equivalently, to minimize the generalization error of the learner. In this sense, boosting methods tend to perform slightly better than bagging methods. Now that we have covered the regression models, we need to be able to measure their performance. The RMSE provides an absolute measure of lack of fit into the model on the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RMSE. The R2 statistic provides an alternative measure of fit. It takes the form of a proportion, the proportion of variance explained, and so it always takes on a value between zero and one, and is independent of the scale of Y. These two metrics are the most important ones to measure the performance of any progression algorithm. After trying models, it is important to develop criteria to pick the best one. This is called model selection. One has to be aware of selecting the best predictive model over the best fitting model. Given a set of candidate models for the data, the preferred model is the one with the minimum Akaike information criterion or AIC value. Hence AIC rewards goodness of fit, as assessed by the likelihood function, but it also includes a penalty that is an increasing function of the number of estimated parameters. The Bayesian information criterion, or BIC, is a criterion closely related to the AIC that accounts for the number of samples used. When multiple models performs similarly, we need to follow Occam's Razor criteria, that is: select the simplest model that describes the data sufficiently well. Assuming that the data points are statistically independent and that residuals have a theoretical mean of zero and constant variance, the expected value of RMSE can be decomposed in three parts. The first part of the is usually called irreducible noise and cannot be eliminated by modeling. The second term is the squared bias of the model. This reflects how close the functional form of the model can get to the true relationship between the predictors and the outcome. The last term is the model variance. It is generally true that more complex models can have very high variance, which leads to overfitting. On the other hand, simple models tend not to overfit, but underfit if they are not flexible enough to model the true relationship, thus leading to high bias. Also, highly correlated predictors can lead to collinearity issues and this can greatly increase the model variance. This is referred to as the variance-bias trade-off as depicted at the left figure. Another way to see the issue is associating bias with accuracy and precision with variance, as I show in the right figure. Since we are required to select appropriate performance metrics to evaluate our models, we also need to ensure that we are using the appropriate sampling. The basic goal is to ensure that we calculate the best estimate on how a prediction model will perform when deployed with unseen cases. Let us discuss the three most popular sampling approaches: Hold-Out Sampling: This is the best known and simplest one. The samples are divided in training, validation and test sets. In some cases, only a training set and test set are considered. The training set is used to replicate the relation between inputs and outputs, the validation set is used to tune or calibrate the model, whereas the test set is used to measure how closely the model predicts an unseen outputs. This model is suitable when the number of samples is large. Bootstrapping: This is random sampling with replacement and it is recommended when the number of samples is limited. In general, this technique allows estimation of the distribution of almost any statistic using random sampling methods. K-Fold Cross-Validation: the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the testing data, and the remaining k -1 subsamples are used as training data. The cross-validation process is then repeated k times, which represents the folds, with each of the k subsamples used exactly once as the test data. The k results from the folds can then be averaged, or otherwise combined, to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and testing, and each observation is used for testing exactly once. With this slide we conclude the lecture on Regression Models and also this initial series of video lectures on Data Science for Oil & Gas. There are tons of concepts that have not been included in this series but I hope the material serves as a good starting point for learning more exciting concepts in Data Science and Machine Learning that could aid at improving your own daily work. Hope to see you next time in another series of video lectures. Thank you very much!