3. Data Transformation

Lesson content locked

Enroll in Course to Unlock

If you're already enrolled, you'll need to login.

Transcript

- Despite that the data can be cleaned via imputation or outlier removal, the data may not be ready enough to facilitate further analysis. In that case, we may need to transform the data. Data is usually transformed to facilitate interpretability, statistical analysis, or visualization. Generally, the transformation has to be continuous and invertible to be of practical use on any arbitrary set of values. Invertibility ensures that conclusions made on the transformed data also holds in the original data. Three popular transformations are usually carried out to normalize the data. The z-score transformation which maps each value referenced with respect to the mean of the data set. The resulting sequence of normalized values has mean zero and a standard deviation equal to one. The min-max transformation which maps each normalized value to be between the min and max values either contained the original set or provide by the user. The Box-Cox transformation which allows for mapping an arbitrary distribution of values to one that resembles a symmetrical Gaussian shape. This transformation seeks to remove skewness and other distributional features that may complicate analysis. The transformation requires finding a suitable exponent lambda that is usually set to be bounded between minus five and five. It is important to remark that data transformations have a wide applicability in oil and gas, given the recurrent need to generate interpretations and results in a dimensionless form. For example, the use of type curve have been valuable to represent dimensionless or normalized flow rate solutions on properly scaled plots to support decline curve analysis. Another example has to do with the treatment of temporal series such as those arising in drilling. The analysis of drilling series requires normalization to facilitate detection of outliers or dysfunctions whenever the responses deviate from a predefined stationary behavior. Transformations are not only used for changing scales, but also used to reduce variance on the data typically given by random noise. The process of reducing or eliminating frequent and large fluctuations in data is also known as denoising or data smoothing. In data analytics, a moving average, rolling or running average, is a calculation to analyze data points by creating a series of averages from different subsets of the full data set. Moving averages are commonly used with time series data to smooth out short-term fluctuations and highlight long-term trends or cycles. The threshold between short-term and long-term depends on the application, and the parameters of the moving average can be adjusted accordingly. Going back to our familiar DCA case, we can see that averaging neighboring points in the sequence generates a smoothing effect, indicated in blue color here. Moreover, we can see that we have implicitly removed the effect of outliers. The main advantages of moving averaging are, one, it fixes irregular fluctuations in a time series data, and, two, it is easy to understand and does not require any complex mathematical calculation. A more sophisticated approach is to use spectral methods, such as Fourier or wavelet transforms. The idea is to map the data into a frequency-amplitude domain and chop either high frequencies or high amplitudes. After this procedure is done, the data is mapped back to the original domain. We can see that the effect is similar to performing moving averaging. Nevertheless, smoothing with spectral methods may depict a kind of sinusoidal behavior in the resulting denoised data. Before jumping into another ways to transform data, let us look at an issue that is latently present in any of the data problems that we face in oil and gas. Imagine that all the data we encounter can be arranged in a table where the columns indicate the variables, attributes, features, or dimensions associated to our problem. The rows of our table denote the number of points or samples that we have. Intuitively, if we keep the number of samples fixed as we grow in the number of variables or dimensions, that is, increasingly regenerate a flatter table, the points will end up embedded in a much bigger volume of empty space. Moreover, as we increase the number of variables or dimensions, we are going to have a rapid increase of models that could fit the data. Think, for example, about how many lines will intersect the point in 2D or how many planes will intersect the point in 3D. Conversely, if our table rather becomes taller, the number of possible models may be all converging to the same level of accuracy. Consequently, as the number of variable increases, the volume of data required to preserve the same accuracy increases exponentially. The picture summarizes this fact as we go from 1D to 2D and 3D representations for the same number of points. This is what we call the curse of dimensionality. The curse of dimensionality thus refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces. Note that this may not be a strong issue in the 2D or 3D physical space that we manage from our everyday experience. In fact, this is why we tend to omit this intrinsic effect in data. What do we do when we deal with a large number of variables or dimensions then? The key is to reduce the dimensionality of the problem without affecting the essence of the results. Dimensionality reduction can be seen as another way to transform data. There are two ways to do this, by variable selection or by variable extraction. In engineering and geoscience practices, we are surely more familiar about selecting variables that we need. We usually do that by experience or by performing a statistical and visual inspection when the number of variables or dimensions are not overwhelming, say less than a dozen. However, in data science variable selection, it's a broad and rigorous subject that can provide additional insights on how we can proceed more effectively. I will give you just a byte of this. Variable selection relies on information gain or value of information metrics that depends heavily on the application of numerical or analytical models. This is not a one-shot business as it demands an iterative process. We go about variable selection when the number of variables is much larger than the number of samples. The picture below shows the selection of the right subset of variables is indeed an iterative process governed by how good the model performs. In contrast, variable extraction takes all variables and map them into a smaller subset. It is a true transformation of the whole set. There are several methods to do this, that range from linear, Principal Component Analysis, or better known as PCA, to highly sophisticated nonlinear methods, such as Kernel PCA, Isomaps, et cetera. The linear methods are relatively faster, more intuitive, and practical than the nonlinear ones. The point is that transformations are invertible in the linear case, whereas this may not be the case in the nonlinear one. On the other hand, nonlinear variable extraction methods can be very effective for interpreting and discovering non-obvious relations in 2D and 3D. As an example, consider 100 different geological realizations of a channelized reservoir. Suppose that each map consists of 100 times 100 grid cells, where each cell may have high, indicated in red, or low, indicated in blue, permeability values. This derive in a table of 100 rows with 100 times 100 columns for a total of one million values. How do we visualize this in a simple 2D plot? If we apply PCA or Kernel PCA, we can basically collapse each map into a point. That is, the original huge table has been mapped into a table of just 100 rows and two columns. In this 2D representation, we can group these points accordingly to their separation, that is, by performing just clustering. The 2D grouping will give us a way to group the original 100 maps according to their mutual resemblance. Note that the Kernel PCA seems to provide a better visual structure representation than PCA. Nevertheless, any operations in the 2D representation yield by the Kernel PCA may not be transformable or interpretable in the original map. With PCA, this is not the case. So, how does PCA work? Rather than providing the elegant mathematics behind the method, I would rather give you a simple geometrical interpretation that will serve to understand more easily what is going on when a PCA function is applied onto the data. The basic idea is that PCA captures the larger data variations by aligning each of the dimensions along the main axis. The method is quite general since it can be applied to a set of data tables, linear algebra matrices, or just images. Suppose that you have a cloud of points in 3D and you can rotate the axis in any direction, as shown in the video. There is a rotation that give you the best angle, that is, the one that allows you to find more points aligned into a particular direction. That representation will have the three axes modified with respect to original coordinate. You can pick the two most important axes, namely principal components, and be able to project the cloud of points in 2D. In higher dimensions, the PCA method will automatically provide this for you via orthogonal transformations. Of course, we are assuming that there is a linear relationship of possible correlated variables that are mapped into a set of linearly uncorrelated values called principal components.