4. Data Visualization

Lesson content locked

Enroll in Course to Unlock

If you're already enrolled, you'll need to login.

Transcript

- Data visualization at a new level for improving the understanding of data. In fact, these body of data analysis is modernly known as Exploratory Data Analysis or EDA. This is, of course, not a new area. As EDA was defined as graphical detective work in 1977. Nowadays, we count a plethora of public and commercial software resources to perform EDA. Therefore, EDA allow us to better understand many of the previously discussed issues. Including data distributions, data quality issues such as missing values, noise and outliers, discover correlations of different types of functional relations among variables, subsets of interests in the data. evaluating results upon transforming the data, discover solution drivers and perform what-if scenarios with different subsets of the data. In many cases, EDA maybe even sufficient for achieving data analysis calls. As indicated in the right flow chart. In there, we can see that the EDA has a possibility to provide sufficient or complimentary insights to proceed directly with our basic transition before filling another. Let's go over a few visualization techniques. A very familiar one is a pie chart. A pie chart is the simplest statistical graphic which is divided into slices. To illustrate numerical proportion. I illustrated the concept with a simple flat pie chart to help us understand the proportion of rock on fluids on a giving asset. However, a word of caution must be made here. Mislabels on providing the wrong perspective in a pie chart could lead to highly misleading results. Finally, it is important to keep in mind that pie charts can be replaced in most cases by a bar plot. A bar plot is a chart that presents good data with rectangular bars whose length are proportionate to the values of data present. The bars can be plotted vertically or horizontally. Bar plots provide more flexibility than pie charts as other types of visualizations can be overlayed to provide following insights. In the example shown, we can see not only the rise of average production in 2016. But also a raise in average water, gas, and oil during the same year by introducing different colors in each bar. A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the priority distribution of a continuous variable. To construct a histogram, the first step is to divide the entire range of values into series of a small intervals or bins and then count how many values fall into each bin. A histogram may also be normalized for the purpose of displaying offrequencies. It is key to select the right bin size in histograms. For instance, using histograms may not be a good idea for a small data sets or a small number of bins. Sincedistributions can be artificially generated. In the corner field, we see what happens when using a small number of bins give the idea of a smooth gas production distribution shifted to the right. When in reality, the distribution is quite uneven across all branch of values. A very useful resource through this correlationships and correlations between each pair of variables are scatter plots. A scatter plot is very useful when we wish to see how two comparable data sets agree with each other. The more the two data sets agree, the more the scatters tend to concentrate in the vicinity of the identity line that is in the diagonal line yx=x. If the two data sets are numerically identical, the scatter fall in identity line exactly. When dealing with multiple variable, it is convenient to generate a metrics of scatter plots between each variable pair. Scatter plots between the same pair of variables are usually replaced by the histogram of that variable. In the slide, we can see a strong linear correlation between pressure measurements and oil and gas production. In other cases, the correlation is weak and tend to fall on horizontal trend. One of the most powerful aspects of a scatter plot however, is its ability to show not only in a relationships between variables. Furthermore, is the data represented by a mixture of modern, of simple relationships. These relationships will be visually evident as superimposed patterns. For comparing the spreading of multiple variables at the same time, it is very convenient to use box plots. Box plots allow us to graphically depict groups of numerical data through boxed tiles. Box plots may also have line extending vertically from the boxes called whiskers, indicating variability outside the upper and lower quartiles. Outliers may be plotted as individual points. This is also called a box and whisker plot. The inter quartile range or IQR is a difference between the 3rd quartile and 1st quartile. Both plots are known parametrics. These means that they display variations in samples of a statistical population without making any assumptions of then the lying probability of distribution. The space in between the different parts of the box indicate that the degree of a statistical dispersion. A spread. IS, Q and S involved in the data and show outliers. In the example given, we can see the flow and bottom whole pressure data shows less dispersion in its Q and S than the tuning on case impression. A very useful graphical resource for this current trends, dependencies and even solution drivers are the parallel coordinates plots. Assuming that the data is a ranch in a table with variables as colons and samples as rows, the parallel coordinate plot maps each row to line on horizontal profile. This means the variables represent a point in the horizontal axes and there are as many lines as samples in the data set. It means visualization is closely related to time series. Is set on this applied to data where the axes do not correspond to points in time. And therefore, do not have a natural order. Different axes arrangement maybe of interest to this courtrelationships. The big picture can be seen through the patterns of lines. And the lines can be highlighted to see the total performance of a specific data groups. To demonstrate this, we show three different set of lines. In yellow, red, and blue. Corresponding to different row plain sizes. The variables here are given by different statistical moments associated to the grain distribution. Namely, mean, variants, skewness and kurtosis. The values across these variables have beennormalized. We can observe that the grain size is inversely proportional to skewness and kurtosis. On the other hand, the mid-grain size shows the lightest variants or dispersion. There are a few things to keep in mind when using parallel coordinate plots. Large data sets create a lot of visual clutter. The order of axes impact how the reader understands the data. Relationships between adjacent variable are easier to perceive than non-adjacent variables. Depending on the data, each axes can have different scale which may be difficult to display. This concludes our video lecture on data exploration. With this knowledgein mind, you are now ready to perform inference and prediction from data. And see in the next videos, we will cover clustering, classification, and regression. Thanks and see you soon.