The first step in solving problems in public health and making evidence-based decisions is to collect accurate data and to describe, summarize, and present it in such a way that it can be used to address problems.
Compute a mean, median, standard deviation, quartiles, and range for a continuous variable. Construct a frequency distribution table for dichotomous, categorical, and ordinal variables.
Give an example of when the mean is a better measure of central tendency (location) than the median. Procedures to summarize data and to perform subsequent analysis differ depending on the type of data (or variables) that are available. 1) Discrete Variables: variables that assume only a finite number of values, for example, race categorized as non-Hispanic white, Hispanic, black, Asian, other. 3) Time to Event Variables: these reflect the time to a particular event such as a heart attack, cancer remission or death. Frequency distribution tables are a common and useful way of summarizing discrete variables. The investigators also recorded whether or not the subjects were being treated with antihypertensive medication, as shown below. Note that for dichotomous and categorical variables there should be a space in between the response options. In contrast, figure 2 below illustrates a relative frequency bar chart of the distribution of treatment with antihypertensive medications. Consider the graphical representation of the data in Table 3 above, comparing the relative frequency of antihypertensive medications between men and women. A distinguishing feature of bar charts for dichotomous and non-ordered categorical variables is that the bars are separated by spaces to emphasize that they describe non-ordered categories. The first summary statistic that is important to report for a continuous variable (as well as for any discrete variable) is the sample size (in the example here, sample size is n=10).
Diastolic blood pressures <80 mm Hg are considered normal, and we can see that the last two exceed the upper limit just barely. In biostatistics, the term 'average' is a very general term that can be addressed by several statistics. Table 9 displays the sample means and medians for each of the continuous measures for the sample of n=10 in Table 8. Table 10 displays the sample ranges for each of the continuous measures in the subsample of n=10 observations.
When discussing the sample mean, we found that the sample mean for diastolic blood pressure was 71.3. The deviations from the mean reflect how far each individual's diastolic blood pressure is from the mean diastolic blood pressure. Table 13 displays the means, standard deviations, medians, quartiles and interquartile ranges for each of the continuous variables in the subsample of n=10 participants who attended the seventh examination of the Framingham Offspring Study. Since there are no suspected outliers in the subsample of n=10 participants, the mean and standard deviation are the most appropriate statistics to summarize average values and dispersion, respectively, of each of these characteristics. For clarity, we have so far used a very small subset of the Framingham Offspring Cohort to illustrate calculations of summary statistics and determination of outliers. Because men are taller, a more appropriate comparison is of body mass index, see Figure 15 below. The following table summarizes key statistics and graphical displays organized by variable type. Note that the RR produced by proc freq is RR = 4.15 - because they are comparing ASA to GG!
SAS calculates the odds ratio assuming that column 1 is the event of interest and row 1 is the treatment group of interest (meaning that column 2 is the reference event and row 2 is the reference treatment group).
We should re-format to ensure that we are comparing the new treatment, GG, to the usual treatment, ASA.
I don’t know about you … but I get hundreds of e-mails a day!  Not only e-mail – but text messages, tweets, e-mail advertising, notices from blogs, rss feeds…coming to me at a numbingly fast pace day and night. Because it is a lot easier to understand and grasp concepts through pictures than through text. Big data, on a corporate level, is way more challenging that my own personal big data issues.   Billions upon billions of rows of data, flowing in from all channels, faster than it’s possible to wade through it all.
It’s not a fantasy … it’s reality, with game-changing new software called SAS Visual Analytics. Hello and welcome to SAS Voices where SAS employees lead a conversation about notable people, products and ideas at SAS - and point you to the best content about SAS customers, advanced analytics and compelling industry insights.
The blog content appearing on this site does not necessarily represent the opinions of SAS. One clear indication of the current oversupply is how the price of crude reacts to news headlines. The market impact of lower volatility and tighter OPEC controls is reflected in the flatness seen in the latest VirtualOil simulation (see Fig. One example is analysis of downhole sensor data in wells employing steam-assisted gravity drainage (SAGD) to produce heavy oil and bitumen. The hypothetical derivatives-based oil production firm VirtualOil simulates the performance of a generic crude oil asset, and delivers sectorial exposure to the commodity oil market.

Like spaghetti on your plate, they can be hard to unravel, yet for many analysts they are a delicious staple of data visualization. The data set in this article contains World Bank data about the average life expectancy (at birth) for more than 200 countries.
The line plot enables you to easily track the rise and fall of life expectancy over time for each of these 10 countries.
The fact that the labels for Peru and Kuwait overlap is sign of that the line plot is starting to transition to a spaghetti plot. When there are many individual countries (or stocks or patients), the line plot no longer reveals the behavior of each individual unit, but still can reveal trends.
For the life expectancy data, an Income variable records the relative wealth of each country. The following call to PROC SGPLOT creates a spaghetti plot that contains lines for 207 countries.
Thirty-two nations are wealthy and belong to the Organization for Economic Co-operation and Development (OECD).
Fifty-one countries are upper-middle income, 51 are lower-middle income, and 31 are low income.
The GROUPLC= option (available in SAS 9.4M2) is used to color the lines by the five levels of the Income variable. One possible alternative is draw the curves for each category in a separate panel rather than overlaying the categories. By using the tool tips, you can discover the names of countries that experienced drops in life expectancy due to conflict or environmental disasters. For example, Hans Rosling famously used an animated bubble plot to show the relationship between life expectancy and average income over time. For an alternative visualization of this time series data, you might consider lasagna plots.
I would like to use heat map if there are too many lines need to draw, that could messed up . Information consists of data elements or data points which represent the variables of interest. As a result, it is important to have a clear understanding of how variables are classified. For example, total serum cholesterol level, height, weight and systolic blood pressure are examples of continuous variables. Because the numbers of men and women are unequal, the relative frequency of treatment for each sex must be calculated by dividing the number on treatment by the sample size for the sex. The mutually exclusive and exhaustive categories are shown in the first column of the table. Figure 1 below is a frequency bar chart which corresponds to the tabular presentation in Table 1 above.
The analogous graphical representation for an ordinal variable does not have spaces between the bars in order to emphasize that there is an inherent order. This graphical representation corresponds to the tabular presentation in the last column of Table 2 above. However, the bar chart on the left minimizes the difference, because the vertical scale is too expansive, ranging from 0 - 100%. When one is dealing with ordinal variables, however, the appropriate graphical format is a histogram. The rightmost column contains the body mass index (BMI) computed using the height and weight measurements. The table below shows each of the observed values along with its respective deviation from the sample mean. Expert panel on detection, evaluation and treatment of high blood cholesterol in adults: summary of the second report of the NCEP expert panel (Adult Treatment Panel II). They can explore relationships among hundreds of variables and determine their relative importance in order to quickly build predictive models and make iterative changes on the fly.
There is no doubt that a major terrorist attack as horrific as the tragic events in Paris will have serious consequences for the Middle East.
Using analytics to achieve the optimal play between injected steam and production flow can have a significant effect on the cost of producing a barrel of oil. The reorganized VirtualOil structure starts up with an investment of \$200MM in monthly average price call options with a strike price of \$25 per barrel on the price of West Texas Intermediate (WTI) light sweet crude oil. This article presents the good, the bad, and the messy about spaghetti plots and shows how to create basic and advanced spaghetti plots in SAS. The response variable might be the price of a stock, the temperature in a city, or the blood pressure of a patient. As you increase the number of curves in a line plot, more labels will overlap and it becomes harder to distinguish colors and to trace a country's curve from beginning to end. You can see that the highest life expectancy belongs to OECD nations whereas the lowest belongs to low-income nations.
The upper- and lower-middle income panels are a mixture of countries from diverse geographic locations. For example, the life expectancy in Equatorial Guinea is low relative to other high-income non-OECD countries.

Instead of plotting a response versus time, you can show changes in time by using an animation.
Sanjay Matange showed how to use PROC SGPLOT to create Rosling's animated bubble plot in SAS. When dealing with public health problems the units of measurement are most often individual people, although if we were studying differences in medical practice across the US, the subjects, or units of measurement, might be hospitals. The numbers of men and women being treated (frequencies) are almost identical, but the relative frequencies indicate that a higher percentage of men are being treated than women.
The frequencies, or numbers of participants in each response category, are shown in the middle column and the relative frequencies, as percentages, are shown in the rightmost column. On the other hand, the bar chart on the right visually exaggerates the difference, because the vertical scale is too restrictive, ranging from 30 - 40%. A histogram is similar to a bar chart, except that the adjacent bars abut one another in order to reinforce the idea that the categories have an inherent order.
It is preferable to arrange the data to make the outcome of interest appear in the first column and the target group to appear in the first row and to use formatting if it is not. And they can display results in a way that make sense and provide context and explanation for decision makers. In the past, any incident that increased tensions in oil-producing regions resulted in an immediate, substantial market reaction. Value is riding small sparks of occasional market volatility, and Value-at-Risk is closely tracking that dollar-per-barrel Mark to Market valuation. SAS has been working with customers to use analytics to draw optimization out of the data streaming off the oilfield to create more efficient use of assets in the low-price environment. The strip of options starts at 10,000 barrels per day and extends out for five years with a 20 percent average annual decline in underlying notional barrels, replicating a physical oil asset. When there are only a handful of stocks, cities, or patients, you can display multiple lines on the same plot and use labels, colors, or patterns to distinguish the individual units. You can see that the average trend is increasing, and that the slope looks greater for the low-income nations. For the World Bank data, you can use the BY statement in PROC SGPLOT to create full-sized plots of each Income level, or you can use the SGPANEL procedure to create five cells, each with 30 to 50 curves. It discusses statistical and computational algorithms, statistical graphics, simulation, efficiency, and data analysis. A population consists of all subjects of interest, in contrast to a sample, which is a subset of the population of interest. The frequency histogram below summarizes the blood pressure data that was presented in a tabular format in Table 4 on the previous page. Similarly, the OR is the ratio of the odds of CA abnormalities in the ASA arm compared to the GG arm, instead of GG compared to ASA.
If the data are not in this format, the odds ratio and relative risks are computed but the interpretation may be different than what is intended.
It’s an indication of how big the US shale oil boom was (the US Energy Information Administration reckons it was the largest expansion in American crude oil production in more than a century) – and how tenacious the year-long bust has turned out to be.
But in the present market, oil prices did not respond with a run-up – a sign not of indifference, but of glut.
The recent decision to restructure VirtualOil at a \$25 strike price means the portfolio more closely reflects many producers’ current level of operating costs.
You can see that some curves had negative slopes in the 1980s and 1990s, and that several had precarious dips. Again, you can use transparency and tool tips to help discern individual "noodles" among the five spaghetti plots. When the number of curves becomes large (typically more than 20 or 30), it becomes difficult to distinguish individual curves.
It is generally not possible to gather information on all members of a population of interest.
With inventories full to bursting, it's news of drawdowns – rather than events that could spark supply shortages – that provides short-term stimulation in the market.
The difference is that our fictitious oil portfolio can be nimble because it's structured around derivatives, so it can generate profit in the current price environment. Although using semi-transparent lines helps to reduce overplotting, the plot becomes virtually unreadable when you have 200 or more curves.
Instead, we select a sample from the population of interest, and generalizations about the population are based on the assumption that the sample is representative of the population from which it was drawn.
Monthly cash flow is generated when the daily average WTI price relative to the preceding month exceeds \$25 per barrel.
The following statement create a panel of spaghetti plots where each plot is now colored by a categorical variable (Region) that encodes the country's geographic region.
But many nonfictional producers’ results are saddled with additional overhead and capital expenditures, meaning they're losing money on each barrel produced.
Cash flow is reinvested monthly at 5% and the project winds up when the reserves are depleted.