Essential Statistical Concepts: Data Analysis and Modeling

Statistics: techniques (collecting,organizing,analysing,interpreting data)
Data may be:
quantitative (values expressed numerically) qualitative: (characteristics being tabulated). Descriptive statistics
: techniques  summarize, describe numerical data= easier interpretation – can be graphical/involve computational analysis. Inferential statistics: techniques about decisions about statistical population/process are made based only on a sample being observed – use of probability concepts. VARIABLES: discrete variable:
Observed values only at isolated points along a scale of values – counting values generally are expressed as whole numbers. Continuous variable: can assume value at any point along specific interval of values, data is generated by measuring

.


Discrete/continuous variables:
weight of the content (continuous), current ratio (continuous), number of defective computers produced (continuous), num of individuals (discrete), average num of cars sold per sales employee, dollar amount of revenue.


STATISTICAL ANALYSIS OF A VARIABLE: frequency distributions – table which values for a variable are grouped into classes & number observed values which fall distribution called grouped data (bar chart- histogram (F) x axis).Frequency polygon- line graph of frequency distribution, midpoint is on horizontal axis, number of observations represented as a dot along vertical axis connected with line segments to form polygon. (midpoint. + 2 numbers /2)FREQUENCY CURVE- a smoothed frequency polygon, negatively skewed, positively skewed, symmetrical. PARETO CHART- frequency bar chart describing qualitative variable, bars of chart are arranged in descending order from left to right. TIME SERIES- set of observed values for a sequentially ordered series of periods, bar chart and line are popular.
 


MEASURES OF DISPERSION AND DISTRIBUTION: describes the variability among a group of values. The most important techniques: deviation- based on absolute value of the difference between each value in data set + mean of the group, it is the average distance between each datapoint and mean (population mad & sample mad). Variance deviation- squared difference between each value in data set + mean of group. Thesquare root of the variance is called: standard deviation. Standard deviation- most appropriate measure of variability when using a population sample for which the mean is normal (symmetrical + normal distribution). Average deviation- a better gauge of variability when there are distant outliers (data normally distributed). Coefficient of variation- indicates relevant, magnitude of the standard deviation as compared with the mean of distribution of measurement (given as a percentage) (CV…). Pearson´s coefficient of skewness: helps identify if distribution is symmetrical or not, difference between the mean and the median relative to the standard deviation. 


LINEAR REGRESSION + CORRELATION ANALYSIS: Regression analysis- objective is to estimate relationship between 1 dependent variable and an independent variable. Simple regression
Value of a dependent variable is estimated on the basis of one independent variable. (general assumptions, dependent variables is random, scatter plot) Scatter plot- graph in which each plotted point represents an observed pair of values for the independent and dependent variables. Regression line positive slope- direct linear regression, inverse linear relationship (negative slope), no relationship is straight. The method for fitting a regression line- parameters BoB1 in linear regression model are estimated by values bo b1 based on sample data, best fitting regression line: sum of the squared deviations between the estimated & actual values of the dependent variable for sample data is minimized. 


RESIDUAL + RESIDUAL PLOTS: the difference between the observed value y + fitted value. Find fitted value and residual (e) using b1 bo. The standard error of estimate- is the conditional SD of he dependent variable (y) given a value of independent value (x). (Sy*X…) Inferences concerning the slope- in absence of any relationship, B1=0, the new hypothesis is tested: Ho:B1=0, a hypothesized value of slope is tested by computing a -(o)Q...=). Correlation analysis- measures the degree of relationship between the variables, population assumption= relationship is linear, both are random variables, variances are equal. Coefficient of determination- indicates proportion of variance in the dependent variable that is statistically explained by regression equation. Coefficient correlation
Indicates the direction of relationship between X+Y variables, absolute value of coefficient indicates extent of relationship. Significance testing for correlation coefficient- new hypothesis of testing is p=0, a hypothesis value of correlation coefficient is tested by computing a t-statistic + using n-2 degrees of freedom. 


Obtaining data for statistical purposes: direct observation: statistical process control necessary in which samples of output are symmetricallyassessed, if not possible to collect data directly info must be obtained from individual respondents. SAMPLING methods: Random sampling
every item in target population has a known & usually equal chance of being for inclusion in the sample. Simple random sample, systematic sample, stratified sample, cluster sampling, non-random sampling – individual selects the item to be included in the sample based on judgement. 


 STATISTICAL ANALYSIS OF A VARIABLE: frequency distributions – table which values for a variable are grouped into classes & number observed values which fall distribution called grouped data (bar chart- histogram (F) x axis).