3MT3: Midterm 1

Validity (predictor vs. criterion) Content-Related Evidence: education; construct underrepresentation (does it test all imp elements) or construct-irrelevant variance (rely on wrong)? How to know?: are there separate factors and own expert judgement? / Criterion-Related Evidence: make a prediction (predictive validity evidence) high school predicts uni gpa? Concurrent validity evidence: 2 things happening at same time; small sample represents overall? / Validity coefficient: tend to not be high (.3-.4, .6 very high) why? each test has unreliability and may only be somewhat valid. (also depends on what you’re testing for & type 1 (yes when no) vs type 2 (no when yes) errors) / Population: representative, attrition, Restrict. of range / Generalizability: test may not be valid for all groups / Construct-related evidence: define and test and look at range of relationships. Convergent evidence and discriminant evidence. / Maximum validity coefficient: maximum validity is dependent on reliability of the criterion and predictor test (unreliable test can’t be valid).

Test Items Formats: dichotomous, polytomous (3 is good bc 4 usually has 1 useless one) / Guessing correct: R – (W / (n-1)) / Likert: odd has neutral (forcing with even gives diff data) – problems: ordinal? how invested are takers? / other: category (1-10; why? based on discrimination needed) / context effects: depends on what’s around you (anchor vs adjustment) / Item Analysis: difficulty ((chance + 1) / 2) and discriminability (extreme group: bottom vs top quartile; how many each got right and subtract both = discriminability index) of item / Point biserial: different kind of correlation, where one variable is dichotomous (right or wrong?) and correlate with total score; compare how many ppl got item correct with point biserial – total score vs prop. getting item correct: tests if it’s a good item; BEST if difficult between 30-70 and discriminability above .3 / item response theory: ask ppl questions at a level they are performing at (use computer and titrates you to level which is best for you)

Test Administration way you take test is a big deal / ETS Major Field Test: generalized around the world / Examiner and Subject: familiarity (kids need rapport), demeanour (diff tests need diff), race, language (ESL) / Expectancy effects: Rosenthal (expect certain result see it, especially when ambiguous) / Reinforcing responses: matters to children, consistent reinforcement, stereotype (girls & math) / Test manual: exact rules (reliable and valid only to extent manual is followed) / Computer test: easy to standardize / self-disclosure may differ based on test type (anonymity) / Subject variables: anxiety, illness, hormones, fatigue, disabilities / Behavioural assessment: raters are not perfect; reactivity is reaction to ratings being checked makes them rate better; drift: ratings change over time; contrast effect: difficult to make an observation without comparison; expectancy: halo/horn effect

Chapter 1 Types of tests: achievement (previous), aptitude (potential), intelligence (problem solve) / controversy: racial, test bias, scrutiny of law / China (oral exam for work; han dynasty – battery and ming dynasty – 3 tests) -> British to French and German -> Darwin: individual diffs -> Galton: phys diffs -> Cattell mental tests / Education: -> Herbart’s model for education -> Weber’s psych threshold -> Fechner’s sense strength (log) / individual diffs and experim. psych. -> Binet-Simon (subnormal intellect, mental age) -> WW1: army a and b -> against tests: B-S required lang; Weschler didn’t so better -> Woodworth (bad bc assumed item’s validity) -> projective test (Rorshach to TAT) -> MMPI (structured, no assume, factor analysis) -> more testing rather than therapy (lead to less testing) -> new branches needed (+ testing) Chapter 2 Scale types: nominal (none; freq. distrib.), ordinal (magnitude; hard math), interval (no abs 0; diff math), ratio (all (= intervals); any math) / percentile rank: n (below x)/ total x 100% / percentile: specific score below defined percent (i.e. 75th) / McCall’s T = 10Z+50(mean) (only standardize not normalize) / stanine: scale from 1-9 (mean5,sd2) / norm-referenced: compare to standardizes / tracking: stay at the same level relative to peers (if short will remain short; bad for education) / criterion-referenced: skills, tasks, or knowledge (no comparison) Chapter 3 regression: best-fit line (use principle of least squares; y=bx+a) / correlation is standard (from -1 to 1 slope and a=0, df=n-2) ; use for criterion validity (if slope 0 then use norm) / other: spearman’s rho (for ranks), biserial: continuous vs. artificial dich, point biserial: with true dich, phi: both dich (1 true) / residual = observed – predicted (sum of all = 0) / SE of estimate: stdev of residual (predict best when small) / coeff determine = variation in y scores as function of X (alien is nonassociation) / shrinkage: amount of decrease when regression from one pop is applied to another / cross validation: obtain SE of estimate for values predicted by equate A and values observed / bidirectionality, restricted range, regression to mean / multivariate anal: 3 or more variables / discriminant anal: find linear combo providing max discrim between cates / factor anal: matrix with every variable (factor load: correlation for items and factors) Chapter 4 Test score: observed score = true score + error (error is dispersion around true score; use mean of standard error to find true score) / domain sampling: reliability = var of obs on short / var of true on long / classical test requires same test for each person (but ppl differ in abilities so item response better) / test reliability: test-retest (stable traits, carryover/practice effect, poor reliability can mean trait changes) / parallel forms: diff items (only random error and diff items cause variation; difficult) / internal consistency/split-half: how split matters, all evaluate based on the items being same measure, use factor anal (spearman brown: bad for unequal variance; crohnbach: for unequal variance. KR20: all split halves, dich (p=correct, q=incorrect), covariance; KR21: uses approximation for p+q; coefficient alpha: not dich & most general) / difference score reliability always lower bc combined error (except when correlation is 0; low info means can’t trust gain-score) / interrater: kappa (-1 to 1, where >.75 is excellent, 0.4-0.7 good) / sources of error: time sampling (test-restest), item sampling (parallel forms), internal consistency (split-half, KR20, alpha), observer (kappa). / Sm -> score +/- 1.96*Sm / good reliability: .7 to .8 (>.95 for clinical) / low reliability: increase items and remove low items (factor anal, best to be unidimensional; dsicrim anal, correct for attenuation )