URBS 4/513—Urban Program Evaluation

Notes on Evaluation Methodology Basics by E. Jane Davidson

I. What is Evaluation?

a. Definitions

i. …systematic determination of the quality or value of something

ii. Merit—“intrinsic” value of somethinig

iii. Worth—value of something to an individual

b. Fitting Approach to Purpose

i. Accountability—should have independent evaluation

ii. Organizational learning—should have stakeholder participation

c. Key Evaluation Checklist (Scriven, 2003)

d. Identifying the “evaluand” (the word is a hybrid of English & Latin—it means “that which is to be evaluated”)

i. Describe as it really is (not as it is supposed to be)

ii. Describe background and context (why did it come into existence in first place?)

e. Advice for choosing first project

i. Benefits people in some way

ii. “live” program

iii. No major political ramifications

iv. Not current recipient or consumer

II. Purpose of Evaluation

a. Preface of report

i. Who asked for this evaluation and why

ii. Main purposes of this evaluation

1. overall quality or value

2. areas of improvement

3. both

iii. big picture questions

1. absolute merit/worth

2. relative merit/worth

b. Determining Overall Quality/Value

i. Summative evaluation

1. return on investment (ROI)

2. go/no go decision

3. allocation decision

4. competitive analysis (benchmarking)

5. marketing tool (documenting impact for external audiences)

ii. Strategies

1. product/service listing

2. brand recognition

3. selective testimonial

4. documentation of actual results

5. “bring it on!”

c. Finding Areas for Improvement

i. Formative evaluation

1. help new evaluand find its feet

a. if no errors or glitches, probably not pushing the envelope enough

2. explore ways of improving “mature” product

a. “boiled frog” phenomenon

b. Pulse of customer needs

c. Values negative results

d. Big Picture Questions

i. Absolute value/worth—“Grading”

1. good enough to implement?

2. dimension of merit (A-F)

ii. Relative value/worth—“Ranking”

e. Synthesis—2x2 table (summative/formative X grading/ranking)

i. Summative grading—worth what it costs?

ii. Summative ranking—most cost-effective?

iii. Formative grading—how well does it match the real needs?

iv. Formative ranking—how does it compare with results achieved elsewhere?

III. Identifying Evaluative Criteria (“dimensions of merit”)

a. Why not just use “goals”?

i. Good start, but not sufficient

ii. Problems”

1. How do you handle overruns & shortfalls?

2. How do you incorporate goal difficulty and goal importance?

3. What about side effects?

4. How do you synthesize mixed results?

5. Were initial target levels reasonable?

6. Can one ignore process (do ends justify the means)?

7. Whose/which goals should you use?

iii. Alternatives:

1. Goals-free evaluation

2. Needs-based evaluation

b. Needs Assessment approach to evaluation criteria

i. Key issue is impact on end users (“consumers”or “impactees”)

1. data from needs assessment phase can provide baseline data for later impacts

ii. Identifying “consumers”

1. In general, people for whom something changes (or should change) as a result of a particular activity

2. Sometimes, goal is to prevent change.

3. Distinguish between “upstream stakeholders” and “downstream consumers”

iii. Needs vs. wants

1. Need is something without which unsatisfactory functioning occurs.

2. Want is conscious desire without which dissatisfaction (but not necessarily unsatisfactory functioning) occurs.

3. Needs are context-dependent (i.e., few needs are absolute)

4. Types of needs

a. Conscious vs. unconscious

b. Met vs. unmet

c. Performance needs vs. instrumental needs

iv. Needs assessment method

1. Identify & document performance needs (“Severity documentation”)

a. Document extent (quantitative data)

b. Investigate individuals in need (qualitative data)

c. Consider additional performance needs

2. Investigate the underlying causes of performance needs (“diagnostic phase”)

a. Develop “Logic Model” (cause-effect relations linking goals to actions/inputs to outputs/outcomes, resulting in desired outcomes)

b. “If this action is taken, will address this underlying need, which should solve this problem”

c. Makes assumptions specific, allows them to be challenged and tested

d. Complete model should include all major needs identified

3. Strategy for identifying performance needs (2x2 table)

a. Met/Unmet needs X Conscious/Unconscious needs

v. Identifying other relevant criteria

1. Criteria of merit (from definitions/standard usage)

2. Legal requirements

3. Ethical requirements

4. Fidelity to specifications

5. Alignment with organizational strategy/goals

6. Professional Standards

7. Logical consistency

8. Legislative

9. Scientific/technical

10. Market

11. Expert Judgment

12. Historical/Traditional/Cultural standards

IV. Potential Sources of Evidence

a. Multiple sources of evidence

i. Whatever conclusions are drawn, they must be backed by solid evidence

ii. Never draw a conclusion on a single piece of evidence

1. Different types of data (qualitative, quantitative)

2. Multiple sources of information (documentation, observation, different stakeholder groups)

iii. Triangulation

iv. Rolling design (open-ended continuous improvement—develop data collection in phases, so each phase can improve on the previous one)

b. Process Evaluation

i. Categories

1. Content evaluation

2. Implementation

3. Other

ii. Pare down initial list of criteria

c. Outcome Evaluation

d. Comparative Cost-Effectiveness

i. Type of costs: money, time, effort, space, and opportunity

ii. Timing of costs

iii. Locus of costs

iv. Assessment

1. Absolute magnitude (eg, excessive, high, reasonable, cheap)

2. Relative magnitude (is there a more cost-effective alternative?). Find comparisons:

a. Rolls Royce

b. Shoestring

c. A Little More

d. A Little Less

e. Exportability

i. Is this potentially valuable or a significant advance?

ii. Even process can be an advance (eg, GE’s “Six Sigma” method)

iii. Skip this section in final report if nothing is found.

V. Causation

a. Certainty about Causation

i. Different decision-making contexts require different levels of certainty

ii. Single-method research almost always displays weak causation

b. Principles for Inferring Causation

i. Look for evidence for/against suspected cause

ii. Look for evidence for/against important alternative causes

iii. “Critical multiplism”—mix of methods with different strengths together provide sufficient evidence of cause

iv. Strategies for inferring causation:

1. Ask observers

a. Ask those experience little change, some change, substantial change

b. Ask participants’ experience (as a result, anything else affect you, anything else as a result—to you or others)

2. Check whether content of evaluand matches the outcome (also look for counterexamples)

3. Look for other telltale patterns that suggest one cause or another (“modus operandi”—for evaluands with distinctive patterns of effects)

4. Check whether the timing of outcomes makes sense

a. Did effect occur before evaluand was introduced?

b. Is timing of appearance equally logical relative to other possible causes?

c. Did outcomes further downstream occur out of sequence?

5. Is “dose” logically related to “response”? (relationship might not be linear—“ceiling” and “overdose” effects)

6. Compare to control group (fully randomized experimental design, or quasi-experimental design)

7. Statistical control for external variables

8. Identify & check underlying causal mechanisms

v. Blend of strategies

1. Identify potentially most threatening rival explanation and then choose types of evidence that will most quickly and cost-effectively confirm or dispel the rival explanation.

2. Specifically hunt for evidence that would confirm the rival explanation

3. Continue until no more reasonable alternatives remain

c. Types of Evidence Needed

d. Blend of Evidence

VI. Values

a. Program evaluations presumes “arriving at explicitly evaluative conclusions.” This requires several tasks which are not part of descriptive research:

i. Importance weighting

ii. Merit determination

iii. Synthesis

b. “The” Controversy

i. The “scientific paradigm” attempts to take a value-neutral position (see Guba & Lincoln, 1989)

1. present the factual results, let the stakeholders make their own determinations, based on their several values

2. Besides, there is no available methodology that can yield valid and defensible valuative findings (since the stakeholders’ values can be expected to be different, competing, and possibly even conflicting).

ii. The “constructivist/interpretivist” paradigm asserts that there is no such thing as a value-neutral position (“You can’t stand still on a moving train,” as Howard Zinn puts it).

1. The task of evaluation is to make sense of the various values that the multiple stakeholders bring to the process.

iii. Both sides assume

1. All evaluative claims (“claims of value”) are arrived at subjectively

2. Stakeholders must be free to make up their own minds

c. Subjectivity (“All evaluative claims are arrived at subjectively”)

i. Inappropriate application of personal or cultural preferences/biases

ii. Informed judgment (“expert” opinion)

1. Needs triangulation to ensure that it is robust

iii. Autobiography

iv. Do not confuse “subjectivity” with “subjective/objective measures”

d. Subjectivism (“All evaluative statements can be neither true nor false; they are simply statements about the feelings of the people making the statements”—see, for example, Wittgenstein’s Philosophical Investigations)

i. In the real world (as opposed to the Academy), clients expect not only descriptive data but some defensible conclusions about the quality or value of what is being evaluated

ii. Recall that the level of certainty must be appropriate to the decision-making context. Careful identification and application of relevant values, while not indubitable, may provide sufficient evidence.

iii. Not all sources of values are arbitrary or idiosyncratic. It is possible to achieve substantial agreement about appropriate sources for values to be used in analysis.

1. Usually, 100% certainty (indubitability) is not required

2. Sources of values can be documented and justified

3. Meta-evaluation (evaluating the evaluation) can be performed

VII. Importance

a. “Importance” is assigning labels to dimensions or components to indicate their importance.

b. Determining importance

i. Essential for being able to

1. prioritize improvements

2. identify whether strengths/weaknesses are major/minor

3. assess overall performance when results are mixed

ii. Three approaches

1. dimensional: multiple dimensions of merit. Used when consumers experience an entity along multiple dimensions.

2. component: each part evaluated separately—often on several dimensions. Common in evaluation of policies, programs, or interventions that are made of several distinct parts.

3. holistic: used when quality or value is experienced as an entire package. Used in personnel, product, and service evaluation.

c. Strategies

i. Stakeholders “vote” on importance

ii. Knowledge of key stakeholders (“setting the bar”)

iii. Evidence from literature

iv. Specialist judgment

v. Evidence from needs and values assessment

1. determine importance

2. establish rubrics

vi. Program theory and evidence of causal linkages (don’t strive for higher level of precision than what is really required)

Employ principles of “critical multiplism”

VIII. Merit Determination

a. Process of setting “standards”

b. Determining merit

i. Process

1. Define what constitutes the range of quality

2. Use definitions to convert empirical evidence into evaluative conclusions

ii. Measurement issues

1. Grading (single quantitative measure)

2. Rating (qualitative or multiple measures)

iii. Accuracy issues—“Just because we can measure something to 4 decimal places does not mean we can rate its quality or value to the same level of precision.”

c. Rubrics (tool for providing evaluative description of what performance/quality “looks like” at two or more levels)

i. Grading rubric (absolute value)

ii. Ranking rubric (relative value)

1. distinguish between statistical significance (deviation greater than chance) and practical significance (real impact)

2. e.g., need useful comparison

a. control group (in experimental or quasi-experimental design)

b. benchmarking

iii. Consult end-users in developing rubrics

IX. Synthesis Methodology

a. Synthesis is the process of combining a set of ratings or performances on several components or dimensions into an overall rating.

i. Opportunity Costs--whenever resources are allocated to something, this is always at the expense of whatever else might have been done with the same resources.

ii. Review of process so far:

1. define main tasks

2. determine importance weights (3-5 levels are sufficient)

3. draw up rubrics for performance on each task

4. synthesize tasks, weighted for importance and performance

b. Synthesizing for “Grading”

i. Numerical weighting with “bars”—

1. combine weighted average of performance ratings with a minimum acceptable level of performance for one or more tasks

2. a “bar” is minimum level of performance on a specific dimension

3. works well for simple cases, as long as bars are used to prevent inappropriate masking of poor performance

ii. Nonnumerical weighting (with no “bars”)

1. Remember to include guidelines for both positive & negative elements in the merit rubric

2. Combine data sources by starting from strongest and then modifying based on next strongest, and so on.

iii. Nonnumerical weighting with “hurdles”

1. a “hard hurdle” is an overall passing requirement

2. a “soft hurdle” is an overall requirement for entry into a higher rating category (it is a “nonfatal” criterion—it puts a limit on the maximum rating, but does not eliminate)

3. use the median rather than the average rating

a. ensures that extremes do not disproportionately affect results

b. avoids assuming that rating categories are equal-interval scale

4. Nothing magic about cutoffs used in rubric—might be room to debate exactly where those points should be

c. Synthesizing for “Ranking”—comparison must be much more explicitly treated

i. NWS (“Numerical weight and sum”)

1. Procedure:

a. ascribe numerical importance weights and performance scores

b. multiply weights by performance scores

c. sum the products

2. Adequate, provided

a. Small number of criteria (otherwise minor criteria can “swamp” major one)

b. Mechanism for taking bars into account

c. Defensible needs-based strategy for ascribing weights (since relative differences between criteria will be magnified by the multiplication process)

ii. QWS (“Qualitative weights and sum”)

1. Procedure

a. Determine importance in terms of maximum possible value

b. Set bars

c. Create value determination rubrics

d. Check equivalence of values across dimensions

i. validity depends on rough equivalence of value dimensions

ii. take diagonal pairs & consider trade-offs

iii. check maximum value settings (is it really better than nothing?)

e. Rate value of actual performance on each dimension

f. Tally the number of ratings at each level and look for a clear winner (in NWS, the numbers decide; in QWS, must think explicitly about tradeoffs)

g. Refocus

X. Putting It All Together (writing the report)

a. Preliminary

i. Executive summary

1. very short description

2. bottom line

3. graphic profile of performance (7+2 dimensions)

4. strengths & weaknesses (bullet points)

ii. Preface

1. who asked for this

2. main big picture question

3. formative or summative?

4. main audiences

iii. Methodology (make clear why you made choices)

b. Foundation

i. Background & Context

1. basic rationale for program

2. context that constrains or enables performance

ii. Descriptions and Definitions

iii. Consumers (make sure those listed also appear in Outcomes evaluation)

iv. Resources

v. Values (again, justify choices)

c. Evaluation Subdivisions (draw explicit conclusions)

i. Process

ii. Outcomes

iii. Comparative Cost-Effectiveness

iv. Exportability (if any)

d. Conclusions

i. Overall Significance

ii. Recommendations & Explanations

iii. Responsibilities

iv. Report & Support (link back to preface)

v. Meta-evaluation

XI. Meta-Evaluation (determination of quality/value of an evaluation)

a. Validity

i. are conclusions justified?

ii. Were right questions asked?

iii. Were appropriate standards employed?

b. Utility

i. Finding relevant to audiences?

ii. Findings available when needed (timeliness)?

iii. Findings communicated clearly?

c. Conduct

i. Legal, ethical & professional standards

ii. Cultural appropriateness

iii. Unobtrusiveness

d. Credibility

i. Familiarity with context

ii. Independence/impartiality (& lack of conflict of interest)

iii. Expertise in evaluation and in subject matter

e. Costs

i. Money

ii. Opportunity

iii. Time

MSU