Designing Questions to Be Good MeasuresChapter 5 from Fowler, F. J., Jr. (1993). Designing
questions to be good measures. In |
|
In surveys, answers are of interest not intrinsically but because of their relationship to something they are supposed to measure. Good questions are reliable, providing consistent measures In comparable situations, and valid; answers correspond to what they are Intended to measure. This chapter discusses theory and practical approaches to designing questions to be reliable and valid measures. It is always important to remember that designing a question for a survey instrument is designing a measure, not a conversational inquiry. In general, an answer given to a survey question is of no intrinsic interest. Rather the answer is valuable to the extent that it can be shown to have a predictable relationship to facts or subjective states that are of interest. Good questionnaires maximise the relationship between the answers recorded and what the researcher is trying to measure. In one sense, survey answers are simply responses evoked in an artificial situation contrived by the researcher. What does an answer tell us about some reality in which we have an interest? Let us look at a few specific kinds of answers and their meanings:
Although many surveys are analysed and interpreted as if the researcher "knows" what the answer means, that, in fact, is very risky. Studies designed to evaluate the correspondence between respondents' answers and "true values" show that many respondents answer many questions very well. However, there also is a considerable amount of lack of correspondence. To assume perfect correspondence between the answers people give and some other reality is naive. When it is true, it is usually the result of careful design. In the following sections, we discuss many specific ways researchers can improve the correspondence between respondents' answers and the "true" state of affairs. One goal of a good measure is to increase question reliability. When two respondents are in the same situation, they should answer the question in the same way. To the extent that there is inconsistency across respondents, random error is introduced and the measurement is less precise. The first part of this chapter deals with how to increase the reliability of questions. There is also the issue of what a given answer "means" in relation to what a researcher is trying to measure: How well does the answer correspond? The later two sections of this chapter are devoted to validity - the correspondence between answers and "true values"- and ways to improve that correspondence (compare Cronbach & Meehl, 1955). Designing a reliable instrumentOne step toward ensuring consistent measurement is that each respondent in a sample is asked the same set of questions. Answers to these questions are recorded. The researcher would like to be able to make the assumption that differences in answers stem from differences among respondents rather than from differences in the stimuli to which respondents are exposed. A survey data collection is an interaction between a researcher and a respondent. In a self-administered survey, the researcher speaks directly to the respondent through a written questionnaire. In other surveys, an interviewer reads the researcher's words to the respondent. In either case, the questionnaire is the protocol for one side of the interaction. In order to provide a consistent data collection experience for all respondents, a good questionnaire has the following properties:
Inadequate wordingThe simplest example of inadequate question wording is when, somehow, the researcher's words do not constitute a complete question. Incomplete wording
Interviewers (or respondents) will have to add words or change words in order to make an answerable question. If the goal is to have respondents all answering the same questions, then it is best if the researcher writes the questions fully. Sometimes optional wording is required to fit differing respondent circumstances. However, that does not mean that the researcher has to give up writing the questions. A common convention is to put optional wording in parentheses. These words will be used by the interviewer when they are appropriate to the situation and omitted when they are not needed. Examples of optional wording5.3 Were you (or anyone living here with you) attacked or beaten up by a stranger during the past? 5.4 Did (he/she) report the attack to the police? 5.5 How old was (EACH PERSON) on (his/her) last birthday'? In 5.3, the parenthetical phrase would be omitted if the interviewer already knew that the respondent lived alone. However, if more than one person lived in the household, the interviewer and would include it. The parenthetical choice offered in 5.4 may seem minor. However, the parentheses alerts the interviewer to the fact that a choice must be made; the proper pronoun is used, and the principle is maintained that the interviewer need read only the questions exactly as written in order to present a satisfactory stimulus. A variation that accomplishes the same thing is illustrated in 5.5. A format such as that might be used if the same question were to be asked for each person in a household. Rather than repeat the identical words endlessly, a single question is written instructing the interviewer to substitute an appropriate designation (your husband/your son/your oldest daughter). The above examples permit the interviewer to ask questions that makes sense and take advantage of knowledge previously gained in the interview to tailor the questions to the respondent's individual circumstances. There is another kind of optional wording that is seen occasionally in questionnaire that is not acceptable. Example of unacceptable optional wording5.6 What do you like best about this neighbourhood? (We're interested in anything like houses, the people, the parks, or whatever.) Presumably, the parenthetical probe was thought to be helpful to respondents who were having difficulty in answering the question. However, from a measurement point of view, it undermines the principle of standardized interviewing. If interviewers use the parenthetical probe when a respondent does not readily come up with an answer, then a subset of respondents will have answered a different question. Such optional probes usually are introduced when the researcher does not think the initial question is a very good one. The proper approach is to write a good question in the first place. Interviewers should never be given any options about what questions to read or how to read them except, as in the examples above, to make the questions fit the circumstances of a particular respondent in a standardized way. The following is a different example of incomplete question wording. There are three errors embedded in the example. Poor example of standardized wording5.7 I would like you to rate different features of your neighbourhood as very good, good, fair, or poor. Please think carefully about each item as I read it. (a) Public schools (b) Parks and services (c) Other The first problem. with 5.7 is the order of the main stem. The response alternatives are read prior to an instruction to think carefully about the questions. The respondent probably will forget the question. The interviewer likely will have to do some explaining or rewording. Second, the words the interviewer needs to ask about the second item on the list, "parks," are not provided in 5.7. A much better question would he the following: Better example5.7a I am going to ask you to rate different features of your neighbourhood. I want you to think carefully about your answers. How would you rate (FEATURE)-would you say very good, good, fair, or poor? This gives the interviewer the wording needed for asking the first and all subsequent items on the list. The third problem with the example is the alternative "other". What is the interviewer to say? It is not uncommon to see "other" on a list of questions in a form similar to the example. Although occasionally there may be a worthwhile question objective involved, most often the questionnaire will benefit from dropping the item. The above examples illustrate questions that could not he presented consistently to all respondents due to incomplete wording. Another step needed to increase consistency is to create a set of questions that flows smoothly and easily. It can be shown that if questions have awkward or confusing wording, if there are words that are difficult to pronounce, or combinations of words that sound awkward together, interviewers will change the words to make the questions sound better or to make them easier to read. It may be possible to train and supervise interviewers to keep such changes to a minimum. However, good design of the questionnaire will raise the odds of a standardized interview. Ensuring consistent meaning to all respondentsIf all respondents are asked exactly the same questions, one step has been taken to ensure that differences in answers can be attributed to differences in respondents. However, there is a further consideration. The questions should all mean the same thing to all respondents. If two respondents understand the question to mean different things, their answers may be different for that reason alone. One potential problem is using words that are not understood universally. In general samples, it is important to remember that a range of educational experiences and cultural backgrounds will be represented. Even with well-educated samples, using simple words that are short and widely understood is a sound approach to questionnaire design. Undoubtedly, a much more common error than using unfamiliar words is the use of terms or concepts that can have multiple meanings. It is impossible to give an exhaustive list of ambiguous terms used in surveys, but the prevalence of misunderstanding of common terms has been well documented by those who have studied the problem (e.g., Belson, 1981). Poorly defined terms5.8 How many times in the past year have you seen or talked with a doctor about your health? Problem. There are two ambiguous terms or concepts in this question. First, there is basis for uncertainty about what constitutes a doctor. Are only people practicing medicine with M. D. degrees included? If so, then psychiatrists are included, but psychologists, chiropractors, osteopaths, and podiatrists are not included. What about physicians, assistants or nurses who work directly for doctors in doctors' offices? If a person goes to a doctor's office for an innoculation, that is given by a nurse, does it count? Second, what constitutes seeing or talking with a doctor? Do telephone consultations count? Do visits to a doctors office when the doctor is not seen count? Solutions. Often the best approach is to provide respondents and interviewers with the definitions they need. 5.8a We are going to ask about visits to doctors and getting medical advice from doctors. In this case we are interested in all professional personnel who have M.D. degrees or work directly for an M.D. in the office such as a nurse or medical assistant. When the definition of what is wanted is extremely complicated and would take a very long time to define, as may be the case in this question, an additional constructive approach may be to ask supplementary questions about desired events that particularly are likely to be omitted. For example, visits to psychiatrists, visits for inoculations, and telephone consultations often are under reported and may warrant specific follow-up questions. Poorly defined terms5.9 Did you eat breakfast yesterday? The difficulty is that the definition of breakfast varies widely. Some people consider coffee and a donut anytime before noon to be "breakfast". Others do not consider that they have had break- fast unless it includes a major entree, such as bacon and eggs, and is consumed before 8:00 A.M. Solutions. There are two approaches to the solution. On the one hand, one might choose to define breakfast: 5.9a For our purposes, let us consider breakfast to be a meal eaten before 10:00 in the morning, which includes some protein such as eggs, meat or milk, some grain such as toast or cereal, and some fruit or vegetable. Using that definition, did you have breakfast yesterday? While that often is a very good approach, in this case it is very complicated. Instead of trying to communicate a common definition to respondents, the researcher may simply ask people to report what they consumed before 10:00 a.m. At the coding stage, the "quality" of what was eaten can be evaluated consistently without requiring each respondent to share the same definition. Poorly defined terms5.10 Do you favour or oppose gun control legislation? Problem. Gun control legislation can mean banning the legal sale of certain kinds of guns, asking people to register their guns, limiting the number or kinds of guns that people may possess, or which people may possess them. Answers cannot be interpreted without assumptions about what respondents think the question means. Respondents will undoubtedly interpret the question differently. 5.10a One proposal for the control of guns is that no person who ever had been convicted of a violent crime would be allowed to purchase or own a pistol, rifle. or shotgun. Would you oppose or support legislation like that? One could argue that it is only one of a variety of proposals for gun control. That is exactly the point. If one wants to ask multiple questions about different possible responses to a gun control problem, one should ask separate specific questions that can be understood commonly by all respondents and interpreted by researchers. One does not solve the problem of a complex issue by leaving it to the respondents to decide what questions they want to answer. The worst, way to handle a complex definitional problem is to give interviewers instructions about how to define terms if they are asked. Only respondents who ask will receive the definition; interviewers will not give consistently worded definitions if they are not written in the questionnaire. Thus the researcher will never know what question any particular respondent answered. If a complex term that may require definition must be used, interviewers should be required to read a common definition to all respondents. The "Don't Know" OptionWhen respondents are being asked questions about their own lives, feelings, or experiences, a "don't know" response is often a statement that they are unwilling to do the work required to give an answer. On the other hand, sometimes we ask respondents questions about things about which they legitimately do not know. As the object of the questions gets further from their immediate lives, the more plausible and reasonable it is that some respondents will not have adequate knowledge on which to base an answer or will not have formed an opinion or feeling. There are two approaches to dealing with such a possibility. One simply can ask the questions of all respondents, relying on the respondent to volunteer a "don't know." The alternative is to ask respondents whether or not they feel familiar enough with a topic to have an opinion or feeling about it. When a researcher is dealing with a topic about which familiarity is high, whether or not a screening question for knowledge is asked is probably not important. However, when there is reason to think that a notable number of respondents will not be familiar with whatever the question is dealing with, it probably is best to ask a screening question about familiarity with the topic. People differ in their willingness to volunteer a "don't know". A screening question for familiarity helps to produce a kind of standardisation; most people answering the question then will have at least some minimal familiarity with what they are responding to (Schuman & Presser, 1981). Specialised Wording for Special SubgroupsResearchers have wrestled with the fact that the vocabularies in different subgroups of the population are not the same. One could argue that standardised measurement actually would require different questions for different subgroups. Designing different forms of questionnaires for different sub- groups almost is never done. Rather methodologists tend to work very hard to attempt to find wording for questions that has consistent meaning across an entire population. Even though there are situations where a question wording is more typical of the speech of one segment of a community than another (most often the better-educated segment), finding exactly comparable words for some' other group of the population and then giving interviewers reliable rules for deciding when to ask which version is so difficult that it is likely to produce more unreliability than it reduces. Standardized expectations for type of responseThus far we have said it is important to give interviewers a good script so that they can read the questions exactly as worded, and it is important to design questions that mean the same thing to all respondents. The other component of a good question that sometimes is overlooked is that respondents should have the same perception of what constitutes an adequate answer for the question. The simplest way to give respondents the same perceptions of what constitutes an adequate answer is to provide them with a list of acceptable answers. Such questions are called closed questions. The respondent has to choose one, or sometimes more than one, of a set of alternatives provided by the researcher. Closed questions are not suitable in all instances. The range of possible answers may be more extensive than it is reasonable to provide. The researcher may not feel that all reasonable answers can be anticipated. For such reasons, the researcher may prefer not to provide a list of alternatives to the respondent. However, that does not free the researcher from structuring the focus of the question and the kind of response wanted as carefully as possible. 5.11 Why did you vote for Candidate A? Problems. Almost all "why" questions have problems. The reason is that one's sense of causality or frame of references can influence what one talks about. In the particular instance above, the respondent may choose to talk about the strengths of Candidate A, the weaknesses of Candidate B, or the reasons the respondent uses certain criteria ("My mother was a lifelong Democrat"). Hence respondents who see things exactly the same way may answer differently. Solution. Specify the focus of the answer: 5.1]a What characteristics of Candidate A led you to vote for (him/her) over Candidate B? Such a question explains to respondents that we want them to talk about Candidate A, the person for whom they voted. If all respondents answer with that same frame of reference, we then will be able to compare responses from different respondents in a direct fashion.' 5.12 What are some of the things about this neighbourhood that you like best? Problems. In response to a question like that, some people will only make one or two points, while others will make many. It is possible that such differences reflect important differences in respondent perceptions or feelings. However, research has shown pretty clearly that education is related highly to the number of answers people give to questions. Interviewers also affect the number of such answers. Solution. Specify the number of points to be made: 5. 12a What is the feature of this neighbourhood that you would single out as the one you like most? 5.]2b Tell me the three things about this neighbourhood that you like most about living here. Although that may not be a satisfactory solution for all questions, for many such questions it is an effective way of reducing unwanted variation in answers across respondents. The basic point is that answers can vary because respondents have a different understanding of the kind of responses that are appropriate. Better specification of the properties of the answer de- sired can remove a needless source of unreliability in the measurement process. Types of measures / types of questionsIntroductionThe above procedures are designed to maximise reliability - the extent to which people in comparable situations will answer questions in similar ways. However, one can measure with perfect reliability and still not be measuring what one wants to measure. The extent to which the answer given is a true measure and means what the researcher wants it to mean or expects it to mean is called validity. In this section, we discuss other aspects of the design of questionnaires, in addition to steps to maximise the reliability of questions, that can increase the validity of survey measures. For this discussion, it is necessary to differentiate questions designed to measure facts or objectively measurable events from questions designed to measure subjective states such as attitudes, opinions, and feelings. Even though there are questions that fall in a murky area on the borders of these two categories, the idea of validity is somewhat different for subjective and objective measures for several reasons. If it is possible to cheek the accuracy of an answer by some independent observation, then the measure of validity becomes the similarity of the survey report to the value of some "true" measure. In theory one could obtain an independent, accurate count of the number of times that an individual obtained medical services from a physician during a year. Although in practice it may be very difficult to obtain such an independent measure (e.g., records also contain errors), the understanding of validity can be consistent for objective situations. In contrast, when people are asked about subjective states, feelings, attitudes, and opinions, there is no objective way of validating the answers. Only the person has access to his or her feelings and opinions. Thus the only way of assessing the "validity" of reports of subjective states is the way in which they correlate either with other answers that a person gives or with other facts about the person's life that one thinks should be related to what is being measured. For such measures, there is no truly independent direct measure possible; the meaning of answers must be inferred from patterns of association. This fundamental difference in the meaning of validity requires sepa- rate discussions regarding ways of maximising validity. Levels of MeasurementThere are four different ways in which measurement is carried out in social sciences. This produces four different kinds of tasks for respondents and four different kinds of data for analysis:
Most often in surveys, when one is collecting factual data, respondents are asked to fit themselves or their experiences into a category, creating nominal data, or they are asked for a number, most often ratio data. "Are You employed?", "Are you married?". and 'Do You have arthritis?" are examples of questions that provide nominal data. "How many times have 'you seen a doctor?" "How old are you?", and 'What is your income?" are examples of questions to which respondents are asked to provide real numbers for ratio data. When gathering factual data, respondents may be asked for ordinal answers. For example, they may be asked to report their incomes in relatively large categories or to describe their behavior in nonnumerical terms ("usually. occasionally, seldom, or never"). When respondents are asked to report factual events in ordinal terms, it is because great precision is not required by the researcher or because the task of reporting an exact number was considered too difficult; ordinal classification seemed a more realistic task for a respondent. However, there usually is a real numerical basis underlying an ordinal answer to a factual question." The situation is somewhat different with respect to reports of subjective data. Although there have been efforts over the years, first in the work of a psychophysical psychologists (eg., Thurstone, 1929), to have people assign numbers to subjective states that met the assumptions of interval and ratio data, for the most part respondents are asked to provide nominal and ordinal data about subjective states. The nominal question is, "Into which category do your feelings, opinions, or perceptions fall?" The ordinal question is "Where along this continuum do your feelings, opinions, or perceptions fall?" When designing a questionnaire, a basic task of the researcher is to decide what kind of measurement is desired. When that decision is made, there are some clear implications for the form in which the question will be asked. Types of QuestionsSurvey questions can be classified roughly into two groups: those for which a list of acceptable responses is provided to the respondent (closed questions) and those for which the acceptable responses are not provided exactly to the respondent (open questions). When the goal is to put people in unordered categories (nominal data), the researcher has a choice about whether to ask an open or closed question. Virtually identical questions can be designed in either form. Examples of open and closed forms5.13 What health conditions do you have? (Open)5.13a Which of the following conditions do you currently have? (READ LIST) (Closed)5.14 What do you consider to be the most important problem facing our country today? (Open)5.14a Here is a list of problems that many people in the country are concerned about. Which do you consider to be the most important problem facing our country today? (Closed) There are advantages to open questions. They permit the researcher to obtain answers that were unanticipated. They also may describe more closely the real views of the respondent. Third, and this is not a trivial point, respondents like the opportunity to answer some questions in their own words. To answer only by choosing a provided response and never to have an opportunity to say what is on one's mind can be a frustrating experience. Finally, open questions are appropriate when the list of possible answers is longer than it is feasible to present to respondents. Having said all this, closed questions are usually a more satisfactory way of creating data. There are three reasons for this:
Finally, if the researcher wants ordinal data, the categories must be provided to the respondent. One cannot order responses reliably along a single continuum unless a set of permissible ordered answers is specified in the question. A bit more about the task that is given to respondents when they are asked to perform an ordinal task is appropriate, since it is probably the most prevalent kind of measurement in survey research. Figure 5.1 shows a continuum. In this case we are talking about having respondents make a rating of some sort, but the general approach applies to all ordinal questions. There is a dimension that is assumed by the researcher that goes from the most negative feelings possible to the most positive feelings possible. The way survey researchers get respondents into ordered categories is to put designations or labels on such a continuum. Respondents then are asked to consider the labels, consider their own feelings or opinions, and place themselves in the proper category. There are two points worth making about the kinds of data that result from such questions. First, respondents will differ one from the other in their understanding of what the labels or categories mean. However, the only assumption that is necessary in order to make meaningful analyses is that, on the average, the people who rate their feelings as "good" feel more positively than those who rate their feelings as "fair." To the extent that people differ some in their understanding of and criteria for "good" and "fair," there is unreliability in the measurement, but the measurement still may have meaning (i.e., correlate with the underlying feeling state that the researcher wants to measure). Second, an ordinal scale measurement like this is relative. The distribution of people choosing a particular label or category depends on the particular scale that is presented. Consider the rating scale in Figure 5.1 again and consider two approaches to creating ordinal scales. In one case, the researcher used a three-point scale, "good, fair, or poor". In the second case, the researcher used five descriptive words, "excellent, very good, good, fair, and poor". When one compares the two scales, one can see that adding "excellent" and "very good" in all probability does not simply break up the "good" category into three pieces. Rather it changes the whole sense of the scale. People respond to the' ordinal position of categories as well as to the descriptors. "Fair" almost certainly is further to the negative side of the continuum when it is the fourth point on the scale than when it is the second. Thus one would expect considerably more people to give a rating of "good" or better with the five-point scale than with the three-point scale. Such scales are meaningful if used as they are supposed to be used: to order people. However, by itself a statement that some percentage of the population feels something is "good or better" is not appropriate because it implies that the population is being described in some absolute sense. The percentage would change if the question were different. Only comparative statements (or statements about relationships) are justifiable when one is using ordinal measures: (a) Comparing answers to the same question across groups; e.g., 20 percent more of those in Group A than in Group B rated the candidate as "good" or better; or (b) Comparing answers from comparable samples over time, e.g., 10 percent more rate the candidate 'good' or better in January than did so in November. The same general comments apply to data obtained by having respondents order items. ("Consider the schools, police services, and trash collection. Which is the most important city Service to you?") The percentage giving any item top ranking, or the average ranking of an item, is completely dependent on the particular list provided. Comparisons between distributions when the alternatives have been changed at all are not meaningful. Agree-Disagree Items: A Special CaseAgree-disagree items are very prevalent in the survey research field and therefore deserve special attention. One can see that the task that respondents are given in such items is different from that of placing themselves in an ordered category. The usual approach is to read a statement to respondents and ask them if they agree or disagree with that statement. The statement is located somewhere on a continuum such as that portrayed in Figure 5.1. Respondents’ locations on that continuum are calculated by figuring out whether they say they are very close to that statement (by agreeing) or saying their feelings are very far from where that statement is located (by disagreeing). The use of agree-disagree questions to order respondents has two main potential limits. First, a statement, in order to he interpretable, must be located at the end of a continuum. For example, if a statement was to be rated that said "The schools are fair," presumably a point in the middle of a continuum, a respondent could disagree either because he rated the schools as "good" or because he rated them as "poor". The similar limitation is that it is very common for the statements used as stimuli for agree-disagree questions to have more than one dimension, (i.e., to be double-barrelled), in which case the answer cannot be interpreted. The two statements below provide examples of double- barrelled statements. 5.15 In the next five years, this country will probably be strong and prosperous. Problems. It obviously is possible for someone to have the view that the country will be strong but not prosperous or vice-versa. Since prosperity and strength do not go together necessarily, a respondent may have trouble knowing what to do. 5.16 With economic conditions the way they are these days, it really isn't fair to have more than one or two children. Problems. If a person does not happen to think that economic conditions are terrible (which the question imposes as an assumption person does not believe that economic conditions of whatever kind have any implications for family size, but if that person happens to think one or two children is a good target for a family, it is not easy to answer the question. FEELING ABOUT SOMETHING EXTREMELY POSITIVE EXTREMELY NEGATIVE TWO-CATEGORY SCALE GOOD NOT GOOD THREE-CATEGORY SCALE GOOD FAIR POOR ' FOUR-CATEGORY SCALE VERY GOOD GOOD FAIR POOR
FIVE-CATEGORY SCALE EXCELLENT VERY GOOD GOOD FAIR POOR Figure 5.1 Subjective Continuum Scale The problem then is knowing what the respondent agreed to, if he or she agreed. Asking two or three questions at once and having imbedded assumptions in questions are very common problems with the agree-disagree format. The agree-disagree format appears to he a rather simple way to construct questionnaires. In fact, to use this form to provide reliable, useful measures is not easy and requires a great deal of care and attention. In many cases, researchers would have more reliable and interpretable measures if they used a different question form. Increasing the validity of factual reportingWhen a researcher asks a factual question of a respondent, the goal is to have the respondent report with perfect accuracy, that is, give the same answer that the researcher would have given if the researcher had access to the information needed to answer the question. There is a rich methodological literature on the reporting of factual material. Reporting has been compared against records in a variety of areas, in particular the reporting of economic and health events (see Cannell et al., 1977a, for a good summary). Respondents answer many questions accurately. For example, over 90 percent of overnight hospital stays within six months of an interview are reported (Cannell & Fowler, 1965). However, how well people report depends on both what they are being asked and how the question is asked. There are four basic reasons why respondents report events with less than perfect accuracy:
There are several steps that the researcher can take to combat each of these potential problems. Let us review these. Lack of KnowledgeSince the main point of doing a survey is to get information from respondents that is not available in other ways, most surveys deal with questions to which respondents know the answers. The main reason that a researcher would get inaccurate reporting due to lack of knowledge is that he or she is asking one household member for information that another household member has. In health surveys, for example, it is common to use a household informant to report on visits to the doctor, hospitalisations. and illnesses for all household members. Economic and housing surveys often ask for a household respondent to report information for the household as a whole. If the information exists in the household, but simply not with the person that the researcher wants to be the main respondent, the solutions are either to eliminate proxy reporting or to provide an opportunity for respondents to consult with other family members. For example, the National Crime Survey conducted by the Census Bureau obtains reports of household crimes from a single household informant, but in addition asks each household adult directly about personal crimes such as robbery. If the basic interview is to be carried out in person, costs for interviews with other members of the household can be reduced - administered forms are left to be filled out by absent household members, or if secondary interviews are done by telephone. A variation is to ask the main respondent to report the desired information as fully as possible for all household members. Then mail the respondent a summary for verification, permitting consultation with other family members (see Cannell & Fowler, 1965). Finally, it sometimes is worth asking household members to designate the best informed person to answer the questions. The house-wife is not always the most knowledgeable about health, and the husband is not always the most knowledgeable about finances. People themselves often can do a better job of choosing the best respondent for a particular topic than can the researcher. RecallStudies of the reporting of known hospital stays clearly show the significance of memory in the reporting of events. As the time between the interview and a hospitalisation event increases, the probability of it being reported in an interview decreases. In a like way, short hospitalisations are less likely to be reported than long ones. Memory decays in predictable ways; the minor and distant events are more difficult to conjure up in a quick question and answer interview. There are several ways to reduce the impact of memory decay on the reporting of factual events. Five possible methods are as follows:
It should be noted that a trade-off with both the reinterview and diary strategies is that it is more difficult to convince people to keep a diary or be interviewed several times than it is to get them to agree to a one-time interview. Hence the values of improved reporting have to be weighed against the possible biases resulting from sample attrition. Social DesirabilityThere are certain facts or events that respondents rather would not report accurately in an interview. Conditions that have some degree of social undesirability such as mental illness and venereal disease are underreported significantly more than other conditions (Madow, 1963., Densen et al., 1963). Hospitalisation associated with conditions that are particularly threatening, either because of the possible stigmas that may be attached to them or due to their life threatening nature, are reported at a lower rate than average (Cannell et al., 1977a). Aggregate estimates of alcohol consumption strongly suggest underreporting, although the reporting problems may be a combination of recall difficulties and respondents' concerns about social norms regarding drinking. Arrest and bankruptcy are other events that have been found to he underreported consistently, but which seem unlikely to have been forgotten (Locander et al., 1976). There are probably limits to what people will report in a standard interview setting. If a researcher realistically expects someone to admit something that is very embarrassing or illegal, extraordinary efforts are needed to convince respondents that the risks are minimal and the reasons for taking a risk are substantial. The following are some of the steps that a researcher might particularly consider when sensitive questions are being asked (also see Sudman & Bradburn, 1982).
Increasing validity of subjective questionsAs discussed above, the validity of subjective questions has a different meaning than the validity of objective questions. There is no external criterion. One only can estimate the validity of a subjective measure by the extent to which answers are associated in expected ways with the answers to other questions or other characteristics of the individual to which it should be related (see Turner & Martin, 1984, for an extensive discussion of issues affecting the validity of subjective measures). There basically are only three steps to the improvement of validity of subjective measures:
Problematic scales5. 16 How would you rate your job-very rewarding, rewarding but stressful, not very rewarding but not stressful, or not rewarding at all? 5.17 How would you rate your job-very rewarding, somewhat rewarding, rewarding, or not rewarding at all? Question 5.16 has two scaled properties - rewardingness and stress - that need not be related. All the alternatives are not played out. Question 5.16 should be made into two questions if rewardingness and stress of jobs are both to be measured. In 5.17, some would see "rewarding" as more positive than "somewhat rewarding" and be confused about how the categories were ordered. Both of these problems are common and should be avoided.
The most important point to remember about the meaning of subjective measures is their relativity. Distributions can be compared only when the stimulus situation is the same. Small changes in wording, changing the number of alternatives offered, and even changing the position of a question in a questionnaire can make a major difference in how people answer. (See Turner & Martin, 1984; Schuman & Presser, 1981; and Sudman & Bradburn, 1982 for numerous examples of factors that affect response distributions.) The distribution of answers to a subjective question cannot be interpreted directly; it only has meaning when differences between samples exposed to the same questions are compared or when patterns of association among answers are studied. Error in perspectiveA defining property of social surveys is that answers to questions are used as measures. The extent to which those answers are good measures is obviously a critical dimension of the quality of survey estimates. Questions can be poor measures because they are unreliable (producing erratic results) or because they are biased, producing estimates that consistently err in one direction from the true value (as when drunk driving arrests are underreported). We know quite a bit about how to make questions reliable. The principles outlined in this chapter to increase reliability are probably sound. Although other points might be added to the list, creating unambiguous questions that provide consistent measures across respondents is always a constructive step for good measurement. The validity issue is more complex. In a sense, each variable to be measured requires research to identify the best set of questions to measure it and to produce estimates of how valid the resulting measure is. Many of the suggestions to improve reporting in this chapter emerged from a twenty-year program to evaluate and improve the measurement of health-related variables (Cannell et al.,. 1977a, 1977b). There are many areas in which a great deal more work on validation is needed. A third issue is the credibility of a question (or series) as a measure. It always is legitimate to ask researchers for their evidence about how well a question (or series) measures what it is supposed to. Too often, researchers make little effort to evaluate their measures; they assume, and ask their readers to assume, that answers mean what they 'look like" they mean and measure what the researcher thinks they are supposed to measure. To rely on so-called "face validity" of questions is not acceptable practice. Researchers should build explicit efforts to assess the validity of their key measures into their analyses. As standard practice, patterns of association related to validity can be calculated and presented in an appendix. Reducing measurement error through better question design is one of the least costly ways to improve survey estimates. For any survey, it is reasonable to attend to careful questionnaire design and pretesting (which are discussed in Chapter 6) and making use of the existing research literature about how to measure what is to be measured. Also, building a literature over time in which the validity of measures has been evaluated and reported is much needed. Such evaluations are now the exception; they should become routine. |