Designing Questions to Be Good Measures

Chapter 5 from Fowler, F. J., Jr. (1993). Designing questions to be good measures. In
F. J. Jr., Survey research methods (2nd ed.) (pp. 69-93). Newbury Park, CA: Sage.

In surveys, answers are of interest not intrinsically but because of their relationship to something they are supposed to measure. Good questions are reliable, providing consistent measures In comparable situations, and valid; answers correspond to what they are Intended to measure.

This chapter discusses theory and practical approaches to designing questions to be reliable and valid measures.

It is always important to remember that designing a question for a survey instrument is designing a measure, not a conversational inquiry. In general, an answer given to a survey question is of no intrinsic interest. Rather the answer is valuable to the extent that it can be shown to have a predictable relationship to facts or subjective states that are of interest. Good questionnaires maximise the relationship between the answers recorded and what the researcher is trying to measure.

In one sense, survey answers are simply responses evoked in an artificial situation contrived by the researcher. What does an answer tell us about some reality in which we have an interest?

Let us look at a few specific kinds of answers and their meanings:

  1. A respondent tells us that he voted for Nixon rather than McGovem for president in 1972. The reality in which we are interested is which lever, if any, he pulled in the voting booth. The answer given in the survey may differ from what happened in the voting booth for any number of reasons. The respondent may have pulled the wrong lever and, therefore, not know for whom he voted. The respondent could have forgotten for whom he voted. The respondent could have altered his answer for some reason on purpose. The interviewer accidentally could have checked the wrong box even after an "accurate" answer was given.
  2. A respondent tells us how many times he went to the doctor for medical care during the past year. Is that the same number that the researcher would have come up with had he followed the respondent around for 24 hours a day for 365 days, during the past year? Problems of recall, problems of definition of what constitutes a visit to a doctor, and problems of willingness to report accurately may affect the correspondence between the number the respondent gives and the count the researcher would have arrived at independently.
  3. When a respondent rates her public school system as "good" rather than "fair" or "poor," the researcher will want to interpret that answer as reflecting evaluations and perceptions of that school system. If the respondent rated only one school rather than the whole school system, or tilted the answer to please the interviewer, or understood the question differently from others, her answer may not reflect the feelings the researcher tried to measure.

Although many surveys are analysed and interpreted as if the researcher "knows" what the answer means, that, in fact, is very risky. Studies designed to evaluate the correspondence between respondents' answers and "true values" show that many respondents answer many questions very well. However, there also is a considerable amount of lack of correspondence. To assume perfect correspondence between the answers people give and some other reality is naive. When it is true, it is usually the result of careful design. In the following sections, we discuss many specific ways researchers can improve the correspondence between respondents' answers and the "true" state of affairs.

One goal of a good measure is to increase question reliability. When two respondents are in the same situation, they should answer the question in the same way. To the extent that there is inconsistency across respondents, random error is introduced and the measurement is less precise. The first part of this chapter deals with how to increase the reliability of questions.

There is also the issue of what a given answer "means" in relation to what a researcher is trying to measure: How well does the answer correspond?  The later two sections of this chapter are devoted to validity - the correspondence between answers and "true values"- and ways to improve that correspondence (compare Cronbach & Meehl, 1955).

Designing a reliable instrument

One step toward ensuring consistent measurement is that each respondent in a sample is asked the same set of questions. Answers to these questions are recorded. The researcher would like to be able to make the assumption that differences in answers stem from differences among respondents rather than from differences in the stimuli to which respondents are exposed.

A survey data collection is an interaction between a researcher and a respondent. In a self-administered survey, the researcher speaks directly to the respondent through a written questionnaire. In other surveys, an interviewer reads the researcher's words to the respondent. In either case, the questionnaire is the protocol for one side of the interaction. In order to provide a consistent data collection experience for all respondents, a good questionnaire has the following properties:

  1. The researcher's side of the question and answer process is fully scripted so that the questions as written fully prepare a respondent to answer questions.
  2. The question means the same thing to every respondent.
  3. The kinds of answers that constitute an appropriate response to the question are communicated consistently to all respondents.

Inadequate wording

The simplest example of inadequate question wording is when, somehow, the researcher's words do not constitute a complete question.

Incomplete wording

Bad Better
5.1 Age? What was your age on your last birthday?
5.2 Reason last saw doctor? What was the medical problem or reason for which you most recently went to a doctor?

Interviewers (or respondents) will have to add words or change words in order to make an answerable question. If the goal is to have respondents all answering the same questions, then it is best if the researcher writes the questions fully.

Sometimes optional wording is required to fit differing respondent circumstances. However, that does not mean that the researcher has to give up writing the questions. A common convention is to put optional wording in parentheses. These words will be used by the interviewer when they are appropriate to the situation and omitted when they are not needed.

Examples of optional wording

5.3 Were you (or anyone living here with you) attacked or beaten up by a stranger during the past?

5.4 Did (he/she) report the attack to the police?

5.5 How old was (EACH PERSON) on (his/her) last birthday'?

In 5.3, the parenthetical phrase would be omitted if the interviewer already knew that the respondent lived alone. However, if more than one person lived in the household, the interviewer and would include it.

The parenthetical choice offered in 5.4 may seem minor. However, the parentheses alerts the interviewer to the fact that a choice must be made; the proper pronoun is used, and the principle is maintained that the interviewer need read only the questions exactly as written in order to present a satisfactory stimulus.

A variation that accomplishes the same thing is illustrated in 5.5. A format such as that might be used if the same question were to be asked for each person in a household. Rather than repeat the identical words endlessly, a single question is written instructing the interviewer to substitute an appropriate designation (your husband/your son/your oldest daughter).

The above examples permit the interviewer to ask questions that makes sense and take advantage of knowledge previously gained in the interview to tailor the questions to the respondent's individual circumstances. There is another kind of optional wording that is seen occasionally in questionnaire that is not acceptable.

Example of unacceptable optional wording

5.6 What do you like best about this neighbourhood? (We're interested in anything like houses, the people, the parks, or whatever.)

Presumably, the parenthetical probe was thought to be helpful to respondents who were having difficulty in answering the question. However, from a measurement point of view, it undermines the principle of standardized interviewing. If interviewers use the parenthetical probe when a respondent does not readily come up with an answer, then a subset of respondents will have answered a different question. Such optional probes usually are introduced when the researcher does not think the initial question is a very good one. The proper approach is to write a good question in the first place. Interviewers should never be given any options about what questions to read or how to read them except, as in the examples above, to make the questions fit the circumstances of a particular respondent in a standardized way. The following is a different example of incomplete question wording. There are three errors embedded in the example.

Poor example of standardized wording

5.7 I would like you to rate different features of your neighbourhood as very good, good, fair, or poor. Please think carefully about each item as I read it.

(a) Public schools

(b) Parks and services

(c) Other

The first problem. with 5.7 is the order of the main stem. The response alternatives are read prior to an instruction to think carefully about the questions. The respondent probably will forget the question. The interviewer likely will have to do some explaining or rewording. Second, the words the interviewer needs to ask about the second item on the list, "parks," are not provided in 5.7. A much better question would he the following:

Better example

5.7a I am going to ask you to rate different features of your neighbourhood. I want you to think carefully about your answers. How would you rate (FEATURE)-would you say very good, good, fair, or poor?

This gives the interviewer the wording needed for asking the first and all subsequent items on the list.

The third problem with the example is the alternative "other". What is the interviewer to say? It is not uncommon to see "other" on a list of questions in a form similar to the example. Although occasionally there may be a worthwhile question objective involved, most often the questionnaire will benefit from dropping the item.

The above examples illustrate questions that could not he presented consistently to all respondents due to incomplete wording. Another step needed to increase consistency is to create a set of questions that flows smoothly and easily. It can be shown that if questions have awkward or confusing wording, if there are words that are difficult to pronounce, or combinations of words that sound awkward together, interviewers will change the words to make the questions sound better or to make them easier to read. It may be possible to train and supervise interviewers to keep such changes to a minimum. However, good design of the questionnaire will raise the odds of a standardized interview.

Ensuring consistent meaning to all respondents

If all respondents are asked exactly the same questions, one step has been taken to ensure that differences in answers can be attributed to differences in respondents. However, there is a further consideration. The questions should all mean the same thing to all respondents. If two respondents understand the question to mean different things, their answers may be different for that reason alone.

One potential problem is using words that are not understood universally. In general samples, it is important to remember that a range of educational experiences and cultural backgrounds will be represented. Even with well-educated samples, using simple words that are short and widely understood is a sound approach to questionnaire design.

Undoubtedly, a much more common error than using unfamiliar words is the use of terms or concepts that can have multiple meanings. It is impossible to give an exhaustive list of ambiguous terms used in surveys, but the prevalence of misunderstanding of common terms has been well documented by those who have studied the problem (e.g., Belson, 1981).

Poorly defined terms

5.8 How many times in the past year have you seen or talked with a doctor about your health?

Problem. There are two ambiguous terms or concepts in this question. First, there is basis for uncertainty about what constitutes a doctor. Are only people practicing medicine with M. D. degrees included? If so, then psychiatrists are included, but psychologists, chiropractors, osteopaths, and podiatrists are not included. What about physicians, assistants or nurses who work directly for doctors in doctors' offices? If a person goes to a doctor's office for an innoculation, that is given by a nurse, does it count?

Second, what constitutes seeing or talking with a doctor? Do telephone consultations count? Do visits to a doctors office when the doctor is not seen count?

Solutions. Often the best approach is to provide respondents and interviewers with the definitions they need.

5.8a We are going to ask about visits to doctors and getting medical advice from doctors. In this case we are interested in all professional personnel who have M.D. degrees or work directly for an M.D. in the office such as a nurse or medical assistant.

When the definition of what is wanted is extremely complicated and would take a very long time to define, as may be the case in this question, an additional constructive approach may be to ask supplementary questions about desired events that particularly are likely to be omitted. For example, visits to psychiatrists, visits for inoculations, and telephone consultations often are under reported and may warrant specific follow-up questions.

Poorly defined terms

5.9 Did you eat breakfast yesterday?

The difficulty is that the definition of breakfast varies widely. Some people consider coffee and a donut anytime before noon to be "breakfast". Others do not consider that they have had break- fast unless it includes a major entree, such as bacon and eggs, and is consumed before 8:00 A.M.

Solutions. There are two approaches to the solution. On the one hand, one might choose to define breakfast:

5.9a For our purposes, let us consider breakfast to be a meal eaten before 10:00 in the morning, which includes some protein such as eggs, meat or milk, some grain such as toast or cereal, and some fruit or vegetable. Using that definition, did you have breakfast yesterday?

While that often is a very good approach, in this case it is very complicated. Instead of trying to communicate a common definition to respondents, the researcher may simply ask people to report what they consumed before 10:00 a.m. At the coding stage, the "quality" of what was eaten can be evaluated consistently without requiring each respondent to share the same definition.

Poorly defined terms

5.10 Do you favour or oppose gun control legislation?

Problem. Gun control legislation can mean banning the legal sale of certain kinds of guns, asking people to register their guns, limiting the number or kinds of guns that people may possess, or which people may possess them. Answers cannot be interpreted without assumptions about what respondents think the question means. Respondents will undoubtedly interpret the question differently.

5.10a One proposal for the control of guns is that no person who ever had been convicted of a violent crime would be allowed to purchase or own a pistol, rifle. or shotgun. Would you oppose or support legislation like that?

One could argue that it is only one of a variety of proposals for gun control. That is exactly the point. If one wants to ask multiple questions about different possible responses to a gun control problem, one should ask separate specific questions that can be understood commonly by all respondents and interpreted by researchers. One does not solve the problem of a complex issue by leaving it to the respondents to decide what questions they want to answer.

The worst, way to handle a complex definitional problem is to give interviewers instructions about how to define terms if they are asked. Only respondents who ask will receive the definition; interviewers will not give consistently worded definitions if they are not written in the questionnaire. Thus the researcher will never know what question any particular respondent answered. If a complex term that may require definition must be used, interviewers should be required to read a common definition to all respondents.

The "Don't Know" Option

When respondents are being asked questions about their own lives, feelings, or experiences, a "don't know" response is often a statement that they are unwilling to do the work required to give an answer. On the other hand, sometimes we ask respondents questions about things about which they legitimately do not know. As the object of the questions gets further from their immediate lives, the more plausible and reasonable it is that some respondents will not have adequate knowledge on which to base an answer or will not have formed an opinion or feeling.

There are two approaches to dealing with such a possibility. One simply can ask the questions of all respondents, relying on the respondent to volunteer a "don't know." The alternative is to ask respondents whether or not they feel familiar enough with a topic to have an opinion or feeling about it.

When a researcher is dealing with a topic about which familiarity is high, whether or not a screening question for knowledge is asked is probably not important. However, when there is reason to think that a notable number of respondents will not be familiar with whatever the question is dealing with, it probably is best to ask a screening question about familiarity with the topic. People differ in their willingness to volunteer a "don't know". A screening question for familiarity helps to produce a kind of standardisation; most people answering the question then will have at least some minimal familiarity with what they are responding to (Schuman & Presser, 1981).

Specialised Wording for Special Subgroups

Researchers have wrestled with the fact that the vocabularies in different subgroups of the population are not the same. One could argue that standardised measurement actually would require different questions for different subgroups.

Designing different forms of questionnaires for different sub- groups almost is never done. Rather methodologists tend to work very hard to attempt to find wording for questions that has consistent meaning across an entire population. Even though there are situations where a question wording is more typical of the speech of one segment of a community than another (most often the better-educated segment), finding exactly comparable words for some' other group of the population and then giving interviewers reliable rules for deciding when to ask which version is so difficult that it is likely to produce more unreliability than it reduces.

Standardized expectations for type of response

Thus far we have said it is important to give interviewers a good script so that they can read the questions exactly as worded, and it is important to design questions that mean the same thing to all respondents. The other component of a good question that sometimes is overlooked is that respondents should have the same perception of what constitutes an adequate answer for the question.

The simplest way to give respondents the same perceptions of what constitutes an adequate answer is to provide them with a list of acceptable answers. Such questions are called closed questions. The respondent has to choose one, or sometimes more than one, of a set of alternatives provided by the researcher.

Closed questions are not suitable in all instances. The range of possible answers may be more extensive than it is reasonable to provide. The researcher may not feel that all reasonable answers can be anticipated. For such reasons, the researcher may prefer not to provide a list of alternatives to the respondent. However, that does not free the researcher from structuring the focus of the question and the kind of response wanted as carefully as possible.

5.11 Why did you vote for Candidate A?

Problems. Almost all "why" questions have problems. The reason is that one's sense of causality or frame of references can influence what one talks about. In the particular instance above, the respondent may choose to talk about the strengths of Candidate A, the weaknesses of Candidate B, or the reasons the respondent uses certain criteria ("My mother was a lifelong Democrat"). Hence respondents who see things exactly the same way may answer differently.

Solution. Specify the focus of the answer:

5.1]a What characteristics of Candidate A led you to vote for (him/her) over Candidate B?

Such a question explains to respondents that we want them to talk about Candidate A, the person for whom they voted. If all respondents answer with that same frame of reference, we then will be able to compare responses from different respondents in a direct fashion.'

5.12 What are some of the things about this neighbourhood that you like best?

Problems. In response to a question like that, some people will only make one or two points, while others will make many. It is possible that such differences reflect important differences in respondent perceptions or feelings. However, research has shown pretty clearly that education is related highly to the number of answers people give to questions. Interviewers also affect the number of such answers.

Solution. Specify the number of points to be made:

5. 12a What is the feature of this neighbourhood that you would single out as the one you like most?

5.]2b Tell me the three things about this neighbourhood that you like most about living here.

Although that may not be a satisfactory solution for all questions, for many such questions it is an effective way of reducing unwanted variation in answers across respondents.

The basic point is that answers can vary because respondents have a different understanding of the kind of responses that are appropriate. Better specification of the properties of the answer de- sired can remove a needless source of unreliability in the measurement process.

Types of measures / types of questions

Introduction

The above procedures are designed to maximise reliability - the extent to which people in comparable situations will answer questions in similar ways. However, one can measure with perfect reliability and still not be measuring what one wants to measure. The extent to which the answer given is a true measure and means what the researcher wants it to mean or expects it to mean is called validity. In this section, we discuss other aspects of the design of questionnaires, in addition to steps to maximise the reliability of questions, that can increase the validity of survey measures.

For this discussion, it is necessary to differentiate questions designed to measure facts or objectively measurable events from questions designed to measure subjective states such as attitudes, opinions, and feelings. Even though there are questions that fall in a murky area on the borders of these two categories, the idea of validity is somewhat different for subjective and objective measures for several reasons. If it is possible to cheek the accuracy of an answer by some independent observation, then the measure of validity becomes the similarity of the survey report to the value of some "true" measure. In theory one could obtain an independent, accurate count of the number of times that an individual obtained medical services from a physician during a year. Although in practice it may be very difficult to obtain such an independent measure (e.g., records also contain errors), the understanding of validity can be consistent for objective situations.

In contrast, when people are asked about subjective states, feelings, attitudes, and opinions, there is no objective way of validating the answers. Only the person has access to his or her feelings and opinions. Thus the only way of assessing the "validity" of reports of subjective states is the way in which they correlate either with other answers that a person gives or with other facts about the person's life that one thinks should be related to what is being measured. For such measures, there is no truly independent direct measure possible; the meaning of answers must be inferred from patterns of association. This fundamental difference in the meaning of validity requires sepa- rate discussions regarding ways of maximising validity.

Levels of Measurement

There are four different ways in which measurement is carried out in social sciences. This produces four different kinds of tasks for respondents and four different kinds of data for analysis:

  1. Nominal - people or events are sorted into unordered categories. ("Are you male or female?")
  2. Ordinal - people or events are ordered or placed in ordered categories along a single dimension. ("How would you rate your health - very good, good, fair, or poor?")
  3. Interval data - numbers are attached that provide meaningful information about the distance between ordered stimuli or classes. (In fact, interval data are very rare. Fahrenheit temperature readings are among the few common examples.)
  4. Ratio data - numbers are assigned that have absolute meaning such as a count or measurement by an objective, physical scale such as distance, weight, or pressure. ("How old were you on your last birthday?")

Most often in surveys, when one is collecting factual data, respondents are asked to fit themselves or their experiences into a category, creating nominal data, or they are asked for a number, most often ratio data. "Are You employed?", "Are you married?". and 'Do You have arthritis?" are examples of questions that provide nominal data. "How many times have 'you seen a doctor?" "How old are you?", and 'What is your income?" are examples of questions to which respondents are asked to provide real numbers for ratio data.

When gathering factual data, respondents may be asked for ordinal answers. For example, they may be asked to report their incomes in relatively large categories or to describe their behavior in nonnumerical terms ("usually. occasionally, seldom, or never"). When respondents are asked to report factual events in ordinal terms, it is because great precision is not required by the researcher or because the task of reporting an exact number was considered too difficult; ordinal classification seemed a more realistic task for a respondent. However, there usually is a real numerical basis underlying an ordinal answer to a factual question."

The situation is somewhat different with respect to reports of subjective data. Although there have been efforts over the years, first in the work of a psychophysical psychologists (eg., Thurstone, 1929), to have people assign numbers to subjective states that met the assumptions of interval and ratio data, for the most part respondents are asked to provide nominal and ordinal data about subjective states. The nominal question is, "Into which category do your feelings, opinions, or perceptions fall?" The ordinal question is "Where along this continuum do your feelings, opinions, or perceptions fall?"

When designing a questionnaire, a basic task of the researcher is to decide what kind of measurement is desired. When that decision is made, there are some clear implications for the form in which the question will be asked.

Types of Questions

Survey questions can be classified roughly into two groups: those for which a list of acceptable responses is provided to the respondent (closed questions) and those for which the acceptable responses are not provided exactly to the respondent (open questions).

When the goal is to put people in unordered categories (nominal data), the researcher has a choice about whether to ask an open or closed question. Virtually identical questions can be designed in either form.

Examples of open and closed forms

5.13 What health conditions do you have? (Open)5.13a Which of the following conditions do you currently have? (READ LIST) (Closed)5.14 What do you consider to be the most important problem facing our country today? (Open)5.14a Here is a list of problems that many people in the country are concerned about. Which do you consider to be the most important problem facing our country today? (Closed)

There are advantages to open questions. They permit the researcher to obtain answers that were unanticipated. They also may describe more closely the real views of the respondent. Third, and this is not a trivial point, respondents like the opportunity to answer some questions in their own words. To answer only by choosing a provided response and never to have an opportunity to say what is on one's mind can be a frustrating experience. Finally, open questions are appropriate when the list of possible answers is longer than it is feasible to present to respondents.

Having said all this, closed questions are usually a more satisfactory way of creating data. There are three reasons for this:

  1. The respondent can perform more reliably the task of answering the question when response alternatives are given.
  2. The researcher can perform more reliably the task of interpreting the meaning of answers when the alternatives are given to the respondent (Schuman & Presser, 1981).
  3. When a completely open question is asked, many people give relatively rare answers that are not analytically useful. Providing respondents with a constrained number of categories increases the likelihood that there will be enough people in any given category to be analytically interesting.

Finally, if the researcher wants ordinal data, the categories must be provided to the respondent. One cannot order responses reliably along a single continuum unless a set of permissible ordered answers is specified in the question. A bit more about the task that is given to respondents when they are asked to perform an ordinal task is appropriate, since it is probably the most prevalent kind of measurement in survey research.

Figure 5.1 shows a continuum. In this case we are talking about having respondents make a rating of some sort, but the general approach applies to all ordinal questions. There is a dimension that is assumed by the researcher that goes from the most negative feelings possible to the most positive feelings possible. The way survey researchers get respondents into ordered categories is to put designations or labels on such a continuum. Respondents then are asked to consider the labels, consider their own feelings or opinions, and place themselves in the proper category.

There are two points worth making about the kinds of data that result from such questions. First, respondents will differ one from the other in their understanding of what the labels or categories mean. However, the only assumption that is necessary in order to make meaningful analyses is that, on the average, the people who rate their feelings as "good" feel more positively than those who rate their feelings as "fair." To the extent that people differ some in their understanding of and criteria for "good" and "fair," there is unreliability in the measurement, but the measurement still may have meaning (i.e., correlate with the underlying feeling state that the researcher wants to measure).

Second, an ordinal scale measurement like this is relative. The distribution of people choosing a particular label or category depends on the particular scale that is presented. Consider the rating scale in Figure 5.1 again and consider two approaches to creating ordinal scales. In one case, the researcher used a three-point scale, "good, fair, or poor". In the second case, the researcher used five descriptive words, "excellent, very good, good, fair, and poor". When one compares the two scales, one can see that adding "excellent" and "very good" in all probability does not simply break up the "good" category into three pieces. Rather it changes the whole sense of the scale. People respond to the' ordinal position of categories as well as to the descriptors. "Fair" almost certainly is further to the negative side of the continuum when it is the fourth point on the scale than when it is the second. Thus one would expect considerably more people to give a rating of "good" or better with the five-point scale than with the three-point scale.

Such scales are meaningful if used as they are supposed to be used: to order people. However, by itself a statement that some percentage of the population feels something is "good or better" is not appropriate because it implies that the population is being described in some absolute sense. The percentage would change if the question were different. Only comparative statements (or statements about relationships) are justifiable when one is using ordinal measures:

(a) Comparing answers to the same question across groups; e.g., 20 percent more of those in Group A than in Group B rated the candidate as "good" or better; or

(b) Comparing answers from comparable samples over time, e.g., 10 percent more rate the candidate 'good' or better in January than did so in November.

The same general comments apply to data obtained by having respondents order items. ("Consider the schools, police services, and trash collection. Which is the most important city Service to you?") The percentage giving any item top ranking, or the average ranking of an item, is completely dependent on the particular list provided. Comparisons between distributions when the alternatives have been changed at all are not meaningful.

Agree-Disagree Items: A Special Case

Agree-disagree items are very prevalent in the survey research field and therefore deserve special attention. One can see that the task that respondents are given in such items is different from that of placing themselves in an ordered category. The usual approach is to read a statement to respondents and ask them if they agree or disagree with that statement. The statement is located somewhere on a continuum such as that portrayed in Figure 5.1. Respondentsí locations on that continuum are calculated by figuring out whether they say they are very close to that statement (by agreeing) or saying their feelings are very far from where that statement is located (by disagreeing).

The use of agree-disagree questions to order respondents has two main potential limits.

First, a statement, in order to he interpretable, must be located at the end of a continuum. For example, if a statement was to be rated that said "The schools are fair," presumably a point in the middle of a continuum, a respondent could disagree either because he rated the schools as "good" or because he rated them as "poor". The similar limitation is that it is very common for the statements used as stimuli for agree-disagree questions to have more than one dimension, (i.e., to be double-barrelled), in which case the answer cannot be interpreted. The two statements below provide examples of double- barrelled statements.

5.15 In the next five years, this country will probably be strong and prosperous.

Problems. It obviously is possible for someone to have the view that the country will be strong but not prosperous or vice-versa. Since prosperity and strength do not go together necessarily, a respondent may have trouble knowing what to do.

5.16 With economic conditions the way they are these days, it really isn't fair to have more than one or two children.

Problems. If a person does not happen to think that economic conditions are terrible (which the question imposes as an assumption person does not believe that economic conditions of whatever kind have any implications for family size, but if that person happens to think one or two children is a good target for a family, it is not easy to answer the question.

                                                            FEELING ABOUT SOMETHING


EXTREMELY POSITIVE                                                                  EXTREMELY NEGATIVE

TWO-CATEGORY SCALE

GOOD                                                 NOT GOOD


THREE-CATEGORY SCALE

                        GOOD                                                FAIR                                                   POOR


'

FOUR-CATEGORY SCALE


            VERY GOOD             GOOD                                                            FAIR                           POOR

 

FIVE-CATEGORY SCALE


            EXCELLENT              VERY GOOD             GOOD                        FAIR  POOR

Figure 5.1 Subjective Continuum Scale

The problem then is knowing what the respondent agreed to, if he or she agreed. Asking two or three questions at once and having imbedded assumptions in questions are very common problems with the agree-disagree format. The agree-disagree format appears to he a rather simple way to construct questionnaires. In fact, to use this form to provide reliable, useful measures is not easy and requires a great deal of care and attention. In many cases, researchers would have more reliable and interpretable measures if they used a different question form.

Increasing the validity of factual reporting

When a researcher asks a factual question of a respondent, the goal is to have the respondent report with perfect accuracy, that is, give the same answer that the researcher would have given if the researcher had access to the information needed to answer the question. There is a rich methodological literature on the reporting of factual material. Reporting has been compared against records in a variety of areas, in particular the reporting of economic and health events (see Cannell et al., 1977a, for a good summary).

Respondents answer many questions accurately. For example, over 90 percent of overnight hospital stays within six months of an interview are reported (Cannell & Fowler, 1965). However, how well people report depends on both what they are being asked and how the question is asked. There are four basic reasons why respondents report events with less than perfect accuracy:

  1. They do not know the information.
  2. They cannot recall it, although they do know it.
  3. They do not understand the question.
  4. They do not want to report the answer in the interview context.

There are several steps that the researcher can take to combat each of these potential problems. Let us review these.

Lack of Knowledge

Since the main point of doing a survey is to get information from respondents that is not available in other ways, most surveys deal with questions to which respondents know the answers. The main reason that a researcher would get inaccurate reporting due to lack of knowledge is that he or she is asking one household member for information that another household member has. In health surveys, for example, it is common to use a household informant to report on visits to the doctor, hospitalisations. and illnesses for all household members. Economic and housing surveys often ask for a household respondent to report information for the household as a whole.

If the information exists in the household, but simply not with the person that the researcher wants to be the main respondent, the solutions are either to eliminate proxy reporting or to provide an opportunity for respondents to consult with other family members. For example, the National Crime Survey conducted by the Census Bureau obtains reports of household crimes from a single household informant, but in addition asks each household adult directly about personal crimes such as robbery. If the basic interview is to be carried out in person, costs for interviews with other members of the household can be reduced - administered forms are left to be filled out by absent household members, or if secondary interviews are done by telephone. A variation is to ask the main respondent to report the desired information as fully as possible for all household members. Then mail the respondent a summary for verification, permitting consultation with other family members (see Cannell & Fowler, 1965).

Finally, it sometimes is worth asking household members to designate the best informed person to answer the questions. The house-wife is not always the most knowledgeable about health, and the husband is not always the most knowledgeable about finances. People themselves often can do a better job of choosing the best respondent for a particular topic than can the researcher.

Recall

Studies of the reporting of known hospital stays clearly show the significance of memory in the reporting of events. As the time between the interview and a hospitalisation event increases, the probability of it being reported in an interview decreases. In a like way, short hospitalisations are less likely to be reported than long ones. Memory decays in predictable ways; the minor and distant events are more difficult to conjure up in a quick question and answer interview.

There are several ways to reduce the impact of memory decay on the reporting of factual events. Five possible methods are as follows:

  1. Reduce the period of time about which respondents are asked to report. There is great value in having respondents report for as long a period of time as possible, because there is more information obtained in that way. However. the longer the reporting period, the less accurate the reporting (Cannell & Fowler, 1965).
  2. Memory is improved by asking more questions. By asking more than one question about events, more time will elapse for the respondent to think. In addition, questions can be designed that will stimulate associations, thereby helping the recall process. Thus the number of health conditions reported is increased by asking about visits to doctors, taking medications, and missing work (Madow, 1963).
  3. A second chance to think about the answers given also can stimulate memory. The technique suggested above of sending the respondent a summary of answers for verification has been shown to improve the recall process as well; even asking for the same information twice in the same interview can help recall.
  4. A reinterview procedure, interviewing the same respondent twice or even more times, is another good way to deal with problems of recall. One key problem with recalling events over time is setting them in the proper time period. An initial interview can serve as an anchorpoint for people's recall. The previous interview serves as a boundary in their minds. In addition, the researcher can check to make sure that events reported in Interview 1 were not repeated in Interview 2. A final advantage of the panel approach is that respondents are sensitised to the kinds of events that will be asked about, thereby further improving their recall.
  5. Carrying that last point a step further, one way that researchers have dealt with the reporting of minor events that are hard to remember is by asking people to keep a diary. Consumption pat- terns, minor deviations from good health, and patterns of expenditure are all difficult for people to recall in detail over time unless they are taking notes. Even respondents who do not keep their diaries up to date conscientiously report considerably better than they would have had they not been keeping a diary (Sudman & Bradburn, 1974).

It should be noted that a trade-off with both the reinterview and diary strategies is that it is more difficult to convince people to keep a diary or be interviewed several times than it is to get them to agree to a one-time interview. Hence the values of improved reporting have to be weighed against the possible biases resulting from sample attrition.

Social Desirability

There are certain facts or events that respondents rather would not report accurately in an interview. Conditions that have some degree of social undesirability such as mental illness and venereal disease are underreported significantly more than other conditions (Madow, 1963., Densen et al., 1963). Hospitalisation associated with conditions that are particularly threatening, either because of the possible stigmas that may be attached to them or due to their life threatening nature, are reported at a lower rate than average (Cannell et al., 1977a). Aggregate estimates of alcohol consumption strongly suggest underreporting, although the reporting problems may be a combination of recall difficulties and respondents' concerns about social norms regarding drinking. Arrest and bankruptcy are other events that have been found to he underreported consistently, but which seem unlikely to have been forgotten (Locander et al., 1976).

There are probably limits to what people will report in a standard interview setting. If a researcher realistically expects someone to admit something that is very embarrassing or illegal, extraordinary efforts are needed to convince respondents that the risks are minimal and the reasons for taking a risk are substantial. The following are some of the steps that a researcher might particularly consider when sensitive questions are being asked (also see Sudman & Bradburn, 1982).

  1. Minimise a sense of judgment; maximise the importance of accuracy. Careful attention to the introduction and vocabulary that might imply that the researcher would value negatively certain answers is important. Researchers always have to be aware of the fact that respondents are having a conversation with the researcher. The questionnaire, plus the behavior of the interviewer if there is one, constitutes all the information the respondent has about the kind of interpretation the researcher will give to the answers. Therefore, the researcher needs to be very careful about the kind of cues the respondent is receiving and about the type of context in which respondents feel their answers will be interpreted.
  2. Use self-administered questions. Although the data are not conclusive, there is evidence that telephone interviews are more subject to social-desirability bias than personal interviews (e.g., Hendon et al., 1977, Mangione et al., 1982), there is also evidence that having respondents answer questions in a self-administered form rather than having an interviewer ask the questions may produce less social-desirability bias for some items (e.g., Hochstim, 1967). Such a consideration might lead one to think of a mail survey or group administration. A personal interview survey also can be combined usefully with self-administration: A respondent simply is given a set of questions to answer in a booklet as part of the personal interview experience.
  3. Confidentiality and anonymity. Almost all surveys promise respondents that answers will be treated confidentially and that no one outside the research staff will ever be able to associate individual respondents with their answers. Respondents usually are reassured of such facts by interviewers in the introduction and in advance letters, if there are any; these may be reinforced by signed commitments from the researchers. For surveys on particularly sensitive or personal subjects, special steps to ensure that respondents cannot be linked to their answers (such as the random response techniques described by Greenberg et al., 1969) may be used. Again it is important to emphasise that the limit of survey research is what people am willing to tell researchers under the conditions of data collection designed by the researcher. There are some questions that probably cannot be asked of probability samples without extraordinary efforts (e.g., Kinsey et a]., 1948). However, some of the procedures discussed in this section, such as trying to create a neutral context for answers and emphasising the importance of accuracy and the neutrality of the data collection process, are probably worthwhile procedures for the most innocuous of questions. Any question, no matter how innocent it may seem, may embarrass somebody in the sample. It is best to design all phases of a survey instrument with a sensitivity to reducing the effects of social desirability and embarrassment on any answers people may give.

Increasing validity of subjective questions

As discussed above, the validity of subjective questions has a different meaning than the validity of objective questions. There is no external criterion. One only can estimate the validity of a subjective measure by the extent to which answers are associated in expected ways with the answers to other questions or other characteristics of the individual to which it should be related (see Turner & Martin, 1984, for an extensive discussion of issues affecting the validity of subjective measures).

There basically are only three steps to the improvement of validity of subjective measures:

  1. Make the questions as reliable as possible. Review the sections on the reliability of questions, dealing with ambiguity of wording, standardized presentation, and vagueness in response form, and do everything possible to get questions that will mean the same thing to all respondents. To the extent that subjective measures are unreliable, their validity will be reduced. A special issue is the reliability of ordinal scales, which are dominant as measure of subjective states. The response alternatives offered must be unidimensional (deal with only one issue) and monotonic (presented in order, without inversion).

Problematic scales

5. 16 How would you rate your job-very rewarding, rewarding but stressful, not very rewarding but not stressful, or not rewarding at all?

5.17 How would you rate your job-very rewarding, somewhat rewarding, rewarding, or not rewarding at all?

Question 5.16 has two scaled properties - rewardingness and stress - that need not be related. All the alternatives are not played out. Question 5.16 should be made into two questions if rewardingness and stress of jobs are both to be measured. In 5.17, some would see "rewarding" as more positive than "somewhat rewarding" and be confused about how the categories were ordered. Both of these problems are common and should be avoided.

  1. When putting people into ordered classes along a continuum, it probably is better to have more categories than fewer. There is a limit, however, in the precision of discrimination that respondents can exercise in giving ordered ratings. When the number of categories exceeds the respondents' ability to discriminate their feelings, numerous categories simply produce unreliable "noise." However, the validity of a measure will be increased to the extent that real variation among respondents is measured.
  2. Ask multiple questions, with different question forms, that measure the same subjective state; combine the answers into a scale. The answers to all questions potentially are influenced both by the subjective state to be measured and by specific features of the respondent or of the questions. Some respondents avoid extreme categories; some tend to agree more than disagree; others do just the opposite. Multiple questions help even out response idiosyncrasies and improve the validity of the measurement process (Cronbach, 1951).

The most important point to remember about the meaning of subjective measures is their relativity. Distributions can be compared only when the stimulus situation is the same. Small changes in wording, changing the number of alternatives offered, and even changing the position of a question in a questionnaire can make a major difference in how people answer. (See Turner & Martin, 1984; Schuman & Presser, 1981; and Sudman & Bradburn, 1982 for numerous examples of factors that affect response distributions.) The distribution of answers to a subjective question cannot be interpreted directly; it only has meaning when differences between samples exposed to the same questions are compared or when patterns of association among answers are studied.

Error in perspective

A defining property of social surveys is that answers to questions are used as measures. The extent to which those answers are good measures is obviously a critical dimension of the quality of survey estimates.

Questions can be poor measures because they are unreliable (producing erratic results) or because they are biased, producing estimates that consistently err in one direction from the true value (as when drunk driving arrests are underreported).

We know quite a bit about how to make questions reliable. The principles outlined in this chapter to increase reliability are probably sound. Although other points might be added to the list, creating unambiguous questions that provide consistent measures across respondents is always a constructive step for good measurement.

The validity issue is more complex. In a sense, each variable to be measured requires research to identify the best set of questions to measure it and to produce estimates of how valid the resulting measure is. Many of the suggestions to improve reporting in this chapter emerged from a twenty-year program to evaluate and improve the measurement of health-related variables (Cannell et al.,. 1977a, 1977b). There are many areas in which a great deal more work on validation is needed.

A third issue is the credibility of a question (or series) as a measure. It always is legitimate to ask researchers for their evidence about how well a question (or series) measures what it is supposed to. Too often, researchers make little effort to evaluate their measures; they assume, and ask their readers to assume, that answers mean what they 'look like" they mean and measure what the researcher thinks they are supposed to measure. To rely on so-called "face validity" of questions is not acceptable practice.

Researchers should build explicit efforts to assess the validity of their key measures into their analyses. As standard practice, patterns of association related to validity can be calculated and presented in an appendix.

Reducing measurement error through better question design is one of the least costly ways to improve survey estimates. For any survey, it is reasonable to attend to careful questionnaire design and pretesting (which are discussed in Chapter 6) and making use of the existing research literature about how to measure what is to be measured. Also, building a literature over time in which the validity of measures has been evaluated and reported is much needed. Such evaluations are now the exception; they should become routine.