Skip to main content
Glama
brockwebb

Open Census MCP Server

by brockwebb
e0295f88d6c64330f31fafa214c0b0486a904ade.txt118 kB
September 7, 2017 2017 AMERICAN COMMUNITY SURVEY RESEARCH AND EVALUATION REPORT MEMORANDUM SERIES # ACS17-RER-05 MEMORANDUM FOR Victoria Velkoff Chief, American Community Survey Office From: Prepared by: Subject: David Waddington Chief, Social, Economic, and Housing Statistics Division (SEHSD) Brian McKenzie Social, Economic, and Housing Statistics Division (SEHSD) 2016 American Community Survey Content Test Evaluation Report: Journey to Work – Travel Mode of Commute and Time of Departure for Work Attached is the final report for the 2016 American Community Survey (ACS) Content Test for Journey to Work. This report describes the results of the test for revised versions of the Mode of Commute and Time of Departure for Work questions. If you have any questions about this report, please contact Alison Fields at 301-763-2456 or Brian McKenzie at 301-763-6532. Attachment cc: Kathryn Cheza (ACSO) Jennifer Ortman (ACSO) David Raglin (ACSO) Patrick Cantwell (DSSD) Elizabeth Poehler (DSSD) Michael Risley (DSSD) Anthony Tersine (DSSD) Alison Fields (SEHSD) Nicole Scanniello (SEHSD) Intentionally Blank American Community Survey Research and Evaluation Program September 7, 2017 2016 American Community Survey Content Test Evaluation Report: Journey to Work – Travel Mode of Commute and Time of Departure for Work FINAL REPORT Brian McKenzie and Alison Fields, Social, Economic, and Housing Statistics Division Michael Risley, Decennial Statistical Studies Division R. Chase Sawyer, American Community Survey Office Intentionally Blank Table of Contents 2. EXECUTIVE SUMMARY ........................................................................................................... iii BACKGROUND ............................................................................................................... 1 1. 1.1. Justification for Inclusion of Journey to Work in the Content Test ............................. 1 1.2. Question Development ................................................................................................. 3 1.3. Question Content .......................................................................................................... 6 Research Questions....................................................................................................... 6 1.4. METHODOLOGY ............................................................................................................ 7 Sample Design .............................................................................................................. 7 2.1. 2.2. Data Collection ............................................................................................................. 8 2.3. Content Follow-Up ....................................................................................................... 9 2.4. Analysis Metrics ......................................................................................................... 10 Unit Response Rates and Demographic Profile of Responding Households .. 11 2.4.1. Item Missing Data Rates.................................................................................. 12 2.4.2. Response Distributions .................................................................................... 13 2.4.3. Benchmarks ..................................................................................................... 14 2.4.4. Response Error ................................................................................................ 15 2.4.5. Other Analysis Methodology Specific to Commuting Questions ................... 17 2.4.6. 2.4.7. Standard Error Calculations............................................................................. 18 DECISION CRITERIA ................................................................................................... 18 LIMITATIONS ............................................................................................................... 19 RESEARCH QUESTIONS AND RESULTS ................................................................. 21 5.1. Unit Response Rates and Demographic Profile of Responding Households ............. 21 Unit Response Rates for the Original Content Test Interview ........................ 21 Unit Response Rates for the Content Follow-Up Interview ............................ 22 Demographic and Socioeconomic Profile of Responding Households ........... 23 Item Missing Data Rates............................................................................................. 25 Response Distributions ............................................................................................... 26 Benchmarks ................................................................................................................ 30 Response Error ........................................................................................................... 30 Results for Analysis Specific to Journey to Work ...................................................... 33 CONCLUSIONS AND RECOMMENDATIONS .......................................................... 34 6. ACKNOWLEDGEMENTS ............................................................................................ 35 7. 8. REFERENCES ................................................................................................................ 35 APPENDIX A. Supplemental Table for Unit Response Rates ..................................................... 37 5.2. 5.3. 5.4. 5.5. 5.6. 5.1.1. 5.1.2. 5.1.3. 3. 4. 5. i List of Tables Table 1. Interview and Reinterview Counts for Each Response Category Used for Calculating the Gross Difference Rate and Index of Inconsistency .............................. 16 Table 2. Decision Criteria for Commute Mode ............................................................................ 18 Table 3. Decision Criteria for Time of Departure ......................................................................... 19 Table 4. Original Interview Unit Response Rates for Control and Test Treatments, Overall and by Mode .................................................................................................................. 21 Table 5. Mail Response Rates by Designated High (HRA) and Low (LRA) Response Areas .... 22 Table 6. Content Follow-Up Interview Unit Response Rates for Control and Test Treatments, Overall and by Mode of Original Interview .................................................................. 23 Table 7. Response Distributions – Test versus Control Treatment .............................................. 24 Table 8. Comparison of Average Household Size ........................................................................ 24 Table 9. Comparison of Language of Response ........................................................................... 25 Table 10. Item Missing Data Rates for Control and Test Treatments – Commute Mode and Time of Departure ........................................................................................................ 26 Table 11. Commute Mode: Chi-Square Statistic Comparing Control and Test Treatment .......... 26 Table 12. Response Distribution for Control and Test Treatment for Commute Mode ............... 27 Table 13. Proportion of Three Rail-Related Commute Mode Categories Combined................... 27 Table 14. Proportion of Commute Mode for Test and Control Treatments – Internet Response Mode ............................................................................................................................. 28 Table 15. Response Distribution for Control and Test Treatment for Time of Departure ........... 29 Table 16. Time of Departure Distribution for Internet Response Mode ....................................... 30 Table 17. Difference in Gross Difference Rates (GDR) between Test Percent and Control Percent – Commute Mode ............................................................................................ 31 Table 18. Index of Inconsistency between Control and Test Treatments – Commute Mode ....... 32 Table 19. Persons Reporting a Difference of Five Minutes or Less for Time of Departure ........ 32 Table A-1. Unit Response Rates by Designated High (HRA) and Low (LRA) Response Areas. 37 List of Figures Figure 1. Control and Test Versions of Commute Mode Question ................................................ 6 Figure 2. Control and Test Versions of Time of Departure Question ............................................ 6 Figure 3. Commute Mode Categories (control version) Used in Analysis ................................... 14 Figure 4. Selected Cities Included in Targeted Rail Metro Analyses ........................................... 17 ii EXECUTIVE SUMMARY Overview From February to June of 2016, the U.S. Census Bureau conducted the 2016 American Community Survey (ACS) Content Test, a field test of new and revised content. The primary objective was to test whether changes to question wording, response categories, and definitions of underlying constructs improve the quality of data collected. Both new and revised versions of existing questions were tested to determine if they could provide data of sufficient quality compared to a control version as measured by a series of metrics including item missing data rates, response distributions, comparisons with benchmarks, and response error. The results of this test will be used to help determine the future ACS content and to assess the expected data quality of revised questions and new questions added to the ACS. The 2016 ACS Content Test consisted of a nationally representative sample of 70,000 residential addresses in the United States, independent of the production ACS sample. The sample universe did not include group quarters, nor did it include housing units in Alaska, Hawaii, or Puerto Rico. The test was a split-panel experiment with one-half of the addresses assigned to the control treatment and the other half assigned to the test treatment. As in production ACS, the data collection consisted of three main data collection operations: 1) a six-week mailout period, during which the majority of self-response via internet and mailback were received; 2) a one- month Computer-Assisted Telephone Interview period for nonresponse follow-up; and 3) a one- month Computer-Assisted Personal Interview period for a sample of the remaining nonresponse. For housing units that completed the original Content Test interview, a Content Follow-Up telephone reinterview was conducted to measure response error. Journey to Work (Commuting) This report discusses two journey to work, or commuting, questions that appear on the ACS: Commute Mode and Time of Departure for Work. For the commuting questions on the ACS, this iteration of content testing is an attempt to clarify the meaning of existing question wording and maximize response. Both questions provide crucial information for transportation planning. Improving accuracy and decreasing missing data rates are important to maintaining their utility for the transportation planning community. The proposed changes to Commute Mode are motivated by changes in public transportation infrastructure across the United States, particularly the increased prevalence of light rail systems and the need to update and clarify the terminology used to refer to commute modes that already appear as categories on the ACS. For this test, commute mode categories were modified to reflect the nation’s actual public transportation options and the language used to describe them. For example, we added “light rail” as a commute mode category. See Section 1.3 for a comparison of the test and control questions. The question about Time of Departure has historically raised concerns about privacy. An alternative treatment of this question was tested with the objective of phrasing the question in a iii less intrusive way. While the control version of the question asks people when they leave home to go to work, the test version asks what time the person’s trip to work began and does not include the word “home.” Research Questions and Results Missing Data: Some research questions apply to both ACS commuting questions, while others are more specific to one of the two commuting questions. For both Commute Mode and Time of Departure, we tested whether the rate of missing data is lower for the control treatment than the test treatment. This test resulted in a failure to conclude that the control version had a statistically lower missing data rate than the test version for each of the commuting questions. Response Distributions: We compared the response distributions of the test and control treatments of each commuting question based on categories used in published tables. There were no significant differences between the test and control treatments for the modified travel categories tested for Commute Mode. We also combined the rail-related categories into one category, finding no significant difference between the test and control treatments for this experimental combined category. The Commute Mode distribution was also examined for a small subset of metropolitan areas with high rates of transit usage. For this subsample of workers, as with the overall national sample, the distributions showed no significant difference between test and control versions. For Time of Departure, responses were grouped into time intervals corresponding to those in published tables for easier comparison. Results show no statistically significant difference in Time of Departure distribution. Response Reliability: We are interested in understanding how reliable a respondent’s answer to each commuting question is, as measured by asking respondents the same question at two points in time. That is, we are interested in the respondent’s likelihood of giving the same response for both the original interview and a follow-up interview. Of particular interest is whether reliability for the test treatment is higher than the control treatment. There was insufficient evidence to conclude that response reliability in the test treatment was higher than the control treatment for Commute Mode or Time of Departure. Other Question-Specific Analyses: For Commute Mode, we compared the extent to which respondents incorrectly chose multiple commute modes on the paper questionnaire, rather than choosing the one for which they traveled the longest distance. There were no significant differences in incidences of multiple commute modes between test and control treatments. For Time of Departure, respondents often round their reported departure times to numbers ending in 0 or 5. We tested the rate at which this occurred for test and control treatments and found no significant difference between them. iv Conclusions The final wording in the test versions of the commuting questions is the product of consultation with industry experts and extensive cognitive testing. This new version of each ACS question is viewed as preferable to the current version. Among the various metrics used to answer our research questions, none revealed statistically different results between the test version and control version of each question. For both commuting questions tested, response distributions did not differ between test and control versions, which is consistent with the expectations. This suggests continuity in the meaning and interpretation of the control and test versions of each question. We recommend moving forward with the implementation of the new “test” version of each question, Commute Mode and Time of Departure. This recommendation is motivated by the goal of completeness in existing commute mode categories, particularly addressing the current absence of light rail, and reducing respondent burden. Cognitive testing also produced positive feedback for both of the commuting questions tested. Finally, consultation with industry experts representing the field of transportation planning and research resulted in unequivocal support for these category changes. v Intentionally Blank vi 1. BACKGROUND From February to June of 2016, the Census Bureau conducted the 2016 American Community Survey (ACS) Content Test, a field test of new and revised content. The primary objective was to test whether changes to question wording, response categories, and definitions of underlying constructs improve the quality of data collected. Both revised versions of existing questions and new questions were tested to determine if they could provide data of sufficient quality compared to a control version as measured by a series of metrics including item missing data rates, response distributions, comparisons with benchmarks, and response error. The results of this test will be used to help determine the future ACS content and to assess the expected data quality of revised questions and new questions added to the ACS. The 2016 ACS Content Test included the following topics:  Relationship  Race and Hispanic Origin  Telephone Service  Computer and Internet Use  Health Insurance Coverage  Health Insurance Premium and Subsidy (new questions)  Journey to Work: Commute Mode  Journey to Work: Time of Departure for Work  Number of Weeks Worked  Class of Worker   Retirement, Survivor, and Disability Income Industry and Occupation This report discusses Journey to Work: Commute Mode and Time of Departure for Work. For brevity, these commuting questions are referred to as Commute Mode and Time of Departure in this report. 1.1. Justification for Inclusion of Journey to Work in the Content Test Commute Mode A question collecting details of a person’s mode of transportation to work was first introduced in the 1960 Census. The question wording and the transportation modes have changed over time to accommodate evolving transportation options and travel behavior. The 1960 version of Commute Mode included response options for automobile, bus, subway, walked, worked at home, and other means. Since then, several categories have been added or modified. For example, "Bicycle" was added as a separate category in 1980. “Streetcar” first appeared in 1970 and "Streetcar or trolley car" was presented for the first time in 1990. The current version of the question has been used since 1990. Light rail is a transit mode that exists in over 30 metropolitan areas in the United States (American Public Transportation Association, 2014), and the transportation planning community 1 has argued that the current ACS rail-related questions should explicitly include light rail in the list of options. The Federal Transit Administration (FTA) funds most of the nation's fixed rail projects, but currently cannot directly measure commuting rates for light rail projects using ACS data in cities where light rail competes with other rail modes (such as subway). The addition of light rail to the categories will provide a crucial metric by which local and federal transit agencies can assess ridership of light rail systems as distinct from other modes. Among national surveys that ask about means of transportation to work, no ongoing national survey explicitly includes light rail as a category and provides estimates for small areas. The American Housing Survey (AHS) asks how people get to work and school, including the type of public transportation used. Light rail is listed among the public transportation options, but the AHS is not an annual survey and does not provide estimates for areas smaller than large metropolitan areas. The National Household Travel Survey (NHTS) is a survey conducted by the U.S. Department of Transportation that also collects a person’s mode of transportation to work. The last survey was conducted in 2009 and offered 24 distinct categories of transportation modes for the journey to work, plus an ‘other’ category (Federal Highway Administration, U.S. Department of Transportation, 2009). The rail-related categories available on that survey also did not include light rail. Those categories are Amtrak/intercity, commuter train, subway/elevated, and streetcar/trolley. In addition to adding light rail to one of the existing commute mode categories, five of the existing twelve categories were modified in order to more closely reflect today’s commute modes and how people refer to them. Details of question development and content are discussed in subsequent sections. Time of Departure A question collecting details of a person’s time of departure from home to go to work was first introduced in the 1990 Census. The question has not changed since then. The ACS Content Review conducted in 2014 reported Time of Departure from Home to be one of the top questions that caused respondents discomfort and reluctance to answer, according to interviewers (Chappell & Obenski, 2014). However, the question is crucial for transportation planning efforts. The initial content testing goal for this question was to shift the focus of the question away from when a person leaves their home toward when they arrive at work. This new version was expected to alleviate privacy concerns for some respondents while still providing transportation planners with essential information about when commuters are on the road. There are surveys other than the ACS that ask about time of departure from home for work, but none of these surveys are nationally representative or occur regularly. For example, the NHTS, which was last conducted in 2009, asked how many minutes it took to get from home to work and what time a person usually arrived at work (Federal Highway Administration, U.S. Department of Transportation, 2009). The NHTS is only conducted about every seven years and does not provide estimates at geographic levels smaller than the metro area. The Survey of Income and Program Participation (SIPP), a national survey, also collects such information, but has a much smaller sample size and does not provide estimates at small geographies. 2 1.2. Question Development Initial versions of the new and revised questions were proposed by federal agencies participating in the U.S. Office of Management and Budget (OMB) Interagency Committee for the ACS. The initial proposals contained a justification for each change and described previous testing of the question wording, the expected impact of revisions to the time series and the single-year as well as five-year estimates, and the estimated net impact on respondent burden for the proposed revision.1 For proposed new questions, the justification also described the need for the new data, whether federal law or regulation required the data for small areas or small population groups, if other data sources were currently available to provide the information (and why any alternate sources were insufficient), how policy needs or emerging data needs would be addressed through the new question, an explanation of why the data were needed with the geographic precision and frequency provided by the ACS, and whether other testing or production surveys had evaluated the use of the proposed questions. The Census Bureau and the OMB, as well as the Interagency Council on Statistical Policy Subcommittee, reviewed these proposals for the ACS. The OMB determined which proposals moved forward into cognitive testing. After OMB approval of the proposals, topical subcommittees were formed from the OMB Interagency Committee for the ACS, which included all interested federal agencies that use the data from the impacted questions. These subcommittees further refined the specific proposed wording that was cognitively tested. The Census Bureau contracted with Westat to conduct three rounds of cognitive testing. The results of the first two rounds of cognitive testing informed decisions on specific revisions to the proposed content for the stateside Content Test (Stapleton and Steiger, 2015). In the first round, 208 cognitive interviews were conducted in English and Spanish and in two modes (self- administered on paper and interviewer-administered on paper). In the second round of testing, 120 cognitive interviews were conducted for one version of each of the tested questions, in English and Spanish, using the same modes as in the first round. A third round of cognitive testing involved only the Puerto Rico Community Survey (PRCS) and Group Quarters (GQ) versions of the questionnaire (Steiger, Anderson, Folz, Leonard, & Stapleton, 2015). Cognitive interviews in Puerto Rico were conducted in Spanish; GQ cognitive interviews were conducted in English. The third round of cognitive testing was carried out to assess the revised versions of the questions in Spanish and identify any issues with questionnaire wording unique to Puerto Rico and GQ populations.2 The proposed changes identified through cognitive testing for each question topic were reviewed by the Census Bureau, the corresponding topical subcommittee, and the Interagency Council on Statistical Policy Subcommittee for the ACS. The OMB then provided final overall approval of the proposed wording for field testing.3 1 The ACS produces both single and five-year estimates annually. Single year estimates are produced for geographies with populations of 65,000 or more and five-year estimates are produced for all areas down to the block-group level, with no population restriction. 2 Note that the field testing of the content was not conducted in Puerto Rico or in GQs. See the Methodology section for more information. 3 A cohabitation question and domestic partnership question were included in cognitive testing but ultimately we decided not to move forward with field testing these questions. 3 Commute Mode The initial proposal for Commute Mode included a request to capture multiple commute modes, not just the one for which the respondent traveled the longest distance. For example, if a commuter traveled to work by driving their car to a train station, then taking a commuter train for the remainder of their trip, they would be able to select more than one commute mode. The proposal to measure multiple commuting modes was rejected because it was viewed as an additional question, rather than a modification to an existing question. While the information would be valuable, execution of such an effort would present considerable operational burdens and conceptual challenges. For example, the addition of walking as a second mode presents challenges related to what defines a walking segment of a trip (across a parking lot, two blocks to a transit stop, etc.). Such departures from the straightforward ‘longest distance’ question format could present ambiguities for respondents similar to those currently suspected to affect the set of rail options. The final question proposal included a single set of modified commute mode categories; all related to public transportation modes. The category “Streetcar or trolley car” was changed to “Light rail, street car, or trolley;’ “Subway or elevated” was changed to “Subway or Elevated Rail;” “Railroad” was changed to “Long-distance train or commuter rail.” These three rail- related categories were also slightly reordered so that “Subway or elevated rail,” the most prevalent rail mode, is listed first. Finally, for these three rail-related categories, the subcommittee discussed including the word “Rail” at the beginning of each (see below). After considerable discussion, this idea was rejected. __ Rail: light rail, streetcar, or trolley __ Rail: subway or elevated __ Rail: commuter or long-distance railroad The subcommittee discussed moving the “Worked at home” category to the beginning of the list so that workers who work at home could immediately skip to the next set of questions rather than read the entire list of commute modes. The subcommittee eventually decided against this because it only affects about 4 percent of workers. The first round of cognitive testing resulted in three additional changes to the commute mode categories (Stapleton & Steiger, 2015). The phrase “trolley bus” was dropped from the test version and the phrase “worked at home” was changed to “worked from home.” The category “Commuter or long distance railroad” was changed to “Commuter rail or long distance train” to add clarity. The subject matter group unanimously agreed to make these changes. The second round of testing resulted in a minor change of one category. The category “Commuter rail or long-distance train” was changed to “Long-distance train or commuter rail.” The subheading of instructions was also modified: 4 From: How did this person usually get to work LAST WEEK? If this person usually used more than one method of transportation during the trip, mark (X) the box of the one used for most of the distance. To: How did this person usually get to work LAST WEEK? Mark ONE box for the method of transportation used for most of the distance. This change was made to simplify instructions and remove any ambiguity associated with the (X). It is crucial that respondents choose only one commute mode because choosing more than one results in the case being allocated. Time of Departure ACS respondents are currently asked, “What time did this person leave home to go to work last week?” For the first round of testing, the subcommittee modified the question to focus on the end of the commuter’s trip, when they arrived at work, rather than the beginning, which presumes an initial departure from home. Respondents involved in cognitive testing reported no difference in sensitivity between the test and control versions of this question. Responses to the test question were less accurate and highly rounded compared with the control version. Respondents had difficultly estimating the time of arrival at work for other members of the household. Respondents have a reasonable approximation for when other workers within the household departed for work, but were more likely to give highly rounded arrival times. The time of arrival question also resulted in confusion among respondents about the exact point at which they “arrive” at work. For example, they wondered whether the question referred to when they enter the premises or are situated at their workstation. For the second round of testing, the subcommittee revisited the approach of retaining a focus on the beginning of the commuter’s trip, but removed the word “home” to generalize the question. The subcommittee finally decided on asking, “Last week, what time did this person’s trip to work usually begin?” This took the emphasis off the sensitive word “home,” while still gathering information on the beginning of the work trip. Cognitive testing found that respondents answered this version of the question more accurately than the one that focused on time of arrival to work, especially for other members of the household. The “heaping” (clustering around numbers ending in 0 and 5) associated with this version was comparable to the control version. 5 1.3. Question Content Figure 1. Control and Test Versions of Commute Mode Question Control Version Test Version Figure 2. Control and Test Versions of Time of Departure Question Control Version Test Version 1.4. Research Questions The following research questions were formulated to guide the analyses of the Commute Mode and Time of Departure for Work questions. The analyses assessed how the test versions of the questions performed compared to the control versions in the following ways: how often the respondents answered the questions, the consistency and accuracy of the responses, and how the responses affected the resulting estimates. Commute Mode 1. Is the missing data rate the same or lower for the test treatment than for the control treatment? 2. How do the test and control response distributions compare at the national level? This will be compared first using all 12 categories on the ACS questionnaire. These 12 categories are included in American Factfinder (AFF) Table B08301. The distributions 6 from the ten categories in AFF Table B08006 and six categories in AFF ACS Table S0801 will then be shown. 3. How does the proportion of respondents marking one of the three rail categories compare between test and control versions? 4. Are the measures of response reliability (gross difference rate and index of inconsistency) better for the test treatment than for the control treatment? 5. For the paper questionnaire, is the proportion of person records for which respondents incorrectly marked multiple modes of transportation comparable between control and test versions? When multiple modes are marked, if the sample size is large enough, which combinations are most common in each version? Note that respondents are instructed to mark only one commute mode. 6. How do the test and control response distributions compare in metro areas with high levels of light rail usage? 7. How do the test and control response distributions compare when the sample is restricted to only metro areas with high levels of overall rail usage? Time of Departure 8. Is the missing data rate the same or lower for the test treatment than for the control treatment? 9. Using the categories defined in AFF, ACS Table B08302, are the distributions comparable between the test and control questionnaires? 10. Are the measures of response reliability (gross difference rate and index of inconsistency) better for the test treatment than for the control treatment? 11. Is the proportion of respondents who leave home at a time that ends in 0 or 5 comparable between test and control versions? 2. METHODOLOGY 2.1. Sample Design The 2016 ACS Content Test consisted of a nationally representative sample of 70,000 residential addresses in the United States, independent of the production ACS sample. The Content Test sample universe did not include GQs, nor did it include housing units in Alaska, Hawaii, or Puerto Rico.4 The sample design for the Content Test was largely based on the ACS production 4 Alaska and Hawaii were excluded for cost reasons. GQs and Puerto Rico were excluded because the sample sizes required to produce reliable estimates would be overly large and burdensome, as well as costly. 7 sample design with some modifications to better meet the test objectives.5 The modifications included adding an additional level of stratification by stratifying addresses into high and low self-response areas, oversampling addresses from low self-response areas to ensure equal response from both strata, and sampling units as pairs.6 The high and low self-response strata were defined based on ACS self-response rates at the tract level. Sampled pairs were formed by first systematically sampling an address within the defined sampling stratum and then pairing that address with the address listed next in the geographically sorted list. Note that the pair was likely not neighboring addresses. One member of the pair was randomly assigned to receive the control version of the question and the other member was assigned to receive the test version of the question, thus resulting in a sample of 35,000 control cases and 35,000 test cases. As in the production ACS, if efforts to obtain a response by mail or telephone were unsuccessful, attempts were made to interview in person a sample of the remaining nonresponding addresses (see Section 2.2 Data Collection for more details). Addresses were sampled at a rate of 1-in-3, with some exceptions that were sampled at a higher rate.7 For the Content Test, the development of workload estimates for the Computer-Assisted Telephone Interviews (CATI) and Computer- Assisted Personal Interviews (CAPI) did not take into account the oversampling of low response areas. This oversampling resulted in a higher than expected workload for CATI and CAPI and therefore required more budget than was allocated. To address this issue, the CAPI sampling rate for the Content Test was adjusted to meet the budget constraint. 2.2. Data Collection The field test occurred in parallel with the data collection activities for the March 2016 ACS production panel, using the same basic data collection protocol as production ACS with a few differences as noted below. The data collection protocol consisted of three main data collection operations: 1) a six-week mailout period, during which the majority of internet and mailback responses were received; 2) a one-month CATI period for nonresponse follow-up; and 3) a one- month CAPI period for a sample of the remaining nonresponse. Internet and mailback responses were accepted until three days after the end of the CAPI month. As indicated earlier, housing units included in the Content Test sample were randomly assigned to a control or test version of the questions. CATI interviewers were not assigned specific cases; rather, they worked the next available case to be called and therefore conducted interviews for both control and test cases. CAPI interviewers were assigned Content Test cases based on their geographic proximity to the cases and therefore could also conduct both control and test cases. 5 The ACS production sample design is described in Chapter 4 of the ACS Design and Methodology report (U.S. Census Bureau, 2014). 6 Tracts with the highest response rate based on data from the 2013 and 2014 ACS were assigned to the high response stratum in such a way that 75 percent of the housing units in the population (based on 2010 Census estimates) were in the high response areas; all other tracts were designated in the low response strata. Self- response rates were used as a proxy for overall cooperation. Oversampling in low response areas helps to mitigate larger variances due to CAPI subsampling. This stratification at the tract level was successfully used in previous ACS Content Tests, as well as the ACS Voluntary Test in 2003. 7 The ACS production sample design for CAPI follow-up is described in Chapter 4, Section 4.4 of the ACS Design and Methodology report (U.S. Census Bureau, 2014). 8 The ACS Content Test’s data collection protocol differed from the production ACS in a few significant ways. The Content Test analysis did not include data collected via the Telephone Questionnaire Assistance (TQA) program since those who responded via TQA used the ACS production TQA instrument. The Content Test excluded the telephone Failed Edit Follow-Up (FEFU) operation.8 Furthermore, the Content Test had an additional telephone reinterview operation used to measure response reliability. We refer to this telephone reinterview component as the Content Follow-Up, or CFU. The CFU is described in more detail in Section 2.3. ACS production provides Spanish-language versions of the internet, CATI, and CAPI instruments, and callers to the TQA number can request to respond in Spanish, Russian, Vietnamese, Korean, or Chinese. The Content Test had Spanish-language automated instruments; however, there were no paper versions of the Content Test questionnaires in Spanish.9 Any case in the Content Test sample that completed a Spanish-language internet, CATI, or CAPI response was included in analysis. However, if a case sampled for the Content Test called TQA to complete an interview in Spanish or any other language, the production interview was conducted and the response was excluded from the Content Test analysis. This was due to the low volume of non-English language cases and the operational complexity of translating and implementing several language instruments for the Content Test. CFU interviews for the Content Test were conducted in either Spanish or English. The practical need to limit the language response options for Content Test respondents is a limitation to the research, as some respondents self-selected out of the test. 2.3. Content Follow-Up For housing units that completed the original interview, a CFU telephone reinterview was also conducted to measure response error.10 A comparison of the original interview responses and the CFU reinterview responses was used to answer research questions about response error and response reliability. A CFU reinterview was attempted with every household that completed an original interview for which there was a telephone number. A reinterview was conducted no sooner than two weeks (14 calendar days) after the original interview. Once the case was sent to CFU, it was to be completed within three weeks. This timing balanced two competing interests: (1) conducting the reinterview as soon as possible after the original interview to minimize changes in truth between the two interviews, and (2) not making the two interviews so close together that the respondents were simply recalling their previous answers. Interviewers made two call attempts to interview 8 In ACS production, paper questionnaires with an indication that there are more than five people in the household or questions about the number of people in the household, and self-response returns that are identified as being vacant or a business or lacking minimal data are included in FEFU. FEFU interviewers call these households to obtain any information the respondent did not provide. 9 In the 2014 ACS, respondents requested 1,238 Spanish paper questionnaires, of which 769 were mailed back. From that information, we projected that fewer than 25 Spanish questionnaires would be requested in the Content Test. 10 Throughout this report, the “original interview” refers to responses completed via paper questionnaire, internet, CATI, or CAPI. 9 the household member who originally responded, but if that was not possible, the CFU reinterview was conducted with any other eligible household member (15 years or older). The CFU asked basic demographic questions and a subset of housing and detailed person questions that included all of the topics being tested, with the exception of Telephone Service, and any questions necessary for context and interview flow to set up the questions being tested.11 All CFU questions were asked in the reinterview, regardless of whether or not a particular question was answered in the original interview. Because the CFU interview was conducted via telephone, the wording of the questions in CFU followed the same format as the CATI nonresponse interviews. Housing units assigned to the control version of the questions in the original interview were asked the control version of the questions in CFU; housing units assigned to the test version of the questions in the original interview were asked the test version of the questions in CFU. The only exception was for retirement, survivor, and disability income, for which a different set of questions was asked in CFU.12 2.4. Analysis Metrics This section describes the metrics used to assess the revised versions of the questions. The metrics include item missing data rates, response distributions, comparisons to benchmarks, response error, and other metrics. This section also describes the methodology used to calculate unit response rates and standard errors for the test. All Content Test data were analyzed without imputation due to our interest in how question changes or differences between versions of new questions affected “raw” responses, not the final edited variables. Some editing of responses was done for analysis purposes, such as collapsing response categories or modes together or calculating a person’s age based on his or her date of birth. All estimates from the ACS Content Test were weighted. Analysis involving data from the original interviews used the final weights that take into account the initial probability of selection (the base weight) and CAPI subsampling. For analysis involving data from the CFU interviews, the final weights were adjusted for CFU nonresponse to create CFU final weights. The significance level for all hypothesis tests is α = 0.1. Since we are conducting numerous comparisons between the control and test treatments, there is a concern about incorrectly rejecting a hypothesis that is actually true (a “false positive” or Type I error). The overall Type I error rate is called the familywise error rate and is the probability of making one or more Type I errors among all hypotheses tested simultaneously. When adjusting for multiple comparisons, the Holm-Bonferroni method was used (Holm, 1979). 11 Because the CFU interview was conducted via telephone the Telephone Service question was not asked. We assume that CFU respondents have telephone service. 12 Refer to the 2016 ACS Content Test report on Retirement Income for a discussion on CFU questions for survivor, disability, and retirement income. 10 2.4.1. Unit Response Rates and Demographic Profile of Responding Households The unit response rate is generally defined as the proportion of sample addresses eligible to respond that provided a complete or sufficient partial response.13 Unit response rates from the original interview are an important measure to look at when considering the analyses in this report that compare responses between the control and test versions of the survey questionnaire. High unit response rates are important in mitigating potential nonresponse bias. For both control and test treatments, we calculated the overall unit response rate (all modes of data collection combined) and unit response rates by mode: internet, mail, CATI, and CAPI. We also calculated the total self-response rate by combining internet and mail modes together. Some Content Test analyses focused on the different data collection modes for topic-specific evaluations, thus we felt it was important to include each mode in the response rates section. In addition to those rates, we calculated the response rates for high and low response areas because analysis for some Content Test topics was done by high and low response areas. Using the Census Bureau’s Planning Database (U.S. Census Bureau, 2016), we defined these areas at the tract level based on the low response score. The universe for the overall unit response rates consists of all addresses in the initial sample (70,000 addresses) that were eligible to respond to the survey. Some examples of addresses ineligible for the survey were a demolished home, a home under construction, a house or trailer that was relocated, or an address determined to be a permanent business or storage facility. The universe for self-response (internet and mail) rates consists of all mailable addresses that were eligible to respond to the survey. The universe for the CATI response rate consists of all nonrespondents at the end of the mailout month from the initial survey sample that were eligible to respond to the survey and for whom we possessed a telephone number. The universe for the CAPI response rates consists of a subsample of all remaining nonrespondents (after CATI) from the initial sample that were eligible to respond to the survey. Any nonresponding addresses that were sampled out of CAPI were not included in any of the response rate calculations. We also calculated the CFU interview unit response rate overall and by mode of data collection of the original interview and compared the control and test treatments because response error analysis (discussed in Section 2.4.5) relies upon CFU interview data. Statistical differences between CFU response rates for control and test treatments will not be taken as evidence that one version is better than the other. For the CFU response rates, the universe for each mode consists of housing units that responded to the original questionnaire in the given mode (internet, mail, CATI, or CAPI) and were eligible for the CFU interview. We expected the response rates to be similar between treatments; however, we calculated the rates to verify that assumption. Another important measure to look at in comparing experimental treatments is the demographic profile of the responding households in each treatment. The Content Test sample was designed with the intention of having respondents in both control and test treatments exhibit similar distributions of socioeconomic and demographic characteristics. Similar distributions allow us to compare the treatments and conclude that any differences are due to the experimental treatment instead of underlying demographic differences. Thus, we analyzed distributions for data from the 13 A response is deemed a “sufficient partial” when the respondent gets to the first question in the detailed person questions section for the first person in the household. 11 following response categories: age, sex, educational attainment, and tenure. The topics of race, Hispanic origin, and relationship are also typically used for demographic analysis; however, those questions were modified as part of the Content Test, so we could not include them in the demographic profile. Additionally, we calculated average household size and the language of response for the original interview.14 For response distributions, we used chi-square tests of independence to determine statistical differences between control and test treatments. If the distributions were significantly different, we performed additional testing on the differences for each response category. To control for the overall Type I error rate for a set of hypotheses tested simultaneously, we performed multiple- comparison procedures with the Holm-Bonferroni method (Holm, 1979). A family for our response distribution analysis was the set of p-values for the overall characteristic categories (age, sex, educational attainment, and tenure) and the set of p-values for a characteristic’s response categories if the response distributions were found to have statistically significant differences. To determine statistical differences for average household size and the language of response of the original interview we performed two-tailed hypothesis tests. For all response-related calculations mentioned in this section, addresses that were either sampled out of the CAPI data collection operation or that were deemed ineligible for the survey were not included in any of the universes for calculations. Unmailable addresses were also excluded from the self-response universe. For all unit response rate estimates, differences, and demographic response analysis, we used replicate base weights adjusted for CAPI sampling (but not adjusted for CFU nonresponse). 2.4.2. Item Missing Data Rates Respondents leave items blank for a variety of reasons including not understanding the question (clarity), their unwillingness to answer a question as presented (sensitivity), and their lack of knowledge of the data needed to answer the question. The item missing data rate (for a given item) is the proportion of eligible units, housing units for household-level items or persons for person-level items, for which a required response (based on skip patterns) is missing. Commute Mode The percent of eligible persons who did not provide a response to this question in the control treatment is compared to the corresponding percent from the test treatment. Statistical significance between versions is determined using a one-tailed t-test. Note that for the purposes of this analysis, we count mail mode responses where multiple (two or more) categories are selected (checked) as missing responses. A research question specifically addressing multiple- category responses in the Mail mode is included in the Other Analysis Section 2.4.6. We expected that the test treatment would not have a missing data rate that is the same as or lower than the control treatment. While the categories that a respondent chooses may vary across test and control versions of the survey, each survey includes an “Other” category that should be a last resort for a respondent who is confused about which commute mode category to choose. 14 Language of response analysis excludes paper questionnaire returns because there was only an English questionnaire. 12 Time of Departure The percent of eligible persons who did not provide a response to this question in the control treatment is compared to the corresponding percent from the test treatment. Statistical significance between versions is determined using a one-tailed t-test. Note that for the purposes of this analysis, we count mail or internet mode responses as missing when any of the three parts (hour, minute, am/pm) are missing. 2.4.3. Response Distributions Comparing the response distributions between the control version of a question and the test version of a question allows us to assess whether the question change affects the resulting estimates. Comparisons were made using Rao-Scott chi-squared tests (Rao & Scott, 1987) for distribution or t-tests for single categories when the corresponding distributions are found to be statistically different. Proportion estimates for Commute Mode were calculated as: For the Time of Departure question, we defined ranges of valid responses, which are described below. Commute Mode For Commute Mode, distributions are compared first using all 12 categories on the questionnaire, then using a 10-category collapse from American Factfinder (AFF) Table B08006, and finally using six categories, as found in AFF Table S0801 (see Figure 1. in Section 1.3 for the categories). The most detailed category schema involves 12 commute mode categories. The control version of these categories is shown below. One anticipated finding was a smaller proportion of cases in the “Other” category in the test version because respondents might now find clarity for commute modes that were previously unclear. Individual rail-related commute modes may also show small differences due to increased clarity in category names. For example, a respondent who commuted by light rail may have previously chosen Subway or Elevated in the absence of a category that specifically includes light rail. We compare each pair of distributions (control versus test) using a Chi-squared test. Several combinations of collapsed commute mode categories are tested. A t-test is also used to test the proportion of the three rail-related commuting categories combined. Figure 3 shows the commute mode categories included in each distribution. 13 Category proportion= weighted count of valid responses in categoryweighted count of all valid responses Figure 3. Commute Mode Categories (control version) Used in Analysis 12 Categories Bicycle 10 Categories Bicycle 6 Categories Bicycle Bus Bus Car, truck or van Car, truck or van Car, truck or van Ferryboat Ferryboat Public Transportation Taxi, Motorcycle or Other Method Motorcycle Railroad Walked Other Method Streetcar Worked at Home Subway Taxi, Motorcycle or Other Method Walked Worked at Home Railroad Streetcar Subway Taxicab Walked Worked at Home Time of Departure Since this question is presented in an open-ended write-in format, answers are grouped into intervals for easier comparison. We compare each pair of distributions (control versus test) using a Chi-squared test. The time of departure intervals are: 12:00 a.m. to 4:59 a.m. 8:00 a.m. to 8:29 a.m. 8:30 a.m. to 8:59 a.m. 5:00 a.m. to 5:29 a.m. 9:00 a.m. to 9:59 a.m. 5:30 a.m. to 5:59 a.m. 10:00 a.m. to 10:59 a.m. 6:00 a.m. to 6:29 a.m. 11:00 a.m. to 11:59 a.m. 6:30 a.m. to 6:59 a.m. 12:00 p.m. to 3:59 p.m. 7:00 a.m. to 7:29 a.m. 4:00 p.m. to 11:59 p.m. 7:30 a.m. to 7:59 a.m. 2.4.4. Benchmarks No other surveys collect directly comparable data to use as a benchmark in this analysis. The National Household Travel Survey (NHTS), a survey conducted by the U.S. Department of Transportation, uses a travel diary method to collect information about travel patterns in the United States. This survey generally serves as a useful comparison for ACS commuting estimates such as travel mode distribution and travel time. For the purpose of this content test, 14 the most appropriate benchmark is to compare responses from the test treatment to the current production questions, as done in the comparisons provided throughout section 5. 2.4.5. Response Error Response error occurs for a variety of reasons, such as flaws in the survey design, misunderstanding of the questions, misreporting by respondents, or interviewer effects. There are two components of response error: response bias and simple response variance. Response bias is the degree to which respondents consistently answer a question incorrectly. Simple response variance is the degree to which respondents answer a question inconsistently. A question has good response reliability if respondents tend to answer the question consistently. Re-asking the same question of the same respondent (or housing unit) allows us to measure response variance. We measured simple response variance by comparing valid responses to the CFU reinterview with valid responses to the corresponding original interview.15 The Census Bureau has frequently used content reinterview surveys to measure simple response variance for large demographic data collection efforts, including the 2010 ACS Content Test, and the 1990, 2000, and 2010 decennial censuses (Dusch & Meier, 2012). The following measures were used to evaluate consistency:  Gross difference rate (GDR)  Index of inconsistency (IOI)  L-fold index of inconsistency (IOIL) The first two measures – GDR and IOI – were calculated for individual response categories. The L-fold index of inconsistency was calculated for questions that had three or more mutually exclusive response categories, as a measure of overall reliability for the question. The GDR, and subsequently the simple response variance, are calculated using the following table and formula. 15 A majority of the CFU interviews were conducted with the same respondent as the original interview (see the Limitations section for more information). 15 Table 1. Interview and Reinterview Counts for Each Response Category Used for Calculating the Gross Difference Rate and Index of Inconsistency CFU Reinterview “Yes” CFU Reinterview “No” Original Interview Totals Original Interview “Yes” a c a + c Original Interview “No” b d b + d Reinterview Totals a + b c + d n Where a, b, c, d, and n are defined as follows: a = weighted count of units in the category of interest for both the original interview and reinterview b = weighted count of units NOT in the category of interest for the original interview, but in the category for the reinterview c = weighted count of units in the category of interest for the original interview, but NOT in the category for the reinterview d = weighted count of units NOT in the category of interest for either the original interview or the reinterview n = total units in the universe = a + b + c + d. The GDR for a specific response category is the percent of inconsistent answers between the original interview and the reinterview (CFU). We calculate the GDR for a response category as Statistical significance between the GDR for a specific response category between the control and test treatments is determined using a one-tailed t-test. In order to define the IOI, we must first discuss the variance of a category proportion estimate. If we are interested in the true proportion of a total population that is in a certain category, we can use the proportion of a survey sample in that category as an estimate. Under certain reasonable assumptions, it can be shown that the total variance of this proportion estimate is the sum of two components, sampling variance (SV) and simple response variance (SRV). It can also be shown that an unbiased estimate of SRV is half of the GDR for the category (Flanagan, 1996). SV is the part of total variance resulting from the differences among all the possible samples of size n one might have selected. SRV is the part of total variance resulting from the aggregation of response error across all sample units. If the responses for all sample units were perfectly consistent, then SRV would be zero, and the total variance would be due entirely to SV. As the name suggests, the IOI is a measure of how much of the total variance is due to inconsistency in responses, as measured by SRV and is calculated as: 16 GDR= (b+c)n × 100 IOI= n(b+c) a+c c+d +(a+b)(b+d)×100 Per the Census Bureau’s general rule, index values of less than 20 percent indicate low inconsistency, 20 to 50 percent indicate moderate inconsistency, and over 50 percent indicate high inconsistency. An IOI is computed for each response category and an overall index of inconsistency, called the L-fold index of inconsistency, is reported for the entire distribution. The L-fold index is a weighted average of the individual indexes computed for each response category. When the sample size is small, the reliability estimates are unstable. Therefore, we do not report the IOI and GDR values for categories with a small sample size, as determined by the following formulas: 2a + b + c < 40 or 2d + b + c < 40, where a, b, c, and d are unweighted counts as shown in Table 1 above (see Flanagan 1996, p. 15). The measures of response error assume that those characteristics in question did not change between the original interview and the CFU interview. To the extent that this assumption is incorrect, we assume that it is incorrect at similar rates between the control and test treatments. 2.4.6. Other Analysis Methodology Specific to Commuting Questions Commute Mode We are especially interested in analysis for cities with a diverse set of transit options, specifically rail options. It is in these places that we expect to see meaningful differences in the three rail categories. We looked at metro areas with high rates of rail usage, as defined by the American Public Transportation Association 2014 Transit Ridership Report (American Public Transportation Association, 2014). Test responses from these metro areas are combined and compared against the combined control responses from these areas. Figure 4. Selected Cities Included in Targeted Rail Metro Analyses Cities with High Levels of Overall Rail Ridership New York, NY Washington, DC Chicago, IL Boston, MA San Francisco, CA Philadelphia, PA Atlanta, GA Los Angeles, CA Miami, FL Baltimore, MD Cities with High Levels of Light Rail Usage Boston, MA Los Angeles, CA San Francisco, CA San Diego, CA Portland, OR Philadelphia, PA Dallas, TX Denver, CO Salt Lake City, UT St. Louis, MO An additional test for Commute Mode focused on the extent to which respondents who received a paper questionnaire incorrectly marked more than one commute mode. The question instructs 17 respondents to choose the single commute mode for which the longest distance was traveled, but a small percentage of respondents invariably choose more than one. For standard ACS processing, these cases are ultimately allocated. The frequency of choosing multiple modes was compared for the test and control treatments using different travel mode category pairs. Time of Departure There is a tendency for respondents to answer this question with a time that is rounded to a time ending in 0 or 5 (Stapleton & Steiger, 2015). If the test version of this question produces fewer responses ‘heaped’ on times ending in 0 or 5, the test version might be providing estimates that are more precise. This analysis is conducted using a two-tailed t-test. 2.4.7. Standard Error Calculations We estimated the variances of the estimates using the Successive Differences Replication (SDR) method with replicate weights, the standard method used in the ACS (see U.S. Census Bureau, 2014, Chapter 12). We calculated the variance for each rate and difference using the formula below. The standard error of the estimate (X0) is the square root of the variance: where: 𝑋0 = the estimate calculated using the full sample, 𝑋𝑟 = the estimate calculated for replicate 𝑟. 3. DECISION CRITERIA Before fielding the 2016 ACS Content Test, we identified which of the metrics would be given higher importance in determining which version of the question would be recommended for inclusion in the ACS moving forward. The following tables identify the research questions and associated metrics in priority order. Table 2. Decision Criteria for Commute Mode Research Questions 3 2, 6 and 7 1 4 5 Decision Criteria, in order of priority The test version should have the same or higher rate of responses commuting by rail than the control version. Differences in the distribution of commute mode categories should be minimal between test and control versions. The item missing data rates for the test version should be the same or lower than the control version. The reliability for the test version should be higher than the control version. The proportion of person records that mark multiple modes should be comparable between the control and test versions. 18 Var(X0)= 480 (Xr80r=1−X0)2 Table 3. Decision Criteria for Time of Departure Research Questions Decision Criteria, in order of priority 8 11 10 9 The item missing data rates for the test version should be the same or lower than the control version. The proportion of responses in the test version that appear to be rounded should be the same or lower than in the control version. The reliability for the test version should be higher than the control version. The distributions between the control and test versions should have minimal to no differences. 4. LIMITATIONS CATI and CAPI interviewers were assigned control and test treatment cases, as well as production cases. The potential risk of this approach is the introduction of a cross-contamination or carry-over effect due to the same interviewer administering multiple versions of the same question item. Interviewers are trained to read the questions verbatim to minimize this risk, but there still exists the possibility that an interviewer may deviate from the scripted wording of one question version to another. This could potentially mask a treatment effect from the data collected. Interviews were only conducted in English and Spanish. Respondents who needed language assistance in another language were not able to participate in the test. Additionally, the 2016 ACS Content Test was not conducted in Alaska, Hawaii, or Puerto Rico. Any conclusions drawn from this test may not apply to these areas or populations. For statistical analysis specific to the mail mode, there may be bias in the results because of unexplained unit response rate differences between the control and test treatments. We were not able to conduct demographic analysis by relationship status, race, or ethnicity because these topics were tested as part of the Content Test. The CFU reinterview was not conducted in the same mode of data collection for households that responded by internet, by mail, or by CAPI in the original interview since CFU interviews were only administered using a CATI mode of data collection. As a result, the data quality measures derived from the reinterview may include some bias due to the differences in mode of data collection. To be eligible for a CFU reinterview, respondents needed to either provide a telephone number in the original interview or have a telephone number available to the Census Bureau through reverse address look up. As a result, 2,284 of the responding households (11.8 percent with a standard error of 0.2) from the original control interviews and 2,402 of the responding households (12.4 percent with a standard error of 0.2) from the original test interviews were not eligible for the CFU reinterview. The difference between the control and test treatments is statistically significant (p-value=0.06). 19 Although we reinterviewed the same person who responded in the original interview when possible, we interviewed a different member of the household in the CFU for 7.5 percent (standard error of 0.4) of the CFU cases for the control treatment and 8.4 percent (standard error of 0.5) of the CFU cases for the test treatment.16 The difference between the test and control treatments is not statistically significant (p-value=0.26). This means that differences in results between the original interview and the CFU for these cases could be due in part to having different people answering the questions. However, those changes were not statistically significant between the control and test treatments and should not impact the conclusions drawn from the reinterview. The Content Test does not include the production weighting adjustments for seasonal variations in ACS response patterns, nonresponse bias, and under-coverage bias. As a result, any estimates derived from the Content Test data do not provide the same level of inference as the production ACS and cannot be compared to production estimates. In developing initial workload estimates for CATI and CAPI, we did not take into account the fact that we oversampled low response areas as part of the Content Test sample design. Therefore, workload and budget estimates were too low. In order to stay within budget, the CAPI workload was subsampled more than originally planned. This caused an increase in the variances for the analysis metrics used. An error in addressing and assembling the materials for the 2016 ACS Content Test caused some Content Test cases to be mailed production ACS questionnaires instead of Content Test questionnaires. There were 49 of these cases that returned completed questionnaires, and they were all from the test treatment. These cases were excluded from the analysis. Given the small number of cases affected by this error, there is very little effect on the results. Questionnaire returns were expected to be processed and keyed within two weeks of receipt. Unfortunately, a check-in and keying backlog prevented this requirement from being met, thereby delaying eligible cases from being sent to CFU on a schedule similar to the other modes. Additionally, the control treatment questionnaires were processed more quickly in keying than the test treatment questionnaires resulting in a longer delay for test mail cases to be eligible for CFU. On average, it took 18 days for control cases to become eligible for CFU; it took 20 days for test cases. The difference is statistically significant. This has the potential to impact the response reliability results. For Commute Mode, testing categories involving rail is challenging because rail-related travel infrastructure only exists in a small percentage of U.S. cities. Testing a commuting category that only applies to a small sample of the working population requires a relatively large sample size to obtain margins of error large enough to produce significant differences between test and control treatments. Small proportional differences also require relatively large samples in order to register as significantly different. 16 This is based on comparing the first name of the respondent between the original interview and the CFU interview. Due to a data issue, we were not able to use the full name to compare. 20 5. RESEARCH QUESTIONS AND RESULTS This section presents the results from the analyses of the 2016 ACS Content Test data for the questions on Commute Mode and Time of Departure for work. An analysis of unit response rates is presented first followed by topic-specific analyses. For the topic-specific analyses, each research question is restated, followed by corresponding data and a brief summary of the results. 5.1. Unit Response Rates and Demographic Profile of Responding Households This section provides results for unit response rates for both control and test treatments for the original Content Test interview and for the CFU interview. It also provides results of a comparison of socioeconomic and demographic characteristics of respondents in both control and test treatments. 5.1.1. Unit Response Rates for the Original Content Test Interview The unit response rate is generally defined as the proportion of sample addresses eligible to respond that provided a complete or sufficient partial response. We did not expect the unit response rates to differ between treatments. This is important because the number of unit responses should also affect the number of item responses we receive for analyses done on specific questions on the survey. Similar item response universe sizes allow us to compare the treatments and conclude that any differences are due to the experimental treatment instead of differences in the populations sampled for each treatment. Table 4 shows the unit response rates for the original interview for each mode of data collection (internet, mail, CATI, and CAPI), all modes combined, and both self-response modes (internet and mail combined) for the control and test treatments. When looking at the overall unit response rate (all modes combined) the difference between control (93.5 percent) and test (93.5 percent) is less than 0.1 percentage points and is not statistically significant. Table 4. Original Interview Unit Response Rates for Control and Test Treatments, Overall and by Mode Mode Test Interviews Test Percent Control Interviews Control Percent Test minus Control P-Value All Modes 19,400 93.5 (0.3) 19,455 93.5 (0.3) <0.1 (0.4) 0.98 Self-Response Internet Mail 13,131 8,168 4,963 872 5,397 52.9 (0.5) 34.4 (0.4) 18.4 (0.3) 8.7 (0.4) 83.5 (0.7) 13,284 8,112 5,172 880 5,291 53.7 (0.5) 34.1 (0.4) 19.6 (0.3) 9.2 (0.4) 83.6 (0.6) -0.8 (0.6) 0.4 (0.6) -1.2 (0.5) -0.4 (0.6) <0.1 (0.9) 0.23 0.49 0.01* 0.44 0.96 CATI CAPI Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. P-values with an asterisk (*) indicate a significant difference based on a two-tailed t-test at the α=0.1 level. The weighted response rates account for initial sample design as well as CAPI subsampling. When analyzing the unit response rates by mode of data collection, the only modal comparison that shows a statistically significant difference is the mail response rate. The control treatment 21 had a higher mail response (19.6 percent) than the test treatment (18.4 percent) by 1.2 percentage points. As a result of this difference, we looked at how mail responses differed in the high and low response areas. Table 5 shows the mail response rates for both treatments in high and low response areas.17 The difference in mail response rates appears to be driven by the difference of rates in the high response areas. It is possible that the difference in the mail response rates between control and test is related to the content changes made to the test questions. There are some test questions that could be perceived as being too sensitive by some respondents (such as the test question relating to same- sex relationships) and some test questions that could be perceived to be too burdensome by some respondents (such as the new race questions with added race categories). In the automated modes (internet, CATI, and CAPI) there is a higher likelihood of obtaining a sufficient partial response (obtaining enough information to be deemed a response for calculations before the respondent stops answering questions) than in the mail mode. If a respondent is offended by the questionnaire or feels that the questions are too burdensome they may just throw the questionnaire away, and not respond by mail. This could be a possible explanation for the unit response rate being lower for test than control in the mail mode. We note that differences between overall and total self-response response rates were not statistically significant. As most analysis was conducted at this level, we are confident the response rates were sufficient to conduct topic-specific comparisons between the control and test treatments and that there are no underlying response rate concerns that would impact those findings. Table 5. Mail Response Rates by Designated High (HRA) and Low (LRA) Response Areas Test Interviews 2,082 2,881 - Test Percent 20.0 (0.4) 13.8 (0.3) 6.2 (0.5) Control Interviews 2,224 2,948 - HRA LRA Difference Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Minor additive discrepancies are due to rounding. Standard errors are in parentheses. P-values with an asterisk (*) indicate a significant difference based on a two-tailed t-test at the α=0.1 level. The weighted response rates account for initial sample design as well as CAPI subsampling. 0.02* 0.43 0.11 Control Percent 21.5 (0.4) 14.1 (0.3) 7.4 (0.4) Test minus Control -1.5 (0.6) -0.3 (0.4) -1.1 (0.7) P-Value 5.1.2. Unit Response Rates for the Content Follow-Up Interview Table 6 shows the unit response rates for the CFU interview by mode of data collection of the original interview and for all modes combined, for control and test treatments. Overall, the differences in CFU response rates between the treatments are not statistically significant. The rate at which CAPI respondents from the original interview responded to the CFU interview is lower for test (34.8 percent) than for control (37.7 percent) by 2.9 percentage points. While the protocols for conducting CAPI and CFU were the same between the test and control treatments, we could not account for personal interactions that occur in these modes between the respondent 17 Table A-1 (including all modes) can be found in Appendix A. 22 and interviewer. This can influence response rates. We do not believe that the difference suggests any underlying CFU response issues that would negatively affect topic-specific response reliability analysis for comparing the two treatments. Table 6. Content Follow-Up Interview Unit Response Rates for Control and Test Treatments, Overall and by Mode of Original Interview Original Interview Mode All Modes Internet Mail CATI CAPI Test Interviews Test Percent Control Interviews Control Percent Test minus Control P-Value 7,867 44.8 (0.5) 7,903 45.7 (0.6) -0.8 (0.8) 4,078 2,202 369 1,218 51.9 (0.6) 46.4 (0.9) 48.9 (1.9) 34.8 (1.2) 4,045 52.5 (0.7) 2,197 44.2 (0.9) 399 51.5 (2.5) 1,262 37.7 (1.1) -0.6 (0.8) 2.1 (1.3) -2.5 (2.9) -2.9 (1.6) 0.30 0.49 0.11 0.39 0.07* Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. P-values with an asterisk (*) indicate a significant difference based on a two-tailed t-test at the α=0.1 level. 5.1.3. Demographic and Socioeconomic Profile of Responding Households One of the underlying assumptions of our analyses in this report is that the sample for the Content Test was selected in such a way that responses from both treatments would be comparable. We did not expect the demographics of the responding households for control and test treatments to differ. To test this assumption, we calculated distributions for respondent data for the following response categories: age, sex, educational attainment, and tenure.18 The response distribution calculations can be found in Table 7. Items with missing data were not included in the calculations. After adjusting for multiple comparisons, none of the differences in the categorical response distributions shown below is statistically significant. 18 We were not able to conduct demographic analysis by relationship status, race, or ethnicity because these topics were tested as part of the Content Test. 23 Table 7. Response Distributions – Test versus Control Treatment Test Percent (n=43,236) 5.7 (0.2) 17.8 (0.3) 8.6 (0.3) 25.1 (0.3) 26.8 (0.4) 16.0 (0.3) (n=43,374) 48.8 (0.3) 51.2 (0.3) (n=27,482) 1.3 (0.1) 8.1 (0.3) 1.7 (0.1) 21.7 (0.4) 3.5 (0.2) 21.0 (0.4) 8.8 (0.3) 20.9 (0.4) 13.1 (0.3) (n=17,190) 43.1 (0.6) 21.1 (0.4) 33.8 (0.6) 1.9 (0.2) Item AGE Under 5 years old 5 to 17 years old 18 to 24 years old 25 to 44 years old 45 to 64 years old 65 years old or older SEX Male Female EDUCATIONAL ATTAINMENT# No schooling completed Nursery to 11th grade 12th grade (no diploma) High school diploma GED† or alternative credential Some college Associate’s degree Bachelor’s degree Advanced degree TENURE Owned with a mortgage Owned free and clear Rented Occupied without payment of rent Control Percent (n=43,325) 6.1 (0.2) 17.6 (0.3) 8.1 (0.3) 26.2 (0.3) 26.6 (0.4) 15.4 (0.3) (n=43,456) 49.1 (0.3) 50.9 (0.3) (n=27,801) 1.2 (0.1) 8.0 (0.3) 1.6 (0.1) 22.3 (0.4) 3.6 (0.2) 20.2 (0.4) 9.1 (0.3) 20.3 (0.4) 13.7 (0.3) (n=17,236) 43.2 (0.5) 21.2 (0.4) 34.0 (0.5) 1.7 (0.1) Source: U.S. Census Bureau, 2016 American Community Survey Content Test #For ages 25 and older †General Educational Development Note: Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. Significance testing done at the α=0.1 level. P-values have been adjusted for multiple comparisons using the Holm-Bonferroni method. Adjusted P-Value 0.34 - - - - - - 1.00 - - 1.00 - - - - - - - - - 1.00 - - - - We also analyzed two other demographic characteristics shown by the responses from the survey: average household size and language of response. The results for the remaining demographic analyses can be found in Table 8 and Table 9. Table 8. Comparison of Average Household Size Test (n=17,608) Control (n=17,694) Test minus Control Topic Average Household Size (Number of People) Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Standard errors are shown in parentheses. Significance was tested based on a two-tailed t-test at the α=0.1 level. >-0.01 (<0.1) 2.52 (<0.1) 2.51 (<0.1) 0.76 P-value 24 Table 9. Comparison of Language of Response Control Percent (n=17,694) Language of Response 96.2 (0.2) English 2.6 (0.2) Spanish 1.2 (0.1) Undetermined Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Standard errors are shown in parentheses. Significance was tested based on a two-tailed t-test at the α=0.1 level. Test Percent (n=17,608) 96.1 (0.2) 2.7 (0.2) 1.2 (0.1) Test minus Control <0.1 (0.3) <0.1 (0.2) <0.1 (0.2) 0.52 0.39 0.62 P-value The Content Test was available in two languages, English and Spanish, for all modes except the mail mode. However, the language of response variable was missing for some responses, so we created a category called “undetermined” to account for those cases. There are no detectable differences between control and test for average household size or language of response. There are also no detectable differences for any of the response distributions that we calculated. As a result of this analyses, it appears that respondents in both treatments do exhibit comparable demographic characteristics since none of the resulting findings is significant, which verifies our assumption of demographic similarity between treatments. 5.2. Item Missing Data Rates This section addresses research question number 1: Is the missing data rate the same or lower for the test treatment than for the control treatment? Table 10 shows the item missing data rates for the control and test versions of each of the two commuting questions. The universe for Commute Mode is all workers aged 16 and older who were in the workforce during the reference week. The universe for the time leaving home question is the same except that it does not include workers who worked at home. For Commute Mode, the p-value was not significant, indicating that there was insufficient evidence to conclude that the response item-missing data rate for the test treatment is higher than that of the control treatment. This is consistent with expectations. While the categories that a respondent chooses may vary across test and control versions of the survey, each survey includes an “Other” category that should be a last resort for a respondent who is confused about which commute mode category to choose. The item missing data rate for the test treatment of Time of Departure is not statistically higher than that of the control version, at 10.8 and 10.9 percent, respectively. A lower missing data rate for the test version would be a more favorable outcome, but evidence that the item missing data is not higher for test is consistent with expectations. For this question, the number of people who have expressed concern about privacy is small, so a larger sample may be needed to yield significant differences in response rates. 25 Table 10. Item Missing Data Rates for Control and Test Treatments – Commute Mode and Time of Departure Item Test Sample Size Test Percent Control Sample Size Control Percent Test minus Control P-Value (one- tailed) Commute Mode 17,739 1.4 (0.1) 17,951 1.5 (0.1) >-0.1 (0.2) 0.69 16,631 Time Leaving Home Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. Significance was tested based on a one-tailed t-test (test ≤control) at the α=0.1 level. 10.9 (0.4) 10.8 (0.4) -0.1 (0.6) 16,820 0.59 5.3. Response Distributions This section addresses research questions number 2 and number 9: How do the test and control response distributions compare? Research question number 3 is also addressed: How does the proportion of respondents marking one of the three rail categories compare between test and control versions when all three categories are combined? Commute Mode For Commute Mode, we compared each pair of distributions (control versus test) using a Chi- squared test. Table 11 shows the results of the Rao-Scott Chi-squared statistic for each. The results revealed no statistically significant difference between the test and control treatments for any of the standard travel mode category distributions (12 categories, 10 categories, and 6 categories). This result was not surprising given that our goal was limited to refining and clarifying existing categories to prevent ambiguity and keep up with the changing public transportation landscape. Additionally, the number of commuters who are expected to choose a public transportation mode is relatively small. The lack of significant differences served as an indicator that our refined and more inclusive wording for public transportation categories would not undermine comparability across years. Table 11. Commute Mode: Chi-Square Statistic Comparing Control and Test Treatment Category Rao-Scott Chi-Square Statistic P-value 12 Category Distribution 10 Category Distribution 6 Category Distribution Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Significance testing was done at the α=0.1 level based on a chi-square test. 8.5 7.0 4.2 0.67 0.64 0.52 Table 12 shows the distribution of the 12-category version of Commute Mode. The control and test distributions were not significantly different from one another. While we did not anticipate differences in distributions, our expectation was that any potential difference between test and control versions would involve one of the public transportation categories or a reduction in the number of people who choose the “Other” category in the test version, due to increased clarity of public transportation categories. For example, a respondent who commuted by light rail may 26 have previously chosen “Subway” or “Elevated” in the absence of a category that specifically includes light rail. No such differences were detected. Table 12. Response Distribution for Control and Test Treatment for Commute Mode Test Percent (n=17,429) 12 Category Distribution (simplified category names) Control Percent (n = 17,604) Bicycle Bus Car, truck or van Ferryboat Motorcycle Other Method Railroad Streetcar Subway Taxicab Walked Worked at Home Total Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: χ2 = 8.5, p-value=0.67. Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. Significance testing was done at the α=0.1 level, based on a chi-square test. 0.5 (0.1) 1.9 (0.1) 86.5 (0.4) <0.1 (<0.1) 0.1 (<0.1) 0.7 (0.1) 0.7 (0.1) 0.1 (<0.1) 1.5 (0.1) 0.2 (<0.1) 2.6 (0.2) 5.1 (0.3) 100.0 0.6 (0.1) 2.0 (0.1) 86.2 (0.4) <0.1 (<0.1) 0.1 (<0.1) 1.0 (0.1) 0.7 (0.1) 0.1 (<0.1) 1.7 (0.2) 0.1 (<0.1) 2.3 (0.1) 5.2 (0.2) 100.0 To identify overall differences in the public transportation categories, we focus on the three rail- related categories, which were modified to improve clarity and reduce redundancy of all public transportation categories. For testing purposes only, we created a combined category of all three rail-related commute modes and then assessed their prevalence between the test and control treatments. The combined rail categories were not significantly different across treatments (Table 13). This is consistent with the expectation that there will be little to no difference between commute mode distributions across treatments. Table 13. Proportion of Three Rail-Related Commute Mode Categories Combined Test Sample Size Test Percent Control Sample Size Control Percent Test minus Control P-Value Category 2.3 (0.2) Combined Rail Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. Significance was tested based on a two-tailed t-test at the α=0.1 level. -0.3 (0.2) 2.5 (0.2) 17,429 17,604 0.28 Responses for Commute Mode were broken down by survey mode (internet, mail, or “interview- assisted modes” which includes CATI and CAPI). Among the three survey methods, test and control distributions were compared for the six commute mode categories using a Chi-square test. The collapsed six-category distribution is used in this test in order to obtain a sufficiently 27 large sample for each survey mode. For mail and interviewed modes, the distributions of the test and control treatments were not statistically different from one another, but the test and control distributions for internet mode were statistically different with a p-value of <0.10 for the overall distribution. Table 14 shows the difference between control and test in the distribution of individual commute modes for respondents who responded by internet. Table 14. Proportion of Commute Mode for Test and Control Treatments – Internet Response Mode Commute Mode Bicycle Car, truck or van Public Transportation Taxi, Motorcycle or Other Method Walked Worked at Home Source: U.S. Census Bureau, 2016 American Community Survey Content Test. Significance was tested based on a two-tailed t- test at the α=0.1 level. P-values have been adjusted for multiple comparisons using the Holm-Bonferroni method. χ2 = 9.4, p- value<0.10. Test minus Control <0.01 (0.1) -1.4 (0.7) 0.4 (0.4) 0.3 (0.2) -0.2 (0.2) 0.8 (0.4) Test Percent 0.6 (0.1) 85.9 (0.4) 4.8 (0.3) 1.1 (0.2) 2.1 (0.2) 5.6 (0.3) Control Percent 0.6 (0.1) 87.2 (0.4) 4.3 (0.3) 0.7 (0.1) 2.3 (0.2) 4.8 (0.3) Adjusted P-Value 0.96 0.23 0.88 0.37 0.89 0.37 Time of Departure Table 15 shows the distributions for the test and control versions of Time of Departure. The Rao- Scott Chi-square test found no statistical difference between the two distributions of Time of Departure. The primary goal is to reduce the sensitivity of the question, increase response rates, and retain the current distribution. This is an acceptable outcome based on expectations and decision criteria. Note that since there was no significant difference in the distributions, tests were not done on the individual time of departure categories. 28 Table 15. Response Distribution for Control and Test Treatment for Time of Departure Test Percent (N=14,729) 4.4 (0.2) 3.7 (0.2) 4.6 (0.3) 9.0 (0.3) 10.0 (0.4) 14.6 (0.3) 12.7 (0.4) 11.7 (0.4) 5.6 (0.2) 6.2 (0.3) 2.8 (0.2) 1.4 (0.2) 6.9 (0.3) 6.5 (0.3) Departure Time Categories 12:00 am to 4:59 am 5:00 am to 5:29 am 5:30 am to 5:59 am 6:00 am to 6:29 am 6:30 am to 6:59 am 7:00 am to 7:29 am 7:30 am to 7:59 am 8:00 am to 8:29 am 8:30 am to 8:59 am 9:00 am to 9:59 am 10:00 am to 10:59 am 11:00 am to 11:59 am 12:00 pm to 3:59 pm 4:00 pm to 11:59 pm Source: U.S. Census Bureau, 2016 American Community Survey Content Test. Note: χ2 = 6.7, p-value=0.91. Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. Significance testing was done at the α=0.1 level using a chi-square test. Control Percent (N=14,973) 4.4 (0.3) 3.5 (0.2) 4.4 (0.2) 8.3 (0.4) 10.0 (0.4) 15.2 (0.4) 12.4 (0.4) 12.0 (0.4) 5.7 (0.3) 6.5 (0.2) 2.8 (0.2) 1.3 (0.1) 6.6 (0.3) 6.9 (0.3) Responses for Time of Departure were broken down by survey mode (internet, mail, or “interviewed,” which includes CATI and CAPI). Among the three modes, test and control distributions were compared across Time of departure categories using a Chi-square test. The distributions for mail and interviewed modes were not statistically different, but the test and control distributions for internet mode were statistically different from one another (Table 16). Among individual Time of Departure categories, the 4:00 p.m. to 11:59 p.m. category for the control version was 1.9 percentage points higher than that of the test version. 29 Table 16. Time of Departure Distribution for Internet Response Mode Adjusted P-Value 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 <0.01* Time of Departure 12:00 am to 4:59am 5:00 am to 5:29 am 5:30 am to 5:59 am 6:00 am to 6:29 am 6:30 am to 6:59 am 7:00 am to 7:29 am 7:30 am to 7:59 am 8:00 am to 8:29 am 8:30 am to 8:59 am 9:00 am to 9:59 am 10:00 am to 10:59 am 11:00 am to 11:59 am 12:00 pm to 3:59 pm 4:00 pm to 11:59 pm Source: U.S. Census Bureau, 2016 American Community Survey Content Test. Note: χ2 = 23.6, p-value=0.03. P-values have been adjusted for multiple comparisons using the Holm-Bonferroni method. P-values with an asterisk (*) indicate a significant difference at the α=0.1 level. Test Minus Control 0.3 (0.3) -0.1 (0.3) 0.1 (0.4) 0.5 (0.6) -0.3 (0.6) 1.3 (0.8) 0.5 (0.8) -0.4 (0.6) -0.1 (0.4) -0.4 (0.4) -0.1 (0.3) 0.2 (0.2) 0.2 (0.5) -1.9 (0.4) Control Percent 2.7 (0.2) 2.9 (0.2) 4.3 (0.3) 7.3 (0.4) 10.9 (0.5) 15.2 (0.5) 14.8 (0.5) 12.2 (0.4) 6.8 (0.3) 6.9 (0.3) 2.5 (0.2) 1.1 (0.1) 5.8 (0.3) 6.6 (0.3) Test Percent 3.0 (0.2) 2.8 (0.2) 4.4 (0.3) 7.8 (0.4) 10.6 (0.4) 16.5 (0.5) 15.2 (0.5) 11.8 (0.4) 6.8 (0.3) 6.6 (0.4) 2.5 (0.2) 1.3 (0.2) 6.1 (0.3) 4.8 (0.3) 5.4. Benchmarks No other surveys collect directly comparable data to use as a benchmark in this analysis. The National Household Travel Survey (NHTS), a survey conducted by the U.S. Department of Transportation, includes information about travel mode, including public transportation. While differences in sample size and universe, collection years, methodology, and question wording preclude the possibility of a direct comparison, the 2009 NHTS, the most recent available, shows a public transportation rate of 5.1 percent among commuters, similar to that of the ACS in 2015 (5.1 percent of workers) and other recent years. No statistical testing was conducted for this comparison, but the NHTS provides a useful approximation for public transportation commuting rates and other ACS commuting estimates. For the purpose of this content test, the most appropriate benchmark is to compare responses from the test treatment to the current production questions, as done in the comparisons provided throughout section 5. 5.5. Response Error This section addresses research questions number 4 and number 10: Are the measures of response reliability (gross difference rate and index of inconsistency) better for the Test treatment than for the control treatment? To test this, a portion of the original sample population was reinterviewed, and the answers for their responses between the first and second interviews were compared. 30 Commute Mode The hypothesis is that the increased clarity of the rail categories will lead to more consistent responses over time. Statistical significance between the GDR and IOI will be determined using a one-tailed t-test. One limitation to assessing the reliability of this question is that the reference period of the question is “last week.” The reference period for the original response will therefore always be different from the time frame for the CFU response. This could reasonably lead to a different answer between responses. We assume, however, that any inconsistency in responses due to this would occur at the same rate in the control version as in the test version. The GDR test shown in Table 17 indicates that there is variation in the degree of reliability across commute modes, but this variation is generally consistent between test and control versions. No travel mode category in the test treatment was statistically more reliable than their control treatment counterpart. The category “Subway or Elevated Rail” shows a comparatively low adjusted p-value that rounds to 0.10, but is still greater than 0.10. 0.5 (0.1) 0.7 (0.1) 1.00 1.00 0.10 Adjusted P-value Control GDR Percent 4.6 (0.4) 0.9 (0.1) 1.0 (0.2) Test GDR Percent 4.7 (-0.4) 0.9 (0.1) 0.6 (0.1) Table 17. Difference in Gross Difference Rates (GDR) between Test Percent and Control Percent – Commute Mode Test Minus Control 0.1 (0.5) 0.0 (0.2) -0.4 (0.2) Response Category Car, truck, or van Bus Subway or Elevated Rail Long-distance train or commuter rail Light rail, streetcar, or trolley Ferryboat Taxicab Motorcycle Bicycle Walked Worked from home Other method Source: U.S. Census Bureau, 2016 American Community Survey Content Test. Note: Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. Significance was tested based on a one-tailed t-test (test ≥ control) at the α=0.1 level. P-values have been adjusted for multiple comparisons using the Holm- Bonferroni method. The '-' entry in a cell indicates that either no sample observations or too few sample observations were available to compute an estimate or standard error. - - - -0.1 (0.1) 0.5 (0.3) -0.3 (0.4) 0.1 (0.3) - - - 0.3 (0.1) 1.2 (0.2) 3.1 (0.3) 1.0 (0.2) - - - 0.2 (0.1) 1.7 (0.3) 2.8 (0.3) 1.1 (0.2) - - - 1.00 1.00 1.00 1.00 0.1 (0.2) 1.00 - - - - For Commute Mode, the IOI test results in Table 18 show a pattern similar to that of the GDR in that the degree of consistency in responses between original interviews and reinterviews were similar for test and control treatments. One rail-related category stands out as having comparatively low p-values, “Subway or elevated rail,” but the test was not statistically lower 31 than the control. While the relative differences are instructive, the small sample size for these categories may limit the potential for statistically different results between treatments. Response Category Adjusted P-value Table 18. Index of Inconsistency between Control and Test Treatments – Commute Mode Test IOI Percent 19.4 (1.8) 23.6 (3.7) 22.2 (4.1) 38.6 (6.9) — — — — 18.6 (8.3) 32.0 (4.4) 27.0 (2.9) 78.0 (11.5) Car, truck, or van Bus Subway or Elevated Rail Long-distance train or commuter rail Light rail, streetcar, or trolley Ferryboat Taxicab Motorcycle Bicycle Walked Worked from home Other method Source: U.S. Census Bureau, 2016 American Community Survey Content Test. Note: Standard errors are shown in parentheses. Minor additive discrepancies are due to rounding. Significance was tested based on a one-tailed t-test (test ≥ control) at the α=0.1 level. P-values have been adjusted for multiple comparisons using the Holm- Bonferroni method. The ‘-' entry in a cell indicates that either no sample observations or too few sample observations were available to compute an estimate or standard error. Test Minus Control -0.2 (2.2) -2.5 (5.3) -13.0 (5.9) -0.8 (10.5) — — — — -3.0 (11.8) 9.5 (6.4) -3.0 (3.6) -7.0 (11.6) Control IOI Percent 19.6 (1.5) 26.1 (3.8) 35.2 (4.8) 39.4 (7.5) — — — — 21.6 (8.3) 22.5 (4.5) 30.0 (2.5) 84.9 (4.9) 1.00 1.00 0.11 1.00 — — — — 1.00 1.00 1.00 1.00 Time of Departure A different approach was taken to test reliability for Time of Departure. We compared the proportion of follow-up responses that fell within a difference five minutes or less to their corresponding responses for the original interview (Table 19). For both the test and control treatments, about half of the response pairs (original and follow-up interviews) fell within five minutes of one another. While the control treatment showed a higher rate of response pairs within five minutes, the test rate was not significantly smaller. Table 19. Persons Reporting a Difference of Five Minutes or Less Test Rate (%) for Time of Departure Control Rate (%) 51.4 (1.0) 49.5 (1.2) Test Minus Control -1.9 (1.6) P-Value 0.12 Source: U.S. Census Bureau, 2016 American Community Survey Content Test. Note: Standard errors are shown in parentheses. Significant at α=0.1 level based on a one-tailed t-test (test ≥ control). 32 5.6. Results for Analysis Specific to Journey to Work Commute Mode Public transportation systems are geographically concentrated within large cities and metro areas. Research question number 7 asks: How do the test and control response distributions compare when the sample is restricted to only metro areas with high levels of overall rail usage? To improve our understanding of the commute mode distribution within areas where public transportation categories are most relevant, we combined the sample for metros with high rates of public transportation usage (see Section 2.4.6. for a list of metro areas). These are also metro areas with a diverse set of transportation options. For this group of metro areas, no significant difference was found between the commute mode distributions of test and control treatments (Rao-Scott Chi-Square = 7.8 and p-value=0.73). Research question number 6 asks: How do the test and control response distributions compare in metro areas with high levels of light rail usage? To answer this, with a focus on areas where light rail is most relevant, a separate comparison combined metro areas with the 10 largest light rail systems. This includes the set of metro areas listed in Section 2.4.6. The results show no statistical differences between test and control treatments for any commute mode category (Rao- Scott Chi-Square = 8.1. Research question number 5 asks: For the paper questionnaire, is the proportion of person records that respondents incorrectly marked multiple modes of transportation comparable between control and test versions? When multiple modes are marked, if the sample size is large enough, which combinations are most common in each version? A final analysis specific to Commute Mode assessed the prevalence of respondents incorrectly marking two or more travel modes. When this occurs, the commute mode is allocated. The control treatment showed 59 unweighted incidences of respondents marking multiple commute modes, whereas the test treatment showed only 33 such incidences. For both treatments, the most prevalent combination of modes was a combination of bus and long-distance rail or bus and subway. Still, with such a small sample, the comparison two-tailed t-test shows no statistically significant differences between unweighted test and control treatments (p-value=.15). Time of Departure Respondents tend to answer this question with a time that is rounded, particularly on numbers ending in “0” and ”5” (Stapleton & Steiger, 2015). Research question number 11 explores the rate at which such rounding occurs between the test and control treatments of Time of Departure. We anticipated that the test version of Time of Departure would produce as many or fewer instances of this type of heaping. A two-tailed t–test was used to compare the percentage of responses that end in “0” or “5” for the test and control versions. For the test version, 98.7 percent of respondents heaped on a time ending in “0” or “5,” compared with 98.0 percent for the control version, but these rates were not statistically different from one another (p- value=.12). Still, this analysis was instructive in that it showed that a high percentage of 33 respondents round their Time of Departure answer to a “0” or “5,” regardless of how the question is asked. 6. CONCLUSIONS AND RECOMMENDATIONS This report discusses findings from the 2016 ACS Content Test for two questions related to commuting, Commute Mode and Time of Departure for Work. The motivation for modifying each question differed. For Commute Mode, the original set of categories reflected travel modes and terminology of the 1950s, when the question was developed. We modified commute mode categories to more accurately reflect the nation’s public transportation options and the current terminology used to describe them. Time of Departure has long been considered a sensitive question because it specifically asks respondents when they leave their home to go to work. Our aim is to develop a question that captures crucial information about when our nation’s roads and transit systems are used throughout the day, while reducing the respondents’ sensitivity to the question. We tested a new version of the question asking people what time their trip to work began, with the aim of asking the question in a way that seems less intrusive and does not include the word “home.” Among the various metrics used to answer our research questions, none revealed statistically different results between the test version and control version of each question. For both commuting variables, neither the distributions of the test version nor the control version were statistically different from one another. This is consistent with the expectation that the distribution of departure times would not differ between test and control versions. For Commute Mode, the distribution of rail-related categories did not differ between test and control treatments, which is a satisfactory outcome given that the goal was to ensure clarity among commute mode categories, not to change the distribution. This applies to individual rail-related modes as well as a special combined category (including the three rail-related categories). For both commuting questions, item response rates for the test treatment was not lower than that of the control treatment. Reliability metrics for both Commute Mode and Time of Departure did not show that the test version performed better than the control. The final wording in the test versions of the commuting questions is the product of consultation with industry experts and extensive cognitive testing. This new wording is preferred to the control version of the ACS questions. Overall, the results of the various comparisons between test and control versions of each test showed surprising similarities between the two. The lack of significant differences between distributions suggests continuity in the meaning of the control and test versions of each question, which is an acceptable outcome. The overarching goal is to improve and clarify the wording of the question, not to alter the distribution. While smaller item missing data rates for test versions would be a favorable outcome, the findings of no significant difference in item missing data rates is also acceptable. The test versions of the Commute Mode and Time of Departure questions are preferred over the current version, therefore we recommend implementing the new test version of each question. While transportation technology and travel behavior have changed rapidly in recent years, this iteration of ACS question modification changes has taken a conservative approach to modifying the Commute Mode questions by only refining and clarifying terminology for existing categories 34 rather than adding new categories. We will strongly consider the possibility of testing additional transportation categories that correspond with emerging travel trends in future ACS content test iterations. 7. ACKNOWLEDGEMENTS The 2016 ACS Content Test would not have been possible without the participation and assistance of many individuals from the Census Bureau and other agencies. Their contributions are sincerely appreciated and gratefully acknowledged.  Census Bureau staff in the American Community Survey Office, Application Development and Services Division, Decennial Information Technology Division, Decennial Statistical Studies Division, Field Division, National Processing Center, Population Division, and Social, Economic, and Housing Statistics Division.  Representatives from other agencies in the Federal statistical system serving on the Office of Management and Budget’s Interagency Working Group for the ACS and the Topical Subcommittees formed by the Working Group for each topic tested on the 2016 ACS Content Test.  Staff in the Office of Management and Budget’s Statistical and Science Policy Office. The authors would like to thank the following individuals for their contributions to the analysis and review of this report: Elizabeth Poehler, Nicole Scanniello, and Jennifer Ortman. 8. REFERENCES American Public Transportation Association. (October 2014). Light Rail & Streetcar Systems: How They Differ; How They Overlap. Retrieved June 23, 2015 from American Public Transportation Association: http://www.apta.com/resources/reportsandpublications/Documents/APTA Light Rail- Streetcars-How They Differ-How They Overlap Oct 14.pdf Chappell, G., & Obenski, S. (November 2014). ACS Fiscal Year 2014 Content Review Results. Washington, D.C.: U.S. Census Bureau. Retrieved June 24, 2015 from http://www.census.gov/acs/www/Downloads/operations_admin/2014_content_review/M ethods%20and%20Results%20Report/2014_ACS_Content_Review_Final_Documentatio n.pdf Dusch, G. and Meier, F. (2012). 2010 Census Content Reinterview Survey Evaluation Report, U.S. Census Bureau, June 13, 2012. Retrieved May 17, 2016 from http://www.census.gov/2010census/pdf/2010_Census_Content_Reinterview_Survey_Eva luation_Report.pdf Federal Highway Administration, U.S. Department of Transportation. (2009). Introduction to the 2009 NHTS. Retrieved June 24, 2015 from National Household Travel Survey: Our Nation's Travel: http://nhts.ornl.gov/introduction.shtml 35 Flanagan, P. (1996). Survey Quality & Response Variance (Unpublished Internal Document). U.S. Census Bureau. Demographic Statistical Methods Division. Quality Assurance and Evaluation Branch. Holm, S. (1979). “A Simple Sequentially Rejective Multiple Test Procedure,” Scandinavian Journal of Statistics, Vol. 6, No. 2: 65-70. Retrieved on January 31, 2017 from https://www.jstor.org/stable/4615733?seq=1#page_scan_tab_contents Rao, J. N. K.; Scott, A. J. (1987). “On Simple Adjustments to Chi-Square Tests with Sample Survey Data,” The Annal of Statistics, Vol. 15, No. 1, 385-397. Retrieved on January 31, 2017 from http://projecteuclid.org/euclid.aos/1176350273 Stapleton, M., & Steiger, D. (2015). Cognitive Testing of the 2016 American Community Survey Content Test Items: Summary Report for Round 1 and Round 2 Interviews. Westat, Rockville, Maryland, January 2015. Steiger, D., Anderson, J., Folz, J., Leonard, M., & Stapleton, M. (2015). Cognitive Testing of the 2016 American Community Survey Content Test Items: Briefing Report for Round 3 Interviews. Westat, Rockville, Maryland, June, 2015. U.S. Census Bureau. (2014). American Community Survey Design and Methodology (January 2014). Retrieved February 1, 2017 from http://www.census.gov/programs-surveys/acs/methodology/design-and- methodology.html U.S. Census Bureau (2016). 2015 Planning Database Tract Data [Data file]. Retrieved on January 31, 2017 from http://www.census.gov/research/data/planning_database/2015/ 36 APPENDIX A. Supplemental Table for Unit Response Rates Table A-1. Unit Response Rates by Designated High (HRA) and Low (LRA) Response Areas Mode Test Interviews Test Percent Control Interviews Control Percent Test minus Control P-Value Mail Internet HRA LRA Difference Total Response HRA LRA Difference Self-Response HRA LRA Difference 19,400 7,556 11,844 - 13,131 6,201 6,930 - 8,168 4,119 4,049 - 4,963 2,082 2,881 - 872 296 576 - 5,397 1,059 4,338 - 19,455 7,608 11,847 - 13,284 6,272 7,012 - 8,112 4,048 4,064 - 5,172 2,224 2,948 - 880 301 579 - 5,291 1,035 4,256 - Source: U.S. Census Bureau, 2016 American Community Survey Content Test Note: Standard errors are in parentheses. Minor additive discrepancies are due to rounding. P-values with an asterisk (*) indicate a significant difference based on a two-tailed t-test at the α=0.1 level. The weighted response rates account for the initial sample design as well as CAPI subsampling. - 94.3 (0.4) 91.5 (0.3) 2.7 (0.5) - 59.7 (0.7) 33.2 (0.4) 26.5 (0.8) - 39.6 (0.6) 19.4 (0.3) 20.2 (0.6) - 20.0 (0.4) 13.8 (0.3) 6.2 (0.5) - 9.0 (0.5) 7.9 (0.4) 1.1 (0.6) - 82.2 (1.0) 85.8 (0.5) -3.7 (1.1) - 94.5 (0.3) 91.0 (0.3) 3.5 (0.5) - 60.6 (0.7) 33.6 (0.4) 27.0 (0.8) - 39.1 (0.6) 19.5 (0.3) 19.6 (0.7) - 21.5 (0.4) 14.1 (0.3) 7.4 (0.4) - 9.6 (0.6) 8.0 (0.3) 1.6 (0.7) - 82.7 (0.9) 85.0 (0.4) -2.3 (1.0) - -0.2 (0.6) 0.5 (0.5) -0.7 (0.7) - -0.9 (0.9) -0.4 (0.6) -0.5 (1.2) - 0.5 (0.8) 0.1 (0.4) 0.6 (0.9) - -1.5 (0.6) -0.3 (0.4) -1.1 (0.7) - -0.6 (0.8) -0.1 (0.5) -0.5 (0.9) - -0.5 (1.3) 0.8 (0.7) -1.3 (1.5) - 0.72 0.29 0.33 - 0.31 0.55 0.66 - 0.51 0.87 0.52 - 0.02* 0.43 0.11 - 0.44 0.85 0.58 - 0.69 0.23 0.36 HRA LRA Difference HRA LRA Difference HRA LRA Difference CATI CAPI 37

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server