Opening Remarks

COPAFS Chair Maurine Haver noted that the meeting program focused on the topic of big data. She asked for comments on the topic-oriented format, and invited members to serve on a committee on programs for the quarterly meetings. Maurine also mentioned the formation of an advocacy committee that will work to have COPAFS heard on the Hill in support of federal statistics. Members also are invited to serve on that committee.

Executive Director Kitty Smith introduced the big data topic, noting that the term lacks a single definition, but reflects the reality that many companies are using very large data sets to better understand behavior. A key concern for COPAFS is the implications of big data for federal statistics.


Big Data Projects at the Census Bureau
William G. Bostic, Jr. U.S. Census Bureau

Bostic said the use of big data at the Census Bureau is not new, as they have long used IRS and other large administrative data sources. But now there is a new generation of big data, and the Census Bureau needs to learn how to make use of such sources to increase efficiency while at the same time maintaining the quality of data products.

The Census Bureau is researching big data solutions for a variety of operations – a departure from the past practice of targeting technologies to specific surveys. Big data can improve timeliness and explanatory power, but are raw and on a much larger scale than survey and even administrative data. Also of interest are big data techniques such as pattern-discerning models, and mashed up data, that can improve logistical operations – such as determining the point at which it makes sense to discontinue survey data collection.

Bostic noted that 2020 census planning is underway, and includes plans for the use of administrative data in estimating values for nonresponding units, and verifying responses without follow up visits. It is hoped that social media feeds will identify issues to be addressed in field operations, and that big data can contribute to adaptive design objectives. Bostic stressed that such efforts will require contributions from people with “data scientist” skills.

Big data on foreclosures, new construction, and building permits are seen as a way to improve census address lists. The hope is that big data can supplement the building permit and construction data the Census Bureau collects already, but there are issues related to access and standardization.

The Census Bureau also is exploring the possibility of acquiring retail and service statistics from private suppliers to fill gaps in business data products. However, Bostic cautioned that using such data requires further research to address concerns related to confidentiality, definitions, and data quality. Cost is another concern, as the data would have to be acquired through agreements with the private sector sources. Such efforts are moving ahead, as the opportunities are believed to outweigh the risks and challenges.

Bostic stressed the importance of making use of big data, but at the same time maintaining data quality. As he described it, big data always provide large quantities of data, but not always high quality. Because of their potential, big data are on the radar, and the Census Bureau is developing a process to coordinate and facilitate their contributions across the agency. The objective is to systematically explore data products and processes that might be improved, but the bureau is taking a careful approach, and must watch costs, as big data applications can be very expensive.

The discussion raised questions about where big data would supplement official statistics and where they might be used as replacements. Bostic acknowledged the need to be mindful of the differences between data from surveys and those generated by administrative and digital sources. And asked if there will be a central place at the Census Bureau to oversee big data efforts, Bostic pointed to the Center for Adaptive Design as the function in a position to oversee activities across the directorates.

Big Data: A Perspective from the Bureau of Labor Statistics.
Michael Horrigan. Bureau of Labor Statistics

Noting that the subject is new for everyone, Horrigan said his objective was to consider the definition of big data, how such data are used now, and their future use by statistical agencies. He defined big data as non-sampled data, created for purposes other than statistical estimation. Examples are tweets and Google searches, and his definition would include the administrative data already used by BLS and other statistical agencies. As he put it, both big data and administrative data contrast with traditional survey data.

Uses of big data include webscraping – such as the Billion Prices Project, which tracks prices, and generates measures contrasting with official measures, such as the Consumer Price Index. But the CPI itself uses webscraping to track product characteristics used in quality adjustments for hedonic models for products such as televisions and cameras. Other examples include the use of Google searches to track flu outbreaks and tweets related to job loss as a predictor of unemployment claims. Horrigan also described applications involving big data from Intuit, ADP Payroll, Nielsen Homescan, JD Power, Medicare Part B, and stock exchange security trades. Horrigan described additional uses involving the drawing of samples, the estimation of prices, and imputation-based estimation. Many such applications involve linking and issues of data sharing across agencies.

Asked about opportunities for private data sources contributing to the CPI, Horrigan described research on private sector on register transactions for sampled outlets, but noted challenges, such as accounting for substitution, and the risk of losing access to private data. The use of private sources requires agreements with the firms that provide such data, and statistical agencies need to be careful in the kinds of deals they make with such companies.

Horrigan observed that BLS and other agencies evaluate the efficacy of their data programs using a variety of quality dimensions, and he suggested it is reasonable to ask how big data fare on these dimensions. For example, he gave big data high marks for timeliness and relevance, as for example, it is tough to beat the timeliness of Billion Prices, or the relevance of Google searches. However, when it comes to objectivity, accuracy, and lack of bias, there are questions about big data. Horrigan said big data often have an advantage on coverage, but others observed coverage issues, with the Billion Prices Project, for example, not covering all of what goes into the CPI. Big data applications also are challenged by the fact that many sources have their own classification systems that can be inconsistent with standard classifications (such as for occupations). Big data sources also can be lacking in metadata, transparency, interpretability, and accessibility.

Looking to the future, Horrigan said the use of big data (including administrative data) is here to stay, but that improvements in quality assessment are needed. He foresees questions about household cooperation, and suggested there may be more opportunity for big data applications related to business data than household data. The transparency obstacle is a big one, as is the problem of replicating the results of research based on big data from private companies – as replication would require a separate acquisition of comparable data from another company.

Big Data – The Consumer’s Perspective.
John Horrigan. Media and Technology Institute, Joint Center for Political and Economic Studies

John Horrigan started with the observation that he and the previous presenter, Mike Horrigan, had just met and are not related, so far as they know.

Horrigan described his previous work with the Pew Internet and American Life project that explored the use of broadband, and the National Broadband Plan, designed to promote broadband Internet access. Early on, they started considering the role of personal data in broadband use, how such data (now called big data) could be used, and the related issues of confidentiality. Broadband and big data were seen as ways to unleash innovation, empower consumers, personalize health care, and inform policymakers. The sharing of personal information is key to many broadband innovations, and as Horrigan noted, that requires trust.

Describing big data as a “factor of production,” Horrigan said most people probably are unfamiliar with the term. Big data are generated by the activity of individuals, and already are being used in the operation of many businesses. For example, big data can enable consumers to gather information on consumer goods, and then provide opinions on the perceived value of products and services. Information of this type on products is a good in itself, and of course, applications depend on the willingness of consumers to provide information.

Horrigan also described “experience goods” – such as liking or not liking a certain type of music. For example, big data can enable a music provider to suggest to a buyer of a song, a list of other songs that they might like. Noting that the music industry had just experienced its first increase in sales in 13 years, he suggested that such capabilities might at least have made a contribution. Summing up, Horrigan noted that big data are out there, and whether they know it or not, people are providing a lot of information relevant to consumer marketing.

Considering what could go wrong with big data applications, Horrigan noted the concerns about privacy, but said it is unclear just how big a concern it is for consumers. Surveys show some concern, but also that most Internet users know little about privacy provisions, or ways they are contributing data used for various purposes. And some confuse big data privacy issues with those of identity theft. Another question is whether consumer concerns affect behavior, as there is evidence that people will express concern about privacy, but go ahead and use the Internet as they want. There is some tendency for those most concerned with identity theft to show somewhat lower levels of online engagement, but as Horrigan put it, not enough to “throw a damper on the big data party.”

Horrigan noted that the Internet has taken strong hold over the past decade and a half, but said it is about to become even more impactful, due largely to big data. Big data are creating value for businesses and society as a whole, but there is a need to build in consumer protections. There is a need to elevate levels of digital literacy across the population, and to build consumer support for innovations that can deliver these benefits.

Kitty Smith noted that suspicions over how big data are used in the private sector could impair federal applications of the data. With respect to concern that governments might use personal data in ways similar to businesses, Horrigan said it suggests a great research question – whether government agencies or private businesses could be best expected to follow restrictions on the use of personal data.

For those interested, Horrigan recommended a new book, “Big Data: A Revolution That Will Transform How We Live Work and Think” by Viktor Mayer-Schonberger and Kenneth Cukier.

Big Data Meets Official Statistics.
Robert Groves. Georgetown University

Groves defined big data as data sets so large it is not practical to move them. Agencies do not acquire big data and move them into their systems, but rather gain access to them. Groves also drew the distinction between designed data (such as from surveys) and organic data that arise out of the information ecosystem, and are relatively unobtrusive to those being measured. Big data also reflect a new level of timeliness, as many sources are updated with near real time frequency.

Groves described three approaches one could take with respect to big data. One extreme would be to replace existing measures with big data indicators. The tools for this approach are becoming available, but there are issues of quality. For many applications, we need data that cover all parts of the population and that are representative – and big data sources often do poorly by those standards. Another problem with replacement is that big data do not allow much drill down or the ability to crosstabulate and look at information by subgroups. We design data to be multivariate, but big data tend to be univariate, and leave us with many unanswered questions.

The second extreme is to assume that the present system will endure and win out over big data. That view might be optimal now, but Groves said we have to ask what happens when big data become so prevalent and are used so widely in business that they cannot be ignored.

Groves’ view is that traditional survey data (although challenged) are not going away, and that big data are too powerful to ignore. Therefore, we have no choice but to pursue the third option, which is a blending of big data and survey data. Blending is a challenge, since there is no way to link big data to survey response records. Some spatial linking might be useful – or one could link by time, if surveys would note more closely the time that responses reflect. Groves also suggested linking could be helped if surveys asked if respondents participate in social media such as Facebook – although he wondered how proposals for such questions would fare at OMB.

Groves described access as a big problem, as big data owners have concerns about liability, and the consequences of confidentiality breeches beyond their control. Big data providers also are worried about being scooped – that a partner might invent something with their data that they will wish they had invented, or that a competitor might learn something about their business. Some might even want tax incentives in return for access to their data.

To promote blending, Groves suggested the need for a new institution – probably a public/private partnership, maybe with a privacy lobby having a governance role. Federal agencies and others would apply to this institution to arrange for access to big data resources. He also sees the need for technical staff, noting that big data sets are so big that their use requires skills that few have right now. Another problem with big data is a lack of control. Researchers like control over their data, but big data producers can always change what they are doing. The proposed institution could assist in promoting consistency in big data resources. Groves observed that we are moving to a world where there are many alternate estimates for much of what we are trying to measure. It is a world that is not necessarily liked by survey researchers, nor by the public that wants just one answer. There is a great need for education to help everyone adapt to this new world.

Groves concluded with the observation that blending might be the key to defending old line surveys. Currently, we have to defend traditional surveys on their own, but if they are blended with big data, they will be part of something bigger.

Update on Recent Developments and COPAFS Activities
Kitty Smith

Smith noted the new COPAFS tag line “Linking you with a thriving statistical system,” and remarked that it is what COPAFS and the quarterly meetings are all about. She also described her goal of achieving greater engagement with COPAFS members, and in particular engagement beyond the representatives of the member associations. As part of this objective, Smith will be looking for opportunities to present at conferences held by COPAFS member associations.

Dan Newlon took the opportunity to update us on his involvement with the ongoing efforts on data synchronization. Maurine Haver noted that these efforts go back many years, and reminded us that synchronization is the term now used for data sharing.

Issues from COPAFS Constituencies
In addition to data synchronization, Dan Newlon suggested that issues for COPAFS to focus on include the following.

Core economic data at risk in the sequester
Regulatory changes
New modes of data access.
Research grants (in light of proposals for severe cuts).

With no more discussion, the meeting was adjourned.

March 1, 2013 Presentations