Gary King and Nathaniel Persily
A year ago, we developed a new model of industry-academic partnerships and began a collaboration with Facebook to embark on an unprecedented effort to enlist the world’s research community to study social media’s effect on elections and democracy. Today, we are announcing our approval of the first group of over sixty researchers from around the world to use Facebook data for this purpose in a safe, secure, and privacy-protected manner. We have more work to do and hurdles to overcome to achieve the shared ambition of the research community, but we are pleased that with this announcement we have reached an important milestone.
The urgency of this research cannot be overstated. Elections in India are already underway, the European Parliamentary elections will take place in short order, and the U.S. presidential primary campaigns have begun in earnest. Concerns about disinformation, polarization, political advertising, and the role of platforms in the information ecosystem have not diminished. If anything, they have heightened. We believe we can provide the fuel in the form of data access to the scholarly community to help solve some of the major issues in social media that affect elections and democracy across the world.
How Far We Have Come
A team of Facebook employees assigned to this effort has worked tirelessly with us to build a first-of-its-kind privacy-protecting research infrastructure, and to tackle the formidable legal, regulatory, security, computational, administrative, and logistical issues involved. The team at Facebook has been dedicated to our mission and has worked professionally and creatively -- and relentlessly -- to implement it. The challenges our project has faced do not arise from their level of investment in this effort. Unexpected delays emerged from the inherent complexity of the task, the difficult goal we set for ourselves of providing the first ever scholarly access to Facebook data that does not require pre-publication approval by the company, and the fact that few parts of this project have ever even been attempted before, inside or outside of Facebook.
Based on our original assumptions and expectations, we, along with our partners at the Social Science Research Council issued a Request for Proposals in July 2018 for a database at the URL level. The database was to contain information about aggregate exposure to URLs by large numbers of different population subclasses and different characteristics of the URL (e.g., whether it was fact-checked and what the fact-check revealed). Because the resulting dataset would have some subclasses that contained sparse aggregations, there was concern that a researcher with bad intentions could, in theory, leverage the dataset to discover what some individual may have once seen on Facebook. As such, that dataset was deemed insufficiently protective without further modifications to our plans.
Since then, we have been working with Facebook to wrestle with the difficult privacy, policy, legal, and technical questions involved in developing the infrastructure for data access that would satisfy all relevant parties and regulatory bodies. Over the last 6 months, Facebook has built a research tool that allows data grantees to log into and query Facebook data for insights. We have also been helping Facebook deploy cutting edge “differential privacy” systems that, by adding specialized types of noise to the data or analysis methods, prevent researchers from reidentifying individuals while simultaneously not obscuring research findings about societal patterns when researchers perform appropriate analyses. We are still working with the Facebook team to finalize and test the research tool, validate the datasets within the tool, and ensure that the differential privacy algorithms are implemented in ways that provide both utility to researchers and privacy guarantees for the data. Differential privacy has the potential to solve the political problems of data sharing technologically, but it is a new innovation. Over the next months, we hope to finalize these reviews; we are proud of the enormous progress on numerous fronts we and our partners have made, and hope to bring considerable amounts of data sharing to the research community in due course.
Where We Are Today
The researchers announced today will gain access to some data immediately, and other datasets in stages when our testing indicates they both are useful for scholarly research and meet appropriate privacy and legal standards. If the system we are building works to make the URLs dataset available, we believe it will then become possible to make available other highly informative datasets to the research community. We are also about to begin giving data access to other researchers on a faster schedule but without funding, and will give details of this new process in another blog post shortly.
To receive data access, approved researchers and their universities must sign a research data agreement with Facebook. Since this is an unusual agreement for scholars, with some stringent provisions Facebook required, we wrote an explanation for how we think about it so researchers can more easily decide for themselves whether they should participate.
As of now, researchers who sign the agreement will receive (1) funding provided by eight charitable foundations (through the SSRC acting as a fiscal agent), and immediate access to the (2) Crowdtangle API and (3) the Ad Library API, all described below. We encourage researchers whose projects can make use of these APIs to start with these datasets immediately since they are already available. We are also working toward releasing the URL dataset to researchers in several stages, each of which are contingent on successful completion of reviews to ensure the system is privacy-preserving and of utility to researchers. We give more details on the datasets and our process for providing access below.
The Crowdtangle API is a platform used already by many media institutions and journalism schools. It includes a subset of public pages on Facebook, Instagram, Twitter, and Reddit. Facebook has agreed to let Social Science One give access to this API for research on Facebook and Instagram. We hope that Twitter and Reddit will give permission for Crowdtangle to be made available to researchers for those platforms as well, given the importance of cross-platform research to study disinformation, political polarization, and other democracy-relevant phenomena. Nevertheless, the data we are now providing access to should be quite informative as it includes, from Facebook, 6.9 billion page posts, 1.2 billion group posts, and 11.2 million verified profile posts, as well as 1.6 billion Instagram posts.
Facebook is also granting public access to its Ad Library API for the purpose of analyzing political advertising data. We hope to help improve how the API is designed so that researchers can make better use of it. For example, we agree with the recommendations for developing an ads API contained in the letter recently issued by a group of scholars associated with Mozilla. Indeed, Social Science One commission members have made an array of additional recommendations, urging Facebook to provide page identifiers, boolean searches, and other functionality improvements tailored for research purposes, and to remove some instability and other flaws in the platform. The Ad Library is new and under active development, and we are hopeful Facebook will continually improve its capabilities so researchers can better analyze campaign spending, ad targeting, and ad content. It includes information about $539,861,997 spent on 3,259,272 ads in the US since May 2018; it also includes somewhat smaller numbers of ads in India, the UK, Ukraine, and Brazil.
We hope to release access to the URL shares dataset in several stages, starting with (1) Onboarding researchers onto the Facebook research tool for training, (2) Releasing a “URLs-Light” dataset that will include aggregated information on all of the URL's shared on Facebook and aggregated data around interactions with those URLs (see the codebook), and (3) Releasing a more informative full URLs dataset that will provide (privacy protected) information to researchers about user interactions with the URLs. Each stage in this process will rely on the completion of additional security and privacy testing to ensure the security and privacy features and their impact on research. Testing is underway; researchers will not be provided access to each stage until or unless each stage passes these tests.
What the Future Holds
As discussed, our analysis of what Facebook is now offering to researchers is to ensure these new privacy tools are tuned to protect individual privacy while simultaneously not impairing researchers’ ability to find the answers to the questions they are posing. We have only just begun to receive access to the necessary Facebook data in order to conduct such an evaluation, and have not finished negotiating all the legal agreements, but we are optimistic that we will be able to enable research that both protects user privacy and preserves research integrity.
Analyzing the URLs datasets requires training on the new Facebook research infrastructure. We will offer this training to researchers beginning in June. We encourage researchers to take advantage of this training while being mindful of the risks to their time involved given the remaining uncertainty in when they will receive data access, and the scope of data to be provided. If the computer systems Facebook is building and data we hope to provide pass security, privacy, accuracy, and functionality testing, Facebook has promised to make available the full URLs database described above for research.
We hope that this project’s past pace in delivering data to researchers is not indicative of future results. We understand that many of the delays are due to the unprecedented nature of this project and heightened scrutiny by the public and regulators in numerous jurisdictions of anything related to user data. However, Facebook is a company that has achieved amazing things when it sets its collective mind to it; we believe Facebook leadership when they express their deep commitment to our goals, but faster results are necessary to the future of the project.
We are grateful to all of our partners in this effort. Facebook’s team has been dedicated to building an entirely new infrastructure for researchers and applying the highest possible standards of privacy. The Social Science Research Council, the fiscal agent for our effort, has put together a highly capable panel of peer reviewers and is administering the grants. The eight charitable foundations have given generously to our project and continue to provide helpful advice and guidance. Finally, all the academics involved in Social Science One take seriously our responsibility to ensure that researchers obtain data that is informative while privacy-protected. We remain hopeful but also realistic and so our commitment to transparency about how this project is progressing remains a top priority.