GDPR and the Importance of Data to AI Startups

James Bessen, Stephen Michael Impink, Lydia Reichensperger, and Robert Seamans

The European Union’s General Data Protection Regulation (GDPR) has been in effect for two years, yet its impact on businesses that must comply continues to be debated. There is an apparent tradeoff between safeguarding consumers through increased privacy protections and increasing the amount of valuable consumer data that can be used by businesses to train algorithms. Furthermore, these policies disproportionately affect small businesses, particularly those that rely on large datasets to develop and train artificial intelligence (AI) products and services. Understanding the full impact of these regulations on the AI startup community will help identify unintended consequences of the regulations and inform policymakers’ response.

This study relies on unique survey data from AI startups in more than 25 countries and explores the relationship between GDPR and the development of AI related products at these startups. GDPR is intended to protect consumers’ privacy, especially in light of the growing number of massive data breaches, some leaking the personal data of millions, by requiring companies that retain personal data to have adequate safeguards to protect the data. More specifically, the survey asks new questions about the effects of data regulation on startups and their ability to develop their products and compete for customers. It also provides some initial evidence on the relationships among data access, means of access, firm strategy, and technology choices, and aspects of data privacy regulation.

AI firms of all sizes rely on data for their business. About half the firms surveyed retain some secondary reuse rights of their customers’ data, and therefore have some data retention policy. More than half (54%) responded that expertise in data science is more important than either training data or computing resources. That is not a surprise, given the value of and demand for highly skilled data scientists and engineers. About 42% ranked training data as most important, which is also not surprising, given the data-intensity of AI training.

Algorithm-based AI products depend on large amounts of data to train and tune algorithms. Some AI algorithms, such as neural networks and ensemble learning, support more complex tasks and require more training data. For example, training a chatbot’s natural language comprehension needs much more data than a simpler tool like a recommendation system. The value of training data was underscored by 60% of firms saying that they would refresh their training data more often, given hypothetical access to unlimited data. More than two thirds of firms (68%) reported that owning data provides a major advantage in their respective markets. This is particularly true for firms that develop neural networks and ensemble learning algorithms as opposed to other less sophisticated algorithms. At the same time, however, nearly three quarters of firms report deleting some data because of GDPR. With so much need for data, deleting data to comply with GDPR could seriously impact the ability for AI startups to innovate and even dampen AI advancement.

Larger and more established firms generally have an advantage over startups, because they have additional resources, like slack IT resources, customer data, and a breadth of supplier and customer relationships. Smaller firms and startups in particular may not have as ready access to sufficient data to train advanced AI algorithms and may find it difficult to compete. That disparity in itself was identified in a previous survey of AI startups as a hurdle that startups face in developing AI products and services. Larger firms may also have an advantage because of their ability to more easily create new positions specifically to handle GDPR issues or otherwise reallocate resources to deal with GDPR’s impact.

While GDPR applies only to firms that do business in the EU with over $1M in revenue, it has become the de facto standard for data privacy management worldwide. For example, the California Consumer Privacy Act (CCPA) is based at least in part on GDPR. Startup firms aiming for swift growth and outside funding may want to convince potential funders that they comply with GDPR even before they have to. Therefore, even the smallest companies often adopt procedures and safeguards similar to those adopted by firms that must comply with GDPR.

Protecting the personally identifiable information of customers through regulations like GDPR must find a balance between the needs of the customers and the larger societal benefits of improved AI technologies. GDPR’s requirements may place an additional burden on AI startups, forcing them not only to reallocate resources to ensure compliance with GDPR but also delete some data that could be used to develop and train sophisticated AI algorithms involving neural networks and ensemble learning. Especially for startups, where every dollar and every employee needs to be devoted to developing and selling products and services, reallocating resources and deleting data can hinder innovation and growth. These costs may lessen a startup’s ability to compete with established firms and could ultimately harm consumers.

SSRN working paper.