Understanding and avoiding risks associated with machine learning

Written by Imran Malek, Technology Law Clinic

Introduction

Nearly every aspect of our lives is either enhanced by or depends on computing technology – whether it be through the networks and platform that we access using the devices on our desktops, under our TVs, or in our pockets. With the advent of machine learning, the technology that essentially allows to make informed decisions using up to date data with minimal user input, and the “perfect storm” of ubiquitous connected computing devices and affordable data processing mechanisms, we are now at a place where the more we use our technology, the better (or at least more tailored) our experiences can be. machine learning plays a big part in most of our lives. Among many other things, technology that uses machine learning powers our spam filters, optimizes inventory to prevent food waste due to spoilage in grocery stores, and even curates our digital radio stations to the point where never feel the need to skip a track.

With machine learning’s ubiquity, however, come pitfalls. Although implemented with the best of intentions, machine learning has in some areas reinforced biases, stifled opportunities, or produced wholly negative outcomes:[1]

The benefits—and these specific drawbacks—of machine learning also point to another area of concern: personal data privacy. As more and more of our lives are lived online, we produce more data. This increase in production has led to a data “gold rush” where businesses compete to collect the most data possible so that they can make the most informed decisions. With such significant financial implications come increased scrutiny to the data that we produce or is produced on our behalf. In the context of innovative applications of machine learning where personal data is often a critical part of the algorithmic decision-making process, legislation like the European Union’s General Data Protection Regulation (“GDPR”) provides rights to  end users, which could lead to serious consequences for a startup that doesn’t respect those rights. With these drawbacks also come legal challenges that a startup implementing may face, including challenges arising from the unauthorized use of data to “train” machine learning algorithms, algorithmic bias and anti-discrimination law, and collecting, processing, and transmitting data at the expense of user privacy.

With all that in mind, these posts will dive into machine learning and elaborate on the legal issues surrounding it through three major topics:

  1. An introduction to machine learning
  2. How startups can source the right data for a machine learning algorithm while maintaining user privacy, especially in the context of relevant laws like the GDPR.
  3. Implementing machine learning models live while ensuring transparency and promoting fairness in decision making

An Introduction to Machine Learning

While there are plenty of online tutorials (including one of the first and most well-known Massive Online Open Courses) that discuss the details, techniques, and nuances associated with machine learning (concepts like “supervised learning” or “naive bayes classification,” for example) for this post and for the other posts in this series we will simplify machine learning with a definition paraphrased from Sujit Pal, and Antonio Gulli:

Machine learning focuses on teaching computers how to learn from and make predictions based on data.

The data that feeds those predictions, especially today, usually encompass what the business and technology worlds have labeled “big data.” Without going into too much detail, we can define big data and its machine learning application through “three Vs”:

  1. Volume: The amount of data out there is vast, and continues to expand every second.
  2. Variety: As more and more devices, applications, and services are brought online, data is efficiently stored in ways that are easily accessible and usable.
  3. Velocity: Data is now collected in real time, and real time collection, when paired with cheap computing power, means that data can instantly be used.

The scale and speed at which data is generated, classified, and stored make it impossible for humans alone to analyze, interpret, and execute on that data. Accordingly, machine learning has become one of the tools that businesses, governments, researchers, and even individuals use to influence or even make decisions.

The actual mechanism of turning data into a decision is usually referred in the context of “models.” These models represent collections of patterns that, when combined, produce generalizable trends that empower decision making. To put it simply, models are generalized mathematical representations of real-world processes such that when data is input into a model the result would mirror the real-world output.[2]  Since processes in the real world are often the result of multiple independent variables, models need to be “trained” in order to be accurate. It is through this training that the “learning” behind machine learning becomes evident.

To understand training, and machine learning in general, let’s think through a simple example where I want to build a tool that helps me make a decision that many of us face every day: what to order at a coffee shop. In this simple example, I want my model to take in three variables and then use them to decide what I drink I should buy:

Variable

Possible Values

Time of Day

Morning, Afternoon, Evening

Temperature Outside

Hot, Cold

Day of Week

Monday, Tuesday, Wednesday, Thursday,
Friday, Saturday, Sunday

Right now, without any training, my model doesn’t know what to do, it doesn’t know what to do, so I need to train it. To do so, I’ll start by keeping track of one week’s worth of my typical drink purchases:

Day of Week

Time of Day

Temperature Outside

Drink I Ordered

Monday

Morning

Cold

Hot Coffee

Tuesday

Morning

Hot

Iced Coffee

Wednesday

Morning

Cold

Hot Coffee

Thursday

Evening

Cold

Iced Herbal Tea

Friday

Afternoon

Hot

Hot Coffee

Saturday

Evening

Cold

Hot Herbal Tea

Sunday

Morning

Hot

Iced Coffee

Immediately, you can see that there’s a pattern forming – when it’s Hot outside, I prefer cold drinks, and when it’s Cold outside, I prefer hot drinks. You can also see that when it’s in the Evening, I prefer to drink tea instead of coffee.

When the new week starts and it’s Monday, I load my now trained model into an application on my phone, go to my local coffee shop, and try to make a decision. I go in the Morning, and it happens to be Cold outside, so naturally, my model suggests that I should get a Hot Coffee. This is great! I don’t have to worry about making a decision. I

subsequently order my hot coffee and make sure to record that it was Morning, it was Cold outside, and I ordered a Hot Coffee on a Monday. I then input this data into my model. At this point, I’ve not only trained my model, I’ve also provided feedback using to it help validate it.

If we were making a “production ready” machine learning model (for example, one that might be built into a coffee chain’s mobile app), there would be many more variables factored in to the algorithm, including: weather on the current day, what people around me have ordered, what month it is, whether or not I ordered food, what coffee shop location I happen to be visiting, what genre of music is playing in the coffee shop, and many, many, many more!

Now, let’s take it one step further and talk about what might happen if there’s no training data available – on the next day, a Tuesday morning, I skip my morning cup of coffee and decide to walk in to the shop on a Cold Afternoon. My model is left with a dilemma – there’s no training data for this exact combination of variables. Does it suggest an Iced Coffee, because that’s what I ordered on a Tuesday? Or does it suggest a Hot Coffee, because that’s what I ordered the last time I went in to my coffee shop in the Afternoon? While a human might intuitively think that temperature is a more important factor than day of the week when it comes to making a beverage decision, my model doesn’t know that, so it randomly picks from the two likely options and suggests an Iced Coffee. Scoffing at the prospect of drinking something cold on a cold day, I reject that decision, input my rejection, and order a Hot Coffee. My model now knows that on Cold Tuesday Afternoons, I drink Hot Coffee. It also has learned, based off of my decision, that when I make a decision that I haven’t made before, I weigh the temperature as more decisive factor than the day of week.

In the real world, the data used from training can come from a variety of sources, but that data most often comes from historical data that was collected, analyzed, and processed by humans. Unfortunately, since that data is derived from human behavior, historical data may also reinforce preexisting biases (like we saw earlier with the concerns around predictive policing algorithms).

Conclusion

In the next post of this series, we’ll touch on the relevant laws that touch algorithmic bias, and we’ll also provide recommendations to organizations using machine learning on how to diagnose and overcome bias in the tools and technologies that they build.

References

[1] For an example of what some international territories are doing to address this, see this article from Scientific American: https://www.scientificamerican.com/article/the-harm-that-data-do/

[2] Data scientists commonly use the term ‘model’ and ‘algorithm’ interchangeably as they both represent mathematical processes that can be tuned with training.  See https://www.datasciencecentral.com/profiles/blogs/a-tour-of-machine-learning-algorithms-1?overrideMobileRedirect=1 (articulating common machine learning models available to programmers)

View all posts