Intro to Privacy Preserving ML

Because privacy is power.

9 min readJul 29, 2022

Note

This article contains my notes from my initial reading about privacy preserving ML for a hackathon project some 2y ago (so doesn’t cover the latest state) and some added context why the topic is relevant and its impact.

Introduction

Remember in the 90s (or was it 00s?), when the Internet was the latest buzz. Everyone was super excited about building the next big thing on the Internet, and for the longest time nobody paid attention to security. It was a good invention , but once they nailed the security aspect or at least began their attempts, the Internet truly flourished and became the magical, addictive, ubiquitous and larger than life entity we all know, love and sometimes hate! The Internet is now leveraged for services ranging from online banking, stock trading to retailing among many others that would have been unimaginable without security.

I believe we are in a similar phase with ML. Yes, I’m envious too when an AI draws/writes better content than I do, but there is plenty more possible when we can ensure security and privacy in its development process.

Security Vs Privacy — They are NOT the same!

Security is about ensuring only authorised people have access to something, while privacy has more to do with who is authorised. You cannot have privacy without security, but the other way around is possible.

Think of it this way, a hotel room is secure — only people with the key can enter it. But it is not private — any hotel staff with the master key can also enter it, without your knowledge/consent.

Meanwhile, your house is secure and private because you own the keys and control its distribution as well.

Let’s try to see what the current ML scenario looks like, are we secure but lack privacy, are we both, or are we neither?

ML 2020 — An Analysis

Most ML models are trained on a publicly available, unencrypted, and large datasets. The publicly available unencrypted data is accessible to almost anybody, hence it is neither secure nor private.

The training is followed by fine-tuning on a downstream task, the dataset used here can be public or private, depending on the nature of downstream applications and trainer’s discretion, but it has to be unencrypted. Again, due to the lack of encryption, this is potentially un-secure. At the very least, people who are training the model will have eyes on access to the data.

How privacy will benefit ML?

An ML model’s performance is majorly driven by the data it is trained on. The availability of different kinds of datasets varies substantially depending on the quality and domain of data one has access to. Most commonly available datasets contain trivial information. These can be very useful to comprehend the basics of ML but can seldom be used directly to solve real problems afflicting real people. Cute dogs and cats can only save the world to an extent.

If you were to set out to solve a novel and impactful problem, the first hurdle you’ll most likely face is collecting relevant data. Not only you will need to find people/organisations in possession of such data, but also need access to it in an unencrypted state. *legal stuff, better lawyer up!*

ML is, at its core, a way of pattern matching. In many domains, that can benefit from ML, privacy is of utmost importance. Example: Matching early/rare symptoms with correct diagnosis, improving the quality of lifestyle and health research or improved personalised recommendations not as a marketing strategy but as a way to tackle over-consumerism etc. Getting access to such data is a tedious job and rightfully so, it is indeed very personal and critical information and can easily be misused and abused. This results in a major bottleneck and deters development of many real-life applications from reaping the benefits of the advances in ML. There is little use of those State-Of-The-Art results if we can’t use them to solve real problems.

Doing it like the Internet <Ctrl + C && Ctrl +V>

It is evident that this is not the first time in history where concern for privacy is restricting development. Not so long ago, many naysayers believed, the Internet, will never take off and now it is IMPOSSIBLE to imagine a world without it. But it clearly did, you are reading this right now. So can we do the same for ML?

There are plenty of exchanges happening all over the Internet that deal with sensitive information — passwords, bank details, medical information, photographs, official documents you name it and it is on the internet i.e. potentially available to the prying eyes. But it is not a concern one might expect it to be, things can be available online and still be secure and private, only you and the people you authorise are able to access it. Why can’t we do the same for ML? Less eyes on data would mean exposure of private information only to trusted individuals. One can work with that, or can they?

Only a few chosen individuals/groups get “eyes-on” access. This could work but this will, intentionally or unintentionally, hamper progress in a totally different way by creating an uneven playing field. Who gets access to data? How to ensure an unbiased distribution? What does the screening process to make such a discussion looks like? It is too much power for a single authority because the last thing we need are more prejudices and barriers to entry to this industry (as if the existing educational/experience/financial/legal requirements aren’t discouraging enough).

Privacy Preserving ML

Privacy in ML is not completely unheard of, it is not the most lucrative area of research at the moment, but it is starting to grow. It is not in a place to compete with the “State of the Arts” yet, but some great progress have been made. It allows model trainers to develop model while keeping privacy of the data owners in mind. Here are a few ways on how this is done:

Remote Execution/Federated Learning

There are two parties in an exchange, the model owner and the data owner. Currently, data owner has to trust before sharing the data with the model owner. Once the data is with the model owner you HAVE to, you really can’t do much about it, trust them to do the right thing. What if there is a way to train a model without explicitly handing them the data?

This is what remote execution allows. Instead of you sharing your data, the model owner shares the model with various data owners. The model trains on the data, possibly in the data owner’s machine and the updated model (think of updated weights after an epoch) is returned to the model owner. This in some way distributes the trust between both data and model owner.

The data never leaves participants’ device and the ML models of the world can still train on, sounds like a win-win. The data didn’t leave, but some information it held did. This is alright and expected because that is the point of ML model — to extract information from data. The real question is your privacy preserved in this exchange?

The answer is no. Even if the data is securely with you and the eyes of model owners never really saw it, they can in some rare cases deduce some aspects of the data, even if we make the data owner anonymous. If it sounds far-fetched, read about how people de-anonymized data by correlating information from two different datasets.
The main concern (ignoring the combining of datasets for now) is that if the model learns a new behaviour after training on data of user X, then model owners can conclude that user X’s data must have given that information. Simplified example: Model learns how to identify a new species of dog after User X, then model owners can conclude that User X must own a dog of that species. This may not seem like that big of a con in this example, but the whole point of doing this exercise is to make access, to sensitive and personal data, accessible. Also the user still need to trust the model owner to do the right thing, including not just copying the data in the name of training. And the model owner needs to trust the data owner that they simply won’t copy the model or tamper with it.

Federated learning is similar to remote execution. In federated learning, the computations are done on remote devices of the data owners and the results are uploaded to the cloud where multiple such results are combined to update the global model. The minor difference I have noticed based on what I have read is that in federated learning, the results will be uploaded whenever available to the cloud, it is similar to pub-sub model in some ways, but in case of remote execution, it is more driven by the model owners, they will reach out to data owners when they need it, and not listen for updates all the time. It kind of makes sense because the data owner is interested in niche data eg: data about cancer and everything and anything medical.

Differential Privacy*

The above mentioned con can be overcome by using differential privacy in conjunction of remote execution. In differential privacy, in intellectual terms, we create an upper bound on the statistical uniqueness of the dataset aka privacy budget. In layman terms, we add some good old noise. So the model owner can never be 100% sure, in cases such as the one mentioned above, that User X has a dog of the newly learnt species, because it could have been due to noise.

From what I understand, this budget not only limited to per query but to the overall dataset as well. So you cannot query it multiple times in hopes of recreating the OG set. Example: If you query the same user 10 times and notice that 8/10 times you saw similar results, and hence you can deduce the responses that are noise aren’t. Well this won’t work because of the limit on overall dataset as well. How it ensures that is a topic for a different blog.

If you are feeling a little sour about the noise being added, as it will reduce the quality of your results, don’t. A little noise in datasets is encouraged because it prevents the ML model from over-fitting. Better generalisation is more likely to give better results on test data, or when we actually use the model in real life applications.

It is not always practical to be the owner of your own data. For both the data owner and model owner, reaching out to/being reached out by large numbers of users is impractical and honestly annoying. Both parties need to check if they can trust each other and blah blah. Plus how does one de-annonymise it if you are reaching out to individuals?

Secure Multi-Party Computation

As the name suggests (that’s why you name things) it allows you to securely (using encryption) distribute the data across multiple parties (better anonymisation) and still allows the training computation to occur.

You basically take data and divide it into smaller chunks and encrypt it. Each chunk is “owned” by a different individual who, for obvious reasons cannot decrypt it. When a model owner needs to train their model, it asks all chunk holders to send data to a remote machine, the model is sent to that remote machine and it trains of the data. Or maybe the owner sends the model to train on all share-holders of the chunks. No decryption is required for any step in the process. (This is a rough overview of what happens, actual details of various implementation may vary)

The individual parties cannot decrypt data on their own. For additional security, the model owners can also encrypt their models. So both, the data owners and model owners can feel lot more comfortable during collaborations. You cannot do all kinds of computations in an encrypted state but most one the basic ones. So the fancy transformers and all that can’t simply adapt to a privacy preserving ML way of training.

Conclusion

Privacy preserving ML is an exciting area of research and is in its early phase. The solutions we have might not give the results we want to/or expect from ML models at this point, but the area has lot more to offer and will improve tremendously in the years to come. (I’m sure it already has in the 2y I wrote this)

Resources

Obviously there are more caveats and intricacies involved here and this is merely a simplified overview of all the sub-topics mention. If you want a follow up on any of them or explore any other concept in this space, please feel free to reach out to me. :D

*Differential Privacy is a very interesting topic on its own and here I have only covered of what is does on a high level and not how. Most of its claims have a mathematical proof, if you want me to explore it in depth, please reach out.