Skip to content

Encryption and open-source

Say you have a project that's some kind of model. It takes some input dataset, some configuration provided by users, and spits out some results when you run it. You'd love to open-source it so that others can use it, but the problem is the raw input data used by the model is confidential. The output from the model sufficiently obfuscates the confidential data, so you want people to be able to use the model but without having access to the input data.

In a closed-source world you'd simply provide a "black box", hiding your model code and data.

But you want users who don't have access to the confidential data to still be able to run your model, and you'd also like collaborators (who do have access to the confidential data) to get involved. And you want the model code itself to be publicly available.

How can you achieve this in an open-source context? Here's one way of doing it, and some pitfalls to watch out for. We'll choose python for our language, but you'd be able to do the same thing in pretty much any language.

Disclaimer

I'm not a security expert. Below are some basic guidelines and examples. You may want to consult a security professional if you are doing this seriously.

Encrypt the data

It goes without saying, if you're using confidential data, it's your responsibility to safeguard it. Python has a handy encryption package, uncryptically called cryptography. We'll just use its default encryption setting (symmetric 128 bit AES), so we just need to generate a key

from cryptography.fernet import Fernet

key = Fernet.generate_key()
print("ENCRYPTION_KEY=%s" % (envvar, key.decode("utf-8")))

then use it to encrypt the data, and later the same key to decrypt it. When you run it you'll see a base-64 encoded key, something like this:

ENCRYPTION_KEY=2b6sNDrSiuFMDmG58vliW8jzeHy5i91KQA5pIkLsRnw=

We can then use this key to encrypt our confidential dataset, delete the plaintext, then subsequently use the key to decrypt the dataset. So we need to keep this information safe and secret. The obvious solution is to put it in a file, say encryption_key. This is where the first pitfall lies - you definitely don't want this file ending up in source control, or you'll have exposed the key to everyone. Simple - make sure you add the filename to your .gitignore file.

Sharing the key

When another researcher - one you trust - wants to run the software, you need to provide them with the key (and then trust that they also keep the key safe).

When someone else wants to run your code - someone you don't fully trust - there's a problem: they can't run your model on their computer without knowing the secret key, and you're not prepared to give it to them. But you'd like them to be able to run your model, so you need to somehow provide access to your model, but without access to the key. This is where the cloud comes in...

App Services

You can use a cloud-based app service to achieve this. Your model will need a browser interface or API (not covered here) but means anyone can use it yet can't see the internals, especially the secret key. Or can they? Typically, you'd create a docker container with your app running in it for easy cloud deployment. Sorted? Not necessarily. The problem is, your docker container needs the secret key to work, and if you store your container in a public repository, then you've exposed the secret key to everyone, just as if you'd put the key in your (public) github repository, because anyone can pull your image and inspect the contents.

You could make your image private, but this goes against the grain of the open-source ethos. Or perhaps you could store your key in a (private) docker volume and mount it when you deploy your app. But there is a simpler solution: use an environment variable. This way, you can deploy your container in the cloud, and configure the app service's environment to have a variable containing the key. Only those with access to the app service configuration can see it, and you would never allow your app service configuration to be public, as you'd expose yourself to all sorts of potential hackery.

When your model runs, you must ensure you don't save the decrypted dataset to the filesystem. Somebody might be able to inspect your container while it's running and get hold of the plaintext dataset, and locally, you again run the risk of accidentally commiting the dataset to source control.

This way you can let people use your app service without ever exposing the key, or the plaintext dataset.

The code

So you need an environment variable containing a secret. You could add this to your global environment, then every process you're running knows the secret key, which is a potential security risk. As it's only used by this particular application, just set it on a "need-to-know" basis - and keep it self contained within the package itself, in a file (that must be of course be in both .gitignore and .dockerignore). Python has a handy package called python-dotenv which is designed for this purpose (other languages have similar). You call a function load_dotenv() at startup and it will "source" your local environment from a file (by default it looks for .env in the project root). This file just needs to contain the text from above:

ENCRYPTION_KEY=2b6sNDrSiuFMDmG58vliW8jzeHy5i91KQA5pIkLsRnw=

Let's say the dataset is a csv file and you want to load it into a pandas dataframe. We can wrap the details in read and write functions. The first thing you need to do is load your plaintext dataset, encrypt it, save it, and delete the plaintext:

import os
from io import BytesIO
import pandas as pd
from cryptography.fernet import Fernet
from dotenv import load_dotenv

# get key from .env file, if present
load_dotenv()

def _get_key():
  key = os.getenv("ENCRYPTION_KEY")
  if key is None:
    raise EnvironmentError("ENCRYPTION_KEY not set")
  return key

def encrypt_csv(dataframe, filename):
  """ Encrypts a dataframe and saves to filesystem in csv format """
  data = BytesIO()
  dataframe.to_csv(data)

  fernet = Fernet(_get_key())
  encrypted = fernet.encrypt(data.getvalue())

  # Write the encrypted file
  with open(filename, 'wb') as fd:
    fd.write(encrypted)

plaintext = "secret-data.csv"
dataset = pd.read_csv(plaintext)
encrypt_csv(dataset, plaintext + ".enc")
os.remove(plaintext)

The dataframe is saved in csv format to a memory buffer (not a file) which is then encrypted and saved.

Then, when you want to load the data, just do this:

def decrypt_csv(data_file):
  """ Loads a dataframe from an encrypted csv file """
  with open(data_file, 'rb') as f:
    encrypted = f.read()

  fernet = Fernet(_get_key())
  data = fernet.decrypt(encrypted)

  df = pd.read_csv(BytesIO(data))
  return df

ciphertext = "secret-data.csv.enc"
dataset = decrypt_csv(ciphertext)

Takeaways

  • whatever you do, don't leak secrets to github, docker hub or any other public repository
  • ensure - without compromising security - that you don't lose your key
  • don't share your key with anyone you don't fully trust
  • use local environment variables to store the secret key, in combination with dotenv if appropriate
  • decrypt data in-memory, do not write plaintext data to the file system
  • if in doubt, talk to a security expert