Verifying PyPI and Conda Packages

Before we get started here, I must confess something, I know very little about package verification, cryptography or security. This is really a brain dump of everything I have learned from browsing the internet over the last few months.

The Question

The first thing to think about when considering packaging verification is not the technical implementation of anything, it's what are we trying to achieve by using crypto to do anything?

Is the package I am installing the one the package maintainer uploaded?

This I think is the question, I am the maintainer of the sunpy package, and if you are installing SunPy and you are security conscious you probably want to be able to check that the thing you are installing on your machine (potentially with root access) is indeed the code I uploaded. This presents an inherit trust in me the maintainer that I am not deliberately going to mess with your computer, but this is probably as good as you are going to get.

So taking this a little further, what does trusting me actually mean? When using PGP (we will get on to this later) people often talk about a web of trust, in that you trust Bob who trusts me so then you can trust me. This model has many flaws, mainly that not enough people use PGP to make it work. It also misses the point, what you are actually trusting is the git repo I am building the source tarball from. You are trusting that I trust the source I am releasing in the first place.

My GitHub Account is what you are Trusting.

This I think is the core point I am making here, what you are trusting is effectively my GitHub account. Someone with my GitHub account could make changes to the SunPy repo and do something malicious that I may not notice when I do a release. (You are also trust all the other people with merge rights, but one thing at a time.)

My GitHub Account is Linked to my PGP key

My Keybase Profile

So you can know that the person who has control over the PGP key with the fingerprint 60BC5C03E6276769 has access to my GitHub account, and therefore can change the source code you are installing. I also sign my commits with the same PGP key, so you know that they all came from the person with control of that key, and that that person has control of the Cadair account.

My Trust Model

I think this can be summarised as: "You trust me to write your software, so you trust me to run it on your computer". Which in terms of PGP equates to: "This key signed these commits and is linked to this GitHub account, I therefore trust a package made with it."

This is subtly but importantly different from the normal trust model of PGP where you are trying to verify that they key belongs to and is controlled by me the real person Stuart Mumford. In this model knowing that it's the same person who has control of the git repo and GitHub account is sufficient.

Using the Trust Model

This is the second stage. We have determined that we are happy to trust a key that we can associate with a GitHub account and some of the commits in a repo, so how do we use this to verify that the same key is the one that signed the package we are about to install?

PyPI (pip install)

Currently when I upload the SunPy release to PyPI I can sign it with my PGP key and upload the signature alongside the source. The problem is that you would have to run the following commands:

curl https://keybase.io/Cadair/key.asc | gpg --import
pip download --no-deps sunpy
wget https://pypi.io/packages/source/s/sunpy/sunpy-0.7.2.tar.gz.asc
gpg --verify sunpy-0.7.2.tar.gz.asc sunpy-0.7.2.tar.gz
pip install sunpy-0.7.2.tar.gz

To verify that the package you had downloaded was indeed signed by the correct pgp key. Which is silly. There is a little work going on to make this easier e.g. pypa/pip#1035. Various efforts like this seem to be stalling over worries about the trust model and lulling users into a false sense of security.

Conda Package Builds

Recently I started using the excellent conda forge project to start building conda binaries for SunPy. This makes it much easier for people to install and use SunPy, but now you not only have to trust the source file, but you have to trust the binary built package as well.

What does trusting the binary look like? Do you trust that the build bot has been honest and built the package from source correctly? Should conda-forge have a PGP key with which the build bot signs the packages so you know it was indeed built by the build bot? How can conda forge verify the integrity of the source file, and how does it pass that trust onto the end user?

I don't know the answers to any of these questions, and the trust model gets much more tricky when you start considering a build bot being run on CI services where even more people could interfere with the process. My current opinion is that conda forge could do gpg verification on the source from PyPI and then sign the binaries, so that you know that conda forge trusted the original source and that you have downloaded a binary built on the conda-forge build bots. Is that enough?

Conclusions

TL;DR: Python package signing is hard, but I think possible from the pip perspective, once you quantify the minimum you want from a trust model. Conda binaries are a whole different problem, and I can't see an easy solution.

Please direct comments to @stuartmumford.

Jupyter Notebook and Conda

This post is going to be the first in a series of posts about the Jupyter Notebook and the supercomputing facility 'iceberg' at Sheffield University.

This post is about a plugin for the Jupyter Notebook I have written to make it easier to work with Jupyter and the conda python package manager. Specifically the fantastic environments feature, which allow you to have multiple different versions of Python and different stacks of packages installed alongside one another.

When working with conda and the Jupyter Notebook you can create a different envronment and install Jupyter into it and then use the notebook from within that environment. This might look something like this:

conda create -n numpy python=3 numpy jupyter
source activate numpy
jupyter notebook

This approach works fine, but what happens if you want to switch to running your current notebook in the "numpy-1.9" envronment instead to test it with a previous version of NumPy? You would have to do this:

Stop the notebook sever then:

source deactivate
source activate numpy-1.9
jupyter notebook

Then reload the notebook you had open before.

What my Notebook plugin does is enables you to switch environments from within a running notebook server, but using the "kernel" feature of the Notebook.

Jupyter Kernel Switching

Each entry in the kernel list above that starts with 'Environment' is a conda environment that has Jupyter installed within it, and you can start a notebook using any of those envronments.

The plugin that enables this is jupyter_envrionment_kernels (catchy name I know). It looks in the directories you specify for installed environments which have Jupyter installed (the ipython executable is in the bin/ directory) and lists them as kernels for Jupyter to find. This makes it easy to run one notebook instance and access kernels with access to different versions of Python or different modules seamlessly.

To solve our earlier problem of "live" switching the kernel we can use the Kernel > Change Kernel menu:

Jupyter Notebook Switch Running Kernel

Installing jupyter_environment_kernels

Installation of the package is easy just run

pip install environment_kernels

from within the environment in which you want to run the notebook server.

Then run:

jupyter notebook --generate-config

to generate a Jupyter notebook config file (if you already have one then skip this step), finally edit the config file it has generated (by default this is ~/.jupyter/jupyter_notebook_config.py) and add the following two lines:

c.NotebookApp.kernel_spec_manager_class = 'environment_kernels.EnvironmentKernelSpecManager'
c.EnvironmentKernelSpecManager.conda_env_dirs=['~/.conda/envs']

The first line tells the notebook to use the environment_kernels module to manage the kernels, and the second line lists all the directories in which to look for environments with ipython executables. By default (i.e. if you don't provide the second line) it will look in ~/.conda/envs and ~/.virtualenvs where the top level directory is assumed to be the name of the environment and then it looks inside the bin directory for ipython.

It is also possible to configure the package to use the conda terminal command to find your environments. This will only work if conda is availble from where you ran the notebook command (i.e. you installed the notebook using conda). To use this you just need the:

c.NotebookApp.kernel_spec_manager_class = 'environment_kernels.EnvironmentKernelSpecManager'

configuration line.

This module is still very new so if you enounter any issues please raise issues on GitHub.

The Fibonacci Sequence

This post compares a few implementations of calculating the first 10,000 numbers in the Fibbonacci sequence.

The Fibbonacci sequence is defined such that the next number in the sequence is the sum of the previous two:

i.e.

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377

This is some prep work I did to test numba on the HPC cluster at Sheffield University. I am hoping to get more fancy stuff running using numba on the cluster.

First off we start with a nice simple and short Python implementation:

In [1]:
def fib(n):
    fib = [0, 1]
    while len(fib) < n+1:
        fib.append(fib[-2] + fib[-1])
    return fib[1:]

This implementation creates a list with the first two elements in (some consider the first 0 as part of the sequence) and then loops until the length of the output list is the correct length (+1 because of the 0) adding the next element in as the sum of the previous two.

In [2]:
%timeit fib(10000)
100 loops, best of 3: 7.39 ms per loop

This implementation is respectable, but not exactly fast.

Next up we are going to use a little bit of more modern Python magic to see if we can make a pure Python implementation. This uses an interator which stores it's state when 'yield' is called, upon the next time the iterator is called it will resume from where it left off.

In [23]:
def fib_generator(n):
    previous, current = 0, 1
    yield current
    i = 1
    while i < n:
        previous, current = current, previous + current
        yield current
        i += 1
In [24]:
%timeit list(fib_generator(10000))
100 loops, best of 3: 5.88 ms per loop

This is a litte quicker, which is nice.

Next up we are going to try and use numpy. In this example we define an array the length of the desired sequence and then fill it up in a for loop:

In [5]:
import numpy as np
In [106]:
def fib_numpy(n):
    fib = np.zeros(n)
    fib[1] = 1
    for i in range(2, n):
        fib[i] = fib[i-2] + fib[i-1]
    return fib[1:]
In [107]:
%timeit fib_numpy(10000)
100 loops, best of 3: 5.86 ms per loop
/home/cs1sjm/.conda/envs/numba/lib/python3.4/site-packages/ipykernel/__main__.py:5: RuntimeWarning: overflow encountered in double_scalars

This does not give us much, if anything over the generator example.

Now we are going to use numba which is a just in time compilation library for Python, which makes things SUPER speedy!

In [8]:
from numba import jit
In [88]:
@jit
def loop(fib):
    for i in range(len(fib)):
        fib[i] = fib[i-2] + fib[i-1]
    return fib
    
def fib_numba(n):
    fib = np.zeros(n)
    fib = loop(fib)
    return fib
In [89]:
%timeit fib_numba(10000)
The slowest run took 2075.80 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 45 µs per loop

This makes a MASSIVE difference! In fact probably too much of a difference, something fishy is probably going on here.

Finally we shall use the Cython package for completeness:

In [14]:
%load_ext Cython
In [108]:
%%cython
import numpy as np
cimport numpy as np

def fib_cython(int n):
    cdef np.ndarray fib = np.zeros(n)
    cdef int i
    
    fib[1] = 1
    for i in range(2, n):
        fib[i] = fib[i-2] + fib[i-1]
    
    return fib[1:]
In [109]:
%timeit fib_cython(10000)
100 loops, best of 3: 2.28 ms per loop
/home/cs1sjm/.conda/envs/numba/lib/python3.4/site-packages/ipykernel/__main__.py:257: RuntimeWarning: overflow encountered in double_scalars

This isn't a bad speed up and there are more optimisations you can do with cython.

Welcome!

Welcome!

This is my horrifically under used blog. Everything posted here will probably be interesting to me.

This Blog

This website and blog is built using the Nikola project. It means that I can write the pages in markdown and compile it into a static site, which I then FTP to my web host. This blog I am writing using Jupyter (IPython) Notebooks, which when combined with a seperate blog meta data file compile into html and appear here for you to read!

This post for instance has the following meta data file:

In [1]:
!cat welcome.meta
.. title: Welcome!
.. slug: welcome
.. date: 2015-04-01 20:19:09 UTC
.. tags: 
.. link: 
.. description: 
.. type: text

For more information on setting up this Jupyter Notebook blogging platform see this excellent post by Damian Avila.

Contents © 2017 Stuart Mumford Creative Commons License - Powered by Nikola