Code
import numpy as np
1)
np.random.seed(= 5) np.random.normal(size
array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862, 0.86540763])
This module presents some information on numerical analysis (random number generation and floating point precision), packaging, reproducibility, and automation.
Random numbers on a computer are actually (periodic) deterministic sequences that (hopefully) behave as if they were random.
The seed determines where in the cycle of the periodic sequence you are.
It’s critical for ensuring reproducibility when running code that uses random numbers
import numpy as np
1)
np.random.seed(= 5) np.random.normal(size
array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862, 0.86540763])
= 5) np.random.normal(size
array([-2.3015387 , 1.74481176, -0.7612069 , 0.3190391 , -0.24937038])
1)
np.random.seed(= 10) np.random.normal(size
array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862, 0.86540763,
-2.3015387 , 1.74481176, -0.7612069 , 0.3190391 , -0.24937038])
If you don’t do anything, numpy will use a generator called the Mersenne Twister (that’s the default in R too).
However the “default” RNG in numpy is a newer, better algorithm called PCG64.
## Mersenne twister
1)
np.random.seed(= 5) np.random.normal(size
array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862, 0.86540763])
## Mersenne twister, selected specifically
= np.random.Generator(np.random.MT19937(seed = 1))
rng = 5)
rng.normal(size ## Not clear why numbers differ from above. Seed setup might differ.
array([ 2.04676909, -1.03653009, 0.72997546, 0.41584996, -0.1922969 ])
## 'Default' PCG64
= np.random.Generator(np.random.PCG64(seed = 1))
rng = 5) rng.normal(size
array([ 0.34558419, 0.82161814, 0.33043708, -1.30315723, 0.90535587])
= np.random.default_rng(seed = 1)
rng = 5) rng.normal(size
array([ 0.34558419, 0.82161814, 0.33043708, -1.30315723, 0.90535587])
Some cautions (detailed by Philip Stark and Kellie Ottoboni):
0.3 - 0.2 == 0.1
0.3
0.2
0.1 # Hmmm...
0.1
Numbers in most languages are stored by default as double precision floating point numbers.
(GPU libraries often use 4 bytes or even fewer.)
def dg(x, form = '.20f'):
print(format(x, form))
= 0.3
a = 0.2
b
dg(a)
dg(b)-b)
dg(a0.1) dg(
0.29999999999999998890
0.20000000000000001110
0.09999999999999997780
0.10000000000000000555
How many digits of accuracy do we have?
Same thing regardless of the size of the number:
12345678.1234567812345678
12345678.123456782
12345678123456781234.5678
1.234567812345678e+19
12345678123456781234.5678) dg(
12345678123456780288.00000000000000000000
This occurs because of how the 64 bits in a double precision floating point number are allocated to give a large range of numbers in terms of magnitude and high precision in terms of digits of accuracy.
Let’s see what kinds of numbers we can safely compare for exact equality with ==
.
= .5 - .2
x == 0.3
x = 1.5 - .2
x == 0.3
x
= .5 - .5
x == 0
x = .5 - .2 -.3
x == 0
x = (.5 - .2) - (.51 - .21)
x == 0
x
0) np.isclose(x,
Let’s consider accuracy of subtraction.
1.1234 - 1.0
0.12339999999999995
123456781234.1234 - 123456781234.0
0.1233978271484375
This is called catastrophic cancellation.
Do you think calling it “catastrophic” is too extreme? If so, consider this.
81.0 - 80.0
1.0
12345678123456781.0 -12345678123456780.0
0.0
What’s the derivative of \(sin(x)\)? Let’s see what kind of accuracy we can get.
def deriv(f, x, eps=1e-8):
return (f(x+eps) - f(x)) / eps
0.2)
np.cos(
0.2)
deriv(np.sin,
0.2, 1e-12)
deriv(np.sin, 0.2, 1e-14)
deriv(np.sin, 0.2, 1e-15)
deriv(np.sin, 0.2, 1e-16)
deriv(np.sin, 0.2, 1e-17) deriv(np.sin,
The errors seen in doing calculations, such as catastrophic cancellation, cascaded through derivative estimation. That also happens when doing linear algebra operations.
Here’s an example where mathematically all the eigenvalues are real-valued and positive, but not on the computer.
import scipy as sp
= np.arange(100)
xs = np.abs(xs[:, np.newaxis] - xs)
dists # This is a p.d. matrix (mathematically).
= np.exp(-(dists/10)**2)
corr_matrix # But not numerically...
80:99] sp.linalg.eigvals(corr_matrix)[
array([ 1.43024256e-16+1.28433951e-16j, 1.43024256e-16-1.28433951e-16j,
1.49210347e-16+4.87070558e-17j, 1.49210347e-16-4.87070558e-17j,
-1.54661687e-16+1.25601177e-16j, -1.54661687e-16-1.25601177e-16j,
-2.07808543e-16+3.73312254e-17j, -2.07808543e-16-3.73312254e-17j,
9.83783628e-17+7.20174529e-17j, 9.83783628e-17-7.20174529e-17j,
2.37821218e-18+1.00101181e-16j, 2.37821218e-18-1.00101181e-16j,
5.29254936e-17+0.00000000e+00j, -1.62444574e-16+0.00000000e+00j,
-2.74621910e-17+7.55223038e-17j, -2.74621910e-17-7.55223038e-17j,
-9.08016006e-17+3.79199791e-17j, -9.08016006e-17-3.79199791e-17j,
-3.25243092e-17+0.00000000e+00j])
Having a finite number of bits to represent each number means there are minimum and maximum numbers that can be expressed.
1.38e5000
1.38e-400
0.0
Q: Roughly how many observations would it take before we underflow in calculating a likelihood?
import numpy as np
import scipy as sp
= 10
n
= np.random.normal(size=n)
x
np.prod(sp.stats.norm.pdf(x))
= np.sum(sp.stats.norm.logpdf(x))
loglik
loglik np.exp(loglik)
2.0739518337814167e-08
Look at your partner’s code in terms of how it handles the stopping criterion or (for the 1-d case) the finite difference estimate of the gradient and Hessian.
Fork your partner’s repository. In the local version of that repository on your DataHub, make a change to try to improve how the stopping criterion or finite difference estimate is handled. Then make a pull request back into your partner’s repository.
Here are the detailed steps of how to do all that:
https://github.com/<username>/newton-practice/fork
). You can name it newton-practice-<partner_name>
or something else.git clone https://github.com/<your_username>/newton-practice-<partner_name>
.cd newton-practice-<partner_name>
.git push -u origin newton-practice-<partner_name>
.https://github.com/<your_username>/newton-practice-<partner_name>
and open a pull request (“Compare and pull request”). The request will automatically be done in the GitHub account of your partner.Then review the pull request your partner makes in your own repository. Iterate with your partner until you are happy to merge the pull request into your repository. Remember to then run git pull
to get the changes in the local copy of the repository.
What’s a package? In general it is software that is provided as a bundle that can be downloaded and made available on your computer. Some packages (e.g., R, Python themselves) are stand-alone packages. Others (such as R packages and Python packages) are add-on functionality that works with and extends the functionality of the stand-alone software.
What do we need to have a Python package?
A Python package is a directory containing one or more modules and with a file named __init__.py
that is called when a package is imported and serves to initialize the package.
Let’s create a basic package.
mkdir mypkg
cat << EOF > mypkg/__init__.py
## Make objects from mymod.py available as mypkg.foo
from mypkg.mymod import *
print("Welcome to my package.")
EOF
cat << EOF > mypkg/mymod.py
x = 7
def myfun(val):
print("The arg is: ", str(val), ".", sep = '')
EOF
Note that if there were other modules, we could have imported from those as well.
Now we can use the objects from the module without having to know that it was in a particular module (because of how __init__.py
was set up).
import mypkg
mypkg.x7) mypkg.myfun(
Note that one can set __all__
in an __init__.py
to define what is imported, which makes clear what is publicly available and hides what is considered internal.
Fernando has a this toy example package.
Let’s clone the repository and see if we can install it.
git clone https://github.com/fperez/mytoy
cd mytoy
pip install --user .
cd
ls .local/lib/python3.11/site-packages
import mytoy
7) mytoy.toy(
Let’s look in the top directory of the repository. There we see various files related to building and installing the package, namely, pyproj.toml
, setup.py
, setup.cfg
, etc.
In fact, one can install the package with only either setup.py
or pyproj.toml
, but the other files listed here are recommended:
pyproj.toml
(or pyproject.toml
): this is a configuration file used by packaging tools. In the mytoy
example it specifies to use setuptools
to build and install the package.setup.py
: this is run when the package is built and installed when using setuptools
. In the example, it simply runs setuptools.setup()
. With recent versions of setuptools
, you don’t actually need this so long as you have the pyproj.toml
file.setup.cfg
: provides metadata about the package when using setuptools
.environment.yml
: provides information about the full environment in which your package should be used (including examples, documentation, etc.). For projects using setuptools
, a minimal list of dependencies needed for installation and use of the package can instead be included in the install_requires
option of setup.cfg
.LICENSE
: specifies the license for your package giving the terms under which others can use it.The postBuild
file is a completely optional file only needed if you want to use the package with a MyBinder environment.
At the numpy GitHub repository, by looking in pyproject.toml
, you can see that numpy
is build and installed using a system called Meson, while at the Jupyter GitHub repository you can see that the jupyter
package is built and installed using setuptools
.
Building a package usually refers to compiling source code but for a Python package that just has Python code, nothing needs to be compiled. Installing a package means putting the built package into a location on your computer where packages are installed.
It would take more work to make the package available in a repository such as PyPI or via Conda.
Make a package out of your Newton method code, following the structure of mytoy
. Include your tests as part of the package.
See if you can install your package, run the tests, and use the package, following the README.md
of mytoy
.
If a package is on PyPI or available through Conda but not on your system, you can install it easily (usually). You don’t need root permission on a machine to install a package, though you may need to use pip install --user
or set up a new Conda environment.
Packages often depend on other packages. In general, if one package depends on another, pip or conda will generally install the dependency automatically.
One advantage of Conda is that it can also install non-Python packages on which a Python package depends, whereas with pip you sometimes need to install a system package to satisfy a dependency.
It’s not uncommon to run into a case where conda has trouble installing a package because of version inconsistencies amongst the dependencies. mamba
is a drop-in replacement for conda
and often does a better job of this “dependency resolution”. We use mamba
by default on the SCF.
With newer versions of Conda, you can use the libmamba
dependency “resolver” by running conda config --set solver libmamba
, which adds solver: libmamba
to your .condarc
file.
For reproducibility, it’s important to know the versions of the packages you use (and the version of Python). pip
and conda
make it easy to do this. You can create a requirements file that captures the packages you are currently using (and, critically, their versions) and then install exactly that set of packages (and versions) based on that requirements file.
## Reproducing using pip.
pip freeze > requirements.txt
pip install -r requirements.txt
## Reproducing a Conda environment.
conda env export > environment.yml
conda env create -f environment.yml
Conda is a general package manager. You can use it to manage Python packages but lots of other software as well, including R and Julia.
Conda environments provide an additional layer of modularity/reproducibility, allowing you to set up a fully reproducible environment for your computation. Here (by explicitly giving python=3.11
) the Python 3.11 executable and all packages you install in the environment are fully independent of whatever Python executables are installed on the system. (That said, there are some caveats that can make it a bit of headache to ensure full isolation.)
type python
conda create -n my_iso_env python=3.11
source activate my_iso_env
type python
conda install numpy
If you use conda activate
rather than source activate
, Conda will prompt you to run conda init
, which will make changes to your ~/.bashrc
that, for one, activate the Conda base environment automatically when a shell is started. This may be fine, but it’s helpful to be aware.
Packages in Python (and in R, Julia, etc.) may be installed in various places on the filesystem, and it sometimes it is helpful (e.g., if you end up with multiple versions of a package installed on your system) to be able to figure out where on the filesystem the package is being loaded from.
pkgname.__file__
will show where the imported package is installed.pkname.__version__
will show the version of the package (as will pip list
or conda list
, for all packages).sys.path
shows where Python looks for packages on your system.
pip
will generally be in ~/.local/lib/python3.X/site-packages
.The difference between a source package and a binary package is that
If you install a package from source, C/C++/Fortran code will be compiled on your system (if the package has such code). That should mean the compiled code will work on your system, but requires you to have a compiler available and things properly configured. A binary package doesn’t need to be compiled on your system, but in some cases (e.g., if you got the wrong binary version) the code may not run on your system because it was compiled in such a way that is not compatible with your system.
Python wheels are a binary package format for Python packages. Wheels for some packages will vary by platform (i.e., operating system) so that the package will install correctly on the system where it is being installed.
Having reproducible environments (e.g., Conda environments or Docker containers) provides the ability to run (and automate) various actions to run on remote/cloud computational resources and know that they (should) work.
Docker containers share some similarities with Conda environments except that you provide a full specification for the Linux operating system you want to use and software you want installed.
You need to provide the recipe for the environment or container. The automation system can then set up the environment as a virtual environment/container on the remote machine and run the action/code you request. This could simply be giving you a terminal or starting a Jupyter notebook in the environment and giving you access via your browser.
In this section, we’ll show some examples of automation with reproducible environments.
There are a lot more out there. In fact, DataHub is based on Docker containers (and a container management service called Kubernetes). Let’s see what operating system is being run in the container that your DataHub server is using.
$ cat /etc/issue
Ubuntu 22.04.4 LTS \n \l
This is all quite powerful, and the use of reproducible environments generally helps with debugging because you can also set up the environment on your own machine. That said, running code remotely or in the cloud can also be hard to debug at times.
One example is using Binder to open a Jupyter notebook in an executable environment.
Based on a configuration (various options are possible for how to specify the configuration), Binder will create a Docker container with the needed software installed.
The most common configuration format is to provide a Conda environment file. Binder will then create a Conda environment inside the Docker (Linux) container. The Conda environment will contain the packages specified in the environment file.
Let’s set up a Binder environment “Geospatial analytics in Python with Geopandas” that can access the notebooks and materials provided in the repository in a computational environment designed to use the notebooks reliably and reproducibly.
In some cases an image for the container will already be available, and all that needs to happen is to start a virtual machine using that image as the computational environment. In other cases (e.g., with a configuration you’ve just created), the Docker container and Conda environment need to be set up.
If we’re familiar with what happens when Docker containers and Conda environments are set up, we can see in the Binder logs that a Docker container is set up (using Ubuntu Linux) and one of the steps of doing that is to create a Conda environment within the container.
Let’s see the steps that are run if we use the mytoy
repository to start a Binder environment at ovh.mybinder.org
, specifying fperez/mytoy
as the repository. (Note that in a number of cases, the startup has failed, probably because mybinder
has very limited resources for providing cloud compute resources. We are using ovh.mybinder.org
rather than mybinder.org
in hopes that is more likely to succeed.
A second example is using GitHub Actions (GHA) to automate activities such as testing. To set up a GitHub Actions workflow, one
A good example is running tests whenever you make a commit to a repository containing software.
With GHA, you specify the operating system and then run the steps you specify in .github/workflows/some_action.yml
. Some steps will customize the environment as the initial steps and then additional step(s) will run shell or other code to run your workflow. You use pre-specified operations (called actions) to do common things (such as checking out a GitHub repository and installing commonly used software).
When triggered, GitHub will run the steps in a virtual machine, which is called the runner.
We’ll walk through the steps of setting up automated testing for the mytoy
example package. Why do this? It’s common to have testing pass locally on your own machine, but fail for various reasons when done elsewhere.
The workflow is specified using a YAML file placed in the .github/workflows
directory of the repository.
Here’s an example YAML file for testing the mytoy
package:
on:
push:
branches:
- main
jobs:
CI:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Install dependencies
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.12
- name: Install mytoy
run: |
# There is a small bug here.
cd mytoy
pip install --user .
- name: Run tests
run: |
# Another small bug here.
cd mytoy
pytest
We’ll first run this with the current tests for mytoy
, which should pass when run via GitHub Actions. We can take a look at the output.
Next, let’s introduce a failing test to see what the output looks like and get a sense for how to debug it.
We can use GitHub Actions with GitHub pages to easily publish websites, where we make changes to the repository and the workflow causes the website to be updated.
Here’s the GitHub Actions workflow (a YAML file) that publishes the website for the spring 2023 version of Statistics 159/259. Any commits made in the main
branch cause the website to be updated.
To set up GitHub Pages for a repository, go to https://github.com/github_org/repository/settings/pages
and select Source -> Deploy from a branch
and then choose gh-pages
as the branch (first creating the gh-pages
branch if necessary). The website should appear at github_org.github.io/repository
.
In fact the website for this workshop was done using Quarto, which uses GitHub Pages and GitHub Actions behind the scenes.
Chris runs quarto publish gh-pages
to update the site. Quarto will (on the local machine) render the html (convert the Markdown to html) and then behind the scenes Quarto uses GitHub Actions and GitHub Pages (via the `gh-pages branch) to publish the website.
This is the first time we’ve done this, so it’s particularly valuable now to get feedback on it. Please fill out this brief survey.