Basic Statistics and Timeseries with pandas

Robert Taylor December 08, 2015

This post is mostly just a refresher course for myself in how to perform basic statistics in Python. Also, I’m currently playing around with pandas, which is pretty much awesome for data analysis.

Basic Statistics

We’ve got to start somewhere, so why not start with the theoretical basics: random variables, means, variances, and correlation.

First, the multivariate random variable (also called a random vector):

\[\vec X=\begin{bmatrix} X_1\\ X_2\\ \vdots\\ X_n \end{bmatrix}\]

Every random variable can be characterized by a special function called a joint distribution function:

\[F(x_1, \dots, x_n)=P\{X_1\le x_1,\dots,X_n\le x_n\}\]

A closely related function is the joint probability density function. This function differs for continuous random variables (defined on a continuous sample space) and discrete random variables (defined on a countable sample space). For continuous random variables, the density function is:

\[f(x_1, \dots, x_n)=\frac{\partial^n F(x_1, \dots, x_n)}{\partial x_1 \dots \partial x_n}\]

The two functions are related by:

\[F(x_1, \dots, x_n)=\int_{-\infty}^{x_1} \dots \int_{-\infty}^{x_n} f(t_1, \dots, t_n)\ dt_1\dots dt_n\]

For discrete random variables, we get similar equations. The density function for discrete random variables is:

\[f(x_1,\dots,x_n)=P\{X_1=x_1,\dots,X_n=x_n\}\]

And it is related to the distribution function by:

\[F(x_1,\dots,x_n)=\sum_{t_1\le x_1}\dots\sum_{t_n\le x_n} f(t_1, \dots, t_n)\]

Distributions can have any number of what are called moments. The most relevant moments are the mean (first moment) and the variance (second central moment). They are defined in terms of an expectation operator \(E\). The expectation, or mean, of a continuous random variable is:

\[\mu=E[X]=\int_{-\infty}^{\infty}x\ f_X(x)\ dx\]

For a discrete random variable, the mean is:

\[\mu=E[X] = \sum_x x\ P\{X=x\}\]

The variance is then:

\[\sigma^2=Var[X]=E[(X-E[X])^2]=E[(X-\mu)^2]=E[X^2]-\mu^2\]

By extending these concepts to the realm of multivariate random variables we get the vector mean:

\[\vec\mu=E[\vec X]=\begin{bmatrix} \mu_1\\ \mu_2\\ \vdots\\ \mu_n \end{bmatrix}\]

And the covariance matrix:

\[\mathbf\Sigma=\begin{bmatrix} \sigma_1^2 & \sigma_{1,2} & \dots & \sigma_{1,n} \\ \sigma_{2,1} & \sigma_2^2 & \dots & \sigma_{2,n} \\ \vdots & \vdots & & \vdots \\ \sigma_{n,1} & \sigma_{n,2} & \dots & \sigma_n^2 \\ \end{bmatrix}\]

Where \(\sigma_{i,j}\) is the covariance of two random variables:

\[\sigma_{i,j}=Cov[X_i, X_j]=E[(X_i-E[X_i])(X_j-E[X_j])]=E[X_iX_j]-E[X_i]E[X_j]\]

Finally, we can define the correlation of two random variables:

\[\rho_{i,j}= \frac{Cov[X_i,X_j]}{\sqrt{Var[X_i]Var[X_j]}}=\frac{\sigma_{i,j}}{\sigma_i\sigma_j}\]

And the correlation matrix for \(\vec X\):

\[\mathbf R=\begin{bmatrix} 1 & \rho_{1,2} & \dots & \rho_{1,n} \\ \rho_{2,1} & 1 & \dots & \rho_{2,n} \\ \vdots & \vdots & & \vdots \\ \rho_{n,1} & \rho_{n,2} & \dots & 1 \\ \end{bmatrix}\]

Python Code

Doing statistical work in Python is very easy, assuming you know which tools to use. Of course, there are dozens of excellent Python packages to choose from. At a minimum you’ll want to have NumPy, SciPy, matplotlib, and pandas available.

Let’s do some basic mean and standard deviation calculations.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
s = pd.Series(0.7 * np.random.randn(1000) - 1.3)
s.hist(color='c')
mean = s.mean()
std = s.std()
plt.axvline(mean, color='b', linestyle='dashed', linewidth=2)
plt.axvline(mean-std, color='b', linestyle='dashed', linewidth=1)
plt.axvline(mean+std, color='b', linestyle='dashed', linewidth=1)
plt.title('Mean and standard deviation example')
plt.xlabel('Value')
plt.ylabel('Count');
output

To calculate covariance and correlation we can do:

df = pd.DataFrame(np.random.randn(50, 3), columns=['a', 'b', 'c'])
df.cov()
abc
a1.446393-0.0530090.178310
b-0.0530091.1105890.003758
c0.1783100.0037581.118477
df.corr()
abc
a1.000000-0.0418250.140191
b-0.0418251.0000000.003372
c0.1401910.0033721.000000

Many times it is necessary to generate random numbers according to a particular distribution function. SciPy and NumPy make this easy. The following demonstrates how to generate PDF’s, CDF’s, and random numbers on normal, chi-squared, Student’s t, and F distributions.

from scipy.stats import norm
fig, (ax1, ax2) = plt.subplots(2, 1)
x = np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100)
ax1.plot(x, norm.pdf(x), 'r-', lw=2, alpha=0.5)
ax2.plot(x, norm.cdf(x), 'r-', lw=2, alpha=0.5)
ax1.set_ylim((0,0.45))
ax1.set_title('PDF and CDF of Normal Distribution');
output
pd.DataFrame(np.random.normal(-2, 1.2, 1000)).hist(bins=16)
plt.title('Histogram of Random Samples from Normal Distribution');
output
from scipy.stats import chi2
df = 55
fig, (ax1, ax2) = plt.subplots(2, 1)
x = np.linspace(chi2.ppf(0.01, df), chi2.ppf(0.99, df), 100)
ax1.plot(x, chi2.pdf(x, df), 'r-', lw=2, alpha=0.5)
ax2.plot(x, chi2.cdf(x, df), 'r-', lw=2, alpha=0.5)
ax1.set_title('PDF and CDF of Chi-Squared Distribution (with {} DOF)'.format(df));
output
pd.DataFrame(np.random.chisquare(34, 1000)).hist(bins=16)
plt.title('Histogram of Random Samples from Chi-Squared Distribution');
output
from scipy.stats import t
df = 2.74
fig, (ax1, ax2) = plt.subplots(2, 1)
x = np.linspace(t.ppf(0.01, df), t.ppf(0.99, df), 100)
ax1.plot(x, t.pdf(x, df), 'r-', lw=2, alpha=0.5)
ax2.plot(x, t.cdf(x, df), 'r-', lw=2, alpha=0.5)
ax1.set_title('PDF and CDF of Student\'s t Distribution (with {} DOF)'.format(df));
output
pd.DataFrame(np.random.standard_t(2.8, 1000)).hist()
plt.title('Histogram of Random Samples from Student\'s t Distribution');
output
from scipy.stats import f
dfn, dfd = 29, 18
fig, (ax1, ax2) = plt.subplots(2, 1)
x = np.linspace(f.ppf(0.01, dfn, dfd), f.ppf(0.99, dfn, dfd), 100)
ax1.plot(x, f.pdf(x, dfn, dfd), 'r-', lw=2, alpha=0.5)
ax2.plot(x, f.cdf(x, dfn, dfd), 'r-', lw=2, alpha=0.5)
ax1.set_title('PDF and CDF of F Distribution (with [{}, {}] DOF)'.format(dfn,dfd));
output
pd.DataFrame(np.random.f(29, 18, 1000)).hist()
plt.title('Histogram of Random Samples from F Distribution');
output

Time Series

Now that we’ve covered the basics, let’s get started talking about time series. A time series is an ordered sequence of observations, sometimes called a signal. The ordering is normally through time, particularly in terms of equally spaced time intervals. Here are some examples (original source data found at vincentarelbundock.github.io/Rdatasets).

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df_cur = pd.read_csv('data/Garch.csv', parse_dates=['date'], index_col='date')
df_cur['bp'].plot()
plt.ylim(0.8, 2.6)
plt.title('Daily Exchange Rates (British Pound to US Dollar)')
plt.xlabel('Date')
plt.ylabel('GBP to USD');
output
df_temp = pd.read_csv('data/nhtemp.csv', parse_dates=['time'], index_col='time')
plt.plot(df_temp.index, df_temp['nhtemp'])
plt.ylim(46, 56)
plt.title('Average Yearly Temperatures in New Haven')
plt.xlabel('Year')
plt.ylabel('Avg. Temp.');
output
df_www = pd.read_csv('data/WWWusage.csv', index_col='time')
plt.plot(df_www.index, df_www['WWWusage'])
plt.ylim(80, 240)
plt.title('Internet Usage per Minute')
plt.xlabel('Minute')
plt.ylabel('Usage');
output

Estimate Distributions and Densities

Given a data set, how do you to which particular distribution function is belows? Well, that can be a tough one. Fortunately, SciPy can generate an estimate for you, which you can then plot and visualize.

First, I’ll show the pandas shortcut method (single line of code). Following that is the more robust, SciPy method (more control).

df_www['WWWusage'].plot(kind='kde')
plt.title('Kernel Density Estimate (via pandas)')
plt.xlabel('Usage');
output
from scipy.stats.kde import gaussian_kde
kde = gaussian_kde(df_www['WWWusage'])
xspan = np.linspace(0, 300, 100)
pdf = kde(xspan)
plt.plot(xspan, pdf)
plt.fill_between(xspan, 0, pdf, alpha=0.4)
plt.title('Kernel Density Estimate (via scipy)')
plt.xlabel('Usage')
plt.ylabel('Density');
output
cdf = np.cumsum(kde(xspan))
cdf = cdf / max(cdf)
plt.plot(xspan, cdf)
plt.fill_between(xspan, 0, cdf, alpha=0.4);
plt.title('Cumulative Distribution Function Estimate')
plt.xlabel('Usage')
plt.ylabel('Probability');
output

Conclusion

That’s it for now. I’ve covered: some very basic math theory; how to calculate mean, covariance, etc.; how to generate random numbers; how to visualize densities and distributions; and a little bit on time series.

References

  1. Madsen, Henrik. Time Series Analysis. Chapman & Hall/ CRC, 2007.

GET IN TOUCH

Have a project in mind?

Reach out directly to hello@humaticlabs.com or use the contact form.

HUMATIC LABS LLC

All rights reserved