The Datasets Package¶
statsmodels
provides data sets (i.e. data and meta-data) for use in
examples, tutorials, model testing, etc.
Using Datasets from Stata¶
webuse (data[, baseurl, as_df]) |
Download and return an example dataset from Stata. |
Using Datasets from R¶
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset
function. The actual data is accessible by the data
attribute. For example:
In [1]: import statsmodels.api as sm
ImportErrorTraceback (most recent call last)
<ipython-input-1-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/api.py in <module>()
5 from . import regression
6 from .regression.linear_model import OLS, GLS, WLS, GLSAR
----> 7 from .regression.recursive_ls import RecursiveLS
8 from .regression.quantile_regression import QuantReg
9 from .regression.mixed_linear_model import MixedLM
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/regression/recursive_ls.py in <module>()
14 from statsmodels.regression.linear_model import OLS
15 from statsmodels.tools.data import _is_using_pandas
---> 16 from statsmodels.tsa.statespace.mlemodel import (
17 MLEModel, MLEResults, MLEResultsWrapper)
18 from statsmodels.tools.tools import Bunch
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/mlemodel.py in <module>()
12 from scipy.stats import norm
13
---> 14 from .kalman_smoother import KalmanSmoother, SmootherResults
15 from .kalman_filter import (KalmanFilter, FilterResults, INVERT_UNIVARIATE,
16 SOLVE_LU)
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/kalman_smoother.py in <module>()
12 import numpy as np
13
---> 14 from statsmodels.tsa.statespace.representation import OptionWrapper
15 from statsmodels.tsa.statespace.kalman_filter import (KalmanFilter,
16 FilterResults)
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/representation.py in <module>()
8
9 import numpy as np
---> 10 from .tools import (
11 find_best_blas_type, prefix_dtype_map, prefix_statespace_map,
12 validate_matrix_shape, validate_vector_shape
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/tools.py in <module>()
10 from scipy.linalg import solve_sylvester
11 from statsmodels.tools.data import _is_using_pandas
---> 12 from . import _statespace
13
14 has_find_best_blas_type = True
ImportError: cannot import name _statespace
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
NameErrorTraceback (most recent call last)
<ipython-input-2-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
NameError: name 'sm' is not defined
In [3]: print(duncan_prestige.__doc__)
NameErrorTraceback (most recent call last)
<ipython-input-3-e850f273c413> in <module>()
----> 1 print(duncan_prestige.__doc__)
NameError: name 'duncan_prestige' is not defined
In [4]: duncan_prestige.data.head(5)
NameErrorTraceback (most recent call last)
<ipython-input-4-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)
NameError: name 'duncan_prestige' is not defined
R Datasets Function Reference¶
get_rdataset (dataname[, package, cache]) |
download and return R dataset |
get_data_home ([data_home]) |
Return the path of the statsmodels data dir. |
clear_data_home ([data_home]) |
Delete all the content of the data home cache. |
Available Datasets¶
Usage¶
Load a dataset:
In [5]: import statsmodels.api as sm
ImportErrorTraceback (most recent call last)
<ipython-input-5-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/api.py in <module>()
5 from . import regression
6 from .regression.linear_model import OLS, GLS, WLS, GLSAR
----> 7 from .regression.recursive_ls import RecursiveLS
8 from .regression.quantile_regression import QuantReg
9 from .regression.mixed_linear_model import MixedLM
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/regression/recursive_ls.py in <module>()
14 from statsmodels.regression.linear_model import OLS
15 from statsmodels.tools.data import _is_using_pandas
---> 16 from statsmodels.tsa.statespace.mlemodel import (
17 MLEModel, MLEResults, MLEResultsWrapper)
18 from statsmodels.tools.tools import Bunch
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/mlemodel.py in <module>()
12 from scipy.stats import norm
13
---> 14 from .kalman_smoother import KalmanSmoother, SmootherResults
15 from .kalman_filter import (KalmanFilter, FilterResults, INVERT_UNIVARIATE,
16 SOLVE_LU)
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/kalman_smoother.py in <module>()
12 import numpy as np
13
---> 14 from statsmodels.tsa.statespace.representation import OptionWrapper
15 from statsmodels.tsa.statespace.kalman_filter import (KalmanFilter,
16 FilterResults)
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/representation.py in <module>()
8
9 import numpy as np
---> 10 from .tools import (
11 find_best_blas_type, prefix_dtype_map, prefix_statespace_map,
12 validate_matrix_shape, validate_vector_shape
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/tools.py in <module>()
10 from scipy.linalg import solve_sylvester
11 from statsmodels.tools.data import _is_using_pandas
---> 12 from . import _statespace
13
14 has_find_best_blas_type = True
ImportError: cannot import name _statespace
In [6]: data = sm.datasets.longley.load()
NameErrorTraceback (most recent call last)
<ipython-input-6-6daf677753dc> in <module>()
----> 1 data = sm.datasets.longley.load()
NameError: name 'sm' is not defined
The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data
attribute.
In [7]: data.data
NameErrorTraceback (most recent call last)
<ipython-input-7-42500bbde965> in <module>()
----> 1 data.data
NameError: name 'data' is not defined
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [8]: data.endog[:5]
NameErrorTraceback (most recent call last)
<ipython-input-8-ecf121fa201d> in <module>()
----> 1 data.endog[:5]
NameError: name 'data' is not defined
In [9]: data.exog[:5,:]
NameErrorTraceback (most recent call last)
<ipython-input-9-eb86cb28e7fa> in <module>()
----> 1 data.exog[:5,:]
NameError: name 'data' is not defined
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [10]: data.endog_name
NameErrorTraceback (most recent call last)
<ipython-input-10-78ac46fd3666> in <module>()
----> 1 data.endog_name
NameError: name 'data' is not defined
In [11]: data.exog_name
NameErrorTraceback (most recent call last)
<ipython-input-11-53b38d63b171> in <module>()
----> 1 data.exog_name
NameError: name 'data' is not defined
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [12]: type(data.data)
NameErrorTraceback (most recent call last)
<ipython-input-12-2a4072828d02> in <module>()
----> 1 type(data.data)
NameError: name 'data' is not defined
In [13]: type(data.raw_data)
NameErrorTraceback (most recent call last)
<ipython-input-13-55b385c14017> in <module>()
----> 1 type(data.raw_data)
NameError: name 'data' is not defined
In [14]: data.names
NameErrorTraceback (most recent call last)
<ipython-input-14-bb6578e2a1cd> in <module>()
----> 1 data.names
NameError: name 'data' is not defined
Loading data as pandas objects¶
For many users it may be preferable to get the datasets as a pandas DataFrame or
Series object. Each of the dataset modules is equipped with a load_pandas
method which returns a Dataset
instance with the data readily available as pandas objects:
In [15]: data = sm.datasets.longley.load_pandas()
NameErrorTraceback (most recent call last)
<ipython-input-15-dd9cc940a6dd> in <module>()
----> 1 data = sm.datasets.longley.load_pandas()
NameError: name 'sm' is not defined
In [16]: data.exog
NameErrorTraceback (most recent call last)
<ipython-input-16-a6a50950081b> in <module>()
----> 1 data.exog
NameError: name 'data' is not defined
In [17]: data.endog
NameErrorTraceback (most recent call last)
<ipython-input-17-5f625520ab35> in <module>()
----> 1 data.endog
NameError: name 'data' is not defined
The full DataFrame is available in the data
attribute of the Dataset object
In [18]: data.data
NameErrorTraceback (most recent call last)
<ipython-input-18-42500bbde965> in <module>()
----> 1 data.data
NameError: name 'data' is not defined
With pandas integration in the estimation classes, the metadata will be attached to model results:
Extra Information¶
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']
Additional information¶
- The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
- To add datasets, see the notes on adding a dataset.