Tutorial¶

We now illustrate the basic capabilities of the grmpy package. We start by outlining some basic functional form assumptions before introducing to alternative models that can be used to estimate the marginal treatment effect (MTE). We then turn to some simple use cases.

Assumptions¶

The grmpy package implements the normal linear-in-parameters version of the generalized Roy model. Both potential outcomes and the choice \((Y_1, Y_0, D)\) are a linear function of the individual’s observables \((X, Z)\) and random components \((U_1, U_0, V)\).

\[\begin{split}Y_1 &= X \beta_1 + U_1 \\ Y_0 &= X \beta_0 + U_0 \\ D &= I[D^{*} > 0] \\ D^{*} &= Z \gamma -V\end{split}\]

Individuals decide to select into latent indicator variable \(D^{*}\) is positive. Depending on their decision, we either observe \(Y_1\) or \(Y_0\).

Parametric Normal Model¶

The parametric model imposes the assumption of joint normality of the unobservables \((U_1, U_0, V) \sim \mathcal{N}(0, \Sigma)\) with mean zero and covariance matrix \(\Sigma\).

Semiparametric Model¶

The semiparametric approach invokes no assumption on the distribution of the unobservables. It requires a weaker condition \((X,Z) \indep {U_1, U_0, V}\)

Under this assumption, the MTE is:

additively separable in \(X\) and \(U_D\), which means that the shape of the MTE is independent of \(X\), and
identified over the common support of \(P(Z)\), unconditional on \(X\).

The assumption of common support is crucial for the application of LIV and needs to be carefully evaluated every time. It is defined as the region where the support of \(P(Z)\) given \(D=1\) and the support of \(P(Z)\) given :math:`D=0 overlap.

Model Specification¶

You can specify the details of the model in an initialization file (example). This file contains several blocks:

SIMULATION

The SIMULATION block contains some basic information about the simulation request.

Key	Value	Interpretation
agents	int	number of individuals
seed	int	seed for the specific simulation
source	str	specified name for the simulation output files

ESTIMATION

Depending on the model, different input parameters are required.

PARAMETRIC MODEL

Key	Value	Interpretation
semipar	False	choose the parametric normal model
agents	int	number of individuals (for the comparison file)
file	str	name of the estimation specific init file
optimizer	str	optimizer used for the estimation process
start	str	flag for the start values
maxiter	int	maximum numbers of iterations
dependent	str	indicates the dependent variable
indicator	str	label of the treatment indicator variable
output_file	str	name for the estimation output file
comparison	int	flag for enabling the comparison file creation

SEMIPARAMETRIC MODEL

Key	Value	Interpretation
semipar	True	choose the semiparametric model
show_output	bool	If True, intermediate outputs of the estimation process are displayed
dependent	str	indicates the dependent variable
indicator	str	label of the treatment indicator variable
file	str	name of the estimation specific init file
logit	bool	If false: probit. Probability model for the choice equation
nbins	int	Number of histogram bins used to determine common support (default is 25)
bandwidth	float	Bandwidth for the locally quadratic regression
gridsize	int	Number of evaluation points for the locally quadratic regression (default is 400)
ps_range	list	Start and end point of the range of \(p = u_D\) over which the MTE shall be estimated
rbandwidth	int	Bandwidth for the double residual regression (default is 0.05)
trim_support	bool	Trim the data outside the common support, recommended (default is True)
reestimate_p	bool	Re-estimate \(P(Z)\) after trimming, not recommended (default is False)

In most empirical applications, bandwidth choices between 0.2 and 0.4 are appropriate. [11] find that a gridsize of 400 is a good default for graphical analysis. For data sets with less than 400 observations, we recommend a gridsize equivalent to the maximum number of observations that remain after trimming the common support. If the data set of size N is large enough, a gridsize of 400 should be considered as the minimal number of evaluation points. Since grmpy’s algorithm is fast enough, gridsize can be easily increased to N evaluation points.

The “rbandwidth”, which is 0.05 by default, specifies the bandwidth for the LOESS (Locally Estimated Scatterplot Smoothing) regression of \(X\), \(X \ \times \ p\), and \(Y\) on \(\widehat{P}(Z)\). If the sample size is small (N < 400), the user may need to increase “rbandwidth” to 0.1. Otherwise grmpy will throw an error.

Note that the MTE identified by LIV consists of wo components: \(\overline{x}(\beta_1 - \beta_0)\) (which does not depend on \(P(Z) = p\)) and \(k(p)\) (which does depend on \(p\)). The latter is estimated nonparametrically. The key “p_range” in the initialization file specifies the interval over which \(k(p)\) is estimated. After the data outside the overlapping support are trimmed, the locally quadratic kernel estimator uses the remaining data to predict \(k(p)\) over the entire “p_range” specified by the user. If “p_range” is larger than the common support, grmpy extrapolates the values for the MTE outside this region. Technically speaking, interpretations of the MTE are only valid within the common support. In our empirical applications, we set “p_range” to \([0.005,0.995]\).

The other parameters (“trim_support” and “reestimate_p”) are set by default and do not need to be specified by the user. In rare cases, the user might wish to change these parameters. In general, we do not recommend this.

TREATED

The TREATED block specifies the number and order of the covariates determining the potential outcome in the treated state and the values for the coefficients \(\beta_1\). Note that the length of the list which determines the parameters has to be equal to the number of variables that are included in the order list.

Key	Container	Values	Interpretation
params	list	float	Parameters
order	list	str	Variable labels

UNTREATED

The UNTREATED block specifies the covariates that a the potential outcome in the untreated state and the values for the coefficients \(\beta_0\).

Key	Container	Values	Interpretation
params	list	float	Parameters
order	list	str	Variable labels

CHOICE

The CHOICE block specifies the number and order of the covariates determining the selection process and the values for the coefficients \(\gamma\).

Key	Container	Values	Interpretation
params	list	float	Parameters
order	list	str	Variable labels

Further Specifications for the Parametric Model¶

DIST

The DIST block specifies the distribution of the unobservables.

Key	Container	Values	Interpretation
params	list	float	Upper triangular of the covariance matrix

VARTYPES

The VARTYPES section enables users to specify optional characteristics to specific variables in their simulated data. Currently there is only the option to determine binary variables. For this purpose the user have to specify a key which reflects the corresponding variable label and assign a list to this label which contains the type (binary) as a string as well as a float (<0.9) that determines the probability for which the variable is one.

Key	Container	Values	Interpretation
Variable label	list	string and float	Type of variable + additional information

SCIPY-BFGS

The SCIPY-BFGS block contains the specifications for the BFGS minimization algorithm. For more information see: SciPy documentation.

Key	Value	Interpretation
gtol	float	the value that has to be larger as the gradient norm before successful termination
eps	float	value of step size (if jac is approximated)

SCIPY-POWELL

The SCIPY-POWELL block contains the specifications for the POWELL minimization algorithm. For more information see: SciPy documentation.

Key	Value	Interpretation
xtol	float	relative error in solution values xopt that is acceptable for convergence
ftol	float	relative error in fun(xopt) that is acceptable for convergence

Examples¶

Parametric Normal Model¶

In the following chapter we explore the basic features of the grmpy package. The resources for the tutorial are also available online. So far the package provides the features to simulate a sample from the generalized Roy model and to estimate some parameters of interest for a provided sample as specified in your initialization file.

Simulation

First we will take a look on the simulation feature. For simulating a sample from the generalized Roy model you use the simulate() function provided by the package. For simulating a sample of your choice you have to provide the path of your initialization file as an input to the function.

import grmpy

grmpy.simulate('tutorial.grmpy.yml')

This creates a number of output files that contain information about the resulting simulated sample.

data.grmpy.info, basic information about the simulated sample
data.grmpy.txt, simulated sample in a simple text file
data.grmpy.pkl, simulated sample as a pandas data frame

Estimation

The other feature of the package is the estimation of the parameters of interest. By default, the parametric model is chosen, in which case the parameter semipar in the ESTIMATION section of the initialization file is set to False. The start values and optimizer options need to be specified in the ESTIMATION section.

grmpy.fit('tutorial.grmpy.yml', semipar=False)

As in the simulation process this creates an output files that contain information about the estimation results.

Local Instrumental Variables¶

If the user wishes to estimate the parameters of interest using the semiparametric LIV approach, semipar must be changed to True.

grmpy.fit('tutorial.semipar.yml', semipar=True)

If show_output is True, grmpy plots the common support of the propensity score and shows some intermediate outputs of the estimation process.