# Manual Statistics and Econometric Models, General Concepts, Estimation, Prediction and Algorithms

This class includes maximum likelihood estimation, but is much broader. The second viewpoint is from Bayesian statistics. In this case, the model specifies a family of conditional probability distributions, indexed by parameters. These parameters are considered random as well, and so a prior distribution needs to be specified for them. Once a prior is assumed, the joint distribution over the parameters and the observations is well defined.

Statistics 101: Linear Regression, The Very Basics

The distribution of the model parameter is characterized by the posterior distribution of the parameters given historic data, which is defined as the conditional distribution of the parameter given the data. The concept of a loss function is discussed in later sections. From a procedural point of view, Bayesian methods often take the form of numerical integration.

This is because the posterior distribution is a normalized probability measure, and computing the normalization factor requires integration. The numerical integration is generally carried out via some form of Monte Carlo sampling procedure or other type of numerical approximation. Thus, procedurally, methods studied in frequentist statistics often take the form of optimization procedures, and methods studied in Bayesian statistics often take the form of integration procedures. The relevance of these viewpoints to massive data analysis comes down in part to the scalability of optimization versus integration.

That said, it is also possible to treat integration problems using the tools of optimization; this is the perspective of the variational approach to Bayesian inference Wainwright and Jordan, It should also be noted that there are many links between the frequentist and Bayesian views at a conceptual level, most notably within the general framework of statistical decision theory. Many data analyses involve a blend of these perspectives, either procedurally or in terms of the analysis.

The difference is that in a parametric model, the number of parameters is fixed once and for all,. Sometimes the growth is explicitly specified in the model, and sometimes it is implicit.

• Second Language Students in Mainstream Classrooms: A Handbook for Teachers in International Schools (Parents and Teachers Guides, 2).
• Research Areas in Statistics.
• Mindfulness in Early Buddhism: New Approaches through Psychology and Textual Analysis of Pali, Chinese and Sanskrit Sources (Routledge Critical Studies in Buddhism).
• Form and Function of the Pauline Thanksgivings (Beihefte zur Zeitschrift fur die neutestamentliche Wissenschaft und die Kunde der Alteren Kirche).
• The North End Italian Cookbook, 5th?
• Variable selection – A review and recommendations for the practicing statistician.

For example, a nonparametric model may involve a set of functions that satisfy various constraints e. Moreover, the growth in the number of parameters may arise implicitly as part of the estimation procedure. Although parametric models will always have an important role to play in data analysis, particularly in situations in which the model is specified in part from an underlying scientific theory such that the parameters have meaning in the theory , the committee finds that the nonparametric perspective is particularly well aligned with many of the goals of massive data analysis.

The nonparametric perspective copes naturally with the fact that new phenomena often emerge as data sets increase in size. The Bayesian approach to nonparametrics generally involves replacing classical prior distributions with stochastic processes, thereby supplying the model with an open-ended infinite number of random parameters. The frequentist approach, with its focus on analysis, shoulders the burden of showing that good estimates of parameters can be obtained even when the number of parameters is growing.

One rarely wishes to model all aspects of a data set in detail, particularly in the setting of massive data. Rather, there will aspects of the model that more important than others. Also, certain functions of parameters may be of particular interest, such as the output label in a classification problem. If these values disagree, then the loss is one, otherwise it is zero. In regression it is common to measure the error in a fitted function via the squared-error loss. Both frequentists and Bayesians make use of loss functions. For frequentists the loss is a fundamental ingredient in the evaluation of statistical procedures; one wishes to obtain small average loss over multiple draws of a data set.

The probabilistic model of the data is used to define the distribution under which this average is taken. For Bayesians the loss function is used to specify the aspects of the posterior distribution that are of particular interest to the data analyst and thereby guide the design of posterior inference and decision-making procedures. The use of loss functions encourages the development of partially specified models. For example, in regression, where the goal is to predict Y from X , if the loss function only refers to Y , as is the case with the least-squares loss, then this encourages one to develop a model in which the distribution of X is left unspecified.

Similarly, in binary classification, where the goal is to label a vector X with a zero or one, one can forgo an attempt to model the class-conditional distributions of X and focus only on a separating surface that predicts the labels well. For example, if one considers the class of all hyperplanes, then one has a parametric model in which the parameters are those needed to define a hyperplane note that the number of parameters is fixed in advance in this case.

Alternatively, one can consider flexible surfaces that grow in complexity as data accrue; this places one in the nonparametric modeling framework. On the other hand, the model has also become more concrete in that it is targeted to the inferential goal of the data analyst via the use of a loss function. In addition to the statistical modeling perspectives discussed above, which have been extensively studied in the statistical and machine-learning.

These methods also give meaningful descriptions of the data, but they are more procedure-driven than model-driven. Some of these procedures rely on optimization criteria that are not based on statistical models or even have any underlying statistical underpinning. For example, the k -means algorithm is a popular method for clustering, but it is not based on a statistical model of the data. However, the optimization criterion still characterizes what a data analyst wants to infer from the data: whether the data can be clustered into coherent groups.

This means that instead of a statistical model, appropriately defined optimization formulations may be more generally regarded as models that capture useful descriptions of the data. In such a case, parameters in the optimization formula determine a model, and the optimal parameter gives the desired description of the data. It should be noted that many statistical parameter estimation methods can be regarded as optimization procedures such as maximum-likelihood estimation.

Therefore, there is a strong relationship between the optimization approach which is heavily used in machine learning and the more traditional statistical models. Some other data-analysis procedures try to find meaningful characterizations of the data that satisfy some descriptions but are not necessarily based on optimization. These methods may be considered as algorithmic approaches rather than models. Although the specific statistical quantities these algorithms try to compute provide models of the data in a loose sense i.

Nevertheless, the algorithmic approaches are important in massive data analysis simply because the need for computational efficiency goes hand-in-hand with massive data analysis. In building a statistical model from any data source, one must often deal with the fact that data are imperfect. Real-world data are corrupted with noise. Such noise can be either systematic i. Measurement processes are inherently noisy, data can be recorded with error, and parts of the data may be missing. The data produced from simulations or agent-based models, no matter how complex, are also imperfect, given that they are built from intuition and initial data.

Data can also be contaminated or biased by malicious agents. The ability to detect false data is extremely weak, and just having a massive quantity of data is no guarantee against deliberate biasing. Even good data obtained by high-quality instrumentation or from well-designed sampling plans or. Noisy and biased data are thus unavoidable in all model building, and this can lead to poor predictions and to models that mislead.

Although random noise can be averaged out, loosely speaking, using multiple independent experiments—that is, the averaged noise effect approaches zero—this is not the case with systemic noise. In practice, both kinds of noise exist, and thus both kinds should be incorporated into models of the data. The goal is often to build statistical models that include one or more components of noise so that noise can be separated from signal, and thus relatively complex models can be used for the signal, while avoiding overly complex models that would find structure where there is none.

The modeling of the noise component impacts not only the parameter estimation procedure, but also the often informal process of cleaning the data and assessing whether the data are of high-enough quality to be used for the task at hand. This section focuses on this informal activity of data cleaning in the setting of massive data. The science and art of cleaning data is fairly well developed when data sets are of modest size, 1 but new challenges arise when dealing with massive data. In small-scale applications, data cleaning often begins with simple sanity checking. Are there any obvious mistakes, omissions, mislabelings, and so on, that can be seen by sampling a small subset of the data?

Do any variables have obviously incorrect values? This kind of checking typically involves plotting the data in various ways, scanning through summaries, and producing particular snapshots that are designed to expose bad data. Often the result of this process is to return to the source to confirm suspicious entries, or to fill in omitted fields.

How does this approach change with massive data? The sanity checking and identification of potential problems can still be performed using samples and snapshots, although determining how to find representative samples can sometimes pose problems see Chapter 8. However, the ability to react to issues will be constrained by time and size, and most human intervention is impossible.

There are at least two general approaches to overcoming this problem:. Hence the first approach is more attractive, but the second should be considered. For example, features that are text-based lead to many synonyms or similar phrases for the same concept. Humans can curate these lists to reduce the number of concepts, but with massive data this probably needs to be done automatically. Such a step will likely involve natural language processing methodology, and it must be sufficiently robust to handle all cases reasonably well.

With storage limitations, it may not be feasible to store all variables, so the method might have to be limited to a more valuable subset of them. Missing data is often an issue, and dealing with it can be viewed as a form of cleaning. Some statistical modeling procedures—such as trees, random forests, and boosted trees—have built-in methods for dealing with missing values.

However, many model-building approaches assume the data are complete, and so one is left to impute the missing data prior to modeling. There are many different approaches to data imputation. Some simple methods that are practical on a massive scale are to replace the missing entries by the mean for that variable. This implicitly assumes that the omissions are completely random. Other more sophisticated methods treat the missing data imputation as a prediction problem—predicting the missing entries for a variable using that variable as a response, and all the values as input variables.

This might create a computational burden that is prohibitive with massive data, so good compromises are sought. Whatever approaches are used to clean and preprocess the data, the steps should be documented, and ideally the scripts or code that were used should accompany the data. Following these steps results in a process that is reproducible and self-explanatory.

Data analysts build models for two basic reasons: to understand the past and to predict the future. One would like to understand how the data were generated, the relationships between variables, and any special structure that may exist in the data. The process of creating this understanding is often referred to as unsupervised learning. This is often referred to as supervised learning.

Data-generating mechanisms typically defy simple characterization, and thus models rarely capture reality perfectly. However, the general hope is that carefully crafted models can capture enough detail to provide useful insights into the data-generating mechanism and produce valuable predictions.

This is the spirit behind the famous observation from the late George Box that all models are wrong, but some are useful. While the model-building literature presents a vast array of approaches and spans many disciplines, model building with massive data is relatively uncharted territory. For example, most complex models are computationally intensive, and algorithms that work perfectly well with megabytes of data may become infeasible with terabytes or petabytes of data, regardless of the computational power that is available. Thus, in analyzing massive data, one must re-think the trade-offs between complexity and computational efficiency.

This section provides a summary of major techniques that have been used for data mining, statistical analysis, and machine learning in the context of large-scale data, but which need re-evaluation in the context of massive data. Some amplification is provided in Chapter 10 , which discusses computational kernels for the techniques identified here.

Unsupervised learning or data analysis aims to find patterns and structures in the data. Some standard tasks that data analysts address include the following:. One general approach is to use a statistical model to characterize the data, and an outlier is then a point that belongs to a set with a small probability which can be measured by a properly defined p -value under the model. A variety of approaches are used in practice to address many of these questions. They include the probabilistic modeling approach with a well-defined statistical model , the non-probabilistic approach based on optimization, or simply a procedure that tries to find desired structures that may or may not rely on optimization.

For example, a mixture model can be used as a statistical model for addressing the clustering problem. With a mixture model, in order to generate each data point, one first generates its mixture component, then generates the observation according to the probability distribution of the mixture component. Hence the statistical approach requires a probabilistic model that generates the data—a so-called generative model.

By comparison, the k -means algorithm assumes that the data are in k clusters, represented by their centroids. Each data point is then assigned to the cluster whose centroid is closest. This is iterated, and the algorithm converges to a local minimum of an appropriate distance-to-center criterion. This approach does not hinge on a statistical model, but instead on a sensible optimization criterion. There are also valid clustering procedures that are not based on optimization or statistical models.

For example, in hierarchical agglomerative clustering, one starts with each single data point as a cluster, and then iteratively groups the two closest clusters to form a larger cluster; this process is repeated until all data are grouped into a single cluster.

Hierarchical clustering does not depend on a statistical model of the data nor does it attempt to optimize a criterion. Nevertheless, it achieves the basic goal of cluster analysis—to find partitions of the data so that points inside each cluster are close to one another but not close to points in other clusters. In a loose sense, it also builds a useful. However, the model is not detailed enough to generate the data in a probabilistic sense.

Statistical models in the unsupervised setting that focus on the underlying data-generation mechanism can naturally be studied under the Bayesian framework. In that case, one is especially interested in finding unobserved hidden information from the data, such as factors or clusters that reveal some underlying structure in the data. Bayesian methods are natural in this context because they work with a joint distribution, both on observed and unobserved variables, so that elementary probability calculations can be used for statistical inference.

Massive data may contain many variables that require complex probabilistic models, presenting both statistical and computational challenges. Statistically one often needs to understand how to design nonparametric Bayesian procedures that are more expressive than the more traditional parametric Bayesian models.

## These are the best books for learning modern statistics—and they’re all free

Moreover, in order to simplify the specification of the full joint probability distribution, it is natural to consider simplified relationships among the data such as with graphical models that impose constraints in the form of conditional independencies among variables. Computational efficiency is also a major challenge in Bayesian analysis, especially for massive data. Methods for efficient large-scale Monte Carlo simulation or approximate inference algorithms such as variational Bayesian methods become important for the success of the Bayesian approach.

Predictive modeling is referred to as supervised learning in the machine-learning literature. One has a response or output variable Y, and the goal is to build a function f X of the inputs X for predicting Y. Basic prediction problems involving simple outputs include classification Y is a discrete categorical variable and regression Y is a real-valued variable.

Statistical approaches to predictive modeling can be generally divided into two families of models: generative models and discriminative models. In a discriminative model, the conditional probability P Y X is directly modeled without assuming any specific probability model for X. An example of a generative model for classification is linear discriminant analysis. Here one assumes in each class the conditional distribution is Gaussian with common covariance matrix used for all classes ; hence, the joint distribution is the product of a Gaussian density with a class probability.

Logistic regression proposes a model for P Y X and is not concerned with estimating the distribution of X. It turns out that they both result in the same parametric representation for P Y X , but the two approaches lead to different estimates for the parameters. In the traditional statistical literature, the standard parameter estimation method for developing either a generative or discriminative model is maximum likelihood estimation MLE , which leads to an optimization problem.

One can also employ optimization in predictive modeling in a broader sense by defining a meaningful criterion to optimize. For example, one may consider a geometric concept such as a margin and use it to define an optimization criterion for classification that measures how well classes are separated by the underlying classifier. This leads to purely optimization-based machine learning methods such as the support vector machine method for recognizing patterns that are not based on statistical models of how the data were generated although one can also develop a model-based perspective for such methods.

Another issue in modern data analysis is the prevalence of high-dimensional data, where a large number of variables are observed that are difficult to handle using traditional methods such as MLE. In order to deal with the large dimensionality, modern statistical methods focus on regularization approaches that impose constraints on the model parameters so that they can still be reliably estimated even when the number of parameters is large.

Examples of such methods include ridge regression and the Lasso method for least-squares fitting. In both cases one adds a penalty term that takes the form of a constraint on the norm of the coefficient vector L 2 norm in the case of ridge regression, and L 1 norm in the case of Lasso. In the Bayesian statistical setting, constraints in the parameter space can be regarded naturally as priors, and the associated optimization methods correspond to maximum a posteriori estimation.

In many complex predictive modeling applications, nonlinear prediction methods can achieve better performance than linear methods. Therefore, an important research topic in massive data analysis is to investigate nonlinear prediction models that can perform efficiently in high dimensions. Classical examples of nonlinear methods include nearest neighbor classification and decision trees. Some recent developments include kernel methods, random forests, and boosting.

Some practical applications require one to predict output Y with a rather complex structure. For example, in machine translation, input X is observed as a sentence in a certain language, and a corresponding sentence translation Y needs to be generated in another language. These kinds of problems are referred to as structured prediction problems, an active research topic in machine learning. Many of these complex problems can. Nevertheless, additional computational challenges arise. Efficient leverage of massive data is an important research topic currently in structured prediction, and this is likely to continue for the near future.

Another active research topic is online prediction, which can be regarded both as modeling for sequential prediction and as optimization over massive data. Online algorithms have a key advantage in handling massive data, in that they do not require all data to be stored in memory. Instead, each time they are invoked, they look at single observations or small batches of observations. One popular approach is stochastic gradient descent. Because of this advantage, these algorithms have received increasing attention in the machine learning and optimization communities.

Moreover, from a modeling perspective, sequential prediction is a natural setting for many real-world applications where data arrive sequentially over time. Agent-based models and system dynamic models are core modeling techniques for assessing and reasoning about complex socio-technical systems where massive data are inherent.

These models require the fusion of massive data, and the assessment of said data, to set initial conditions. In addition, these models produce massive data, potentially comparable in size and complexity to real-world data. Two examples, used in epidemiology and biological warfare, are BioWar Carley et al. Both simulate entire cities, and thus are both users and producers of massive data. BioWar, for example, generates data on who interacts with whom and when, who has what disease, who is showing which symptom s and where, and what they are doing at a given time; it updates this picture across a city for all agents in 4-hour time blocks.

For these models, core challenges are identifying reduced-form solutions that are consistent with the full model, storing and processing data generated, fusing massive amounts of data from diverse sources, and ensuring that results are due to actual behavior and not tail constraints on long chains of data. In data analysis, increasingly researchers are using relational or network models to assess complex systems. Models of social interaction, communication, and technology infrastructure e. It is increasingly common for such models to have millions of nodes.

A core challenge includes generation of massive but realistic network data. A second core challenge centers on statistically assessing confidence in network metrics given different types and categories of network errors. Row-column dependencies in networks violate the assumptions of simple parametric models and have driven the development of nonparametric approaches. However, such approaches are often computationally intensive, and research on scalability is needed. Another core challenge is then how to estimate confidence without requiring the generation of samples from a full joint distribution on the network.

Models are, by their nature, imperfect. They may omit important features of the data in either the structural or noise components, make unwarranted assumptions such as linearity, or be otherwise mis-specified. On the other hand, models that are over-specified, in terms of being richer than the data can support, may fit the training data exceptionally well but generalize poorly to new data.

Thus, an important aspect of predictive modeling is performance evaluation. With modern methods, this often occurs in two stages: model tuning, and then the evaluation of the chosen model. Tuning is discussed first, followed by model evaluation. The models that are fit to particular data are often indexed by a tuning parameter. Some relevant examples are the following:. The process of deciding what model type to work with remains more art than science. In the massive data context, computational considerations frequently drive the choice.

For example, in sequence modeling, a conditional random field may lead to more accurate predictions, but a simple hidden Markov model may be all that is feasible. Similarly, a multilevel Bayesian model may provide an elegant inferential framework, but computational constraints may lead an analyst to a simpler linear model.

### Top Authors

For many applications, a model-complexity ladder exists that provides the analyst with a range of choices. For example, in the high-dimensional classification context, the bottom rung contains simple linear classifiers and naive. The next rung features traditional tools, such as logistic regression and discriminant analysis. The top rungs might feature boosting approaches and hierarchical nonparametric Bayesian methods. Similarly in pharmaco-epidemiology, simple and widely used methods include disproportionality analyses based on two-by-two tables.

Case-control and case-crossover analyses provide a somewhat more complex alternative. High-dimensional propensity scoring methods and multivariate self-controlled case series are further up the ladder. Ultimately the appropriate rung on the ladder must depend on the signal-to-noise ratio.

Presumably there is little point in fitting a highly complex model to data with a low signal-to-noise ratio, although little practical guidance currently exists to inform the analyst in this regard. Ideally the family of models has been set up so that a tuning parameter orders the models in complexity. All the examples given above are of this kind.

The complexity is increased in an attempt to remove any systematic bias in the model. However, higher complexity also means that the model will fit the training data more closely, and there is a risk of over-fitting. Then one picks the position on the path with the best validation performance. Because the family is ordered according to complexity, this process determines the right complexity for the problem at hand. A substantial body of knowledge now exists about complexity trade-offs in modeling. As mentioned previously, more complex models can over-fit data and provide poor predictions.

These trade-offs, however, are poorly understood in the context of massive data, especially with non-stationary massive data streams. It is also important to note that statistical model complexity and computational complexity are distinct. Given a model with fixed statistical complexity and for a fixed out-of-sample accuracy target, additional data allow one to estimate a model with more computational efficiency. This is because in the worst case a computational algorithm can trivially subsample the data to reduce data size without hurting computation, but some algorithms can utilize the increased data size more efficiently than simple subsampling.

This observation has been discussed in some recent machine-learning papers. However, in nonparametric settings, one may use models whose complexity grows with increasing data. It will be important to study how to grow model complexity or shift models in the non-stationary setting in a computationally efficient manner.

## The Stata Blog » econometrics

There are other good reasons for dividing the model-building process into the two stages of fitting a hierarchical path of models, followed by performance evaluation to find the best model in the path. This can be relaxed during the second stage, which can use whatever figure of merit has most meaning for the specific application, and it need not be smooth at all.

In most cases this tuning is performing some kind of trade-off between bias and variance; essentially, deciding between under- and over-fitting of the training data. Thus, once the best model has been chosen, its predictive performance should be evaluated on a different held-back test data set, because the selection step can introduce bias.

Ideally, then, the following three separate data sets will be identified for the overall task:. Model validation refers to using the validation data to evaluate a list of models. A plot of model prediction error versus tuning parameter values can be revealing. To see this, assume that as the tuning parameter increases, the model complexity increases. Then two general scenarios tend to occur in practice:.

For scenarios that are not data rich, or if there are many more variables than observations, one often resorts to K-fold cross-validation Hastie et al. One then trains on all but the kth chunk and evaluates the prediction error on the kth chunk. This is done K times, and the prediction-error curves are averaged.

One could also use cross-validation for the final task of evaluating the test error for the chosen model. This calls for two nested layers of cross-validation. With limited-size data sets, cross-validation is a valuable tool. Does it lose relevance with massive data sources? Can one over-train?

Or is one largely testing whether one has trained enough? If both the number of variables and the number of observations are large and growing , then one can still overfit, and here too regularization and validation continue to be important steps. If there are more than enough samples, then one can afford to set aside a subset of the data for validation.

As mentioned before, variance is typically not an issue here—one is determining the model complexity needed to accomplish the task at hand. This can also mean determining the size of the training set needed. If models have to repeatedly be fit as the structure of the data changes, it is important to know how many training data are needed. It is much easier to build models with smaller numbers of observations. Care must be taken to avoid sample bias during the validation step, for the following reasons:.

An example of the last of these items is spam filtering, where the spammer constantly tries to figure out the algorithms that filter out spam. In situations like these, one may fit models to one population and end up testing them and making predictions on another. With massive data it is often essential to sample, in order to produce a manageable data set on which algorithms can run.

Here, it is reasonable to sample the positive and negative examples at different rates. Any logistic regression or similar model can be fit on the stratified sample and then post-facto corrected for. This can be done on a larger scale, balancing for a variety of other factors. While it makes the modeling trickier, it allows one to work with more modest-sized data sets.

Stratified sampling of this kind is likely to play an important role in massive data applications. One has to take great care to correct the imbalance after fitting, and account for it in the validation. See Chapter 8 for further discussion of sampling issues. With massive data streams, it appears there may be room for novel approaches to the validation process. With online algorithms, one can validate before updating—i. If the complexity of the model is governed by the number of learning steps, this would make for an interesting adaptive learning algorithm.

There is a large literature on this topic; indeed, online learning and early stopping were both important features in neural networks. See, for example, Ripley, , and Bishop, Once a model has been selected, one often wants to assess its statistical properties. Some of the issues of interest are standard errors of parameters and predictions, false discovery rates, predictive performance, and relevance of the chosen model, to name a few.

Standard error bars are often neglected in modern large-scale applications, but predictions are always more useful with standard errors. If the data are massive, these can be negligibly small, and can be ignored. But if they are not small, they raise a flag and usually imply that one has made a prediction in a data-poor region. Standard error estimates usually accompany parameter and prediction estimates for traditional linear models.

However, as statistical models have grown in complexity, estimation of secondary measures such as standard errors has not kept pace with prediction performance. It is also more difficult to get a handle on standard errors for complex models. The bootstrap is a very general method for estimating standard errors, independent of the model complexity. It can be used to estimate standard errors of estimated parameters or predictions at future evaluation points. With the bootstrap method, the entire modeling procedure is applied to many randomly sampled subsets of the data, and a prediction is produced from each such model. One then computes the standard deviation of the predictions derived from each sample.

The original bootstrap takes samples of size N the original size with replacement from the training data, which may be infeasible in the setting of massive data. The bootstrap is also the basis for model averaging in random forests, and in this context is somewhat similar to certain Bayesian averaging procedures. Massive data streams open the door to some interesting new modeling paradigms that need to be researched to determine their potential effectiveness and usefulness.

With parallel systems one could randomize the data stream and produce multiple models and, hence, predictions. These could be combined to form average predictions, prediction intervals, standard errors, and so on. This is a new area that has not been studied much in the literature. Building effective models for the analysis of massive data requires different considerations than building models for the kinds of small data sets that have been more common in traditional statistics.

Each of the topics outlined above face new challenges when the data become massive, although having access to much more data also opens the door to alternative approaches. For example, are missing data less of an issue, because we have so much data that we can afford to lose some measurements or observations? Can we simply discard observations with missing entries? Likewise, does dirty mislabeled or incorrect data hurt as much if we have a large amount of it? The committee notes that, in general, while sampling error decreases with increasing sample size, bias does not—big data does not help overcome bad bias.

Because massive amounts of observational data are exposed to many sources of contamination, sometimes through malicious intervention, can models be built that self-protect against these various sources? Can the scope of these models be enlarged to cover more and different sources of contamination?

Is this still relevant for massive data—that is, is variance still an issue? If the models considered involve combinations of the p variables interactions , then the numbers of such combinations grow rapidly, and many of them will be impacted by variance. In general, how will model selection change to reflect these issues? One suggestion is to find a least complex model that explains the data in a sufficient manner. Is cross-validation relevant for massive data? Cross-validation has to live with correlations between estimates from different subsets of the data, because of overlap.

This has an impact on, for example, standard error estimates for predictions Markatou et al. Is one better off using independent subsets of the data to fit the model sequences, or some hybrid approach? The bootstrap is a general tool for evaluating the statistical properties of a fitted model. Is the bootstrap relevant and feasible e. Please email the task view maintainer with suggestions. Bayesian packages for general model fitting The arm package contains R functions for Bayesian inference using lm, glm, mer and polr objects. BACCO contains three sub-packages: emulator, calibrator, and approximator, that perform Bayesian emulation and calibration of computer programs.

The models include linear regression models, multinomial logit, multinomial probit, multivariate probit, multivariate mixture of normals including clustering , density estimation using finite mixtures of normals as well as Dirichlet Process priors, hierarchical linear models, hierarchical multinomial logit, hierarchical negative binomial regression models, and linear instrumental variable models.

LaplacesDemon seeks to provide a complete Bayesian environment, including numerous MCMC algorithms, Laplace Approximation with multiple optimization algorithms, scores of examples, dozens of additional probability distributions, numerous MCMC diagnostics, Bayes factors, posterior predictive checks, a variety of plots, elicitation, parameter and variable importance, and numerous additional utility functions.

It contains R functions to fit a number of regression models linear regression, logit, ordinal probit, probit, Poisson regression, etc. It also contains a generic Metropolis sampler that can be used to fit arbitrary models. The mcmc package consists of an R function for a random-walk Metropolis algorithm for a continuous random vector. Users can choose samplers and write new samplers.

The package also supports other methods such as particle filtering or whatever users write in its algorithm language. Bayesian packages for specific models or methods abc package implements several ABC algorithms for performing parameter estimation and model selection. Cross-validation tools are also available for measuring the accuracy of ABC estimates, and to calculate the misclassification probabilities of different models.

### Kundrecensioner

It provides routines to help determine optimal Bayesian network models for a given data set, where these models are used to identify statistical dependencies in messy, complex data. AdMit provides functions to perform the fitting of an adapative mixture of Student-t distributions to a target density through its kernel function. The mixture approximation can be used as the importance density in importance sampling or as the candidate density in the Metropolis-Hastings algorithm.

The distribution parameters may capture location, scale, shape, etc. The BART package provide flexible nonparametric modeling of covariates for continuous, binary, categorical and time-to-event outcomes. BAS utilizes an efficient algorithm to sample models without replacement. BayesSummaryStatLM provides two functions: one function that computes summary statistics of data and one function that carries out the MCMC posterior sampling for Bayesian linear regression models where summary statistics are used as input.

BayesFactor provides a suite of functions for computing various Bayes factors for simple designs, including contingency tables, one- and two-sample designs, one-way designs, general ANOVA designs, and linear regression. BayesVarSel calculate Bayes factors in linear models and then to provide a formal Bayesian answer to testing and variable selection problems. BCE contains function to estimates taxonomic compositions from biomarker data using a Bayesian approach. BCBCSF provides functions to predict the discrete response based on selected high dimensional features, such as gene expression data.

It is also capable to computing Bayesian discrimination probabilities equivalent to the implemented Bayesian clustering. Spike-and-Slab models are adopted in a way to be able to produce an importance measure for clustering and discriminant variables. BDgraph provides statistical tools for Bayesian structure learning in undirected graphical models for multivariate continuous, discrete, and mixed data.

BLR provides R functions to fit parametric regression models using different types of shrinkage methods. The BMA package has functions for Bayesian model averaging for linear models, generalized linear models, and survival models. The complementary package ensembleBMA uses the BMA package to create probabilistic forecasts of ensembles using a mixture of normal distributions. Built-in priors include coefficient priors fixed, flexible and hyper-g priors , and 5 kinds of model priors.

Bmix is a bare-bones implementation of sampling algorithms for a variety of Bayesian stick-breaking marginally DP mixture models, including particle learning and Gibbs sampling for static DP mixtures, particle learning for dynamic BAR stick-breaking, and DP mixture regression. BNSP is a package for Bayeisan non- and semi-parametric model fitting. It handles Dirichlet process mixtures and spike-slab for multivariate and univariate response analysis, with nonparametric models for the means, the variances and the correlation matrix.

BoomSpikeSlab provides functions to do spike and slab regression via the stochastic search variable selection algorithm. It handles probit, logit, poisson, and student T data. This package allows Bayesian estimation of multi-gene models via Laplace approximations and provides tools for interval mapping of genetic loci. The package also contains graphical tools for QTL analysis. Gaussian processes are represented with a Fourier series based on cosine basis functions.

BVS is a package for Bayesian variant selection and Bayesian model uncertainty techniques for genetic association studies. It includes the calculations of the Kalman filter and smoother, and the forward filtering backward sampling algorithm. EbayesThresh implements Bayesian estimation for thresholding methods. Although the original model is developed in the context of wavelets, this package is useful when researchers need to take advantage of possible sparsity in a parameter set.

FME provides functions to help in fitting models to data, to perform Monte Carlo, sensitivity and identifiability analysis. It is intended to work with models be written as a set of differential equations that are solved either by an integration routine from deSolve, or a steady-state solver from rootSolve. The gbayes function in Hmisc derives the posterior and optionally the predictive distribution when both the prior and the likelihood are Gaussian, and when the statistic of interest comes from a two-sample problem. The hbsae package provides functions to compute small area estimates based on a basic area or unit-level model.

The model is fit using restricted maximum likelihood, or in a hierarchical Bayesian way. The function krige.

• Statistical model;
• Financial Econometrics - E-bok - Christian Gourieroux, Joann Jasiak () | Bokus.
• Model Averaging and its Use in Economics - Munich Personal RePEc Archive.
• Towards Data Science.