## SEARCH

#### Institution

##### ( see all 172)

- University of Guelph 6 (%)
- McMaster University 4 (%)
- Universidad de Valladolid 3 (%)
- University College Dublin 3 (%)
- University of Alabama 3 (%)

#### Author

##### ( see all 333)

- McNicholas, Paul D. 7 (%)
- García-Escudero, L. A. 3 (%)
- Gordaliza, A. 3 (%)
- Hennig, Christian 3 (%)
- Mayo-Iscar, A. 3 (%)

#### Publication

##### ( see all 11)

- Advances in Data Analysis and Classification 96 (%)
- TEST 10 (%)
- Statistical Papers 8 (%)
- Methodology and Computing in Applied Probability 2 (%)
- Trabajos de Estadistica 2 (%)

## CURRENTLY DISPLAYING:

Most articles

Fewest articles

Showing 1 to 10 of 124 matching Articles
Results per page:

## On two simple and effective procedures for high dimensional classification of general populations

### Statistical Papers (2016-04-01) 57: 381-405 , April 01, 2016

In this paper, we generalize two criteria, the determinant-based and trace-based criteria proposed by Saranadasa (J Multivar Anal 46:154–174, 1993), to general populations for high dimensional classification. These two criteria compare some distances between a new observation and several different known groups. The determinant-based criterion performs well for correlated variables by integrating the covariance structure and is competitive to many other existing rules. The criterion however requires the measurement dimension be smaller than the sample size. The trace-based criterion, in contrast, is an independence rule and effective in the “large dimension-small sample size” scenario. An appealing property of these two criteria is that their implementation is straightforward and there is no need for preliminary variable selection or use of turning parameters. Their asymptotic misclassification probabilities are derived using the theory of large dimensional random matrices. Their competitive performances are illustrated by intensive Monte Carlo experiments and a real data analysis.

## Cluster analysis of census data using the symbolic data approach

### Advances in Data Analysis and Classification (2008-10-01) 2: 163-176 , October 01, 2008

The aim of this paper is to investigate the economic specialization of the Italian local labor systems (sets of contiguous municipalities with a high degree of self-containment of daily commuter travel) by using the Symbolic Data approach, on the basis of data derived from the Census of Industrial and Service Activities. Specifically, the economic structure of a local labor system (LLS) is described by an interval-type variable, a special symbolic data type that allows for the fact that all municipalities within the same LLS do not have the same economic structure.

## Model based clustering for mixed data: clustMD

### Advances in Data Analysis and Classification (2016-06-01) 10: 155-169 , June 01, 2016

A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.

## Marginal and simultaneous predictive classification using stratified graphical models

### Advances in Data Analysis and Classification (2016-09-01) 10: 305-326 , September 01, 2016

An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models.

## Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach

### Advances in Data Analysis and Classification (2009-06-01) 3: 39-61 , June 01, 2009

In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.

## On Hölder fields clustering

### TEST (2012-06-01) 21: 301-316 , June 01, 2012

Based on *n* randomly drawn vectors in a Hilbert space, we study the *k*-means clustering scheme. Here, clustering is performed by computing the Voronoi partition associated with centers that minimize an empirical criterion, called distorsion. The performance of the method is evaluated by comparing the theoretical distorsion of empirical optimal centers to the theoretical optimal distorsion. Our first result states that, provided that the underlying distribution satisfies an exponential moment condition, an upper bound for the above performance criterion is
$O(1/\sqrt{n})$
. Then, motivated by a broad range of applications, we focus on the case where the data are real-valued random fields. Assuming that they share a Hölder property in quadratic mean, we construct a numerically simple *k*-means algorithm based on a discretized version of the data. With a judicious choice of the discretization, we prove that the performance of this algorithm matches the performance of the classical algorithm.

## Clustering of functional data in a low-dimensional subspace

### Advances in Data Analysis and Classification (2012-10-01) 6: 219-247 , October 01, 2012

To find optimal clusters of functional objects in a lower-dimensional subspace of data, a sequential method called tandem analysis, is often used, though such a method is problematic. A new procedure is developed to find optimal clusters of functional objects and also find an optimal subspace for clustering, simultaneously. The method is based on the *k*-means criterion for functional data and seeks the subspace that is maximally informative about the clustering structure in the data. An efficient alternating least-squares algorithm is described, and the proposed method is extended to a regularized method. Analyses of artificial and real data examples demonstrate that the proposed method gives correct and interpretable results.

## Model-based regression clustering for high-dimensional data: application to functional data

### Advances in Data Analysis and Classification (2016-03-14): 1-37 , March 14, 2016

Finite mixture regression models are useful for modeling the relationship between response and predictors arising from different subpopulations. In this article, we study high-dimensional predictors and high-dimensional response and propose two procedures to cluster observations according to the link between predictors and the response. To reduce the dimension, we propose to use the Lasso estimator, which takes into account the sparsity and a maximum likelihood estimator penalized by the rank, to take into account the matrix structure. To choose the number of components and the sparsity level, we construct a collection of models, varying those two parameters and we select a model among this collection with a non-asymptotic criterion. We extend these procedures to functional data, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms and apply and evaluate our methods both on simulated and real datasets, to understand how they work in practice.

## A multilevel finite mixture item response model to cluster examinees and schools

### Advances in Data Analysis and Classification (2016-03-01) 10: 53-70 , March 01, 2016

Within the educational context, a key goal is to assess students’ acquired skills and to cluster students according to their ability level. In this regard, a relevant element to be accounted for is the possible effect of the school students come from. For this aim, we provide a methodological tool which takes into account the multilevel structure of the data (i.e., students in schools) and allows us to cluster both students and schools into homogeneous classes of ability and effectiveness, and to assess the effect of certain students’ and school characteristics on the probability to belong to such classes. The proposed approach relies on an extended class of multidimensional latent class IRT models characterised by: (i) latent traits defined at student and school level, (ii) latent traits represented through random vectors with a discrete distribution, (iii) the inclusion of covariates at student and school level, and (iv) a two-parameter logistic parametrisation for the conditional probability of a correct response given the ability. The approach is applied for the analysis of data collected by two national tests administered in Italy to middle school students in June 2009: the INVALSI Language Test and the Mathematics Test.

## On the breakdown behavior of the TCLUST clustering procedure

### TEST (2013-09-01) 22: 466-487 , September 01, 2013

Clustering procedures allowing for general covariance structures of the obtained clusters need some constraints on the solutions. With this in mind, several proposals have been introduced in the literature. The TCLUST procedure works with a restriction on the “eigenvalues-ratio” of the clusters scatter matrices. In order to try to achieve robustness with respect to outliers, the procedure allows to trim off a proportion *α* of the most outlying observations. The resistance to infinitesimal contamination of the TCLUST has already been studied. This paper aims to look at its resistance to a higher amount of contamination by means of the study of its breakdown behavior. The rather new concept of restricted breakdown point will demonstrate that the TCLUST procedure resists to a proportion *α* of contamination as soon as the data set is sufficiently “well clustered”.