## SEARCH

#### Institution

##### ( see all 189)

- University of Guelph 6 (%)
- McMaster University 4 (%)
- Universidad de Valladolid 3 (%)
- Universitat Jaume I 3 (%)
- University College Dublin 3 (%)

#### Author

##### ( see all 293)

- McNicholas, Paul D. 7 (%)
- García-Escudero, L. A. 3 (%)
- Gordaliza, A. 3 (%)
- Hennig, Christian 3 (%)
- Mayo-Iscar, A. 3 (%)

#### Publication

##### ( see all 11)

- Advances in Data Analysis and Classification 108 (%)
- TEST 10 (%)
- Statistical Papers 8 (%)
- Methodology and Computing in Applied Probability 2 (%)
- Trabajos de Estadistica 2 (%)

## CURRENTLY DISPLAYING:

Most articles

Fewest articles

Showing 1 to 10 of 136 matching Articles
Results per page:

## On two simple and effective procedures for high dimensional classification of general populations

### Statistical Papers (2016-04-01) 57: 381-405 , April 01, 2016

In this paper, we generalize two criteria, the determinant-based and trace-based criteria proposed by Saranadasa (J Multivar Anal 46:154–174, 1993), to general populations for high dimensional classification. These two criteria compare some distances between a new observation and several different known groups. The determinant-based criterion performs well for correlated variables by integrating the covariance structure and is competitive to many other existing rules. The criterion however requires the measurement dimension be smaller than the sample size. The trace-based criterion, in contrast, is an independence rule and effective in the “large dimension-small sample size” scenario. An appealing property of these two criteria is that their implementation is straightforward and there is no need for preliminary variable selection or use of turning parameters. Their asymptotic misclassification probabilities are derived using the theory of large dimensional random matrices. Their competitive performances are illustrated by intensive Monte Carlo experiments and a real data analysis.

## Cluster analysis of census data using the symbolic data approach

### Advances in Data Analysis and Classification (2008-10-01) 2: 163-176 , October 01, 2008

The aim of this paper is to investigate the economic specialization of the Italian local labor systems (sets of contiguous municipalities with a high degree of self-containment of daily commuter travel) by using the Symbolic Data approach, on the basis of data derived from the Census of Industrial and Service Activities. Specifically, the economic structure of a local labor system (LLS) is described by an interval-type variable, a special symbolic data type that allows for the fact that all municipalities within the same LLS do not have the same economic structure.

## Model based clustering for mixed data: clustMD

### Advances in Data Analysis and Classification (2016-06-01) 10: 155-169 , June 01, 2016

A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.

## Marginal and simultaneous predictive classification using stratified graphical models

### Advances in Data Analysis and Classification (2016-09-01) 10: 305-326 , September 01, 2016

An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models.

## Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach

### Advances in Data Analysis and Classification (2009-06-01) 3: 39-61 , June 01, 2009

In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.

## On Hölder fields clustering

### TEST (2012-06-01) 21: 301-316 , June 01, 2012

Based on *n* randomly drawn vectors in a Hilbert space, we study the *k*-means clustering scheme. Here, clustering is performed by computing the Voronoi partition associated with centers that minimize an empirical criterion, called distorsion. The performance of the method is evaluated by comparing the theoretical distorsion of empirical optimal centers to the theoretical optimal distorsion. Our first result states that, provided that the underlying distribution satisfies an exponential moment condition, an upper bound for the above performance criterion is
$O(1/\sqrt{n})$
. Then, motivated by a broad range of applications, we focus on the case where the data are real-valued random fields. Assuming that they share a Hölder property in quadratic mean, we construct a numerically simple *k*-means algorithm based on a discretized version of the data. With a judicious choice of the discretization, we prove that the performance of this algorithm matches the performance of the classical algorithm.

## Clustering of functional data in a low-dimensional subspace

### Advances in Data Analysis and Classification (2012-10-01) 6: 219-247 , October 01, 2012

To find optimal clusters of functional objects in a lower-dimensional subspace of data, a sequential method called tandem analysis, is often used, though such a method is problematic. A new procedure is developed to find optimal clusters of functional objects and also find an optimal subspace for clustering, simultaneously. The method is based on the *k*-means criterion for functional data and seeks the subspace that is maximally informative about the clustering structure in the data. An efficient alternating least-squares algorithm is described, and the proposed method is extended to a regularized method. Analyses of artificial and real data examples demonstrate that the proposed method gives correct and interpretable results.

## A multilevel finite mixture item response model to cluster examinees and schools

### Advances in Data Analysis and Classification (2016-03-01) 10: 53-70 , March 01, 2016

Within the educational context, a key goal is to assess students’ acquired skills and to cluster students according to their ability level. In this regard, a relevant element to be accounted for is the possible effect of the school students come from. For this aim, we provide a methodological tool which takes into account the multilevel structure of the data (i.e., students in schools) and allows us to cluster both students and schools into homogeneous classes of ability and effectiveness, and to assess the effect of certain students’ and school characteristics on the probability to belong to such classes. The proposed approach relies on an extended class of multidimensional latent class IRT models characterised by: (i) latent traits defined at student and school level, (ii) latent traits represented through random vectors with a discrete distribution, (iii) the inclusion of covariates at student and school level, and (iv) a two-parameter logistic parametrisation for the conditional probability of a correct response given the ability. The approach is applied for the analysis of data collected by two national tests administered in Italy to middle school students in June 2009: the INVALSI Language Test and the Mathematics Test.

## A divisive clustering method for functional data with special consideration of outliers

### Advances in Data Analysis and Classification (2017-08-11): 1-20 , August 11, 2017

This paper presents DivClusFD, a new divisive hierarchical method for the non-supervised classification of functional data. Data of this type present the peculiarity that the differences among clusters may be caused by changes as well in level as in shape. Different clusters can be separated in different subregion and there may be no subregion in which all clusters are separated. In each step of division, the DivClusFD method explores the functions and their derivatives at several fixed points, seeking the subregion in which the highest number of clusters can be separated. The number of clusters is estimated via the gap statistic. The functions are assigned to the new clusters by combining the *k*-means algorithm with the use of functional boxplots to identify functions that have been incorrectly classified because of their atypical local behavior. The DivClusFD method provides the number of clusters, the classification of the observed functions into the clusters and guidelines that may be for interpreting the clusters. A simulation study using synthetic data and tests of the performance of the DivClusFD method on real data sets indicate that this method is able to classify functions accurately.

## On the breakdown behavior of the TCLUST clustering procedure

### TEST (2013-09-01) 22: 466-487 , September 01, 2013

Clustering procedures allowing for general covariance structures of the obtained clusters need some constraints on the solutions. With this in mind, several proposals have been introduced in the literature. The TCLUST procedure works with a restriction on the “eigenvalues-ratio” of the clusters scatter matrices. In order to try to achieve robustness with respect to outliers, the procedure allows to trim off a proportion *α* of the most outlying observations. The resistance to infinitesimal contamination of the TCLUST has already been studied. This paper aims to look at its resistance to a higher amount of contamination by means of the study of its breakdown behavior. The rather new concept of restricted breakdown point will demonstrate that the TCLUST procedure resists to a proportion *α* of contamination as soon as the data set is sufficiently “well clustered”.