## SEARCH

#### Country

##### ( see all 43)

- United States 110 (%)
- Germany 70 (%)
- France 33 (%)
- Canada 28 (%)
- United Kingdom 24 (%)

#### Institution

##### ( see all 593)

- Max Planck Institute for Mathematics in the Sciences 12 (%)
- Santa Fe Institute 12 (%)
- Iowa State University 9 (%)
- Bielefeld University 6 (%)
- Carnegie Mellon University 6 (%)

#### Author

##### ( see all 888)

- Stadler, Peter F 13 (%)
- Fernández-Baca, David 8 (%)
- Sagot, Marie-France 7 (%)
- Backofen, Rolf 6 (%)
- Morgenstern, Burkhard 6 (%)

#### Subject

- Algorithms 315 (%)
- Bioinformatics 315 (%)
- Computational Biology/Bioinformatics 315 (%)
- Life Sciences 315 (%)
- Physiological, Cellular and Medical Topics 315 (%)

## CURRENTLY DISPLAYING:

Most articles

Fewest articles

Showing 1 to 10 of 315 matching Articles
Results per page:

## Complexity and algorithms for copy-number evolution problems

### Algorithms for Molecular Biology (2017-05-16) 12: 1-11 , May 16, 2017

### Background

Cancer is an evolutionary process characterized by the accumulation of somatic mutations in a population of cells that form a tumor. One frequent type of mutations is copy number aberrations, which alter the number of copies of genomic regions. The number of copies of each position along a chromosome constitutes the chromosome’s copy-number profile. Understanding how such profiles evolve in cancer can assist in both diagnosis and prognosis.

### Results

We model the evolution of a tumor by segmental deletions and amplifications, and gauge distance from profile
$$\mathbf {a}$$
to
$$\mathbf {b}$$
by the minimum number of events needed to transform
$$\mathbf {a}$$
into
$$\mathbf {b}$$
. Given two profiles, our first problem aims to find a parental profile that minimizes the sum of distances to its children. Given *k* profiles, the second, more general problem, seeks a phylogenetic tree, whose *k* leaves are labeled by the *k* given profiles and whose internal vertices are labeled by ancestral profiles such that the sum of edge distances is minimum.

### Conclusions

For the former problem we give a pseudo-polynomial dynamic programming algorithm that is linear in the profile length, and an integer linear program formulation. For the latter problem we show it is NP-hard and give an integer linear program formulation that scales to practical problem instance sizes. We assess the efficiency and quality of our algorithms on simulated instances.

### Availability

https://github.com/raphael-group/CNT-ILP

## Getting DNA copy numbers without control samples

### Algorithms for Molecular Biology (2012-08-16) 7: 1-18 , August 16, 2012

### Background

The selection of the reference to scale the data in a copy number analysis has paramount importance to achieve accurate estimates. Usually this reference is generated using control samples included in the study. However, these control samples are not always available and in these cases, an artificial reference must be created. A proper generation of this signal is crucial in terms of both noise and bias.

We propose NSA (Normality Search Algorithm), a scaling method that works with and without control samples. It is based on the assumption that genomic regions enriched in SNPs with identical copy numbers in both alleles are likely to be normal. These normal regions are predicted for each sample individually and used to calculate the final reference signal. NSA can be applied to any CN data regardless the microarray technology and preprocessing method. It also finds an optimal weighting of the samples minimizing possible batch effects.

### Results

Five human datasets (a subset of HapMap samples, Glioblastoma Multiforme (GBM), Ovarian, Prostate and Lung Cancer experiments) have been analyzed. It is shown that using only tumoral samples, NSA is able to remove the bias in the copy number estimation, to reduce the noise and therefore, to increase the ability to detect copy number aberrations (CNAs). These improvements allow NSA to also detect recurrent aberrations more accurately than other state of the art methods.

### Conclusions

NSA provides a robust and accurate reference for scaling probe signals data to CN values without the need of control samples. It minimizes the problems of bias, noise and batch effects in the estimation of CNs. Therefore, NSA scaling approach helps to better detect recurrent CNAs than current methods. The automatic selection of references makes it useful to perform bulk analysis of many GEO or ArrayExpress experiments without the need of developing a parser to find the normal samples or possible batches within the data. The method is available in the open-source R package NSA, which is an add-on to the aroma.cn framework. http://www.aroma-project.org/addons .

## Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data

### Algorithms for Molecular Biology (2014-03-10) 9: 1-11 , March 10, 2014

### Background

Change point problems arise in many genomic analyses such as the detection of copy number variations or the detection of transcribed regions. The expanding Next Generation Sequencing technologies now allow to locate change points at the nucleotide resolution.

### Results

Because of its complexity which is almost linear in the sequence length when the maximal number of segments is constant, and as its performance had been acknowledged for microarrays, we propose to use the Pruned Dynamic Programming algorithm for Seq-experiment outputs. This requires the adaptation of the algorithm to the negative binomial distribution with which we model the data. We show that if the dispersion in the signal is known, the PDP algorithm can be used, and we provide an estimator for this dispersion. We describe a compression framework which reduces the time complexity without modifying the accuracy of the segmentation. We propose to estimate the number of segments via a penalized likelihood criterion. We illustrate the performance of the proposed methodology on RNA-Seq data.

### Conclusions

We illustrate the results of our approach on a real dataset and show its good performance. Our algorithm is available as an *R* package on the CRAN repository.

## EUCALYPT: efficient tree reconciliation enumerator

### Algorithms for Molecular Biology (2015-01-23) 10: 1-11 , January 23, 2015

### Background

Phylogenetic tree reconciliation is the approach of choice for investigating the coevolution of sets of organisms such as hosts and parasites. It consists in a mapping between the parasite tree and the host tree using event-based maximum parsimony. Given a cost model for the events, many optimal reconciliations are however possible. Any further biological interpretation of them must therefore take this into account, making the capacity to enumerate all optimal solutions a crucial point. Only two algorithms currently exist that attempt such enumeration; in one case not all possible solutions are produced while in the other not all cost vectors are currently handled. The objective of this paper is two-fold. The first is to fill this gap, and the second is to test whether the number of solutions generally observed can be an issue in terms of interpretation.

### Results

We present a polynomial-delay algorithm for enumerating all optimal reconciliations. We show that in general many solutions exist. We give an example where, for two pairs of host-parasite trees having each less than 41 leaves, the number of solutions is 5120, even when only time-feasible ones are kept. To facilitate their interpretation, those solutions are also classified in terms of how many of each event they contain. The number of different classes of solutions may thus be notably smaller than the number of solutions, yet they may remain high enough, in particular for the cases where losses have cost 0. In fact, depending on the cost vector, both numbers of solutions and of classes thereof may increase considerably. To further deal with this problem, we introduce and analyse a restricted version where host switches are allowed to happen only between species that are within some fixed distance along the host tree. This restriction allows us to reduce the number of time-feasible solutions while preserving the same optimal cost, as well as to find time-feasible solutions with a cost close to the optimal in the cases where no time-feasible solution is found.

### Conclusions

We present *Eucalypt*, a polynomial-delay algorithm for enumerating all optimal reconciliations which is freely available at
http://eucalypt.gforge.inria.fr/
.

## Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis

### Algorithms for Molecular Biology (2012-05-02) 7: 1-12 , May 02, 2012

### Background

Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an *L*-long suffix will be located within 2 ^{-L} distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.

### Results

The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.

### Conclusions

The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.

## A tree-based method for the rapid screening of chemical fingerprints

### Algorithms for Molecular Biology (2010-01-04) 5: 1-10 , January 04, 2010

### Background

The fingerprint of a molecule is a bitstring based on its structure, constructed such that structurally similar molecules will have similar fingerprints. Molecular fingerprints can be used in an initial phase of drug development for identifying novel drug candidates by screening large databases for molecules with fingerprints similar to a query fingerprint.

### Results

In this paper, we present a method which efficiently finds all fingerprints in a database with Tanimoto coefficient to the query fingerprint above a user defined threshold. The method is based on two novel data structures for rapid screening of large databases: the *k* D grid and the Multibit tree. The *k* D grid is based on splitting the fingerprints into *k* shorter bitstrings and utilising these to compute bounds on the similarity of the complete bitstrings. The Multibit tree uses hierarchical clustering and similarity within each cluster to compute similar bounds. We have implemented our method and tested it on a large real-world data set. Our experiments show that our method yields approximately a three-fold speed-up over previous methods.

### Conclusions

Using the novel *k* D grid and Multibit tree significantly reduce the time needed for searching databases of fingerprints. This will allow researchers to (1) perform more searches than previously possible and (2) to easily search large databases.

## Computing evolutionary distinctiveness indices in large scale analysis

### Algorithms for Molecular Biology (2012-04-13) 7: 1-7 , April 13, 2012

We present optimal linear time algorithms for computing the Shapley values and 'heightened evolutionary distinctiveness' (HED) scores for the set of taxa in a phylogenetic tree. We demonstrate the efficiency of these new algorithms by applying them to a set of 10,000 reasonable 5139-species mammal trees. This is the first time these indices have been computed on such a large taxon and we contrast our finding with an ad-hoc index for mammals, fair proportion (FP), used by the Zoological Society of London's EDGE programme. Our empirical results follow expectations. In particular, the Shapley values are very strongly correlated with the FP scores, but provide a higher weight to the few monotremes that comprise the sister to all other mammals. We also find that the HED score, which measures a species' unique contribution to future subsets as function of the probability that close relatives will go extinct, is very sensitive to the estimated probabilities. When they are low, HED scores are less than FP scores, and approach the simple measure of a species' age. Deviations (like the *Solendon* genus of the West Indies) occur when sister species are both at high risk of extinction and their clade roots deep in the tree. Conversely, when endangered species have higher probabilities of being lost, HED scores can be greater than FP scores and species like the African elephant *Loxondonta africana*, the two solendons and the thumbless bat *Furipterus horrens* can move up the rankings. We suggest that conservation attention be applied to such species that carry genetic responsibility for imperiled close relatives. We also briefly discuss extensions of Shapley values and HED scores that are possible with the algorithms presented here.

## Gravitation field algorithm and its application in gene cluster

### Algorithms for Molecular Biology (2010-09-20) 5: 1-11 , September 20, 2010

### Background

Searching optima is one of the most challenging tasks in clustering genes from available experimental data or given functions. SA, GA, PSO and other similar efficient global optimization methods are used by biotechnologists. All these algorithms are based on the imitation of natural phenomena.

### Results

This paper proposes a novel searching optimization algorithm called Gravitation Field Algorithm (GFA) which is derived from the famous astronomy theory Solar Nebular Disk Model (SNDM) of planetary formation. GFA simulates the Gravitation field and outperforms GA and SA in some multimodal functions optimization problem. And GFA also can be used in the forms of unimodal functions. GFA clusters the dataset well from the Gene Expression Omnibus.

### Conclusions

The mathematical proof demonstrates that GFA could be convergent in the global optimum by probability 1 in three conditions for one independent variable mass functions. In addition to these results, the fundamental optimization concept in this paper is used to analyze how SA and GA affect the global search and the inherent defects in SA and GA. Some results and source code (in Matlab) are publicly available at http://ccst.jlu.edu.cn/CSBG/GFA .

## Consistency of the Neighbor-Net Algorithm

### Algorithms for Molecular Biology (2007-06-28) 2: 1-11 , June 28, 2007

### Background

Neighbor-Net is a novel method for phylogenetic analysis that is currently being widely used in areas such as virology, bacteriology, and plant evolution. Given an input distance matrix, Neighbor-Net produces a phylogenetic network, a generalization of an evolutionary or phylogenetic tree which allows the graphical representation of conflicting phylogenetic signals.

### Results

In general, any network construction method should not depict more conflict than is found in the data, and, when the data is fitted well by a tree, the method should return a network that is close to this tree. In this paper we provide a formal proof that Neighbor-Net satisfies both of these requirements so that, in particular, Neighbor-Net is statistically consistent on circular distances.

## A priori assessment of data quality in molecular phylogenetics

### Algorithms for Molecular Biology (2014-09-12) 9: 1-8 , September 12, 2014

Sets of sequence data used in phylogenetic analysis are often plagued by both random noise and systematic biases. Since the commonly used methods of phylogenetic reconstruction are designed to produce trees it is an important task to evaluate these trees *a posteriori*. Preferably, however, one would like to assess the suitability of the input data for phylogenetic analysis *a priori* and, if possible, obtain information on how to prune the data sets to improve the quality of phylogenetic reconstruction without introducing unwarranted biases. In the last few years several different approaches, algorithms, and software tools have been proposed for this purpose. Here we provide an overview of the state of the art and briefly discuss the most pressing open problems.