Friday, April 5, 2002
DNA sequences evolve in response to numerous factors, including
natural selection, demographic parameters, and random "neutral"
changes. Particularly when we focus on the evolution of protein-coding
genes, modeling the process of sequence evolution becomes challenging,
as heterogeneity in the nucleotide substitution process exists at many
levels: rates vary from site to site, from gene to gene, and from
species to species. I will discuss statistical approaches for
identifying the sources of such heterogeneity, and demonstrate their
efficacy to studies of DNA sequence evolution in plant genomes.
When analyzing protein coding sequence data, it is highly desirable to
be able to discriminate between patterns of synonymous and
non-synonymous mutation rates, because synonymous and non-synonymous
mutations differ radically in their biological consequences. Current
evolutionary models which account for rate variability, make some
undesirable a-priori assumptions about the structure of rate patterns.
These assumptions may lead to erroneous classification of loci as
being under selective pressure, for instance. I will present a new
computationally feasible evolutionary which allows us to study syn and
non-syn rates independently and avoid most of the shortcomings of
existing models.
The motor protein kinesin, powered by ATP hydrolysis, moves processively
along microtubules. This behavior has made it suitable for extensive
single molecule experiments. Recent studies measuring the position of a
bead tethered to a moving kinesin while trapped in an optical tweezer
have generated extensive data records. Past analysis have evaluated the
global properties of the data, particularly determining velocity and
trajectory variances. However, the details of the data potentially
contain much more information about the protein. We introduce the use of
hidden Markov models in order to estimate model parameters based on the
kinesin assays by using the EM algorithm, as well as a method for model
selection. We look at the results of the application of the algorithms
on simulated data, with an eye toward applying the algorithm to the
experimental data.
Population genetic studies of the human immunodeficiency viruses have
demonstrated substantial evolution of HIV's during transmission and
chronic infection. Viral genotypes involved in initiating new infections
are typically minority forms relative to the viral population present in
the donor host. Changes in genotype frequencies are due, in part, to
sampling effects during transmission, when the viral population is forced
through a severe bottleneck. However, studies of cell tropism and
coreceptor usage by HIV-1 also suggest that transmission may exert
selective pressures which are in conflict with those experienced once the
host has been successfully colonized.
Here we consider the population genetic consequences of tradeoffs between
transmission and within-host adaptation using a single-locus model
incorporating genetic drift, transmission bottlenecks, mutation, and both
forms of selection. We obtain deterministic and stochastic limits (at the
level of a measure-valued process) as the size of the infected host
population is scaled to infinity and present the invariant measures and
transmission probabilities for the former. With a detailed study of the
genetics of transmission of HIV-1, these considerations could prove
useful to ongoing efforts to develop an HIV vaccine.
Microarray analysis measures the expression levels of thousands of
genes concurrently, creating data sets that are too large for humans
to assimilate. Numerous techniques have been employed to reduce and
display these data sets including pattern recognition techniques,
statistical approaches, and learning algorithms, however evaluating
the performance of each technique has proven difficult without
knowledge of the true classifications within the data. This seminar
will focus on the application of various hierarchical clustering
techniques to microarray data, including the data normalization and
centering processes that are usually performed in advance of
clustering. As part of the seminar, the results of a study that
evaluated the performance of ten hierarchical clustering methods using
simulated microarray data will be presented. In order to evaluate the
performance of these hierarchical clustering techniques, computer
generated simulated experimental microarray data sets were created
that contained data with known classifications. Performance was
evaluated based on the percentage of data points that were clustered
correctly, as defined by the groups assigned during dataset
generation. While all of the algorithms performed well at low levels
of variance, performance varied at higher levels of variance. In
particular, the Flexible-Beta and Ward's Minimum-Variance methods
proved most robust, performing well at all levels of variance, while
the Single Linkage algorithm performed poorly at higher levels of
variance due to its tendency to incorrectly chain observations
together. Lastly, a hierarchical clustering algorithm that utilizes
replicate measurements will be presented.
Spencer Muse, North Carolina State University
Heterogeneity in the Process of Nucleotide Substitution
Sergei Kosakovsky Pond, University of Arizona
Modeling Synonymous and Non-Synonymous Substitution Rate Patterns
Bruce Walsh, University of Arizona
What? I'm related to YOU?? -- Estimating the time to the
most recent common ancestor for a pair of individuals
David Brian Walton, University of Arizona
Hidden Markov Model Filtering of Single Molecule Kinesin Assay Data
Jay Taylor, University of Arizona
Tradeoffs between Transmission and Within-Host Adaptation
Kevin Greer, University of Arizona
Hierarchical Clustering of cDNA Microarray Data