Hao Helen Zhang - Research

Statistical Machine Learning

Image source: Robust brain MRI image classification with SIBOW-SVM

Statistical machine learning combines mathematical learning theory and principles, statistical models, and computing algorithms to learn from data and make decisions. We are particularly interested in supervised learning from complex-structured data with a focus on classification, including unbalanced data, tensor data, image classification, networks, and dynamic problems.

Selected publications in the area

Zhang, H. H., Liu, Y., Wu, Y. and Zhu, J. (2008) Variable selection for the multicategory SVM via adaptive sup-norm regularization. Electronic Journal of Statistics, 2, 149-167.
Wu, Y. and Zhang, H. H. (2006) The analysis of rank data using the exponential scoring rule. Statistica Sinica, 16, 1021-1032.
Zhang, H. H. and Singer, B. (2010) Recursive Partitioning and Applications, Second Edition. Springer, New York.

Nonparametric Smoothing

Image source: The adaptive COSSO for nonparametric surface estimation and model selection

Nonparametric smoothing techniques are powerful modeling tools used to estimate complex, nonlinear relationships between variables without making strong assumptions about the underlying functional form. These data-driven methods allow the shape of the relationship to be determined by the data itself, offering more flexibility than parametric approaches. Our research focuses on smoothing splines for high-dimensional data analysis to enhance their interpretability and parsimony while simultaneously achieving flexibility.

Selected publications in the area

Wahba, G., Lin, Y. and Zhang, H. H. (2000) Generalized approximate cross validation for support vector machines. Advances in Large Margin Classifiers, MIT Press, 297-310.
Lin, Y. and Zhang, H. H. (2006) Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics, 34, 2272-2297.
Zhang, H. H., Cheng, G. and Liu, Y. (2011) Linear or nonlinear? Automatic structure discovery for partially linear models. Journal of American Statistical Association, 106, 1099-1112.
Zhu, H., Yao, F., and Zhang, H. H. (2014) Structured functional additive regression in reproducing kernel Hilbert spaces. Journal of the Royal Statistical Society, Series B, 76, 581-603.
Shin, S. J., Wu, Y., Zhang, H. H., and Liu, Y. (2017) Principal weighted support vector machines for sufficient dimension reduction in binary classification. Biometrika, 104(1), 67-81.

High Dimensional Data

Image source: Variable selection for the multicategory SVM via adaptive sup-norm regularization

High-dimensional data involves a vast number of variables or features, often surpassing the number of observations. This presents theoretical, methodological, and practical challenges in uncovering hidden patterns, building interpretable models, and making valid inferences. To address these challenges, we have developed methods and theories for variable/feature selection, dimension reduction, sparse modeling, and statistical inferences.

Selected publications in the area

Zhang, H. H. and Lu, W. (2007) Adaptive-LASSO for Cox's proportional hazard model. Biometrika, 93, 1-13.
Zou, H. and Zhang, H. H. (2009) On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37, 1733-1751.
Lu, W., Zhang, H. H. and Zeng. D. (2013) Variable selection for optimal treatment decision. Statistical Methods in Medical Research, 22, 492-503.
Hao, N. and Zhang, H. H. (2014) Interaction screening for ultra-high dimensional data. Journal of American Statistical Association, 109, 1285-1301.
Kong, D., Xue, K., Yao, F., and Zhang, H. H. (2016) Partially functional linear regression in high dimensions. Biometrika, 103, 147-159.
Hao, N., Feng, Y. and Zhang, H. H. (2018) Model Selection for High Dimensional Quadratic Regression via Regularization. Journal of American Statistical Association, 113 (522), 615-625.

Data Science Applications

Image source: The karyometric signature is altered in fallopian tubes with serous tubal intraepithelial carcinoma

Our research is motivated by real-world challenges through interdisciplinary scientific collaborations across genomics, biological sciences, cancer research, medicine, and engineering. We've pioneered computational methods and analytical tools to extract actionable insights from big massive and noisy data, bridging the gap between theory and practice of mathematical statistics and data science.

Selected publications in the area

Ma, C., Zhang, H. H., and Wang, X. (2014) Machine learning for big data analytics in plants. Trends in Plant Science, 19, 798-808.
Wang, X., Fujimaki, K., Mitchell, G., Kwon J. Croce, K., Langsdorf, C., Zhang, H. H., and Yao, G. (2017) Exit from quiescence displays a memory of cell growth and division. Nature Communications, 8(1), 321.
Sharma, Y., Zhang, H. H., and Xin, H. (2020) Machine Learning Techniques for Optimizing Design of Double T-Shaped Monopole Antenna. IEEE Transactions on Antennas and Propagation, 68, 5658-5663.
Ebrahimi, M. Chai, Y. and Zhang. H. H. (2022) Heterogeneous Domain Adaptation with Adversarial Neural Representation Learning: Experiments on E-Commerce and Cybersecurity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 1862-1875.
Ding, H., Wang, Z., Liu, Z., Fang, Y., Zhang, H. H, Hao, N. and Que, J. (2024+) Training Data Diversity Enhances the Basecalling of Novel RNA Modification-Induced Nanopore Sequencing Readouts. Nature Communications. Accepted.