|
Theory-informed Machine Learning
Data-efficient machine learning that respects theory and does not hallucinate.
Areas:
- Scientific Large Language Models
- Benchmarking
- Representing theoretical prior
- Multi-process learning
- Rapid adaptation
- Promptable structure generation
- Knowledge graphs
| Content: |
Overview: Current
leading data-driven learning systems such as ChatGPT, Gemini, and Sora
often hallucinate non-existent, sometimes dangerous artifacts and
generate scientifically implausible outcomes when applied to realistic
science and engineering settings. Lacking built-in mechanisms to
understand real-world phenomena, they fail to generalize beyond their
trained ranges. In contrast, theory-driven models respect the
underlying laws and extrapolate well but may suffer from being
simplistic, incomplete, and prohibitively expensive to solve, thus
unable to accurately represent the dynamics and complexity of the real
world. For example, a detailed theory-driven model of weather would
fail to compute forecasts in real-time due to the sheer complexity of
modeling the atmosphere, ocean, land, and the interactions among them.
Theory-informed Machine Learning (TiML) integrates the strengths of
both approaches, showing great promise for providing trustworthy
insights to help solve pressing global and local problems like
infectious diseases, energy security, and climate change. We
have pioneered the development of TiML to solve diverse real-world
problems. For example, our ontology-induced model of medical risk
discovered risk factors that were more certain and consistent with
established medicine than those found by purely data-driven methods.
Our epidemiology-guided neural network model of COVID-19 dynamics had
both its design and parameters informed by theory and past outbreaks.
This approach greatly outperformed standard data-driven machine
learning and mechanistic models, helping Ho Chi Minh City, home to over
10 million people, make critical decisions to mitigate a devastating
late 2021 COVID crisis that took over 20,000 lives. Likewise, our
physics-informed graph neural network integrates the external potential
term found in density functional theory calculations to predict the
potential energy surface in materials science applications. Our model
is more accurate in predicting the total energy per atom of a defective
system, as well as the structural changes that result from the presence
of a defect in a material. Another TiML work of ours develops a crystal
generative model that exploits the symmetry of crystal groups and
incorporates an expert-guided reward function. This results in much
faster and more stable crystal generation compared to competing
data-driven methods that do not respect these theoretical priors. However,
TiML tends to be instance-specific and demands extensive domain and ML
expertise to implement in practice. Despite ultimate success, our
experience with COVID-19 modeling shows that TiML models are very hard
to train, have an extremely complex loss landscape, suffer from
instability, and require major efforts in data conditioning and
hyperparameter tuning. To realize the global potential for TiML to
problem-solve across diverse real-world instances and domains, key
obstacles around model expressiveness and adaptability must be overcome. Aims:
Our goal is to develop more generally applicable and trustworthy TiML
models. This project addresses critical gaps in current TiML, which
struggles to adapt, capture complexity, and generalize across diverse
problem instances, spatiotemporal domains, and datasets. We will create
new TiML algorithms and neural architectures that are:
- Well-validated:
comprehensive benchmarking across diverse domains including, but not
limited to, infectious diseases, road traffic, reaction-diffusion,
materials science, and energy storage;
- Expressive: able to represent a wide range of problems and theoretical priors, including those hidden in scientific literature;
- Adaptive: enabling rapid adaptation to new problems with minimal expertise, data, compute, and effort; and
- Realistic: effectively capturing interacting, multi-scale, multi-physics processes.
Talks/Tutorials
-
Generative AI to accelerate discovery of materials, Keynote @PRICM11, Nov 2023.
- AI for automated materials discovery via learning to represent, predict, generate and explain, @Thuyloi University, May 2023.
- Machine learning and reasoning for drug discovery Tutorial @ECML-PKDD, Sept 2021.
- Climate
change: Challenges and AI-driven solutions, @Swinburne Vietnam, Hanoi,
Vietnam, Dec 2019.
- Modern
AI for drug discovery, VietAI
Summit, Nov 2019.
- Lecture on Deep
learning for biomedicine, Southeast
Asia Machine Learning (SEA ML) School, Depok, Greater Jakarta,
Indonesia, July 2019.
- Deep
learning for genomics: Present and future, Genomic Medicine 2019, Hanoi,
Vietnam, June 2019.
- AI for matters, Phenikaa University, Hanoi, Vietnam, Jan 2019.
- Deep
learning for biomedicine: Genomics and Drug design, Institute of Big Data, Hanoi,
Vietnam, Jan 2019.
Benchmarking Policy reports
Publications
- Enabling discovery of
materials through enhanced generalisability of deep learning models, Tawfik, Sherif Abdulkader, Tri Minh Nguyen, Salvy P. Russo, Truyen
Tran, Sunil Gupta, and Svetha Venkatesh.
arXiv preprint arXiv:2402.10931.
- Towards understanding structure–property relations in materials with interpretable deep learning, Tien-Sinh Vu, Minh-Quyet Ha, Duong Nguyen Nguyen, Viet-Cuong Nguyen, Yukihiro Abe, Truyen Tran, Huan Tran, Hiori Kino, Takashi Miyake, Koji Tsuda, Hieu-Chi Dam, npj Computational Materials, 9(215), (2023).
- Hierarchical GFlowNet for crystal structure generation, Nguyen, Tri, Sherif Tawfik, Truyen Tran, Sunil Gupta, Santu Rana, and Svetha Venkatesh. In AI for Accelerated Materials Design-NeurIPS 2023 Workshop. 2023.
- Machine learning-aided exploration of ultrahard materials, Tawfik, Sherif Abdulkader, Phuoc Nguyen, Truyen Tran, Tiffany R. Walsh, and Svetha Venkatesh. The Journal of Physical Chemistry C 126, no. 37 (2022): 15952-15961.
- Learning to discover medicines, Nguyen, Minh-Tri, Thin Nguyen, and Truyen Tran. International Journal of Data Science and Analytics (2022): 1-16.
- Mitigating cold-start problems in drug-target affinity prediction with interaction knowledge transferring, Nguyen, Tri Minh, Thin Nguyen, and Truyen Tran. Briefings in Bioinformatics 23, no. 4 (2022): bbac269.
- Explaining black box drug target prediction through model agnostic counterfactual samples, Nguyen, Tri Minh, Thomas P. Quinn, Thin Nguyen, and Truyen Tran. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2022).
- GEFA: Early fusion approach in drug-target affinity prediction, Tri Minh Nguyen, Thin Nguyen, Thao Minh Le, Truyen Tran, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2021.
- Personalized
Annotation-based Networks (PAN) for the prediction of breast cancer
relapse, T Nguyen, SC Lee, TP Quinn, B Truong, X Li, T Tran, S Venkatesh, TD Le, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2021.
- Deep
in the bowel: Highly interpretable neural encoder-decoder networks
predict gut metabolites from gut microbiome, V Le, TP Quinn, T Tran, S Venkatesh, BMC Genomics (21), 07/2020.
- DeepTRIAGE:
Interpretable and individualised biomarker scores using attention
mechanism for the classification of breast cancer sub-types. A
Beykikhoshk, TP Quinn, SC Lee, T Tran,
S Venkatesh, BMC Medical Genomics, 2020.
- Incomplete conditional density estimation for fast materials discovery, Phuoc Nguyen, Truyen Tran, Sunil Gupta, Svetha Venkatesh. SDM'19.
- Committee machine that votes for similarity between materials; Duong-Nguyen Nguyen, Tien-Lam Pham, Viet-Cuong Nguyen, Tuan-Dung Ho, Truyen Tran, Keisuke Takahashi and Hieu-Chi Dam. IUCrJ, 2018 Nov 1; 5(Pt 6): 830–840.
- Graph transformation policy network for chemical reaction prediction, Kien Do, Truyen Tran, Svetha Venkatesh, KDD'19.
- Attentional multilabel
learning over graphs: A message passing approach, K Do, T Tran, T Nguyen, S Venkatesh, Machine Learning, 2019.
- Knowledge
Graph
Embedding with
Multiple Relation Projections, Kien Do, Truyen Tran, Svetha
Venkatesh, ICPR'18.
- Graph memory networks for molecular activity prediction, Trang Pham, Truyen Tran,
Svetha Venkatesh, ICPR'18.
- Column
Networks for Collective Classification, Trang Pham, Truyen Tran, Dinh Phung, Svetha
Venkatesh, AAAI'17
- Graph classification via deep learning with virtual nodes Trang Pham, Truyen Tran, Hoa Dam, Svetha
Venkatesh, Third Representation
Learning for Graphs Workshop (ReLiG 2017).
- Stabilizing Linear
Prediction Models using Autoencoder, Shivapratap
Gopakumara, Truyen Tran,
Dinh Phung, Svetha Venkatesh, International
Conference on Advanced Data Mining and Applications (ADMA
2016).
- Neural
Choice by Elimination via Highway Networks, Truyen Tran, Dinh
Phung and Svetha Venkatesh, PAKDD workshop on Biologically
Inspired Techniques for Data Mining (BDM'16), April 19-22
2016, Auckland, NZ.
- Graph-induced restricted
Boltzmann machines for document modeling, Tu Dinh
Nguyen, Truyen Tran,
Dinh
Phung, and Svetha Venkatesh, Information
Sciences, 2016.
- Stabilizing Sparse Cox Model
using Statistic and Semantic Structures in Electronic Medical Records.
Shivapratap Gopakumar, Tu Dinh Nguyen, Truyen
Tran, Dinh
Phung, and Svetha Venkatesh, PAKDD'15,
HCM City, Vietnam, May 2015.
- Tensor-variate
Restricted Boltzmann Machines, Tu Dinh Nguyen, Truyen Tran, Dinh
Phung, and Svetha Venkatesh, AAAI
2015.
- Stabilizing
high-dimensional
prediction models using feature graphs, Shivapratap
Gopakumar, Truyen Tran,
Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh, IEEE Journal of Biomedical and
Health Informatics, 2014
DOI:10.1109/JBHI.2014.2353031S
- Stabilized
sparse ordinal regression for medical risk stratification, Truyen Tran, Dinh
Phung, Wei Luo, and Svetha Venkatesh, Knowledge
and Information Systems,
2014, DOI: 10.1007/s10115-014-0740-4.
- Thurstonian
Boltzmann machines: Learning from multiple inequalities, Truyen Tran,
Dinh
Phung, and Svetha Venkatesh, In Proc.
of
30th
International Conference in Machine Learning (ICML’13),
Atlanta, USA, June, 2013.
- A
Sequential Decision Approach
to Ordinal Preferences in Recommender Systems, Truyen
Tran, Dinh Phung, Svetha Venkatesh, in Proc.
of 25-th Conference on Artificial Intelligence (AAAI-12),
Toronto,
Canada, July 2012.
Key references
- Karniadakis, G. E., Kevrekidis, I. G., Lu, L.,
Perdikaris, P., Wang, S., & Yang, L. (2021). Physics-informed
machine learning. Nature Reviews Physics, 3(6), 422-440.
- Nguyen,
T. M., Tawfik, S. A., Tran, T., Gupta, S., Rana, S., & Venkatesh,
S. (2023). Hierarchical GFlownet for Crystal Structure Generation. In AI for Accelerated Materials Design-NeurIPS 2023 Workshop (pp. 1-15).
- Hao, Z., Liu, S., Zhang, Y., Ying, C., Feng, Y., Su, H., & Zhu, J. (2022). Physics-informed machine learning: A survey on problems, methods and applications. Preprint.
- Kovachki, N. B., Lanthaler, S., & Stuart, A. M. (2024). Operator Learning: Algorithms and Analysis. Preprint.
- Lu,
L., Meng, X., Mao, Z., & Karniadakis, G. E. (2021). DeepXDE: A deep
learning library for solving differential equations. SIAM review, 63(1), 208-228.
- Takeishi,
N., & Kalousis, A. (2021). Physics-integrated variational
autoencoders for robust and interpretable generative modeling. Advances in Neural Information Processing Systems, 34, 14809-14821
- Raissi,
M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed
neural networks: A deep learning framework for solving forward and
inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378, 686-707.
- Tran, T., Phung, D., & Venkatesh, S. (2013, May). Thurstonian Boltzmann machines: learning from multiple inequalities. In International Conference on Machine Learning (pp. 46-54). PMLR.
|