transparent ML
(Source: wikipedia.com)


  

Investigators 
Truyen Tran (Australia) 
Hieu-Chi Dam (Japan) 
Alejandro Franco (France) 

Members  
Kien Do 
Phuoc Nguyen 
Dung Nguyen  
Thang Nguyen  
Kha Pham  

 Alumni  
Adam Beykikhoshk 
Shivapratap Gopakumar 
Vuong Le  
Tu Nguyen  
Trang Pham  



 

 

 

Theory-informed Machine Learning

Data-efficient machine learning that respects theory and does not hallucinate.

Areas:

  • Scientific Large Language Models
  • Benchmarking
  • Representing theoretical prior
  • Multi-process learning
  • Rapid adaptation
  • Promptable structure generation
  • Knowledge graphs

Content:

Overview:

Current leading data-driven learning systems such as ChatGPT, Gemini, and Sora often hallucinate non-existent, sometimes dangerous artifacts and generate scientifically implausible outcomes when applied to realistic science and engineering settings. Lacking built-in mechanisms to understand real-world phenomena, they fail to generalize beyond their trained ranges. In contrast, theory-driven models respect the underlying laws and extrapolate well but may suffer from being simplistic, incomplete, and prohibitively expensive to solve, thus unable to accurately represent the dynamics and complexity of the real world. For example, a detailed theory-driven model of weather would fail to compute forecasts in real-time due to the sheer complexity of modeling the atmosphere, ocean, land, and the interactions among them. Theory-informed Machine Learning (TiML) integrates the strengths of both approaches, showing great promise for providing trustworthy insights to help solve pressing global and local problems like infectious diseases, energy security, and climate change.

We have pioneered the development of TiML to solve diverse real-world problems. For example, our ontology-induced model of medical risk discovered risk factors that were more certain and consistent with established medicine than those found by purely data-driven methods. Our epidemiology-guided neural network model of COVID-19 dynamics had both its design and parameters informed by theory and past outbreaks. This approach greatly outperformed standard data-driven machine learning and mechanistic models, helping Ho Chi Minh City, home to over 10 million people, make critical decisions to mitigate a devastating late 2021 COVID crisis that took over 20,000 lives. Likewise, our physics-informed graph neural network integrates the external potential term found in density functional theory calculations to predict the potential energy surface in materials science applications. Our model is more accurate in predicting the total energy per atom of a defective system, as well as the structural changes that result from the presence of a defect in a material. Another TiML work of ours develops a crystal generative model that exploits the symmetry of crystal groups and incorporates an expert-guided reward function. This results in much faster and more stable crystal generation compared to competing data-driven methods that do not respect these theoretical priors.

However, TiML tends to be instance-specific and demands extensive domain and ML expertise to implement in practice. Despite ultimate success, our experience with COVID-19 modeling shows that TiML models are very hard to train, have an extremely complex loss landscape, suffer from instability, and require major efforts in data conditioning and hyperparameter tuning. To realize the global potential for TiML to problem-solve across diverse real-world instances and domains, key obstacles around model expressiveness and adaptability must be overcome.

Aims: Our goal is to develop more generally applicable and trustworthy TiML models. This project addresses critical gaps in current TiML, which struggles to adapt, capture complexity, and generalize across diverse problem instances, spatiotemporal domains, and datasets. We will create new TiML algorithms and neural architectures that are:

  • Well-validated: comprehensive benchmarking across diverse domains including, but not limited to, infectious diseases, road traffic, reaction-diffusion, materials science, and energy storage;
  • Expressive: able to represent a wide range of problems and theoretical priors, including those hidden in scientific literature;
  • Adaptive: enabling rapid adaptation to new problems with minimal expertise, data, compute, and effort; and
  • Realistic: effectively capturing interacting, multi-scale, multi-physics processes.

Talks/Tutorials

Benchmarking

Policy reports

Publications

Key references

  • Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., & Yang, L. (2021). Physics-informed machine learning. Nature Reviews Physics, 3(6), 422-440.
  • Nguyen, T. M., Tawfik, S. A., Tran, T., Gupta, S., Rana, S., & Venkatesh, S. (2023). Hierarchical GFlownet for Crystal Structure Generation. In AI for Accelerated Materials Design-NeurIPS 2023 Workshop (pp. 1-15).
  • Hao, Z., Liu, S., Zhang, Y., Ying, C., Feng, Y., Su, H., & Zhu, J. (2022). Physics-informed machine learning: A survey on problems, methods and applications. Preprint
  •  Kovachki, N. B., Lanthaler, S., & Stuart, A. M. (2024). Operator Learning: Algorithms and Analysis. Preprint.
  • Lu, L., Meng, X., Mao, Z., & Karniadakis, G. E. (2021). DeepXDE: A deep learning library for solving differential equations. SIAM review, 63(1), 208-228.
  • Takeishi, N., & Kalousis, A. (2021). Physics-integrated variational autoencoders for robust and interpretable generative modeling. Advances in Neural Information Processing Systems, 34, 14809-14821
  • Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378, 686-707.
  • Tran, T., Phung, D., & Venkatesh, S. (2013, May). Thurstonian Boltzmann machines: learning from multiple inequalities. In International Conference on Machine Learning (pp. 46-54). PMLR.