Abstract:
Knowledge of solubility curves for active pharmaceutical ingredients (APIs) in organic solvents and solvent mixtures is essential for pharmaceutical process engineering, impacting synthesis in flow reactors, scale-up, crystallization, and purification. Current models often rely on the availability of experimental data or on empirical correlations, and struggle with predictive performance for new molecules. Machine learning (ML) has emerged as a powerful tool for molecular property prediction and offers promising opportunities for solubility curve prediction. However, solubility data remain scarce, scattered, and biased toward common APIs and solvents, limiting model accuracy and generalizability. Addressing these challenges demands creative strategies to leverage these limited datasets effectively.



