Classify the bitter or sweet taste of compounds

Datasets
Kaggle
Classification
A dataset for the classification of the bitter or sweet taste of compounds.
Author

Rohit Farmer

Published

October 15, 2022

Original Post

This post is an identical copy of “About Dataset” at Kaggle: https://www.kaggle.com/dsv/4234193

Context

Throughout human evolution, we have been drawn toward sweet-tasting foods and averted from bitter tastes - sweet is good or desirable, bitter is undesirable, ear wax or medicinal. Therefore, a better understanding of molecular features that determine the bitter-sweet taste of substances is crucial for identifying natural and synthetic compounds for various purposes.

Sources

This dataset is adapted from https://github.com/cosylabiiit/bittersweet, https://www.nature.com/articles/s41598-019-43664-y. In chemoinformatics, molecules are often represented as compact SMILES strings. In this dataset, SMILES structures, along with their names and targets (bitter, sweet, tasteless, and non-bitter), were obtained from the original study. Subsequently, SMILES were converted into canonical SMILES using RDKit, and the features (molecular descriptors, both 2D and 3D) were calculated using Mordred. Secondly, tasteless and non-bitter categories were merged into a single category of non-bitter-sweet. Finally, since many of the compounds were missing names, IUPAC names were fetched using PubChemPy for all the compounds, and for still missing names, a generic compound + incrementor name was assigned.

Inspiration

This is a classification dataset with the first three columns carrying names, SMILES, and canonical SMILES. Any of these columns can be used to refer to a molecule. The fourth column is the target (taste category). And all numeric features are from the 5th column until the end of the file. Many features have cells with string annotations due to errors produced by Mordred. Therefore, the following data science techniques can be learned while working on this dataset:

  1. Data cleanup
  2. Features selection (since the number of features is quite large in proportion to the data points)
  3. Feature scaling/transformation/normalization
  4. Dimensionality reduction
  5. Binomial classification (bitter vs. sweet) - utilize non-bitter-sweet as a negative class.
  6. Multinomial classification (bitter vs. sweet vs. non-bitter-sweet)
  7. Since SMILES can be converted into molecular graphs, graph-based modeling should also be possible.

Initial data preparation

A copy of the original dataset and the scripts and notebooks used to convert SMILES to canonical SMILES, generate features, fetch names, and export the final TSV file for Kaggle is loosely maintained at https://github.com/rohitfarmer/bittersweet.

Back to top

Citation

BibTeX citation:
@dataset{farmer2022,
  author = {Farmer, Rohit},
  publisher = {Kaggle},
  title = {Classify the Bitter or Sweet Taste of Compounds},
  date = {2022-10-15},
  url = {https://www.kaggle.com/dsv/4234193},
  doi = {10.34740/KAGGLE/DSV/4234193},
  langid = {en}
}
For attribution, please cite this work as:
Farmer, Rohit. 2022. “Classify the Bitter or Sweet Taste of Compounds.” Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4234193.