Transformation Product Prediction (TPP)#
In this tutorial, an algorithm for Transformation Product Prediction, will be explained. This is based on the repository TP_predict and the paper Trostel, L. et. al.. Their code showcases how enviPath can be used to generate a suspect list of transformation products for an environmental pathway.
The goal that this tutorial aims to assess is: given a pathway as shown below, which is the node of the pathway that should be first explored. One can assess this problem by answering: what is the conditional probability of reaching the child nodes D, E and F?

We can achieve that by considering the probability of reaching the given node as the probability of the reaction multiplied by the probability of reaching the parent node (which again can be computed in the same way) and assigning as the probability of the root node (the compound that we will input) as 1. We will label the result of this probability multiplication as node_probability. In this way, the probability of getting D can be computed as follows:
In this tutorial, as it will be seen, when a node that it is already on the queue of compounds to explore is found through another path, the maximum of the computed probabilities will be assigned to the given node. In this way, one can compute the associated node probabilities (as seen on the figure below) and determine that the next node to be explored is D.

Each node that has not yet been explored will remain on a priority queue, where compounds with higher associated node_probability will be higher on the queue. In each iteration of the algorithm, the node with highest node probability (i.e. the one on the top of the queue) will be explored.
Once the theory is clear, let’s dig into the code!
from enviPath_python.enviPath import *
from enviPath_python.objects import *
import pandas as pd
import getpass
First of all, one needs to generate a set of global variables to be used throughout the script. Please do note, that one can use any other model available on enviPath by providing a valid EP_MODEL_ID. If this model can only be accessed for specifics users, please uncomment the lines after eP = enviPath(INSTANCE_HOST). The compound_input parameter contains the SMILES and description of the compound whose transformation products should be predicted. On the aforementioned repository, they generalize this to be a list of compounds, allowing in this way to perform a batch-processing exploration. The value MAX_TP encodes the maximum amount of transformation products that want to be searched. PROBABILITY_THRESHOLD indicates which is the minimum cut-off for a conditional probability to be considered as “unlikely” transformation product.
INSTANCE_HOST = 'https://envipath.org'
EP_MODEL_ID = 'https://envipath.org/package/32de3cf4-e3e6-4168-956e-32fa5ddb0ce1/relative-reasoning/23e1b2ec-dcc0-4389-9b65-afd52bd72e27'
# data of parent compound
compound_input = {"smiles": "CCN1CCN(CC1)CC2=CN=C(C=C2)NC3=NC=C(C(=N3)C4=CC5=C(C(=C4)F)N=C(N5C(C)C)C)F", "name": "Abe"}
# Maximum number of TPs to predict
MAX_TP = 50
# Lower probability threshold
PROBABILITY_THRESHOLD = 0 # any value equal to or lower than the threshold will be excluded
eP = enviPath(INSTANCE_HOST)
# USERNAME = input("Please, provide your username: ")
# password = getpass.getpass()
# eP.login(USERNAME, password)
rr = RelativeReasoning(eP.requester, id=EP_MODEL_ID)
Next we define two functions to update entries on queues, that will be used afterwards on the body of the TP prediction algorithm.
update_queue: it takes each SMILES from the list of transformation products and:if already present on the queue or list of already predicted SMILES, it will update the data on those lists
else, it will add the SMILES to the queue of transformation products to be predicted
Eventually, it sorts the queue by descending order of probability of the transformation product
update_compound_entry: this function is used on the first bullet point of the previous explained function. If the node probability of obtaining the given SMILES is higher now than on the validated list of transformation products or on the current queue, it will update the information to add the current one.
def update_queue(queue, validated_TPs, TPs, parent_data):
for smiles in TPs:
data = TPs[smiles]
# If the probability is 0 , we don't consider the TP further
this_probability = data['probability']
if this_probability <= PROBABILITY_THRESHOLD:
continue
data['node_probability'] = parent_data['node_probability'] * data['probability']
data['generation'] = parent_data['generation'] + 1
data['parent_smiles'] = [parent_data['smiles']]
# first, check if compound already in validated. if yes, update
if smiles in validated_TPs.keys():
validated_TPs[smiles] = update_compound_entry(validated_TPs[smiles], data)
# next, check if compound is already in queue. if yes, update
elif smiles in queue:
queue[smiles] = update_compound_entry(queue[smiles], data)
# else, add new item to queue
else:
queue[smiles] = data
# order dict by node probability
queue = dict(sorted(queue.items(), key=lambda item: item[1]['node_probability'], reverse=True))
return queue, validated_TPs
def update_compound_entry(reference_data, new_data):
if reference_data['node_probability'] < new_data['node_probability']:
reference_data['node_probability'] = new_data['node_probability']
reference_data['rules'] += new_data["rules"]
reference_data['rule_IDs'] += new_data["rule_IDs"]
reference_data['generation'] = new_data['generation']
reference_data['parent_smiles'] += new_data['parent_smiles']
return reference_data
We begin the TP prediction algorithm by adding the root node to the queue. Then we enter on the loop that will last either until the maximum number of transformation products, MAX_TP, has been reached or until no more compounds are on the queue. On each round of this loop:
The SMILES on top of the queue (the most probable) is retrieved
The model is used to predict possible transformation products
This information is ordered by probability
Use each predicted transformation product to generate a dictionary of transformation products,
TP_dict, where their relevant information is storedUse
TP_dictonupdate_queueto update the queue, based on their probabilitiesNow the most probable SMILES has been explored, add its data to
validated_TPsand go back to 1.
num_TP = -1 # counter starts at -1, because source compound is also in the TP list
validated_TPs = {} # container for resulting predictions
queue = {compound_input['smiles']: {'probability': 1, 'node_probability': 1, 'smiles': compound_input['smiles'], 'generation': 0, 'parent_smiles': [''],
'rules': [''], 'rule_IDs': [''], 'name': compound_input['name']}}
while num_TP < MAX_TP:
if len(queue) == 0:
print('\nEmpty queue - The exploration of has converged at {} predicted TPs'.format(num_TP))
break
smiles = list(queue.keys())[0] # get top item in queue
parent_data = queue.pop(smiles) # remove data from queued items
# Perform a prediction based on the previously defined relative reasoning object
prediction_data = rr.classify_smiles(smiles)
# sort by probability
prediction_data.sort(reverse=True, key=lambda x: x['probability'])
TP_dict = {}
for prediction in prediction_data:
probability = float(prediction['probability'])
for product_smiles in prediction['products']:
if product_smiles not in TP_dict.keys():
TP_dict[product_smiles] = {'rules' : [prediction['name']], 'rule_IDs': [prediction['id']], 'probability': probability, 'smiles': product_smiles}
else:
# check if there's a rule with better probability
if probability > TP_dict[product_smiles]['probability']:
# update probability and rules associated to this probability
TP_dict[product_smiles]['probability'] = probability
TP_dict[product_smiles]['rules'] = [prediction['name']]
TP_dict[product_smiles]['rule_IDs'] = [prediction['id']]
queue, validated_TPs = update_queue(queue, validated_TPs, TP_dict, parent_data)
validated_TPs[smiles] = parent_data
num_TP += 1
pd.DataFrame.from_dict(validated_TPs, orient='index').head()
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/envipath-python/envs/develop/lib/python3.10/site-packages/requests/models.py:976, in Response.json(self, **kwargs)
975 try:
--> 976 return complexjson.loads(self.text, **kwargs)
977 except JSONDecodeError as e:
978 # Catch JSON-related errors and raise as requests.JSONDecodeError
979 # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
File ~/.asdf/installs/python/3.10.17/lib/python3.10/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:
File ~/.asdf/installs/python/3.10.17/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
333 """Return the Python representation of ``s`` (a ``str`` instance
334 containing a JSON document).
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
File ~/.asdf/installs/python/3.10.17/lib/python3.10/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 3 column 1 (char 2)
During handling of the above exception, another exception occurred:
JSONDecodeError Traceback (most recent call last)
Cell In[4], line 13
10 parent_data = queue.pop(smiles) # remove data from queued items
12 # Perform a prediction based on the previously defined relative reasoning object
---> 13 prediction_data = rr.classify_smiles(smiles)
15 # sort by probability
16 prediction_data.sort(reverse=True, key=lambda x: x['probability'])
File ~/checkouts/readthedocs.org/user_builds/envipath-python/envs/develop/lib/python3.10/site-packages/enviPath_python/objects.py:1890, in RelativeReasoning.classify_smiles(self, smiles)
1885 def classify_smiles(self, smiles: str):
1886 params = {
1887 'smiles': smiles,
1888 'classify': 'ILikeCats'
1889 }
-> 1890 return self.requester.get_request(self.id, params=params).json()
File ~/checkouts/readthedocs.org/user_builds/envipath-python/envs/develop/lib/python3.10/site-packages/requests/models.py:980, in Response.json(self, **kwargs)
976 return complexjson.loads(self.text, **kwargs)
977 except JSONDecodeError as e:
978 # Catch JSON-related errors and raise as requests.JSONDecodeError
979 # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 980 raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
JSONDecodeError: Expecting value: line 3 column 1 (char 2)
Finally, we can use the obtained dictionary to create a graph network using Networkx and display it using Plotly. It is worth to mention that in order to generate the graph visualization, the following stackoverflow issues were used as an inspiration to generate a tree based network graph and to display images as nodes.