Download Additional Information data

Download Additional Information data#

In this tutorial, we will show how to access the additional information data stored in scenarios within a given package. We will do this for the Atenolol Pathway as an example. We will first access one scenario associated with the compound Atenolol within the Atenolol pathway, and extract all the additional information (metadata) within the scenario. Afterwards we will show how to access the metadata for multiple scenarios within a package by extending the code in the provided example. Finally, we will explore how to analyze trends in the metadata using experimental location as an example.

We first import the relevant enviPath objects for this tutorial

from enviPath_python.enviPath import enviPath
from enviPath_python.objects import *

import pandas as pd

As in other tutorials, we instantiate the host and the package we want to work with

INSTANCE_HOST = "https://envipath.org/"
EAWAG_SLUDGE_DATA_PACKAGE = "https://envipath.org/package/7932e576-03c7-4106-819d-fe80dc605b8a"

eP = enviPath(INSTANCE_HOST)
pkg = Package(eP.requester, id=EAWAG_SLUDGE_DATA_PACKAGE)

As discussed, we will access the metadata contained on Atenolol pathway and display it.

First, we search the pathway:

atenolol_pathway = eP.search("Atenolol", pkg)["pathway"][0]
print(f"Pathway name: {atenolol_pathway.get_name()}")
# We're interested in Atenolol itself, so we fetch the root node at position 1
node = atenolol_pathway.get_nodes()[1]
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/envipath-python/envs/develop/lib/python3.10/site-packages/requests/models.py:976, in Response.json(self, **kwargs)
    975 try:
--> 976     return complexjson.loads(self.text, **kwargs)
    977 except JSONDecodeError as e:
    978     # Catch JSON-related errors and raise as requests.JSONDecodeError
    979     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File ~/.asdf/installs/python/3.10.17/lib/python3.10/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File ~/.asdf/installs/python/3.10.17/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    333 """Return the Python representation of ``s`` (a ``str`` instance
    334 containing a JSON document).
    335 
    336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338 end = _w(s, end).end()

File ~/.asdf/installs/python/3.10.17/lib/python3.10/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 3 column 1 (char 2)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
Cell In[3], line 1
----> 1 atenolol_pathway = eP.search("Atenolol", pkg)["pathway"][0]
      2 print(f"Pathway name: {atenolol_pathway.get_name()}")
      3 # We're interested in Atenolol itself, so we fetch the root node at position 1

File ~/checkouts/readthedocs.org/user_builds/envipath-python/envs/develop/lib/python3.10/site-packages/enviPath_python/enviPath.py:108, in enviPath.search(self, term, packages, method)
    105 res = self.requester.get_request('{}search'.format(self.BASE_URL), params=params)
    106 res.raise_for_status()
--> 108 data = res.json()
    110 result = {}
    111 for k, vals in data.items():

File ~/checkouts/readthedocs.org/user_builds/envipath-python/envs/develop/lib/python3.10/site-packages/requests/models.py:980, in Response.json(self, **kwargs)
    976     return complexjson.loads(self.text, **kwargs)
    977 except JSONDecodeError as e:
    978     # Catch JSON-related errors and raise as requests.JSONDecodeError
    979     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 980     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Expecting value: line 3 column 1 (char 2)

We access a scenario from the list of scenarios that are attached to Atenolol

scenarios = node.get_scenarios()
print(f"We have {len(scenarios)} scenarios for Atenolol")
scenario = scenarios[0]
print(f"In this first example, we will explore the metadata contained in {scenario.get_id()} , ")
print(f"with name {scenario.get_name()}")
print(f"Description: {scenario.get_description()}")
We have 19 scenarios for Atenolol
In this first example, we will explore the metadata contained in https://envipath.org/package/7932e576-03c7-4106-819d-fe80dc605b8a/scenario/10dccbcb-4ab6-4a3a-b653-77b909bc6675 , 
with name Helbling et al., 2012 (DOM3) (Related Scenario) - (00000)
Description: no description

Lastly, we extract all the additional information objects in that scenario

additional_information_list = scenario.get_additional_information()
for ai in additional_information_list:
    print(f"\n{ai.name}")
    for param in ai.params.keys():
        print(f"\t{param}: {ai.params[param]}")
acidity
	lowPh: 7.5
	highPh: 7.5
	acidityType: 
	unit: pH

biologicaltreatmenttechnology
	biologicaltreatmenttechnology: nitrification & denitrification
	unit: 

bioreactor
	bioreactortype: amber glass Schott bottles (loosely capped)
	bioreactorsize: 100.0
	unit: mL

finalcompoundconcentration
	finalcompoundconcentration: 100
	unit: &#956g/L

inoculumsource
	inoculumsource: activated sludge from biological aeration basin
	unit: 

location
	location: Switzerland (DOM3)
	unit: 

nitrogencontent
	nitrogencontentType: NH&#8324-N
	nitrogencontentInfluent: 24.9
	unit: mg/L

originalsludgeamount
	originalsludgeamount: 70
	unit: mL

oxygendemand
	oxygendemandType: Biological Oxygen Demand (BOD5)
	oxygendemandInfluent: 320.0
	oxygendemandEffluent: 
	unit: mg/L

phosphoruscontent
	phosphoruscontentInfluent: 9.0
	phosphoruscontentEffluent: 
	unit: mg/L

purposeofwwtp
	purposeofwwtp: municipal WW
	unit: 

rateconstant
	rateconstantorder: First order
	rateconstantcorrected: sorption corrected & abiotic degradation corrected
	rateconstantlower: 15.62
	rateconstantupper: NaN
	rateconstantcomment: r2 = 0.9934
	unit: 1 / day

redox
	redoxType: aerob
	unit: 

sludgeretentiontime
	sludgeretentiontimeType: sludge retention time
	sludgeretentiontime: 9.8
	unit: d

solventforcompoundsolution
	solventforcompoundsolution1: MeOH
	solventforcompoundsolution2: None
	solventforcompoundsolution3: None
	unit: 

sourceofliquidmatrix
	sourceofliquidmatrix: none (sludge only)
	unit: 

temperature
	temperatureMin: 20.0
	temperatureMax: 20.0
	unit: °C

tts
	ttsStart: 12.4
	ttsEnd: 12.4
	unit: g/L

typeofaddition
	typeofaddition: plating
	unit: 

typeofaeration
	typeofaeration: shaking
	unit: 

In the following lines of code, we generalize this process to extract all the metadata of a package. Some lines are commented out to reduce the amount of requests and computation time. The user can download this tutorial on the upper-right corner and test those lines by themselves if desired. The underlying logic can be described as follows:

  1. Declare a data list where we will store all the information retrieved

  2. Loop over each node on a pathway

    1. Extract all the scenarios

    2. For each scenario, get all the experimental data (additional information) and store it on the data list together with its SMILES, node, scenario and pathway IDs and the scenario description

  3. Create a pandas DataFrame and use it to generate a .csv file with all the extracted data

# data = []

# for path in pkg.get_pathways():
#     for node in path.get_nodes():
#         scenarios = node.get_scenarios()
#         for scenario in scenarios:
#             temp_data = {"smiles": node.get_smiles(), "node_id": node.get_id(), 
#                          "scenario_id": scenario.get_id(), "scenario_description": scenario.get_description(),
#                          "pathway_id": path.get_id()}
#             temp_add_info = scenario.get_additional_information()
#             for ai in temp_add_info:
#                 add_info = {ai.name + "_" + key: value for (key,value) in ai.params.items()}
#                 temp_data.update(add_info)
#             data.append(temp_data)
            
# # save data
# raw_data = pd.DataFrame(data)
# raw_data.to_csv("../assets/additional_information_data.csv", sep='\t', index=False)
raw_data = pd.read_csv("../assets/additional_information_data.csv", sep="\t")
raw_data.head()
smiles node_id scenario_id scenario_description pathway_id acidity_lowPh acidity_highPh acidity_acidityType acidity_unit biologicaltreatmenttechnology_biologicaltreatmenttechnology ... oxygendemand_oxygendemandType oxygendemand_oxygendemandInfluent oxygendemand_oxygendemandEffluent oxygendemand_unit dissolvedorganiccarbon_dissolvedorganiccarbonStart dissolvedorganiccarbon_dissolvedorganiccarbonEnd dissolvedorganiccarbon_unit volatiletts_volatilettsStart volatiletts_volatilettsEnd volatiletts_unit
0 C1=CC(=C(C=C1)N2CCNCC2)Cl https://envipath.org/package/7932e576-03c7-410... https://envipath.org/package/7932e576-03c7-410... no description https://envipath.org/package/7932e576-03c7-410... 8.1 8.1 NaN pH nitrification & denitrification & biological p... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 C1=CC(=C(C=C1)N2CCNCC2)Cl https://envipath.org/package/7932e576-03c7-410... https://envipath.org/package/7932e576-03c7-410... no description https://envipath.org/package/7932e576-03c7-410... 6.3 6.3 NaN pH nitrification & denitrification & biological p... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 C1=CC(=C(C=C1)N2CCNCC2)Cl https://envipath.org/package/7932e576-03c7-410... https://envipath.org/package/7932e576-03c7-410... no description https://envipath.org/package/7932e576-03c7-410... 7.1 7.1 NaN pH nitrification & denitrification & biological p... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 CC12CCC3C4=CC=C(C=C4CCC3C2CCC1=O)O https://envipath.org/package/7932e576-03c7-410... https://envipath.org/package/7932e576-03c7-410... https://doi.org/10.1023/A:1014117329403 https://envipath.org/package/7932e576-03c7-410... NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 CC12CCC3C4=CC=C(C=C4CCC3C2CCC1=O)O https://envipath.org/package/7932e576-03c7-410... https://envipath.org/package/7932e576-03c7-410... https://doi.org/10.1023/A:1014117329403 https://envipath.org/package/7932e576-03c7-410... NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 93 columns

Finally, we use the extracted data to analyze the locations of each experiment in EAWAG-SLUDGE. To do this we map similar locations to a common name, i.e. (Dübendorf, WWTP Duebendorf (ARA Neugut), Switzerland, …) -> Dübendorf, Switzerland

Hide code cell source

import plotly.express as px

def process_location(df):
    if pd.notna(df):
        if "Duebendorf" in df or "Dübendorf" in df:
            return "Dübendorf, Switzerland"
        elif "IND" in df or "DOM" in df:
            return "Switzerland (IND, DOM)"
        elif "4 parallel" in df:
            return df.split("-->")[0]
    return df

plot_df = raw_data
plot_df.location_location = plot_df.location_location.apply(lambda x: process_location(x))
plot_df = plot_df[["smiles", "scenario_id", "location_location"]].groupby(["scenario_id", "location_location"]).count().reset_index()[["location_location", "smiles"]].groupby("location_location").sum().reset_index()
plot_df.rename(columns={"location_location": "location", "smiles": "count"}, inplace=True)
px.pie(plot_df, names="location", values="count", title="Location of experiments in EAWAG-SLUDGE",
       width=900, height=500)

We see that there Dübendorf, Switzerland is the predominant location on our dataset. In the same way, one could analyze other relevant features, such as temperature, pH or half lives