Download Additional Information data#
In this tutorial, we will show how to access the additional information data stored in scenarios within a given package. We will do this for the Atenolol Pathway as an example. We will first access one scenario associated with the compound Atenolol within the Atenolol pathway, and extract all the additional information (metadata) within the scenario. Afterwards we will show how to access the metadata for multiple scenarios within a package by extending the code in the provided example. Finally, we will explore how to analyze trends in the metadata using experimental location as an example.
We first import the relevant enviPath objects for this tutorial
from enviPath_python.enviPath import enviPath
from enviPath_python.objects import *
import pandas as pd
As in other tutorials, we instantiate the host and the package we want to work with
INSTANCE_HOST = "https://envipath.org/"
EAWAG_SLUDGE_DATA_PACKAGE = "https://envipath.org/package/7932e576-03c7-4106-819d-fe80dc605b8a"
eP = enviPath(INSTANCE_HOST)
pkg = Package(eP.requester, id=EAWAG_SLUDGE_DATA_PACKAGE)
As discussed, we will access the metadata contained on Atenolol pathway and display it.
First, we search the pathway:
atenolol_pathway = eP.search("Atenolol", pkg)["pathway"][0]
print(f"Pathway name: {atenolol_pathway.get_name()}")
# We're interested in Atenolol itself, so we fetch the root node at position 1
node = atenolol_pathway.get_nodes()[1]
Pathway name: Atenolol (ATE)
We access a scenario from the list of scenarios that are attached to Atenolol
scenarios = node.get_scenarios()
print(f"We have {len(scenarios)} scenarios for Atenolol")
scenario = scenarios[0]
print(f"In this first example, we will explore the metadata contained in {scenario.get_id()} , ")
print(f"with name {scenario.get_name()}")
print(f"Description: {scenario.get_description()}")
We have 19 scenarios for Atenolol
In this first example, we will explore the metadata contained in https://envipath.org/package/7932e576-03c7-4106-819d-fe80dc605b8a/scenario/10dccbcb-4ab6-4a3a-b653-77b909bc6675 ,
with name Helbling et al., 2012 (DOM3) (Related Scenario) - (00000)
Description: no description
Lastly, we extract all the additional information objects in that scenario
additional_information_list = scenario.get_additional_information()
for ai in additional_information_list:
print(f"\n{ai.name}")
for param in ai.params.keys():
print(f"\t{param}: {ai.params[param]}")
acidity
lowPh: 7.5
highPh: 7.5
acidityType:
unit: pH
biologicaltreatmenttechnology
biologicaltreatmenttechnology: nitrification & denitrification
unit:
bioreactor
bioreactortype: amber glass Schott bottles (loosely capped)
bioreactorsize: 100.0
unit: mL
finalcompoundconcentration
finalcompoundconcentration: 100.0
unit: μg/L
inoculumsource
inoculumsource: activated sludge from biological aeration basin
unit:
location
location: Switzerland (DOM3)
unit:
nitrogencontent
nitrogencontentType: NH₄-N
nitrogencontentInfluent: 24.9
unit: mg/L
originalsludgeamount
originalsludgeamount: 70.0
unit: mL
oxygendemand
oxygendemandType: Biological Oxygen Demand (BOD5)
oxygendemandInfluent: 320.0
unit: mg/L
phosphoruscontent
phosphoruscontentInfluent: 9.0
unit: mg/L
purposeofwwtp
purposeofwwtp: municipal WW
unit:
rateconstant
rateconstantorder: First order
rateconstantcorrected: sorption corrected & abiotic degradation corrected
rateconstantlower: 15.62
rateconstantupper: nan
rateconstantcomment: r2 = 0.9934
unit: 1 / day
redox
redoxType: aerob
unit:
sludgeretentiontime
sludgeretentiontimeType: sludge retention time
sludgeretentiontime: 9.8
unit: d
solventforcompoundsolution
solventforcompoundsolution1: MeOH
unit:
sourceofliquidmatrix
sourceofliquidmatrix: none (sludge only)
unit:
temperature
temperatureMin: 20.0
temperatureMax: 20.0
unit: °C
tts
ttsStart: 12.4
ttsEnd: 12.4
unit: g/L
typeofaddition
typeofaddition: plating
unit:
typeofaeration
typeofaeration: shaking
unit:
In the following lines of code, we generalize this process to extract all the metadata of a package. Some lines are commented out to reduce the amount of requests and computation time. The user can download this tutorial on the upper-right corner and test those lines by themselves if desired. The underlying logic can be described as follows:
Declare a
datalist where we will store all the information retrievedLoop over each node on a pathway
Extract all the scenarios
For each scenario, get all the experimental data (additional information) and store it on the data list together with its SMILES, node, scenario and pathway IDs and the scenario description
Create a pandas DataFrame and use it to generate a .csv file with all the extracted data
# data = []
# for path in pkg.get_pathways():
# for node in path.get_nodes():
# scenarios = node.get_scenarios()
# for scenario in scenarios:
# temp_data = {"smiles": node.get_smiles(), "node_id": node.get_id(),
# "scenario_id": scenario.get_id(), "scenario_description": scenario.get_description(),
# "pathway_id": path.get_id()}
# temp_add_info = scenario.get_additional_information()
# for ai in temp_add_info:
# add_info = {ai.name + "_" + key: value for (key,value) in ai.params.items()}
# temp_data.update(add_info)
# data.append(temp_data)
# # save data
# raw_data = pd.DataFrame(data)
# raw_data.to_csv("../assets/additional_information_data.csv", sep='\t', index=False)
raw_data = pd.read_csv("../assets/additional_information_data.csv", sep="\t")
raw_data.head()
| smiles | node_id | scenario_id | scenario_description | pathway_id | acidity_lowPh | acidity_highPh | acidity_acidityType | acidity_unit | biologicaltreatmenttechnology_biologicaltreatmenttechnology | ... | oxygendemand_oxygendemandType | oxygendemand_oxygendemandInfluent | oxygendemand_oxygendemandEffluent | oxygendemand_unit | dissolvedorganiccarbon_dissolvedorganiccarbonStart | dissolvedorganiccarbon_dissolvedorganiccarbonEnd | dissolvedorganiccarbon_unit | volatiletts_volatilettsStart | volatiletts_volatilettsEnd | volatiletts_unit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C1=CC(=C(C=C1)N2CCNCC2)Cl | https://envipath.org/package/7932e576-03c7-410... | https://envipath.org/package/7932e576-03c7-410... | no description | https://envipath.org/package/7932e576-03c7-410... | 8.1 | 8.1 | NaN | pH | nitrification & denitrification & biological p... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | C1=CC(=C(C=C1)N2CCNCC2)Cl | https://envipath.org/package/7932e576-03c7-410... | https://envipath.org/package/7932e576-03c7-410... | no description | https://envipath.org/package/7932e576-03c7-410... | 6.3 | 6.3 | NaN | pH | nitrification & denitrification & biological p... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | C1=CC(=C(C=C1)N2CCNCC2)Cl | https://envipath.org/package/7932e576-03c7-410... | https://envipath.org/package/7932e576-03c7-410... | no description | https://envipath.org/package/7932e576-03c7-410... | 7.1 | 7.1 | NaN | pH | nitrification & denitrification & biological p... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | CC12CCC3C4=CC=C(C=C4CCC3C2CCC1=O)O | https://envipath.org/package/7932e576-03c7-410... | https://envipath.org/package/7932e576-03c7-410... | https://doi.org/10.1023/A:1014117329403 | https://envipath.org/package/7932e576-03c7-410... | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | CC12CCC3C4=CC=C(C=C4CCC3C2CCC1=O)O | https://envipath.org/package/7932e576-03c7-410... | https://envipath.org/package/7932e576-03c7-410... | https://doi.org/10.1023/A:1014117329403 | https://envipath.org/package/7932e576-03c7-410... | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 93 columns
Finally, we use the extracted data to analyze the locations of each experiment in EAWAG-SLUDGE. To do this we map similar locations to a common name, i.e. (Dübendorf, WWTP Duebendorf (ARA Neugut), Switzerland, …) -> Dübendorf, Switzerland
We see that there Dübendorf, Switzerland is the predominant location on our dataset. In the same way, one could analyze other relevant features, such as temperature, pH or half lives