Phenotype Curation (UNDER CONSTRUCTION)
This page provides a few examples of how the raw health data in the TRE have been curated to produce useful output for a range of analyses. The files used in the examples below are all located in genesandhealth/library-red
, please see the page describing TRE file structures for further information. An overview of the different datafiles are available in the raw phenotype data page
Tools
The philosophy of the Genes and Health project is to provide researchers with the tools to generate their own phenotypes as well as providing a curated set of phenotypes.
The curated phenotypes are generated using the tre-tools
package, which is a custom package developed by the Data Team. A curated phenotype is a phenotype that has been generated using a set of codelists that have been checked and validated by researchers in Genes and Health.
The code is available at tre-tools. We strongly recommend that you use this package to generate your own phenotypes, as the code has been generated by the Data Team using a software engineering approach, and has been thoroughly tested.
Codelists
There are a number of codelists available in the TRE that can be used to generate phenotypes. These are stored in the library-red
folder. The codelists are stored in a folder called codelists
and are named according to the phenotype they are used to generate.
Codelists are simple csv files that contain the codes that are used to generate the phenotype. If you wish to use your own codelist, simply utilise the provided clipboard functionality, allowing you to transfer text data, such as CSV files, from your local machine to the TRE.
In order to make sure that analyses are only carried out with codelists that have the expected structure (i.e. the correct format for ICD10), the Data Team have created a set of tools in tre-tools
that can be used to check the codelists.
The tool checks that all codes in the codelist conform to the expected format, and that there are no duplicate codes. For example, checking that a SNOMED codelist only contains SNOMED codes which have the format of a numerical value between 6 and 18 digits in length.
Example code
The following code loads a codelist called Diabetes.
diabetes.csv
code,term
"100000001","Type 2 Diabetes"
"100000002","Diabetic Review"
from tretools.codelists.codelist_types import CodelistType
from tretools.codelists.codelist import Codelist
# Load the codelist
diabetes = Codelist("diabetes.csv", CodelistType.SNOMED.value)
In the example above, we have loaded a codelist called diabetes.csv
which contains two SNOMED codes. The Codelist
class is used to load the codelist, and the CodelistType
enum is used to specify the type of codelist. In this case, the codelist is a SNOMED codelist. If any of the codes in the codelist do not conform to the expected format, an error will be raised.
For more information on how to use the tre-tools
package, please see the tre-tools documentation, and the test files in the tre-tools
repository for examples of how to use the package. We welcome any feedback on the package, and are happy to help with any issues you may encounter.
Binary Traits
Binary traits are those that have two possible outcomes, such as disease status (e.g. diabetes, hypertension). A person either has the disease or they do not.
The raw health data in the TRE is curated to produce binary traits for a range of analyses. The files used in the examples below are all located in genesandhealth/library-red
. There is a lengthy readme file in the folder that describes the data in detail, along with timestamps for when the data was last updated. We recommend that you read this file before using the data.
The binary traits data was generated using the tre-tools
custom package. The code is available at tre-tools. The data has been cleaned and processed for use in a range of analyses. The data includes all the primary care and hospital data for the Genes and Health cohort, and has been processed and saved in a format that is easy to use for a different types of analyses. The cleaning processes are described in the README file in the folder but in summary:
- Each data file has been loaded and de-duplicated on the basis of the unique identifier for each patient, date and code.
- All datasets of the same coding system (for example, ICD10) have been merged into a single dataset for each "time cut" of data received.
- All datasets have been merged into a single dataset for each coding system.
Binary trait reports are generated using the tre-tools
package and use ProcessedDataset
and Codelist
classes to generate a PhenotypeReport
object. The PhenotypeReport
object can then be used to generate a range of reports for the binary traits, including Regenie input files.