Build DNA methylation database schema and tables

Build DNA methylation database schema and tables using results generated from <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/165>. According to @ewafula, the results in <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/165> are still under preparation.

Following is a brief introduction on how to build the DNA methylation database schema and tables. The implementation of these procedures may require further discussions and modifications, to coordinate with the development of DNA methylation API plot and table endpoints.

Add a new DNA methylation database schema to store all DNA methylation tables, to avoid table name collisions with existing tables in `bulk_expression` schema. The DNA methylation schema can have any appropriate name, and the following description uses `dna_methylation` as an example schema name. Following are the steps to add in a new `dna_methylation` schema:

- Add `DNA_METHYLATION_SCHEMA=dna_methylation` in [`../OpenPedCan-api-secrets/common_db.env`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/0a7046b1dedc7a7b954400edae7a45b4d60a8a98/README.md#32-prepare-docker-environment-files).
- Add `ENV DNA_METHYLATION_SCHEMA="dna_methylation"` in [`Dockerfile`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/4696555392bfe9d69cec6f65998bca4aa4a294df/Dockerfile#L41).
- Add the following code in [`db/init_db.sh`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/0a7046b1dedc7a7b954400edae7a45b4d60a8a98/db/init_db.sh#L69-L78):

  ```bash
  CREATE SCHEMA ${DNA_METHYLATION_SCHEMA};
  GRANT SELECT ON ALL TABLES IN SCHEMA ${DNA_METHYLATION_SCHEMA} TO ${DB_USERNAME};
  ALTER DEFAULT PRIVILEGES IN SCHEMA ${DNA_METHYLATION_SCHEMA} GRANT SELECT ON TABLES TO ${DB_USERNAME};
  ```

- Add `DNA_METHYLATION_SCHEMA = "DNA_METHYLATION_SCHEMA"` in [`db/r_interfaces/db_env_vars.R`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/4696555392bfe9d69cec6f65998bca4aa4a294df/db/r_interfaces/db_env_vars.R#L37-L38).
- Add the following code in [`db/load_db.sh`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/0a7046b1dedc7a7b954400edae7a45b4d60a8a98/db/load_db.sh#L89-L95):

  ```bash
  GRANT USAGE ON SCHEMA ${DNA_METHYLATION_SCHEMA} TO ${DB_USERNAME};
  GRANT SELECT ON ALL TABLES IN SCHEMA ${DNA_METHYLATION_SCHEMA} TO ${DB_USERNAME};
  ALTER DEFAULT PRIVILEGES IN SCHEMA ${DNA_METHYLATION_SCHEMA} GRANT SELECT ON TABLES TO ${DB_USERNAME};
  ```

- Add `--schema="$DNA_METHYLATION_SCHEMA"` in [`db/build_tools/build_db_docker_cmd.sh`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/58c024468bcd284ccf85091436ff64c7f9a9c397/db/build_tools/build_db_docker_cmd.sh#L84-L87) to include `DNA_METHYLATION_SCHEMA` in the database dump.

Add new DNA methylation tables to store DNA methylation results that are required to generate plots and tables for MTP. The DNA methylation tables can have any appropriate names, and the following description uses `dna_methylation_summary` as an example table name. Following are the steps to add a new `dna_methylation_summary` table:

- Add `DNA_METHYLATION_SUMMARY_TBL=dna_methylation_summary` in [`../OpenPedCan-api-secrets/common_db.env`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/0a7046b1dedc7a7b954400edae7a45b4d60a8a98/README.md#32-prepare-docker-environment-files).
- Add `ENV DNA_METHYLATION_SUMMARY_TBL="dna_methylation_summary"` in [`Dockerfile`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/4696555392bfe9d69cec6f65998bca4aa4a294df/Dockerfile#L43).
- Add `DNA_METHYLATION_SUMMARY_TBL = "DNA_METHYLATION_SUMMARY_TBL"` in [`db/r_interfaces/db_env_vars.R`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/4696555392bfe9d69cec6f65998bca4aa4a294df/db/r_interfaces/db_env_vars.R#L37-L38).
- Create an empty table in [`db/build_tools/build_db.R`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/0a7046b1dedc7a7b954400edae7a45b4d60a8a98/db/build_tools/build_db.R#L660):

  ```R
  # dna_methylation_summary_tbl is assumed to be the table for loading into the database.
  #
  # dna_methylation_summary_tbl does not have to be the full table, and it can only have zero or a few lines.
  db_write_table(
    dplyr::slice(dna_methylation_summary_tbl, 0), conn,
    tolower(db_env_vars$DNA_METHYLATION_SCHEMA),
    tolower(db_env_vars$DNA_METHYLATION_SUMMARY_TBL))
  ```

- Prepare a CSV file for loading into the database. The CSV files can either be prepared in <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/165> or `db/build_tools/build_db.R`.
  - If prepared in <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/165>:
    - Upload the prepared CSV file to a S3 bucket as new OpenPedCan-analysis data release.
    - The prepared CSV file will be downloaded to `OpenPedCan-analysis/data/` by [`db/build_db.sh`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/58c024468bcd284ccf85091436ff64c7f9a9c397/db/build_db.sh#L49-L55).
  - If prepared in `db/build_tools/build_db.R`:
    - Upload a result file to a S3 bucket as new OpenPedCan-analysis data release. The result file can be in any format, which needs further preparation before loading into the database.
    - The result file will be downloaded to `OpenPedCan-analysis/data/` by [`db/build_db.sh`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/58c024468bcd284ccf85091436ff64c7f9a9c397/db/build_db.sh#L49-L55).
    - Load the result file in `db/build_tools/build_db.R`.
    - Prepare the table for database loading, such as filtering, renaming columns, and reordering columns.
    - Output the prepared table as CSV file like the procedure in [`db/build_tools/build_db.R`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/0a7046b1dedc7a7b954400edae7a45b4d60a8a98/db/build_tools/build_db.R#L620-L627).
- Add the following code in [`db/build_tools/build_db_docker_cmd.sh`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/58c024468bcd284ccf85091436ff64c7f9a9c397/db/build_tools/build_db_docker_cmd.sh#L64-L68) to load the prepared CSV file to the database:

  ```bash
  # Assuming the prepared CSV file is at ${BUILD_OUTPUT_DIR_PATH}/${DNA_METHYLATION_SCHEMA}_${DNA_METHYLATION_SUMMARY_TBL}.csv
  psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$DB_NAME" <<EOSQL
  COPY ${DNA_METHYLATION_SCHEMA}.${DNA_METHYLATION_SUMMARY_TBL}
  FROM '${BUILD_OUTPUT_DIR_PATH}/${DNA_METHYLATION_SCHEMA}_${DNA_METHYLATION_SUMMARY_TBL}.csv'
  WITH (FORMAT csv, HEADER);
  EOSQL
  ```

- Add the following code in [`db/build_tools/build_db_docker_cmd.sh`](https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/58c024468bcd284ccf85091436ff64c7f9a9c397/db/build_tools/build_db_docker_cmd.sh#L92-L93) to add a table index on the column that will be queried:

  ```bash
  echo "CREATE INDEX ensg_id_idx ON ${DNA_METHYLATION_SCHEMA}.${DNA_METHYLATION_SUMMARY_TBL} (\"Gene_Ensembl_ID\");" \
  | gzip --no-name -c >> "$db_dump_out_path"
  ```

The updated database building procedure can be tested using `./db/build_db.sh`.

cc @afarrel @chinwallaa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build DNA methylation database schema and tables #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Build DNA methylation database schema and tables #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions