Running ehrQL
You can run ehrQL in two places:
- on your own computer, where you can try out ehrQL against dummy tables, and test that your analysis code runs correctly
- on an OpenSAFELY backend database, to use ehrQL with real tables
Running ehrQL on your own computer against dummy tables🔗
There are three ways to run ehrQL on your own computer against dummy tables:
- using the OpenSAFELY VS Code extension, to explore the contents of ehrQL elements and datasets
- as a standalone action, to test your dataset definition, via
opensafely exec
- as the first step in an OpenSAFELY pipeline, to test the whole pipeline, via
opensafely run
1. Exploring ehrQL using the OpenSAFELY VS Code extension🔗
The OpenSAFELY VS Code extension uses dummy tables to allow you to inspect the contents of ehrQL tables, columns, datasets and queries as you are writing ehrQL in VS Code.
2. Running ehrQL as a standalone action via opensafely exec
🔗
To actually run your ehrQL queries against real tables, you need to write a dataset definition and save it in a file.
But first, while you are developing an ehrQL query, you can run your dataset definition against dummy tables to produce an output file that you can inspect.
Copy and paste the following dataset definition
into a new file called dataset_definition.py
and save it in your learning-ehrql
directory:
from ehrql import create_dataset
from ehrql.tables.core import patients, medications
dataset = create_dataset()
dataset.define_population(patients.date_of_birth.is_on_or_before("1999-12-31"))
asthma_codes = ["39113311000001107", "39113611000001102"]
latest_asthma_med = (
medications.where(medications.dmd_code.is_in(asthma_codes))
.sort_by(medications.date)
.last_for_patient()
)
dataset.med_date = latest_asthma_med.date
dataset.med_code = latest_asthma_med.dmd_code
Can you work out what the dataset definition will generate?
Make sure you save the file!
From the learning-ehrql
directory,
use the command below to run your dataset definition with ehrQL.
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv
ehrQL dataset definitions are written in Python. But, unlike typical Python code, we instead run the dataset definition via the OpenSAFELY CLI. The OpenSAFELY CLI internally uses a correctly configured version of Python to run the dataset definition.
What each part of this command does🔗
opensafely exec ehrql:v1
uses the OpenSAFELY CLI to run ehrQL. Thev1
after the:
refers to the version of ehrQL being used.generate-dataset
instructs ehrQL to generate a dataset from the dataset definition.dataset_definition.py
specifies the filename of the dataset definition to use.- The dataset definition file is in the directory that we are running
opensafely exec
so we do not need to specify the full path to the file in this case.
- The dataset definition file is in the directory that we are running
--dummy-tables example-data
specifies that the dummy CSV input data is in theexample-data
directory.- If the
--dummy-tables
option is omitted, randomly generated data will be used instead.
- If the
--output output/dataset.csv
specifies the path to the output CSV file.- If the
--output
option is omitted, the output will be not be saved to a file, but displayed on screen.
- If the
What you should see when you run the command🔗
You should see output displayed similar to this:
2023-04-19 08:53:41 [info ] Compiling dataset definition from dataset_definition.py [ehrql.main]
2023-04-19 08:53:41 [info ] Generating dummy dataset [ehrql.main]
2023-04-19 08:53:41 [info ] Reading CSV data from example-data [ehrql.main]
2023-04-19 08:53:41 [info ] Building dataset and writing results [ehrql.main]
The date and time you see will differ from that here.
The output file🔗
The output will be stored in a file called dataset.csv
in the output
directory.
The file will contain the following CSV data:
patient_id,med_date,med_code
0,2014-01-11,39113611000001102
1,2018-09-21,39113311000001107
4,2017-05-11,39113611000001102
5,2019-07-06,39113611000001102
6,,
7,,
8,,
9,,
The bottom 4 rows in the generated dataset show that there are 4 patients in the defined population that do not have any record for the medications specified in the dataset definition.
Try running the ehrQL dataset definition again,
without the --dummy-tables
and --output
options:
opensafely exec ehrql:v1 generate-dataset dataset_definition.py
By not specifying the dummy tables to use, random data is used as the data source to generate the dataset, instead of using the sample dummy tables that you downloaded previously.
By not specifying the output file, the output is displayed on screen. This can be useful for quickly seeing output while working on a dataset definition, instead of saving to a file, and then opening the file.
When things go wrong🔗
If your dataset definition contains some invalid ehrQL, an error message will be displayed on the screen.
This is one example:
$ opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv
2023-04-21 17:53:42 [info ] Compiling dataset definition from dataset_definition.py [ehrql.main]
Error loading file 'dataset_definition.py':
Traceback (most recent call last):
File "/workspace/dataset_definition.py", line 10, in <module>
dataset.med_date = latest_asthma_med.dat
^^^^^^^^^^^^^^^^^^^
AttributeError: 'medications' object has no attribute 'dat'
Refer to the catalogue of errors for help with interpreting error messages.
3. Running ehrQL in an OpenSAFELY pipeline via opensafely run
🔗
To run your ehrQL queries as part of an OpenSAFELY pipeline with opensafely run
,
you need to have a file called project.yaml
.
There is considerably more technical detail on the project pipeline in the OpenSAFELY documentation.
Copy the following into a file called
project.yaml
in your learning-ehrql
directory:
version: '4.0'
actions:
generate_dataset:
run: ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv.gz
outputs:
highly_sensitive:
dataset: output/dataset.csv.gz
summarise_dataset:
run: python:latest summarise_dataset.py
needs: [generate_dataset]
outputs:
moderately_sensitive:
dataset: output/summary.txt
Users already familiar with the OpenSAFELY research template may notice that the research template already includes a basic project.yaml
file that can be edited.
Here, for the purposes of this tutorial,
to skip setting up the template,
we create this file entirely by hand.
The project.yaml
file defines two actions: generate_dataset
and summarise_dataset
.
Each of these actions specifies one or more outputs
.
The definitions of "highly sensitive" and "moderately sensitive" are explained in the project.yaml
documentation.
The generate_dataset
action's run:
command should look familiar from the previous section.
However, note that the --output
path is now to a compressed CSV file (dataset.csv.gz
).
We recommend the use of compressed CSV files when generating a dataset definition as part of an OpenSAFELY pipeline.
summarise_dataset
uses a Python script called summarise_dataset.py
.
Copy the following into a file called summarise_dataset.py
in your learning-ehrql
directory.
import pandas as pd
dataframe = pd.read_csv("output/dataset.csv.gz")
num_rows = len(dataframe)
with open("output/summary.txt", "w") as f:
f.write(f"There are {num_rows} patients in the population\n")
Even if you don't know how to use pandas, can you guess at what this code might do before you run the OpenSAFELY project?
From the learning-ehrql
directory,
use the command below to run all of the actions
in project.yaml
:
opensafely run run_all
If is this is the first time you have used opensafely exec
,
the OpenSAFELY CLI may fetch some other Docker images (python
and busybox
) needed to run the action.
What you should see when you run the command🔗
You should see in the logs output displayed similar to this:
$ opensafely run run_all
Running actions: generate_dataset, summarise_dataset
jobrunner.run loop started
generate_dataset: Preparing your code and workspace files
...
summarise_dataset: Extracting output file: output/summary.txt
summarise_dataset: Finished recording results
summarise_dataset: Completed successfully
summarise_dataset: Cleaning up container and volume
=> generate_dataset
Completed successfully
log file: metadata/generate_dataset.log
outputs:
output/dataset.csv.gz - highly_sensitive
=> summarise_dataset
Completed successfully
log file: metadata/summarise_dataset.log
outputs:
output/summary.txt - moderately_sensitive
Some of the middle lines of this log have been omitted.
The output files🔗
The generate_dataset
action will generate a compressed CSV file called dataset.csv.gz
in the output
directory.
If you unzip this, you should see the same output as the previous example.
The summarise_dataset
action will generate a small text file called summary.txt
in the output
directory.
This will tell you how many patients are in your population.
Running ehrQL on an OpenSAFELY backend database🔗
Once you are happy with your ehrQL queries and any analysis code, you can submit your project to run against real data in an OpenSAFELY backend database.
To submit your project to run against real data, refer to the existing documentation on using the OpenSAFELY jobs site.
You will require approval for an OpenSAFELY project, before you can submit your project to the jobs site.
Questions🔗
- Why would you use the ehrQL sandbox?
- Which
opensafely
command would you use to run just a dataset definition as a single action? - Which
opensafely
command would you use to run an entire OpenSAFELY project consisting of multiple actions?