The FHIR® specification (release 4) has been published, and along with this, the scope of FHIR is also increasing. Ancillary specifications are under development, these build on the core FHIR specification to meet specific and required needs. These specifications take common use cases and provide guides that describe them in detail. Along with architectural options and supporting FHIR artifacts, this helps prevent the same problem being solved in multiple ways by different implementers.

The FHIR Bulk Data Export Specification
One of these is the FHIR Bulk Data Extract, which is designed to meet use cases where information from multiple patients, needs to be extracted from a data source such as an Electronic Health Record (EHR) or FHIR server, to support requirements such as:

  • Obtaining data: from an EHR for population-based research and training data for a machine learning algorithm or quality metrics.
  • Moving data: in bulk between servers – perhaps when medical practices and EHR’s merge or when an organisation changes their EHR. Another example would be moving directory data between servers as described in the Validated Healthcare Directory Guide (VHDir) Implementation Guide.
  • Extracting data: entire data sets on patient groups for analytics – patients on a clinical trial or under Case Management.

The data that is returned might include Personal Health Information (PHI), or it may have been de-identified if this is not appropriate. It’s important to appreciate that this specification is in the early stages of development and like FHIR itself, is being tested as it is being developed at connectathons. Many major EHR vendors, as well as plenty of others, are involved in the FHIR bulk data extract development. Here are the high-level results from the connectathon held in January at the 2019 Working Group Meeting in San Antonio.

The client submits a request for data, and the server will create a file containing the data for the client to consume. The actual request can be specified in a number of ways:

  • Firstly, the extract could be across all the patients in the server, or just a subset of patients – as defined by the Group
  • Also, then you can also specify which resource types you are interested in, and a ‘resource modified’ date.
  • There’s also a new ‘type filter’ that’s being discussed that allows you to get more granular (e.g., include blood pressure and pulse observations only in the last six months).
The Asynchronous Approach

Because of the large size of the data that can result from these searches, the specification describes an ‘asynchronous’ approach – in other words, the client doesn’t wait for the server to create the extract as this could take a long time. The details of this are in the specification, a brief description is:  

  1. The client makes the request of the server called the ‘kickoff’ request. If the request is accepted, the server returns a location that the client can query to determine when the extraction has been completed.
  2. The client periodically polls the server using the address it has been given to monitor the progress of the extraction. When the extraction is complete, the server will return an ‘outcome’ response that has the location if the file/s contain the extracted data, any errors that may have occurred, security details for actually downloading the data and others.
  3. The client then downloads the files containing the extracted data (in ndjson format – which describes how to have multiple JSON objects in a single file) from the specified location. Observing whatever security measures were specified, for example, it may have to provide a specific access token that was previously negotiated with the server.
  4. Finally, the client informs the server that the data has been downloaded, and the files can be deleted. (Of course, nothing is stopping the server doing this automatically after a time interval).

So here’s an example of a kickoff request that might be used to pull directory related resources from a registry into some other server:

GET [base]/$export ?_type=Organization, Location, Practitioner, PractitionerRole, HealthcareService

Or how about moving terminology resources between terminology servers: 

GET [base]/$export ?_type=ValueSet, CodeSystem, ConceptMap, NamingSystem,

The Importance of Security 

Security is front and centre in the specification. It notes that functionality such as this is a ‘holy grail’ for attackers – ‘give me all your data’ is an attractive operation to subvert!

The details of this are in the specification, and the discussions are ongoing. The specification so far:

  • Any transport is secure (obviously)
  • Authentication (identifying the client) is paramount, and there is work on specifying, which was tested at the recent connectathon based on the SMART
  • The server is obligated to preserve patient privacy. The exact mechanism of how this is to be done is dependent on the implementation, for example, a server could support ‘opt-out’ – or ‘opt-in’ consent like Sync for Science.
  • Only data that the client is able to access should be returned. Again, the exact mechanism will vary.
  • The output may be encrypted in a format that only the client can decrypt.

In conclusion, as FHIR matures and is increasingly accepted as the manner in which healthcare information is exchanged and at times stored, attention is being turned to how best to use FHIR for specific scenarios, sometimes this advice is in the form of Implementation Guides, but where the scenario is more general it is represented as a separate specification.

To learn more about FHIR bulk data extract, have a look at these resources:

Also, thanks to Brian Postlethwaite for helping out with this post.

To learn more about FHIR, read my white paper Enabling the Ecosystem with FHIR which discusses how healthcare information can be made available where and when it is needed.