Five minutes with VP of Engineering Reece Robinson

 In our earlier chats we discussed why it’s important to have well-organized data, and what it takes to get it perfectly unified and centralized. That’s key, because in a warehouse or lakehouse, the data is still siloed. But when it’s unified, standardized and centralized into a data supermodel, the data environment becomes almost supercharged, enabling providers to predict trends, identify potential risks, and address business and patient issues before they escalate. 

Additionally, out-of-the-box data lake and lakehouse solutions will teach you how to use a product but not how to adapt it to your specific use cases. You’re stuck with it as is, not perfectly tuned to your organization. Which means more work for your analysts, integration and data engineers. With a supermodel, that work has already been done. Analysts have specified the data model and behaviour, or the logic, and the engineers have built the use-case implementation, where for example, information about a person is linked to any other context existing information about them inside the system.

This is in part because of the pipelines, i.e. how the data gets in and out. Quality data usage outcomes rely on quality data, and too many solutions only do half the job. You might think this stuff is only relevant for data nerds like me, but it’s important to understand if you are in any way tasked with solving an organization’s health data challenges, because many products sound very similar, but each provider is uniquely complex and until now there has been no one-size-fits-all solution. That’s how organizations end up spending billions of dollars on tech infrastructure and still have non-interoperable tech stacks, and data not utilized to its full potential. Then later they spend more money on add-ons and extensions just to keep up. It’s totally unnecessary!  

That’s the grand prize from the enterprise perspective – the ability to make all of your data available for all possible uses across all of an organization’s health and health-adjacent touchpoints. You want your data infrastructure to provide extensibility – the ability to apply at scale – and also flexibility – the ability to adjust for any customer site, dependent upon how the data is going to be consumed. You might want to pull millions of data points to train predictive models for a population, or you might want to analyse a single individual on a clinical screen. Because unifying data is extremely difficult, these tasks are often treated as two separate systems with their own tech stack. But a data supermodel makes it possible to do both, using the same data points. In basic terms that’s a huge time and cost saver. 

“You might want to pull millions of data points to train predictive models for a population, or you might want to analyse a single individual on a clinical screen. A data supermodel makes it possible to do both, using the same data points”

Back to those pipelines! They’re vital for making that data useful as it comes in and out. In a data model they are pre-built, then extended and customised for each customer as necessary. At design time, you might ask the customer to share a large data sample – say 2 million Admit Discharge Transfers (ADTs) – to get an idea of the statistical variation. As that sample data comes in through the pipeline, it goes through a series of processes to be ‘cleaned’.Analysts work to understand the fingerprint of the sample, and then define the parameters for what they’ll consider valid or invalid. They determine useful statistics or important fields to monitor, and what level of preparation the data needs to be processed, or standardized. Some providers might send insurance information, some might not, others might send resolved diagnosis codes. There’s a massive variation even though they’re using approved health data standards. 

“Analysts should make sure they understand the unique fingerprint of a customer’s data, then design and adjust the pipelines to accommodate. That way when the solution is deployed fully, it’s pipelines are perfectly tuned to that provider’s data.”

Then consider output!  Lets say a researcher is looking at a number of use cases for analytics and population health. When they want to pull that data from the supermodel it’s now nicely organized, health-data model-shaped data that’s usableby any and all 3rd party platforms, or downstream applications, for any applicable use cases. Clinical viewer, patient portal – you name it. Different users might want to use it in different ways – just HL7, or CCDAs, or CSV custom data sets. Any other random future thing. When it leaves, you need a pipeline that can feed from the model with output pipelines for each of these formats. That not only saves you an incredible amount of time and effort, you can now predict trends, identify potential risks, and address business and patient issues before they escalate. You’ve future-proofed your organisation. 

So that’s why you’d jump from datalake to data model. It’s all the value add from those processing pipelines. It’s because your analysts and researchers can be productive from day one. With a model, your team can leverage all the data now. Compare that with researchers and analysts only working on raw data in the data lake. Every time they dip in they have to perform the work that the pipelines do. And the chance of re-use is low for the next time. Because data feeds and upstream systems can change, but by using a data model there’s complete isolation of impact for the people and systems consuming data up or downstream. Even with a lakehouse, they’re still working on a use-case by use-case basis. That’s how powerful a data supermodel is, and why organizations like Gartner are so excited by them.