Summary Solution: In order to get accurate data into analysts’ hands, organizations should employ a combination of data lakes and data warehousing to effectively balance the need for freedom in a structured data environment.

The potential of big data is becoming undeniable. The breadth of data collected has exploded, and the tools and techniques to effectively leverage it have never been more sophisticated. From tech companies born in the cloud to fast food chains, technology-forward, data-rich organizations are winning by harnessing this power.

Within the financial services industry, data streaming and machine learning are supercharging marketing and social media efforts, yielding insight into potential new product lines and market segments. But the story stops there. While some financial institutions are more nimble, or have an innate comfort with data-driven decision-making (though it can honestly vary by business unit), on balance, they seem to be missing the boat in a way other industries are not. Why?

There are certainly several contributing factors at play, including unwieldy legacy tech, batch processing and the high burden of regulation. Even still, a driving factor, despite a treasure trove of data, is that they struggle to effectively deliver the data to the analysts themselves.

The reason is that the majority of big data found inside financial service companies tends to be obscured from frontline analysts – like your business analysts working in credit decisioning, who might use it to drive business outcomes. Sometimes it’s raw data hidden behind a curtain of pre-chewed, pre-aggregated, pre-trimmed reports in a clunky repository. Other times, it’s presented as an intimidating torrent of unstructured and unmanaged data that requires a computer science degree to use it.

In order to truly leverage the power of big data, financial institutions must begin asking themselves: How can we give analysts the power of big data in a way that makes it accessible enough for them to derive insights from it?

The solution is to leverage the strengths of data lakes, warehouses and robust data management, exploration and visualization tools in tandem to provide analytical and technical communities the right balance of freedom and guardrails to do their jobs effectively while constantly upskilling to keep pace with this changing industry.

Some Things to Consider

While the goals of each data ecosystem vary, the attributes of a robust system remain constant.  An ideal system is able to:


About Data Lakes and Warehouses

At a high level, the difference between a data lake and a warehouse comes down to its structure. Data lakes are meant to pool as much data as possible and make it available for any use case that comes along. To be clear: True data lakes pool all data to make it universally available in new and innovative ways. “All of our data lives on AWS S3,” is not the same as a data lake. Data warehouses, on the other hand, are carefully curated places with specified schemas and usage patterns.

Data warehouses, whether on or off premises, are typically a step above prepackaged reports in something like Tableau. While views may be prescribed for analytical, production or other use cases, data is organized to support maximum flexibility for the identified use cases. The speed of querying warehouses with a language like SQL also makes them ideal for routine ad hoc work.

But for large-scale projects or specialized work, a data lake with entirely new types and subject areas of data is ideal. The challenge is in making the lake navigable, regardless of the user’s experience level.

Managing Data Effectively

A metadata registry – the central repository where metadata definitions are stored and maintained – is a vital tool for discoverability and navigability within any data ecosystem. When analysts can find and understand what they’re looking at – as it relates to origin, relationships and use – the better the insights. Definitions on the quality of the data are also critical to ensuring users exercise the appropriate level of skepticism when dealing with a given dataset.

When it comes to giving analysts insight into data, warehouses tend to excel, given the predefined schemas. The problem, however, is that they lack change control, which brings added risk when attempting to run a compliant, well-governed production process that must be reliable and repeatable.

Change control is easier to maintain in flat files like those found in data lakes – especially if they are well organized with descriptive attributes like write access and versioning, which tend to be granular and methodical. Timestamps on every file, and the increased granularity of formats like Parquet, leave a more obvious trail, should something change unexpectedly. Good change control also serves risk management and governance aims – either for inputs to a process or the memorialization of outputs at a point in time.

Data management is a large undertaking that requires discipline and a clearly communicated partnership between producers and consumers of data. Still, the benefit of a well-defined system can have significant impact on the insights generated, far exceeding initial or ongoing investments. And given the rate at which the best players are outpacing late adopters in this space, it’s becoming less of an option and more of a mandate.


Putting It Together

When designing a data ecosystem that brings depth, breadth, power, control and usability, the choice to pair data lakes with a warehousing approach is clear, and frameworks that cohesively achieve that on the enterprise level are far superior to those letting business units implement their own.

The openness of data lakes makes them ideal for storing unstructured or uncurated data that has yet to be mined. Their relative control, especially in robust cloud-based applications, is also ideal for production inputs and outputs of critical datasets. Lastly, they tend to be far more affordable for storing big data – even considering the cost of input/output, which is always a hidden cost for cloud storage.

Then, by feeding data from the lake into warehouses, analysts have access to a critical mass of impactful, granular, explorable data, providing an analytical layer with predefined schemas and easy query tools like SQL. This extracts analysts from the considerable technology and data noise behind the simplified query screen they’re interacting with, accommodating less technical analysts while bolstering their talent and business intuition.

Impactful data ecosystems take this a step further, including powerful tools for data movement – some exposed to key users and others reliably executing behind the scenes. This additional element requires organizational alignment around the role and importance of metadata and registration of data usages to ensure clarity for all involved. But the benefit of leveraging analysis-ready datasets, with clear and stable business logic applied to data in a well-controlled way, can give performance gains to even the largest of repeated use cases and sidestep the dependence on data experts to explain how transformations and derived variables were implemented.

Data Lake and data Warehouse

Lastly, make sure to build data quality monitoring into the platform, with flexible reporting available to all users, which ensures that data is managed, at the source, in a consistent and robust way. This not only reduces duplicative effort across departments, but also provides users with a fair idea of the quality and reliability of the data they are about to use, while still affording maximum freedom in exploration and insight generation.

Setting up a thoughtful data ecosystem is a cornerstone of deriving meaningful insights from your data. Most organizations know this. But, when done well, these powerful, far-reaching, systems can be a competitive advantage – one that financial services firms need now, more than ever.

Click here to contact us and learn more about how we might approach big data solutions for your organization.