Modern businesses have vast, diverse data that they want to make use of in as many ways as possible, including for analytics. A data lake can serve as a single repository for multiple data-driven projects.
Understanding data lakes
A data lake is a centralized repository for hosting raw, unprocessed enterprise data. Data lakes can encompass hundreds of terabytes or even petabytes, storing replicated data from operational sources, including databases and SaaS platforms. They make unedited and unsummarized data available to any authorized stakeholder. Thanks to their potentially large (and growing) size and the need for global accessibility, they are often implemented in cloud-based, distributed storage.
How does the data get into a data lake? Stakeholders, who may be business managers or data analytics professionals, begin by identifying important or interesting data sources. They then replicate the data from these sources to the data lake with few if any structural, organizational, or formatting transformations. Replicating the raw data allows businesses to simplify the data ingestion process while creating an integrated source of truth for uses such as data analytics or machine learning.
Data stored in a lake can be anything, from completely unstructured data like text documents or images, to semistructured data such as hierarchical web content, to the rigidly structured rows and columns of relational databases. This flexibility means that enterprises can upload anything from raw data to the fully aggregated analytical results.
The important point is that a data lake provides a single place to save and access valuable enterprise data. Without a good data lake, businesses increase the threshold of effort needed from stakeholders who would benefit from data.
Agnostic, scalable data lake architecture
Data lake architecture satisfies the need for massive, fast, secure, and accessible storage. At the core of this architecture lies a storage layer designed for durability (protecting data from corruption or loss and guaranteeing constant uptime) and scalability (allowing for arbitrarily changing, voluminous data).
This storage layer must be agnostic to data types and structures, capable of keeping any kind of object in a single repository. This implies that data lake architecture is independent of data models, so that diverse schemas may be applied when the data is consumed, rather than when it’s stored.
A critical component of data lake architecture is its separation of storage from computation. Data lakes are the most highly abstracted repositories available, and their architectural requirements purely concern the provisioning and access of storage space. Processing and analytics layers are built on top.
Cloud platforms, with their intrinsic scalability and highly modular services, make the best hosts for data lakes. Storage services like Amazon S3 are engineered with the characteristics that make a good data lake, with abstracted, durable, flexible, and data-agnostic architectures.
Important data lake characteristics
Beyond its core architecture, a data lake must also include some key features:
- Diverse interfaces, APIs, and endpoints for uploading, accessing, and moving data. These are important because they support the data lake’s extreme variety of possible use cases.
- Sophisticated access control mechanisms. Data owners must be able to set permissions for keeping data secure and private when and where it needs to be. Access control, encryption, and network security features are critical for data governance.
- Search and cataloguing features. Without generic methods for organizing and locating huge amounts of diverse data, data lakes fail to be maximally available and useful. These features might include optimized key-value storage, metadata, tagging, or tools for collecting and classifying subsets of all objects.
- Support for the construction of or connection to processing and analytics layers. Analysts, data scientists, machine learning engineers, and decision-makers all derive the greatest benefit from centralized, fully available data, so the lake must support their various processing, transformation, aggregation, and analytical needs.
Data lakes vs. data warehouses: What a data lake is not
Many businesses already use another kind of centralized repository: a data warehouse. They might wonder whether they need a data lake at all, or whether implementing one would replace existing analytics data storage. But data warehouses and data lakes are distinct kinds of repositories: They have different features and serve separate purposes, though they can be used together.
|Characteristics||Data warehouse||Data lake|
|Data type||Data is processed before integration||Data is integrated in its raw and unstructured form|
|Use case||Data has a predetermined use case||Data does not have a predetermined use case|
|Users||Business users||Data scientists|
|Data quality||Data is curated and adheres to data governance practices||Data is more agile and does not necessarily comply with governance guidelines|
Contrasting designs and data structure
At a high level, data lakes and warehouses fulfill different goals and are based on contrasting philosophies. Data warehouses are intended as stable platforms for complex analytical queries. They are structured by default, so they can power technologies like online analytical processing (OLAP), with a focus on resolving queries efficiently. This all means that data is modeled first, then integrated into the data warehouse.
The data lake flips this paradigm — modeling and schemas are applied when users consume the stored, raw data. This allows data to be uploaded more easily, and provides users with the flexibility to run different types of analytics to uncover a range of insights. The efficiency and speed of a data lake’s analytics is based on the technologies used, and less reliant on data lake architecture or design.
Undetermined use cases vs. particular purpose
The term “data lake” is used to describe centralized but flexible and unstructured cloud storage. A data lake can act as a reservoir for backed-up or archival data, but more importantly, it can be a platform for self-service analytics. A data lake allows information to be loaded into storage without a predetermined purpose.
Meanwhile, data warehouses answer a specific business requirement or user need. They are designed from the ground up to solve this particular issue, with little room for adaptability or analytical improvisation later.
Different users and accessibility
Data lakes contain raw data and cater to users across the entire enterprise, though often more technically specialized users will garner the most value. Meanwhile, data warehouses contain more processed data, anticipating a business-focused user base and business intelligence applications.
Data scientists, with expert knowledge in working with large volumes of unstructured data, are the primary users of data lakes. However, less specialized users can also interact with unstructured data thanks to the emergence of self-service data preparation tools. A data lake empowers both advanced users working on data discovery or asking hypothetical questions, and anyone needing a source of truth and access to unprocessed data for reference or validation.
Meanwhile, business analysts and less technically proficient decision-makers can more readily used preprocessed data, such as that present in data warehouses. Data from warehouses is accessed by BI tools and becomes daily or weekly reporting, charts in presentations, or simple aggregations in spreadsheets presented to executives.
Agility and analytics vs. data quality
Both data lakes and data warehouses facilitate analytics; the difference is that in the warehouse processed data has a predetermined use case, whereas in data lakes its purpose might be pending.
While the raw data in data lakes is malleable, which is ideal for agile analysis and machine learning, its unstructured nature means less strict adherence to data governance practices. In a data warehouse, the business processes used to assemble and manage the system ensure high-quality data and compliance with data governance standards.
Data lake benefits
Data lakes are best for businesses that need to make large amounts of data available to stakeholders with varied skills and needs. Within this context, they provide many benefits.
- Resource reduction: Being able to store any kind of data means resource savings at no loss of value. In traditional systems, engineers and designers put effort into fitting everything together under one model. Data going unused represents time wasted on unnecessary processing. In a data lake, resources are only expended if and when information is consumed.
- Organization-wide accessibility: Data lakes provide a way around rigid silos and bureaucratic boundaries between business processes. Every stakeholder is empowered to access any and all enterprise data if they have the proper privileges.
- Performance efficiency: Data lakes do not require data to be defined by schemas. As a result, use of a data lake leads to simpler data pipelines and faster design and planning processes.
Data lake challenges
The main danger when building a data lake is that bad planning or management can transform the repository into a data swamp instead. A data swamp is a data lake with degraded value, whether due to design mistakes, stale data, or uninformed users and lack of regular access. Businesses implementing a data lake should anticipate several important challenges if they wish to avoid being left with a data swamp.
- Set business priorities: Assuming that any kind of data will eventually provide value and throwing everything into storage is not good practice. Organizations should assess their priorities, then get a general sense for what data is useful to store, and finally anticipate how the business might evolve and what that means for the contents of a potential data lake.
- Designate use cases and end users: Data should be accurate and fit for a purpose, but also catered to the people manipulating it. Data inconsistent with the tools and skills available to its consumers serves little purpose.
- Commit to good communication: A data lake cannot be opaque storage. Before implementation, businesses must commit to good communication in order to maintain focus and ensure important stakeholders are aware of how and why to use the data in a data lake. Though data lakes generally benefit from ingestion without modeling, that doesn’t mean they shouldn’t be documented. Users who know where to look for details regarding the provenance and contents of stored data are better prepared to act on that data.
- Establish a robust data ingestion process: Focus on analytics can lead to deemphasizing ingestion. Data lakes require fast, accurate ingestion, as getting uncorrupted raw data into storage is a primary focus. This step might seem easy where data lakes are concerned, but without a robust data ingestion step, the lake will fail.
Taking your data lake skyward with Stitch
Data lakes are agile, multipurpose, and contain unstructured data for often undetermined use cases. Distributed storage in the cloud is the ideal platform for such a system, since cloud storage shares many characteristic architectural traits of a data lake. For savings on on-premises hardware and in-house resources, businesses building centralized online storage should consider cloud platforms first.
Stitch can replicate data to your Amazon S3 data lake. With reliable, scalable key-based object storage and a deep feature set, S3 is well-suited for deploying vast online storage.
A data lake may serve as a foundational step when seeking more advanced and agile analytics. Try Stitch for free today and access dozens of data connectors that make it easy to load diverse enterprise data into a data lake.