Modern businesses have vast, diverse data that they want to make use of in as many ways as possible, including for analytics. A data lake can serve as a single repository for multiple data-driven projects.
A data lake is a centralized repository for hosting raw, unprocessed enterprise data. Data lakes can encompass hundreds of terabytes or even petabytes, storing replicated data from operational sources, including databases and SaaS platforms. They make unedited and unsummarized data available to any authorized stakeholder. Thanks to their potentially large (and growing) size and the need for global accessibility, they are often implemented in cloud-based, distributed storage.
How does the data get into a data lake? Stakeholders, who may be business managers or data analytics professionals, begin by identifying important or interesting data sources. They then replicate the data from these sources to the data lake with few if any structural, organizational, or formatting transformations. Replicating the raw data allows businesses to simplify the data ingestion process while creating an integrated source of truth for uses such as data analytics or machine learning.
Data stored in a lake can be anything, from completely unstructured data like text documents or images, to semistructured data such as hierarchical web content, to the rigidly structured rows and columns of relational databases. This flexibility means that enterprises can upload anything from raw data to the fully aggregated analytical results.
The important point is that a data lake provides a single place to save and access valuable enterprise data. Without a good data lake, businesses increase the threshold of effort needed from stakeholders who would benefit from data.
Data lake architecture satisfies the need for massive, fast, secure, and accessible storage. At the core of this architecture lies a storage layer designed for durability (protecting data from corruption or loss and guaranteeing constant uptime) and scalability (allowing for arbitrarily changing, voluminous data).
This storage layer must be agnostic to data types and structures, capable of keeping any kind of object in a single repository. This implies that data lake architecture is independent of data models, so that diverse schemas may be applied when the data is consumed, rather than when it's stored.
A critical component of data lake architecture is its separation of storage from computation. Data lakes are the most highly abstracted repositories available, and their architectural requirements purely concern the provisioning and access of storage space. Processing and analytics layers are built on top.
Cloud platforms, with their intrinsic scalability and highly modular services, make the best hosts for data lakes. Storage services like Amazon S3 are engineered with the characteristics that make a good data lake, with abstracted, durable, flexible, and data-agnostic architectures.
Beyond its core architecture, a data lake must also include some key features:
Many businesses already use another kind of centralized repository: a data warehouse. They might wonder whether they need a data lake at all, or whether implementing one would replace existing analytics data storage. But data warehouses and data lakes are distinct kinds of repositories: They have different features and serve separate purposes, though they can be used together.
|Data is processed before integration
|Data is integrated in its raw and unstructured form
|Data has a predetermined use case
|Data does not have a predetermined use case
|Data is curated and adheres to data governance practices
|Data is more agile and does not necessarily comply with governance guidelines
At a high level, data lakes and warehouses fulfill different goals and are based on contrasting philosophies. Data warehouses are intended as stable platforms for complex analytical queries. They are structured by default, so they can power technologies like online analytical processing (OLAP), with a focus on resolving queries efficiently. This all means that data is modeled first, then integrated into the data warehouse.
The data lake flips this paradigm — modeling and schemas are applied when users consume the stored, raw data. This allows data to be uploaded more easily, and provides users with the flexibility to run different types of analytics to uncover a range of insights. The efficiency and speed of a data lake's analytics is based on the technologies used, and less reliant on data lake architecture or design.
The term "data lake" is used to describe centralized but flexible and unstructured cloud storage. A data lake can act as a reservoir for backed-up or archival data, but more importantly, it can be a platform for self-service analytics. A data lake allows information to be loaded into storage without a predetermined purpose.
Meanwhile, data warehouses answer a specific business requirement or user need. They are designed from the ground up to solve this particular issue, with little room for adaptability or analytical improvisation later.
Data lakes contain raw data and cater to users across the entire enterprise, though often more technically specialized users will garner the most value. Meanwhile, data warehouses contain more processed data, anticipating a business-focused user base and business intelligence applications.
Data scientists, with expert knowledge in working with large volumes of unstructured data, are the primary users of data lakes. However, less specialized users can also interact with unstructured data thanks to the emergence of self-service data preparation tools. A data lake empowers both advanced users working on data discovery or asking hypothetical questions, and anyone needing a source of truth and access to unprocessed data for reference or validation.
Meanwhile, business analysts and less technically proficient decision-makers can more readily used preprocessed data, such as that present in data warehouses. Data from warehouses is accessed by BI tools and becomes daily or weekly reporting, charts in presentations, or simple aggregations in spreadsheets presented to executives.
Both data lakes and data warehouses facilitate analytics; the difference is that in the warehouse processed data has a predetermined use case, whereas in data lakes its purpose might be pending.
While the raw data in data lakes is malleable, which is ideal for agile analysis and machine learning, its unstructured nature means less strict adherence to data governance practices. In a data warehouse, the business processes used to assemble and manage the system ensure high-quality data and compliance with data governance standards.
Data lakes are best for businesses that need to make large amounts of data available to stakeholders with varied skills and needs. Within this context, they provide many benefits.
The main danger when building a data lake is that bad planning or management can transform the repository into a data swamp instead. A data swamp is a data lake with degraded value, whether due to design mistakes, stale data, or uninformed users and lack of regular access. Businesses implementing a data lake should anticipate several important challenges if they wish to avoid being left with a data swamp.
Data lakes are agile, multipurpose, and contain unstructured data for often undetermined use cases. Distributed storage in the cloud is the ideal platform for such a system, since cloud storage shares many characteristic architectural traits of a data lake. For savings on on-premises hardware and in-house resources, businesses building centralized online storage should consider cloud platforms first.
Stitch can replicate data to your Amazon S3 data lake. With reliable, scalable key-based object storage and a deep feature set, S3 is well-suited for deploying vast online storage.
A data lake may serve as a foundational step when seeking more advanced and agile analytics. Try Stitch for free today and access dozens of data connectors that make it easy to load diverse enterprise data into a data lake.