“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed” (KDnuggets). “A data lake is usually a single store of all enterprise data, including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics, and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video)” (Wikipedia). “You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning — to guide better decisions” (AWS).

How does a data lake differ from a data warehouse?

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line-of-business applications. The data structure and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust.

A data lake … stores relational data from line-of-business applications and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights. — AWS

More from the data glossary

A definitive guide to data definitions and trends, from the team at Stitch.

Give Stitch a try, on us

Stitch streams all of your data directly to your analytics warehouse.

Set up in minutes Unlimited data volume during trial 5 million rows of data free, forever