The image above gives a bigger picture of what a Data lake is. Data lake is a storage repository. How is it different from a database or a data warehouse? Well, data lake stores the data in raw format. Raw format meaning the same format in which it was acquired from the source. In a data warehouse or relational database, the data is structured according to the schema of the database.
James Dixon, the founder and CTO of Pentaho, has been credited with coming up with the term. This is how he describes a data lake:
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”
As far as processing of the data is concerned the data lake deals with it in a different fashion as compared to a data warehouse. Data warehouse follows the concept of schema-on-write. That is before loading the data into the data warehouse we need to give it some shape and structure ( ETL/ model). Whereas, the Data lake follows the concept of schema-on-read. Here, you store the data in raw form and then shape and structure the data when you are ready to use it.
Its popularity is down to a belief that by consolidating data, you get rid of the information silos created by having independently managed collections of data, thereby increasing information use and sharing
- Coping with the 3Vs of big data generation – velocity, variety, and volume
- Storage of data in its native format, with meta tags & Schemas and transformations, are only applied when queries are made by other users or systems (“schema-on-read”)
- Users and apps can interpret the data as they choose
- Lower costs through server and license reduction, cheap scalability, flexibility for use with future systems, and the ability to keep the data until you have a use for it
For a start, data lakes lack semantic consistency and governed metadata, increasing the degree of skill required of users looking to find the data they want for manipulation and analysis
- Indiscriminate data hoarding, leading to stale data
- Different user/app interpretations of data may conflict
- Without initial checks, corrupt data may be ingested and used, before the problem is recognized
The entire point of a data lake is that it pulls in any data with no governance. With no restrictions on the cleanliness of the data, there is a real potential that it will eventually turn into a data swamp
Source: Gartner (September 2017)
From the above diagram, we can see that Data Lakes are on the border of ‘Trough of Disillusionment‘ (Interest wanes as experiments and implementations fail to deliver. Producers of the technology shake out or fail. Investments continue only if the surviving providers improve their products to the satisfaction of early adopters)
Is the Hype Real?