Data Lake & the hype

DLThe image above gives a bigger picture of what a Data lake is. Data lake is a storage repository. How is it different from a database or a data warehouse? Well, data lake stores the data in raw format. Raw format meaning the same format in which it was acquired from the source. In a data warehouse or relational database, the data is structured according to the schema of the database.

James Dixon, the founder and CTO of Pentaho, has been credited with coming up with the term. This is how he describes a data lake:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”

As far as processing of the data is concerned the data lake deals with it in a different fashion as compared to a data warehouse. Data warehouse follows the concept of schema-on-write. That is before loading the data into the data warehouse we need to give it some shape and structure ( ETL/ model). Whereas, the Data lake follows the concept of schema-on-read. Here, you store the data in raw form and then shape and structure the data when you are ready to use it.

The Good

Its popularity is down to a belief that by consolidating data, you get rid of the information silos created by having independently managed collections of data, thereby increasing information use and sharing

  • Coping with the 3Vs of big data generation – velocity, variety, and volume
  • Storage of data in its native format, with meta tags & Schemas and transformations, are only applied when queries are made by other users or systems (“schema-on-read”)
  • Users and apps can interpret the data as they choose
  • Lower costs through server and license reduction, cheap scalability, flexibility for use with future systems, and the ability to keep the data until you have a use for it

 

The Bad

For a start, data lakes lack semantic consistency and governed metadata, increasing the degree of skill required of users looking to find the data they want for manipulation and analysis

  • Indiscriminate data hoarding, leading to stale data
  • Different user/app interpretations of data may conflict
  • Without initial checks, corrupt data may be ingested and used, before the problem is recognized

The entire point of a data lake is that it pulls in any data with no governance. With no restrictions on the cleanliness of the data, there is a real potential that it will eventually turn into a data swamp

Expectations

gartner

Source: Gartner (September 2017)

From the above diagram, we can see that Data Lakes are on the border of  ‘Trough of Disillusionment‘  (Interest wanes as experiments and implementations fail to deliver. Producers of the technology shake out or fail. Investments continue only if the surviving providers improve their products to the satisfaction of early adopters)

Is the Hype Real?

NoSQL: Inception of a new SQL?

Call it Not-only-SQL or Non-SQL doesn’t matter. What matters is that the Industry Giants are using it (as NoSQL offers Horizontal scalability) and if you want to get in these companies, you got to know NoSQL (say Yes-to-NoSQL). NoSQL can be scaled as per the modern needs of applications. Basically, NoSQL differs from a traditional SQL in terms of the datatypes used and how the data is stored. We can store graphs (e.g.-neo4j), key-value pairs (e.g.-Oracle NoSQL), wide column (as opposed to row-wise data in traditional SQL. e.g.- SAP HANA, Cassandra), and documents (e.g.-MongoDB). Basically, we can store anything from structured to unstructured or semi-structured and we do not necessarily organize the data in a tabular format.

Big giants in the industry like LinkedIn, Facebook, and Google have multiple data servers across the globe. The most important and common characteristic of the data in these servers is that it is distributed in nature (spread across the globe in different servers). It is difficult for a traditional SQL DataBase to capture and process the data across these distributed servers(use of ad-hoc joins across the network would cause slow data processing). This is where the need for a distributed SQL database kicks in. Now you know why the term NoSQL is so widely used these days.

Every coin has a flip side. Likewise, there is a downside to NoSQL as well. The horizontal scalability (Ability to add new servers easily) which NoSQL offers is gained by compromising on some but not all of the properties of data like Availability, Consistency, Isolation, Durability (ACID). The compromise depends on many different factors like the type of application, the type of data stored and so on. A particular distributed system can have three major characteristics which are Consistency (C), Availability (A) and Partition (P). Ideally, all the three characteristics are essential for a perfect system but in practice, we can have only two out of the three. In case of distributed systems, the P is mandatory (since the network is distributed, the data is partitioned across the network). So, we have to make a choice between C and A. Consistency means that every transaction receives the most recent data and Availability means that data is 100% available at all times for each transaction. Most of the NoSQL database use the AP model where they compromise on consistency. NoSQL uses a mechanism called as Eventual Consistency where data changes are propagated to the nodes eventually but not immediately. In this case, the data that is provided may not be the latest data but it is the previously stored data (stale data).

NoSQL databases differ from each other in terms of datatypes they are used to store. They are primarily used for applications that have flexible and scalable architecture. NoSQL provides lightweight and faster querying. Different NoSQL databases have different querying syntax based on the type of data they are used to store. NoSQL is preferred in applications that do not rely much on ACID properties.