Data Lake is a file-based system that allows to store both structured and unstructured data. All the raw data coming from different sources can be stored in a data lake without pre-defining a schema for it.
Data Lakes are very agile storage systems where users have all the flexibility of how they want to store the data. Moreover, the data stored in a data lake is very easy to process because it can be accessed from many different processing engines while at the same time having the possibility to leverage the parallel computing for it
Azure Data Lake Storage is a repository for structured, semi-structured and unstructured data stored in a native format. When data is captured in a data lake the structure is not defined, which makes it easy to just store all the data without the need to define a structure or questions you want to answer from it upfront.
Azure Data Lake Storage Gen2 is built on top of Azure Blob Storage and combines the best features of Azure Data Lake Storage Gen1 (ADLS Gen1) and Azure Blob Storage. Data Lake Gen2 is a storage optimized for big data analytics allowing you to manage vast amounts of data at a low-cost. We recommend to use Gen 2 vs. Gen 1 for any analytical use-case.
Azure Blob Storage is a flat namespace storage, where the user was still able to simulate a hierarchy (virtual directory) in the containers using slashes in the naming convention. Azure Data Lake Gen 2 starts from the Azure Blob Storage as a base and extends it with a real hierarchical structure. With this, instead of listing through all the objects in the container in a blob storage to find the file on which you want to perform an operation e.g. delete, you can have efficient data access and perform just a single operation.
Data Lake Gen2 is recommended to use for big data analytics workloads. If your workload is not using the hierarchical namespace and you just need a general storage, then it is better to use the Blob Storage without HNS to avoid the transaction costs which are higher (still economical) when the HNS is enabled.
Azure data Lake Storage Gen1 (formerly known as Azure Data Lake Store) is an optimized storage for big data analytics workloads built as a hierarchical file system (Apache Hadoop). The data stored in Azure Data Lake Storage Gen1 can be in its native format and to analyze the data we can use Hadoop’s analytical frameworks as MapReduce and Hive.
For all new workloads, it's recommended that you start using the Azure Data Lake Gen2 because it combines the best from Azure Data Lake Gen1 and Azure Blob Storage.
Tek-Analytics has worked on many occasions with customers (different industries and sizes) mapping out challenges with regards to big data and defining a solid and scalable solution to process and analyse their data. We have built up extensive knowledge in designing and setting up a best practice data lakes.