Skip to content

Mahesh's Knowledgebase

Big Data Overview: Concepts and Technologies

Big Data Overview: Concepts and Technologies

1. What is Big Data?

Definition: Data that is too large, complex, or fast-moving to be processed by traditional methods.
The 3 Vs (and more):
- Volume: Size of data (Terabytes, Petabytes).
- Velocity: Speed of generation (Real-time streams).
- Variety: Structured, Semi-structured, Unstructured.
- Veracity: trustworthiness.
- Value: Business insight.

2. Hadoop Ecosystem

HDFS (Hadoop Distributed File System): Storage layer. Blocks, NameNode, DataNode.
MapReduce: Processing layer. Split, Map, Shuffle, Reduce.
YARN: Resource management (Yet Another Resource Negotiator).

3. Spark Framework

Faster than MapReduce: In-memory processing.
RDDs (Resilient Distributed Datasets): core abstraction.
DataFrames & Datasets: Optimized structured API.
Spark SQL: Querying data with SQL.
Spark Streaming: Micro-batch processing.

4. Modern Data Stack

Data Warehousing: Snowflake, Google BigQuery, Amazon Redshift (OLAP).
Data Lakes: S3, ADLS, GCS (Raw storage).
Lakehouse: Databricks (combining Warehouse and Lake).
File Formats: Parquet, Avro, ORC (Columnar vs Row-based).

5. Components Summary

5.1 Ingestion

Kafka / Kinesis: Event streaming.
Sqoop: RDBMS to Hadoop import/export.
Flume: Log aggregation.

5.2 Processing

Spark / Flink: Stream & Batch.
Hive / Presto / Trino: SQL on Hadoop.

5.3 Workflows / Orchestration

Airflow: DAG based scheduling.
Oozie: Old school hadoop workflows.