⬡ Hub
Skip to content

Big Data Overview: Concepts and Technologies

1. What is Big Data?

  • Definition: Data that is too large, complex, or fast-moving to be processed by traditional methods.
  • The 3 Vs (and more):
    • Volume: Size of data (Terabytes, Petabytes).
    • Velocity: Speed of generation (Real-time streams).
    • Variety: Structured, Semi-structured, Unstructured.
    • Veracity: trustworthiness.
    • Value: Business insight.

2. Hadoop Ecosystem

  • HDFS (Hadoop Distributed File System): Storage layer. Blocks, NameNode, DataNode.
  • MapReduce: Processing layer. Split, Map, Shuffle, Reduce.
  • YARN: Resource management (Yet Another Resource Negotiator).

3. Spark Framework

  • Faster than MapReduce: In-memory processing.
  • RDDs (Resilient Distributed Datasets): core abstraction.
  • DataFrames & Datasets: Optimized structured API.
  • Spark SQL: Querying data with SQL.
  • Spark Streaming: Micro-batch processing.

4. Modern Data Stack

  • Data Warehousing: Snowflake, Google BigQuery, Amazon Redshift (OLAP).
  • Data Lakes: S3, ADLS, GCS (Raw storage).
  • Lakehouse: Databricks (combining Warehouse and Lake).
  • File Formats: Parquet, Avro, ORC (Columnar vs Row-based).

5. Components Summary

5.1 Ingestion

  • Kafka / Kinesis: Event streaming.
  • Sqoop: RDBMS to Hadoop import/export.
  • Flume: Log aggregation.

5.2 Processing

  • Spark / Flink: Stream & Batch.
  • Hive / Presto / Trino: SQL on Hadoop.

5.3 Workflows / Orchestration

  • Airflow: DAG based scheduling.
  • Oozie: Old school hadoop workflows.