Big Data Overview: Concepts and Technologies
1. What is Big Data?
- Definition: Data that is too large, complex, or fast-moving to be processed by traditional methods.
- The 3 Vs (and more):
- Volume: Size of data (Terabytes, Petabytes).
- Velocity: Speed of generation (Real-time streams).
- Variety: Structured, Semi-structured, Unstructured.
- Veracity: trustworthiness.
- Value: Business insight.
2. Hadoop Ecosystem
- HDFS (Hadoop Distributed File System): Storage layer. Blocks, NameNode, DataNode.
- MapReduce: Processing layer. Split, Map, Shuffle, Reduce.
- YARN: Resource management (Yet Another Resource Negotiator).
3. Spark Framework
- Faster than MapReduce: In-memory processing.
- RDDs (Resilient Distributed Datasets): core abstraction.
- DataFrames & Datasets: Optimized structured API.
- Spark SQL: Querying data with SQL.
- Spark Streaming: Micro-batch processing.
4. Modern Data Stack
- Data Warehousing: Snowflake, Google BigQuery, Amazon Redshift (OLAP).
- Data Lakes: S3, ADLS, GCS (Raw storage).
- Lakehouse: Databricks (combining Warehouse and Lake).
- File Formats: Parquet, Avro, ORC (Columnar vs Row-based).
5. Components Summary
5.1 Ingestion
- Kafka / Kinesis: Event streaming.
- Sqoop: RDBMS to Hadoop import/export.
- Flume: Log aggregation.
5.2 Processing
- Spark / Flink: Stream & Batch.
- Hive / Presto / Trino: SQL on Hadoop.
5.3 Workflows / Orchestration
- Airflow: DAG based scheduling.
- Oozie: Old school hadoop workflows.