Apache Spark is an open source parallel processing framework which is designed to run data analytics across clustered computers.

The general engine for large-scale data processing is maintained by the Apache Software Foundation.

Spark is designed to provide programmers with an application programming interface (API) centred on a resilient distributed dataset, this is a multiset of data items distributed over a cluster of machines.

Spark is capable of running on Hadoop, Mesos, standalone, or in the cloud and can access numerous data sources such as HDFS, Cassandra, HBase, and S3.

 

Why has Spark become so popular?

The functionalities of Spark, combined with its speed and relative ease of use, has seen it become one of the hottest tech products in recent years. Companies such as IBM have aligned their analytics portfolios are the technology.

Spark Core is the foundation of the project, this provides distributed task dispatching, scheduling, and basic I/O functionalities. Languages such as Java, Python, Scala, and R can be used with Spark.