Skip to content

Latest commit

 

History

History
193 lines (162 loc) · 15.2 KB

README.md

File metadata and controls

193 lines (162 loc) · 15.2 KB

Apache Spark Data Connector

Spice can read data from a Spark instance. This guide will start Spark isntance with sample data, create an app, load and query a dataset.

Prerequisites

This recipe requires Docker and Docker Compose to be installed.

Also ensure that you have the spice CLI installed. You can find instructions on how to install it here.

How to run

Clone this cookbook repo locally and navigate to the spark directory:

git clone https://github.com/spiceai/cookbook.git
cd cookbook/spark
  1. Start the Docker Compose stack, which includes a Spark instance and init notebook to load the NYC taxi trip parquet data:
docker compose up -d

It will take about about 30 seconds to start the Spark instance and load the sample dataset. Check service logs to see when the Spark instance is ready:

docker compose logs -f spark
spark  | 25/01/14 14:11:12 INFO CodeGenerator: Code generated in 14.890809 ms
spark  | +--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
spark  | |VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
spark  | +--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
spark  | |       1| 2022-03-01 00:13:08|  2022-03-01 00:24:35|            1.0|          2.4|       1.0|                 N|          90|         209|           2|       10.0|  3.0|    0.5|       0.0|         0.0|                  0.3|        13.8|                 2.5|        0.0|
spark  | |       1| 2022-03-01 00:47:52|  2022-03-01 01:00:08|            1.0|          2.2|       1.0|                 N|         148|         234|           2|       10.5|  3.0|    0.5|       0.0|         0.0|                  0.3|        14.3|                 2.5|        0.0|
spark  | |       2| 2022-03-01 00:02:46|  2022-03-01 00:46:43|            1.0|        19.78|       2.0|                 N|         132|         249|           1|       52.0|  0.0|    0.5|     11.06|         0.0|                  0.3|       67.61|                 2.5|       1.25|
spark  | |       2| 2022-03-01 00:52:43|  2022-03-01 01:03:40|            2.0|         2.94|       1.0|                 N|         211|          66|           1|       11.0|  0.5|    0.5|      4.44|         0.0|                  0.3|       19.24|                 2.5|        0.0|
spark  | |       2| 2022-03-01 00:15:35|  2022-03-01 00:34:13|            1.0|         8.57|       1.0|                 N|         138|         197|           1|       25.0|  0.5|    0.5|      5.51|         0.0|                  0.3|       33.06|                 0.0|       1.25|
spark  | |       1| 2022-03-01 00:11:57|  2022-03-01 00:53:05|            2.0|         14.0|       1.0|                 N|         132|          33|           1|       43.5| 1.75|    0.5|       9.2|         0.0|                  0.3|       55.25|                 0.0|       1.25|
spark  | |       2| 2022-03-01 00:05:11|  2022-03-01 00:08:22|            1.0|         0.61|       1.0|                 N|         166|         151|           1|        4.5|  0.5|    0.5|       1.0|         0.0|                  0.3|         6.8|                 0.0|        0.0|
spark  | |       2| 2022-03-01 00:30:56|  2022-03-01 00:46:21|            1.0|         2.83|       1.0|                 N|          74|         238|           1|       13.0|  0.5|    0.5|       3.7|         0.0|                  0.3|        18.0|                 0.0|        0.0|
spark  | |       2| 2022-03-01 00:30:28|  2022-03-01 00:30:36|            1.0|          0.1|       1.0|                 N|         145|         145|           3|       -2.5| -0.5|   -0.5|       0.0|         0.0|                 -0.3|        -3.8|                 0.0|        0.0|
spark  | |       2| 2022-03-01 00:30:28|  2022-03-01 00:30:36|            1.0|          0.1|       1.0|                 N|         145|         145|           2|        2.5|  0.5|    0.5|       0.0|         0.0|                  0.3|         3.8|                 0.0|        0.0|
spark  | +--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
spark  |
spark  | 25/01/14 14:11:12 INFO SparkContext: Invoking stop() from shutdown hook
spark  | 25/01/14 14:11:12 INFO SparkContext: SparkContext is stopping with exitCode 0.
spark  | 25/01/14 14:11:12 INFO SparkUI: Stopped Spark web UI at http://2556af46dcb0:4040
spark  | 25/01/14 14:11:12 INFO StandaloneSchedulerBackend: Shutting down all executors
spark  | 25/01/14 14:11:12 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Asking each executor to shut down
spark  | 25/01/14 14:11:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
spark  | 25/01/14 14:11:12 INFO MemoryStore: MemoryStore cleared
spark  | 25/01/14 14:11:12 INFO BlockManager: BlockManager stopped
spark  | 25/01/14 14:11:12 INFO BlockManagerMaster: BlockManagerMaster stopped
spark  | 25/01/14 14:11:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
spark  | 25/01/14 14:11:12 INFO SparkContext: Successfully stopped SparkContext
spark  | 25/01/14 14:11:12 INFO ShutdownHookManager: Shutdown hook called
spark  | 25/01/14 14:11:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-e026e8e3-f302-4adf-bdf3-de083138fa40
spark  | 25/01/14 14:11:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-314a37cb-843b-4e6a-a4a2-8e71944f85bf
spark  | 25/01/14 14:11:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-314a37cb-843b-4e6a-a4a2-8e71944f85bf/pyspark-ffa34bfd-aa50-4a7a-80f5-4b6dc6e0b0fb
spark  | Spark services started!
  1. Initialise a Spice app
spice init spark_demo
cd spark_demo
  1. Start the Spice runtime
spice run
2025/01/14 02:52:58 INFO Checking for latest Spice runtime release...
2025/01/14 02:52:59 INFO Spice.ai runtime starting...
2025-01-13T17:53:00.171638Z  INFO runtime::init::dataset: No datasets were configured. If this is unexpected, check the Spicepod configuration.
2025-01-13T17:53:00.181202Z  INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051
2025-01-13T17:53:00.181420Z  INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
2025-01-13T17:53:00.185632Z  INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090
2025-01-13T17:53:00.185712Z  INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052
2025-01-13T17:53:00.196585Z  INFO runtime::init::results_cache: Initialized results cache; max size: 128.00 MiB, item ttl: 1s
  1. Configure a Spark dataset into the spicepod. Copy and paste the following spicepod.yaml configuration into your Spicepod.
version: v1
kind: Spicepod
name: spark_demo

datasets:
  - from: spark:nyc_taxis
    name: nyc_taxis
    params:
        spark_remote: sc://localhost:15002
  1. Confirm that the runtime has loaded the new table (in the original terminal)
2025-01-13T17:54:33.672161Z  INFO runtime::init::dataset: Dataset nyc_taxis registered (spark:nyc_taxis), results cache enabled.
  1. Check the table exists from the Spice REPL
spice sql
Welcome to the Spice.ai SQL REPL! Type 'help' for help.

show tables; -- list available tables
sql> show tables;
+---------------+--------------+--------------+------------+
| table_catalog | table_schema | table_name   | table_type |
+---------------+--------------+--------------+------------+
| spice         | runtime      | task_history | BASE TABLE |
| spice         | runtime      | metrics      | BASE TABLE |
| spice         | public       | nyc_taxis    | BASE TABLE |
+---------------+--------------+--------------+------------+

Time: 0.031211 seconds. 3 rows.
  1. Check the table structure of nyc_taxis.
describe nyc_taxis;
+-----------------------+-----------------------------------------+-------------+
| column_name           | data_type                               | is_nullable |
+-----------------------+-----------------------------------------+-------------+
| VendorID              | Int64                                   | YES         |
| tpep_pickup_datetime  | Timestamp(Microsecond, Some("Etc/UTC")) | YES         |
| tpep_dropoff_datetime | Timestamp(Microsecond, Some("Etc/UTC")) | YES         |
| passenger_count       | Float64                                 | YES         |
| trip_distance         | Float64                                 | YES         |
| RatecodeID            | Float64                                 | YES         |
| store_and_fwd_flag    | Utf8                                    | YES         |
| PULocationID          | Int64                                   | YES         |
| DOLocationID          | Int64                                   | YES         |
| payment_type          | Int64                                   | YES         |
| fare_amount           | Float64                                 | YES         |
| extra                 | Float64                                 | YES         |
| mta_tax               | Float64                                 | YES         |
| tip_amount            | Float64                                 | YES         |
| tolls_amount          | Float64                                 | YES         |
| improvement_surcharge | Float64                                 | YES         |
| total_amount          | Float64                                 | YES         |
| congestion_surcharge  | Float64                                 | YES         |
| airport_fee           | Float64                                 | YES         |
+-----------------------+-----------------------------------------+-------------+

Time: 0.006024208 seconds. 19 rows.
  1. Query against the Spark table. The spice runtime will make a network call to the Spark instance.
SELECT * FROM nyc_taxis LIMIT 10;
+----------+----------------------+-----------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
+----------+----------------------+-----------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+
| 1        | 2022-03-01T00:13:08Z | 2022-03-01T00:24:35Z  | 1.0             | 2.4           | 1.0        | N                  | 90           | 209          | 2            | 10.0        | 3.0   | 0.5     | 0.0        | 0.0          | 0.3                   | 13.8         | 2.5                  | 0.0         |
| 1        | 2022-03-01T00:47:52Z | 2022-03-01T01:00:08Z  | 1.0             | 2.2           | 1.0        | N                  | 148          | 234          | 2            | 10.5        | 3.0   | 0.5     | 0.0        | 0.0          | 0.3                   | 14.3         | 2.5                  | 0.0         |
| 2        | 2022-03-01T00:02:46Z | 2022-03-01T00:46:43Z  | 1.0             | 19.78         | 2.0        | N                  | 132          | 249          | 1            | 52.0        | 0.0   | 0.5     | 11.06      | 0.0          | 0.3                   | 67.61        | 2.5                  | 1.25        |
| 2        | 2022-03-01T00:52:43Z | 2022-03-01T01:03:40Z  | 2.0             | 2.94          | 1.0        | N                  | 211          | 66           | 1            | 11.0        | 0.5   | 0.5     | 4.44       | 0.0          | 0.3                   | 19.24        | 2.5                  | 0.0         |
| 2        | 2022-03-01T00:15:35Z | 2022-03-01T00:34:13Z  | 1.0             | 8.57          | 1.0        | N                  | 138          | 197          | 1            | 25.0        | 0.5   | 0.5     | 5.51       | 0.0          | 0.3                   | 33.06        | 0.0                  | 1.25        |
| 1        | 2022-03-01T00:11:57Z | 2022-03-01T00:53:05Z  | 2.0             | 14.0          | 1.0        | N                  | 132          | 33           | 1            | 43.5        | 1.75  | 0.5     | 9.2        | 0.0          | 0.3                   | 55.25        | 0.0                  | 1.25        |
| 2        | 2022-03-01T00:05:11Z | 2022-03-01T00:08:22Z  | 1.0             | 0.61          | 1.0        | N                  | 166          | 151          | 1            | 4.5         | 0.5   | 0.5     | 1.0        | 0.0          | 0.3                   | 6.8          | 0.0                  | 0.0         |
| 2        | 2022-03-01T00:30:56Z | 2022-03-01T00:46:21Z  | 1.0             | 2.83          | 1.0        | N                  | 74           | 238          | 1            | 13.0        | 0.5   | 0.5     | 3.7        | 0.0          | 0.3                   | 18.0         | 0.0                  | 0.0         |
| 2        | 2022-03-01T00:30:28Z | 2022-03-01T00:30:36Z  | 1.0             | 0.1           | 1.0        | N                  | 145          | 145          | 3            | -2.5        | -0.5  | -0.5    | 0.0        | 0.0          | -0.3                  | -3.8         | 0.0                  | 0.0         |
| 2        | 2022-03-01T00:30:28Z | 2022-03-01T00:30:36Z  | 1.0             | 0.1           | 1.0        | N                  | 145          | 145          | 2            | 2.5         | 0.5   | 0.5     | 0.0        | 0.0          | 0.3                   | 3.8          | 0.0                  | 0.0         |
+----------+----------------------+-----------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+

Time: 1.139626792 seconds. 10 rows.
  1. Stop the Spark instance and cleanup
docker compose down --volumes --rmi local