diff --git a/examples/starburst_trino_NYCTaxi.ipynb b/examples/starburst_trino_NYCTaxi.ipynb new file mode 100644 index 0000000..4bad32f --- /dev/null +++ b/examples/starburst_trino_NYCTaxi.ipynb @@ -0,0 +1,517 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ffc5aa5d", + "metadata": {}, + "source": [ + "This notebook example involves using a managed version of Trino (Starburst). It will work without Starburst provided you are able to import data into a Trino cluster connected to a lake. We will be using one month of Yellow Taxi data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and a zone look up file provided on the same page. Please download both files and register in either your Starburst or Trino cluster before proceeding." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c9d60a2-5fd2-4c95-ac50-53258be19a2e", + "metadata": {}, + "outputs": [], + "source": [ + "import ibis\n", + "import pandas as pd\n", + "ibis.options.interactive = True\n", + "\n", + "#from trino.auth import OAuth2Authentication" + ] + }, + { + "cell_type": "markdown", + "id": "ad9927bc", + "metadata": {}, + "source": [ + "IMPORTANT!!!! Change your user, host, port, database, schema and roles to be relevant to your Starburst Galaxy setup. If you are using OAuth2, uncomment the keyword lines roles, and auth. Then comment PASSWORD to proceed. You can reference: https://ibis-project.org/backends/trino#connecting-to-starburst-managed-trino-instances for more information." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e828353", + "metadata": {}, + "outputs": [], + "source": [ + "import os" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9432297e-5138-4e14-8c66-e5f30a704f7e", + "metadata": {}, + "outputs": [], + "source": [ + "con = ibis.trino.connect(\n", + " user=os.environ['user'],\n", + " host=os.environ['host'],\n", + " password=os.environ['password'],\n", + " port=443,\n", + " database=os.environ['database'],\n", + " schema=os.environ['schema'],\n", + " #roles=\"accountadmin\",\n", + " #auth=OAuth2Authentication(),\n", + " http_scheme=\"https\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a2072b36-83f3-4ea5-b377-b5e0fe17ba00", + "metadata": {}, + "source": [ + "Within Ibis `list_tables()` allows us to list all the tables.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83beb960-ab1d-4fa3-a812-5664c0c65854", + "metadata": {}, + "outputs": [], + "source": [ + "con.list_tables()" + ] + }, + { + "cell_type": "markdown", + "id": "e9bd2195-381a-4b20-aa12-f1ee07dafd6c", + "metadata": {}, + "source": [ + "Ibis tables in trino can be stored through the use of con.table. We're going to create two ibis tables from our tables below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9226dc3a-d43e-4352-aa93-ada91c0e4875", + "metadata": {}, + "outputs": [], + "source": [ + "nycjantrips = con.table(\"taxizonenyc\")\n", + "zonelookup = con.table(\"zonelookup\")" + ] + }, + { + "cell_type": "markdown", + "id": "1181f385-f943-4859-9060-da62221ae5b9", + "metadata": {}, + "source": [ + "In ibis we can check the schema of the tables we just store through .schema()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8cd5251-7021-4f25-b22f-d0b560157071", + "metadata": {}, + "outputs": [], + "source": [ + "nycjantrips.schema()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9c3ca08-c2c7-4313-b367-f297ca642c6d", + "metadata": {}, + "outputs": [], + "source": [ + "zonelookup.schema()" + ] + }, + { + "cell_type": "markdown", + "id": "96c2ab6a-dd46-4b23-abc4-b257188ab5f1", + "metadata": {}, + "source": [ + "We're going to preview the dataset with ibis slice method. We can see the first 10 rows here. We also included ibis.options.interactive = True\n", + " at the start of our notebook which allows us to display the ibis tables in a prettified way." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e09b94d-f8b3-4207-81a4-f8b2344f64da", + "metadata": {}, + "outputs": [], + "source": [ + "nycjantrips[0:10]" + ] + }, + { + "cell_type": "markdown", + "id": "15137806-3afd-49ac-a53b-a478dbc48788", + "metadata": {}, + "source": [ + "To understand the dataset a little more we can try an order by. Looks like there are some columns with passenger count of undefined. In this case we're going to want\n", + "to curate the dataset and clean it up a bit to ensure more accurate data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51491505-9c8c-4056-a107-051eb9dadff1", + "metadata": {}, + "outputs": [], + "source": [ + "nycjantrips.order_by(nycjantrips.trip_distance.desc())" + ] + }, + { + "cell_type": "markdown", + "id": "a5556551-dac6-4f66-9488-9c2e2420de36", + "metadata": {}, + "source": [ + "We can chain together expressions with filter - similar to a WHERE clause in SQL. We can see nan (not a number) involved, ibis also has built-in support for that." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "296be778-f088-452b-801d-e0127c19c29b", + "metadata": {}, + "outputs": [], + "source": [ + "nyc_filtered = nycjantrips.filter((nycjantrips.passenger_count != 0) | (not nycjantrips.passenger_count.isnan()))\n", + "nyc_filtered" + ] + }, + { + "cell_type": "markdown", + "id": "8d5f878c-3e85-42d7-b29d-96999e6244ce", + "metadata": {}, + "source": [ + "You can see with the command below that nan has been filtered out! " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "effeca8d-adb6-4c6d-8917-988d06201a0e", + "metadata": {}, + "outputs": [], + "source": [ + "nyc_filtered.order_by(nyc_filtered.trip_distance.desc())" + ] + }, + { + "cell_type": "markdown", + "id": "4cdabb64-7db8-462e-b7bd-18999661fd89", + "metadata": {}, + "source": [ + "Let's add a column to our dataset. I want to add a column to help calculate the average ride duration. We are going to use the Ibis 'Delta' function for this result\n", + "Ibis is also pretty cool and create a column in isolation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07742d5c-53c9-4604-9f31-96e426c29c6d", + "metadata": {}, + "outputs": [], + "source": [ + "ride_duration = nyc_filtered.tpep_dropoff_datetime.delta(nyc_filtered.tpep_pickup_datetime, \"minute\").name(\"rideminutes\")\n", + "ride_duration" + ] + }, + { + "cell_type": "markdown", + "id": "541601a4-8908-40f6-9b74-3204b3797a32", + "metadata": {}, + "source": [ + "We can also combine the column with our original table using the 'mutate' method shown here. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "619ca82d-2963-4586-bfab-d346e1ef0c24", + "metadata": {}, + "outputs": [], + "source": [ + "nycjanduration = nyc_filtered.mutate(rideminutes=nyc_filtered.tpep_dropoff_datetime.delta(nyc_filtered.tpep_pickup_datetime, \"minute\"))\n", + "nycjanduration[\"vendorid\",\"rideminutes\",\"trip_distance\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75b1628a-d00d-4c68-adef-511a492866c9", + "metadata": {}, + "outputs": [], + "source": [ + "nycjanduration[\"vendorid\",\"rideminutes\",\"trip_distance\"].head(3)" + ] + }, + { + "cell_type": "markdown", + "id": "780d6f0e-e3ac-4c98-a8db-709bd6bdad5a", + "metadata": {}, + "source": [ + "Next up are some basic analytics and aggregations in Ibis - let's get total revenue with `sum()`, longest trip with `max()`, and average trip duration with `mean()`. \n", + "Ibis is able to chain expressions similar to pandas. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce3c154a-b367-41f8-962c-473433b07b86", + "metadata": {}, + "outputs": [], + "source": [ + "#some basic analytics - let's get total revenue, longest trip. \n", + "insights = nycjanduration.agg(\n", + " [\n", + " _.count().name(\"total_trips\"),\n", + " _.total_amount.sum().name(\"total_revenue\"),\n", + " _.trip_distance.sum().name(\"total_distance_all\"),\n", + " _.rideminutes.max().name(\"longest\"), \n", + " _.rideminutes.mean().round(2).name(\"average_ride\")\n", + " ]\n", + ")\n", + "insights" + ] + }, + { + "cell_type": "markdown", + "id": "957102d5-2ee7-485c-ae44-7c8e3817bfb9", + "metadata": {}, + "source": [ + "Wait, the longest trip seems a bit... lengthy... Note: we added a .round function to display the average ride more nicely. Let's check out the ride itself. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9086a984-f484-43eb-a774-e1db784ea3e4", + "metadata": {}, + "outputs": [], + "source": [ + "nycjanduration.filter(_.rideminutes == 10030)" + ] + }, + { + "cell_type": "markdown", + "id": "5c4e3e47-d180-412e-b3d4-46cabfbf8ea3", + "metadata": {}, + "source": [ + "7 day trip? looks like the trip distance is zero, we can decide to remove the row from future calculations of average\n", + "Let's remove the outliers and join with a lookup table to get more information about the \"where\" of our analytical datasets - zones." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99a27e7f-f626-4388-8613-75001f5a9f38", + "metadata": {}, + "outputs": [], + "source": [ + "nycjanduration_new = (\n", + " nycjanduration.filter(nycjanduration.trip_distance != 0.0)\n", + ")\n", + "nycjanduration_new\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "139c7837-16ca-415c-a830-406c816f407c", + "metadata": {}, + "source": [ + "Let's create a cleaner set similar to before." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc2056f1-8d25-4c25-8aac-c9059e5cc4dc", + "metadata": {}, + "outputs": [], + "source": [ + "insights_new = nycjanduration_new.agg(\n", + " [\n", + " ibis._.count().name(\"total_trips\"),\n", + " ibis._[\"total_amount\"].sum().name(\"total_revenue\"),\n", + " ibis._[\"trip_distance\"].sum().name(\"total_distance_all\"),\n", + " ibis._[\"rideminutes\"].max().name(\"longest\"), \n", + " ibis._[\"rideminutes\"].mean().round(2).name(\"average_ride\")\n", + " ]\n", + ")\n", + "insights_new" + ] + }, + { + "cell_type": "markdown", + "id": "db1727ab-e253-48a6-a57a-a6ef59a89c55", + "metadata": {}, + "source": [ + "You can already see a slightly more massaged dataset - the longest trip is lower, alongside average_ride has changed and the total number of trips has gone down by almost 40k" + ] + }, + { + "cell_type": "markdown", + "id": "e47e10dc-399a-49c4-bb49-dab1dc7013c9", + "metadata": {}, + "source": [ + "Next up we want to do something more powerful - join with related datasets to get more insights on geographical behaviour of taxi trips around NYC. Let's look over the zonelookup table again." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6523d156-6df0-4fc4-b4d0-88b30d772529", + "metadata": {}, + "outputs": [], + "source": [ + "zonelookup" + ] + }, + { + "cell_type": "markdown", + "id": "0c503c18-61c0-4287-8058-656a9e5d060a", + "metadata": {}, + "source": [ + "We can see pulocationid is int64, so we must cast to have the tables fully joined. Ibis supports casting data types within its library as well. In the line below, we use .cast(\"str\") to ensure the two tables can be joined together. You can try without the cast and see what happens :). \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6dfd32ae-583a-4168-88eb-fc98997f0aec", + "metadata": {}, + "outputs": [], + "source": [ + "joineddata = nycjanduration_new.inner_join(zonelookup, nycjanduration_new.pulocationid.cast(\"str\") == zonelookup.locationid)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a85caeaa-901b-4d88-b8fa-8c9a379c7274", + "metadata": {}, + "outputs": [], + "source": [ + "joineddata" + ] + }, + { + "cell_type": "markdown", + "id": "dc9b9427-3ebd-44a0-8f32-f98c9461044f", + "metadata": {}, + "source": [ + "Now we can do more cool things in ibis with group bys and aggregate by with zones and boroughs!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "027ddcca-c387-427d-8c64-5a2bd94945df", + "metadata": {}, + "outputs": [], + "source": [ + "groupbyboroughtrips = (\n", + " joineddata\n", + " .group_by(\"zone\")\n", + " .aggregate(\n", + " trips=joineddata.vendorid.count(),\n", + " totalrev=joineddata.fare_amount.sum(),\n", + " totalpassengers=joineddata.passenger_count.sum(),\n", + " averageride=joineddata.rideminutes.mean().round(2)\n", + " \n", + " )\n", + " .order_by(ibis.desc(\"totalrev\"))\n", + " .limit(10)\n", + ")\n", + "groupbyboroughtrips\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "7cef329c-0cbb-4eed-89fd-3596eb195393", + "metadata": {}, + "source": [ + "If you want to see what sql ibis generates, you can use the ibis.to_sql() method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "200492d2-3766-49bd-8a94-bc3b90b23044", + "metadata": {}, + "outputs": [], + "source": [ + "ibis.to_sql(groupbyboroughtrips)" + ] + }, + { + "cell_type": "markdown", + "id": "46d8bc4d-4b9d-4e2d-aa5b-c059ec9a6ad2", + "metadata": {}, + "source": [ + "Airport rides give the most revenue to taxi companies, that makes a lot of sense." + ] + }, + { + "cell_type": "markdown", + "id": "21047d72-ca6f-421f-91ef-1a0efd203b50", + "metadata": {}, + "source": [ + "Let's write our result tables back to trino (to show some write functionality, of course).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4f80f81-12aa-44ca-b197-6f7dacedb983", + "metadata": {}, + "outputs": [], + "source": [ + "con.create_table(\"groupbyboroughtrips\", groupbyboroughtrips)" + ] + }, + { + "cell_type": "markdown", + "id": "dc13b75d-0e1a-46c0-a628-02ce35ae558f", + "metadata": {}, + "source": [ + "There you have it, a quick tutorial with Ibis, and Starburst Galaxy! " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a68443f-eb34-4c88-a0ff-439d76b5054a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}