{"id":192,"date":"2018-10-22T00:00:00","date_gmt":"2013-01-01T00:00:00","guid":{"rendered":"https:\/\/www.sqlhammer.com\/nyc-yellow-cab-data-in-azure-sql-data-warehouse"},"modified":"2026-03-18T21:53:17","modified_gmt":"2026-03-18T21:53:17","slug":"nyc-yellow-cab-data-in-azure-sql-data-warehouse","status":"publish","type":"post","link":"https:\/\/sqlhammer.com\/index.php\/2018\/10\/22\/nyc-yellow-cab-data-in-azure-sql-data-warehouse\/","title":{"rendered":"NYC Yellow Cab Data in Azure SQL Data Warehouse &#8211; SQL Hammer"},"content":{"rendered":"<p>Microsoft Azure provides a convenient feature when creating a new Azure SQL Data Warehouse database. You can create the database from a backup of the Microsoft sample database, AdventureWorksDW. This is super fast and easy. It also has huge tables like dbo.FactInternetSalesReason with\u00a064,515 rows!<\/p>\n<p>Oh, hold on I think I typo\u2019ed. No, actually that is all. For the scale and performance of SQL Data Warehouse, 64,515 rows in a fact table is quite small.<\/p>\n<h2>Early attempts<\/h2>\n<p>On four separate occasions, I decided to build a larger SQL Data Warehouse for testing and demonstrations. My first attempt was using the copy of StackOverflow\u2019s database which Brent Ozar maintains as a BitTorrent, <a href=\"https:\/\/www.brentozar.com\/archive\/2015\/10\/how-to-download-the-stack-overflow-database-via-bittorrent\/\">here<\/a>. Once I spent all the time to download the database and then export it in a format which is easy to upload and load into SQL Data Warehouse, I realized that I still needed to enhance the scale of the tables. Also, it is not a data warehouse database. I wanted a star schema and some data which is more like the types of scenarios that I wanted to demonstrate.<\/p>\n<p>I then pivoted to using the Azure provided copy of AdventureWorksDW but then went about extending the table sizes. I made it work but I was disappointed with the amount of effort it would require to make the data look real and proportionally increasing all the dimensions with the facts.<\/p>\n<h2>Calling a cab<\/h2>\n<p>The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/sql-database\/sql-database-public-data-sets\">here<\/a>.<\/p>\n<p>I decided to go with the <a href=\"http:\/\/www.nyc.gov\/html\/tlc\/html\/about\/trip_record_data.shtml\">NYC Taxi and Limousine Commission (TLC) Trip Record Data<\/a>. Data is available for most taxi and limousine fares with pickup\/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01\/2016 \u2013 06\/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.<\/p>\n<h2>Procedure for data import<\/h2>\n<h3>Download<\/h3>\n<p>Downloading the files is a matter of calling a URL for each file, one per month. I did this manually because queueing up the file downloads did not take very long. However, if you wish you do something more scripted, you can just call the below URL with the year and month number incremented.<\/p>\n<pre><code class=\"language-\">https:\/\/s3.amazonaws.com\/nyc-tlc\/trip+data\/yellow_tripdata_2017-01.csv<\/code><\/pre>\n<p>The complete data dictionary is also available on the TLC website in <a href=\"http:\/\/www.nyc.gov\/html\/tlc\/downloads\/pdf\/data_dictionary_trip_records_yellow.pdf\">PDF format<\/a>.<\/p>\n<h3>Upload<\/h3>\n<p>Create an Azure blob storage account and create a container within to hold your CSV files.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.sqlhammer.com\/wp-content\/plugins\/a3-lazy-load\/assets\/images\/lazy_placeholder.gif?resize=574%2C322&#038;ssl=1\" alt=\"\" \/><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.sqlhammer.com\/wp-content\/uploads\/2018\/10\/blob-storage-1.jpg?resize=574%2C322&#038;ssl=1\" alt=\"\" \/><\/p>\n<p>Download and install <a href=\"http:\/\/aka.ms\/azcopy\">Microsoft\u2019s AzCopy<\/a> utility. You can do this in Command Prompt or PowerShell but the install does not append to your PATH environment variable, which makes going straight to the utility easiest.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.sqlhammer.com\/wp-content\/plugins\/a3-lazy-load\/assets\/images\/lazy_placeholder.gif?resize=579%2C248&#038;ssl=1\" alt=\"\" \/><img decoding=\"async\" src=\"https:\/\/i1.wp.com\/www.sqlhammer.com\/wp-content\/uploads\/2018\/10\/azure-storage-azcopy-start-menu.jpg?resize=579%2C248&#038;ssl=1\" alt=\"\" \/><\/p>\n<p>Run a command, similar to this one, to upload your entire directory of files to blob storage.<\/p>\n<pre><code class=\"language-\">AzCopy \/Source:C:\\temp\\active-upload \/Dest:https:\/\/sqlhammersandbox.blob.core.windows.net\/inbox \/DestKey:N6xsupersecretprimarystorageaccountkeyqhk9Bmgdw== \/Pattern:\"<em>.csv\" \/NC:1<\/code><\/pre>\n<p>Remove the \u2018\/NC 1\u2019, if you have good or great upload speeds. I am limiting my upload to one file at a time because my upload is terrible and the default parallel copy was timing out.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.sqlhammer.com\/wp-content\/plugins\/a3-lazy-load\/assets\/images\/lazy_placeholder.gif?resize=580%2C104&#038;ssl=1\" alt=\"\" \/><img decoding=\"async\" src=\"https:\/\/i2.wp.com\/www.sqlhammer.com\/wp-content\/uploads\/2018\/10\/azcopy.jpg?resize=580%2C104&#038;ssl=1\" alt=\"\" \/><\/p>\n<h3>SQL Data Warehouse Setup<\/h3>\n<p>With at least one file in blob storage, this T-SQL script will create the necessary objects for loading the data. Just update the CREDENTIAL SECRET, master key password, and LOCATION.<\/p>\n<pre><code class=\"language-\">--\/<\/em>\nCREATE MASTER KEY ENCRYPTION BY PASSWORD='MyP@ssword123secretword';\n\n<p>CREATE DATABASE SCOPED CREDENTIAL AzureBlobsqlhammersandbox  \nWITH IDENTITY = 'credential', Secret = 'N6xsupersecretprimarystorageaccountkeyqhk9Bmgdw=='<\/p>\n\n<p>CREATE EXTERNAL DATA SOURCE dsinbox WITH\n(\n\tTYPE = HADOOP,\n\tLOCATION = N'wasbs:\/\/inbox@sqlhammersandbox.blob.core.windows.net\/',\n\tCREDENTIAL = AzureBlobsqlhammersandbox\n)<\/p>\n\n<p>CREATE EXTERNAL FILE FORMAT csv  \nWITH (  \n    FORMAT_TYPE = DELIMITEDTEXT,   \n    FORMAT_OPTIONS \n\t\t(\n\t\tFIELD_TERMINATOR =','\n\t\t,FIRST_ROW = 3\n\t\t,DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss'\n\t\t)\n);  \n--<em>\/<\/p>\n\n<p>--\/<\/em>\nCREATE EXTERNAL TABLE ext_Yellow_Cab_NYC (   \n    vendorid tinyint null,\n\ttpep_pickup_datetime datetime2(3) null,\n\ttpep_dropoff_datetime datetime2(3) null,\n\tpassenger_count tinyint null,\n\ttrip_distance_miles decimal(19,2) null,\n\tratecodeid tinyint null,\n\tstore_and_fwd_flag char null,\n\tpulocationid int null,\n\tdolocationid int null,\n\tpayment_type tinyint null,\n\tfare_amount decimal(19,2) null,\n\textra decimal(19,2) null,\n\tmta_tax decimal(19,2) null,\n\ttip_amount decimal(19,2) null,\n\ttolls_amount decimal(19,2) null,\n\timprovement_surcharge decimal(19,2) null,\n\ttotal_amount decimal(19,2) null\n)  \nWITH (  \n        LOCATION='\/',  \n        DATA_SOURCE = dsinbox,  \n        FILE_FORMAT = csv  \n    )  \n;<\/p>\n\n<p>select TOP 10 <em> \nfrom ext_Yellow_Cab_NYC\n--<\/em>\/<\/p>\n\n<p>--\/<em>\nCREATE TABLE [dbo].[Yellow_Cab_NYC]\n(\n\tvendorid tinyint null,\n\ttpep_pickup_datetime datetime2(3) null,\n\ttpep_dropoff_datetime datetime2(3) null,\n\tpassenger_count tinyint null,\n\ttrip_distance_miles decimal(19,2) null,\n\tratecodeid tinyint null,\n\tstore_and_fwd_flag char null,\n\tpulocationid int null,\n\tdolocationid int null,\n\tpayment_type tinyint null,\n\tfare_amount decimal(19,2) null,\n\textra decimal(19,2) null,\n\tmta_tax decimal(19,2) null,\n\ttip_amount decimal(19,2) null,\n\ttolls_amount decimal(19,2) null,\n\timprovement_surcharge decimal(19,2) null,\n\ttotal_amount decimal(19,2) null\n)\nWITH\n(\n\tDISTRIBUTION = HASH ( vendorid ),\n\tCLUSTERED COLUMNSTORE INDEX\n)\n--<\/em>\/<\/code><\/pre>\n<\/p>\n<h3>Data load<\/h3>\n<p>With all of your files in Azure blob storage, simply INSERT INTO your table from the EXTERNAL table.<\/p>\n<pre><code class=\"language-\">begin tran\n\n<p>insert into Yellow_Cab_NYC\nselect * from ext_Yellow_Cab_NYC<\/p>\n\n<p>commit<\/code><\/pre>\n<\/p>\n<h2>Wrap-up<\/h2>\n<p>This data set is not a star schema, being that it is only one table. However it is; large, can be sized based on how much data you feel like loading, and is realistic without needing to be massaged.<\/p>\n<p>The only thing left to do is enjoy analyzing the taxi fares of NYC.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Microsoft Azure provides a convenient feature when creating a new Azure SQL Data Warehouse database. You can create the database from a backup of the Microsoft sample database, AdventureWorksDW. This is super fast and easy. It also has huge tables like dbo.FactInternetSalesReason with\u00a064,515 rows! Oh, hold on I think I typo\u2019ed. No, actually that is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-192","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/posts\/192","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/comments?post=192"}],"version-history":[{"count":1,"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/posts\/192\/revisions"}],"predecessor-version":[{"id":454,"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/posts\/192\/revisions\/454"}],"wp:attachment":[{"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/media?parent=192"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/categories?post=192"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sqlhammer.com\/index.php\/wp-json\/wp\/v2\/tags?post=192"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}