February 1, 2021 In this tutorial we’ll be working with a dataset from the bike-sharing service Hubway, which includes data on over 1.5 million trips made with the service. We’ll start by looking a little bit at databases, what they are and why we use them, before starting to write some queries of our own in SQL. If you’d like to follow along you can download the SQL Basics: Relational DatabasesA relational database is a database that stores related information across multiple tables and allows you to query information in more than one table at the same time. It’s easier to understand how this works by thinking through an example. Imagine you’re a business and you want to keep track of your sales information. You could set up a spreadsheet in Excel with all of the information you want to keep track of as separate columns: Order number, date, amount due, shipment tracking number, customer name, customer address, and customer phone number. This setup would work fine for tracking the information you need to begin with, but as you start to get repeat orders from the same customer you’ll find that their name, address and phone number gets stored in multiple rows of your spreadsheet. As your business grows and the number of orders you’re tracking increases, this redundant data will take up unnecessary space and generally decrease the efficiency of your sales tracking system. You might also run into issues with data integrity. There’s no guarantee, for example, that every field will be populated with the correct data type or that the name and address will be entered exactly the same way every time. With a relational database, like the one in the above diagram, you avoid all of these issues. You could set up two tables, one for orders and one for customers. The ‘customers’ table would include a unique ID number for each customer, along with the name, address and phone number we were already tracking. The ‘orders’ table would include your order number, date, amount due, tracking number and, instead of a separate field for each item of customer data, it would have a column for the customer ID. This enables us to pull up all of the customer info for any given order, but we only have to store it once in our database rather than listing it out again for every single order. Our Data SetLet’s start by taking a look at our database. The database has two tables,
Our AnalysisWith this information and the SQL commands we’ll learn shortly, here are some questions that we’ll try to answer over the course of this post:
The SQL commands we’ll use to answer these questions are:
Installation and SetupFor the purposes of this tutorial, we will be using a database system called SQLite3. SQLite has come as part of Python from version 2.5 onwards, so if you have Python installed you’ll almost certainly have SQLite as well. Python and the SQLite3 library can easily be installed and set up with Anaconda if you don’t already have them. Using Python to run our SQL code allows us to import the results into a Pandas dataframe to make it easier to display our results in an easy to read format. It also means we can perform further analysis and visualization on the data we pull from the database, although that will be beyond the scope of this tutorial. Alternatively, if we don’t want to use or install Python, we can run SQLite3 from the command line. Simply download the “precompiled binaries” from the SQLite3 web page and use the following code to open the database:
From here we can just type in the query we want to run and we will see the data returned in our terminal window. An alternative to using the terminal is to connect to the SQLite database via Python. This would allow us to use a Jupyter notebook, so that we could see the results of our queries in a neatly formatted table. To do this, we’ll define a function that takes our query (stored as a string) as an input and shows the result as a formatted dataframe:
Of course, we don’t have to use Python with SQL. If you’re an R programmer already, our SQL Fundamentals for R Users course would be a great place to start. SELECTThe first command we’ll work with is In addition to the columns we want to retrieve, we also have to tell the database which table to get them from. To do this we use the keyword
In this example, we started with the One important thing to be aware
of when writing SQL queries is that we’ll want to end every query with a semicolon ( LIMITThe next command we need to know before we start to run queries on our Hubway database is The
We simply added the We will use Let’s run our first query on the Hubway database. First we will store our query as a string and then use the function we defined earlier to run it on the database. Take a look at the following example:
This query uses You will often see that people capitalize the commmand keywords in their queries (a convention that we’ll follow throughout this tutorial) but this is mostly a matter of preference. This capitalization makes the code easier to read, but it doesn’t actually affect the code’s function in any way. If you prefer to write your queries with lowercase commands, the queries will still execute correctly. Our previous example returned every column in the
ORDER BYThe final command we need to know before we can answer the first of our questions is To use it, we simply specify the name of the column we would like to sort on. By default, For example, if we wanted to sort the
With the To answer this question, it’s helpful to break it down into sections and identify which commands we will need to address each part. First
we need to pull the information from the
Using these commands in this way will return the single row with the longest duration, which will provide us the answer to our question. One more thing to note — as your queries add more commands and get more complicated, you may find it easier to read if you separate them onto multiple lines. This, like capitalization, is a matter of personal preference. It doesn’t affect how the code runs (the system just reads the code from the beginning until it reaches the semicolon), but it can make your queries clearer and easier to follow. In Python, we can separate a string onto multiple lines by using triple quote marks. Let’s go ahead and run this query and find out how long the longest trip lasted.
Now we know that the longest trip lasted 9999 seconds, or a little over 166 minutes. With a maximum value of 9999, however, we don’t know whether this is really the length of the longest trip or if the database was only set up to allow a four digit number. If it’s true that particularly long trips are being cut short by the database, then we might expect to see a lot of trips at 9999 seconds where they reach the limit. Let’s try running the same query as before, but adjust the
What we see here is that there aren’t a whole bunch of trips at 9999, so it doesn’t look like we’re cutting off the top end of our durations, but it’s still difficult to tell whether that’s the real length of the trip or just the maximum allowed value. Hubway charges additional fees for rides over 30 minutes (somebody keeping a bike for 9999 seconds would have to pay an extra $25 in fees) so it’s plausible that they decided 4 digits would be sufficient to track the majority of rides. WHEREThe previous commands are great for pulling out sorted information for particular columns, but what if there is a specific subset of the data we want to look at? That’s where
You’ll also notice that we use quote
marks in this query. That’s because the Let’s write a query that uses
As we can see, this query returned 14 different trips, each with a duration of 9990 seconds or more. Something that stands out about this query is that all but one of the results has a We can already see how even a beginner-level command of SQL can help us answer business questions and find insights in our data. Returning to Here’s another personal preference recommendation: use parentheses to separate each logical test, as demonstrated in the code block below. This isn’t strictly required for the code to function, but parentheses make your queries easier to understand as you increase the complexity. Let’s run that query now. We already know it should only return one result, so it should be easy to check that we’ve got it right:
The next question we set out at the beginning of the post is “How many trips were taken by ‘registered’ users?” To answer it, we could run the same query as above and modify the However, SQL actually has a built-in command to do that counting for us,
In this instance, it doesn’t matter which column we choose to count because every column should have data for each row in our query. But sometimes a query might have missing (or “null”) values for some rows. If we’re not sure whether a column contains null values we can run our We can also use Let’s take a look at a query to answer our question. We can use
This query worked, and has returned the answer to our question. But the column heading isn’t particularly descriptive. If someone else were to look at this table, they wouldn’t be able to understand what it meant.
Aggregate Functions
So to answer our third question, “What was the average trip duration?”, we can use the
It turns out that the average trip duration is 912 seconds, which is about 15 minutes. This makes some sense, since we know that Hubway charges extra fees for trips over 30 minutes. The service is designed for riders to take short, one-way trips. What about our next question, do registered or casual users take longer trips? We already know one way to answer this question — we could run two Let’s do it a different way, though. SQL also includes a way to answer this question in a single query, using the GROUP BY
To get a better idea of how this works, let’s take a look at the When we use Once we have our two separate piles, the database will perform any
aggregate functions in our query on each of them in turn. If we used Let’s walk through exactly how to write a query to answer our question of whether registered or casual users take longer trips.
Here’s what the code looks like when we put it all together:
That’s quite a difference! On average, registered users take trips that last around 11 minutes whereas casual users are spending almost 25 minutes per ride. Registered users are likely taking shorter, more frequent trips, possibly as part of their commute to work. Casual users, on the other hand, are spending around twice as long per trip. It’s possible that casual users tend to come from demographics (tourists, for example) that are more inclined to take longer trips make sure they get around and see all the sights. Once we’ve discovered this difference in the data, there are many ways the company might be able to investigate it to better understand what’s causing it. For the purposes of this tutorial, however, let’s move on. Our next question was which bike was used for the most trips?. We can answer this using a very similar query. Take a look at the following example and see if you can figure out what each line is doing — we’ll go through it step by step afterwards so you can check you got it right:
As you can see from the output, bike
Arithmetic OperatorsOur final question is a little more tricky than the others. We want to know the average duration of trips by registered members over the age of 30. We could just figure out the year in which 30 year olds were born in our heads and then plug it in, but a more elegant solution is to use arithmetic operations directly within our query. SQL allows us to use
JOINSo far we’ve been looking at queries that only pull data from the Our bike-sharing database contains a second table, Before we
start to work through some real examples from this database, though, let’s look back at the hypothetical order tracking database from earlier. In that database we had two tables, Let’s say we wanted to write a query that returned the
Unfortunately
To answer the first two of these questions, we can
include the table names for each column in our To tell the database how the We’re going to use an inner join, which means that rows will only be returned where there is a match in the columns specified in As we discussed earlier, these tables are connected on the
Once again we use the
This query will return the order number of every order in the database along with the customer name that is associated with each. Returning to our Hubway database, we can now write some queries
to see Before we get started, we should take a look at the rest of the columns in the
Like before, we’ll try to answer some questions in the data, starting with which station is the most frequent starting point? Let’s work through it step by step:
If you’re familiar with Boston, you’ll understand why these are the most popular stations. South Station is one of the main commuter rail stations in the city, Charles Street runs along the river close to some nice scenic routes, and Boylston and Beacon streets are right downtown near a number of office buildings. The next question we’ll look at is which stations are most frequently used for round trips? We can use much the same query as
before. We will
As we can see, a number of these stations are the same as the previous question but the amounts are much lower. The busiest stations are still the busiest stations, but the lower numbers overall suggest that people are typically using Hubway bikes to get from point A to point B rather than cycling around for a while before returning to where they started. There is one significant difference here — the Esplande, which was not one of the overall busiest stations from our first query, appears to be the busiest for round trips. Why? Well, a picture is worth a thousand words. This certainly looks like a nice spot for a bike ride: On to the next question: how many trips start and end in different municipalities? This question takes things a step further. We want to know how many trips start and
end in a different In order to do this, we have to create an alias for the For example we can use the following code to
Here’s what the final query will look like when we run it. Note that we’ve used
This shows that about 300,000 out of 1.5 million trips (or 20%) ended in a different municipality than they started — further evidence that people mostly use Hubway bicycles for relatively short journeys rather than longer trips between towns. If you’ve made it this far, congratulations! You’ve begun to master the basics of SQL. We have covered a number of important commands, You’ve mastered the SQL basics. Now what?After finishing this beginner SQL tutorial, you should be able to pick up a database you find interesting and write queries to pull out information. A good first step might be to continue working with the Hubway database to see what else you can find out. Here are some other questions you might want to try and answer:
If you would like to take things a step further, check out our interactive SQL courses, which cover everything you’ll need to know from beginning to advanced-level SQL for data analyst and data scientist jobs. You also might want to read our post about exporting the data from your SQL queries into Pandas or check out our SQL Cheat Sheet and our article on SQL certification. Learn SQL the right way!
Why passively watch video lectures when you can learn by doing? Which of the following are benefits of using SQL select all that apply?Benefits of using SQL. Commonality. One of the main benefits of using SQL is the commonality of the language. ... . Simplicity. Another benefit of using SQL is the simplicity of the language. ... . Integration. ... . Speed. ... . Alter data within a table. ... . Create a table. ... . Retrieve data. ... . Change data structure.. What billing state appears in row 17 of your query result?The billing state in row 17 of the query result is CA. The DISTINCT clause is an important part of SQL and can be used to great effect in order to make sure that duplicate data is not returned in a query result. In this case, the clause is used to remove duplicate entries for the billing state.
What billing city appears in row 15 of your query result?The query result in row 15 is Oslo. This is because the DISTINCT clause removes duplicate entries from the billing_city column, and the ORDER BY clause sorts the results by invoice ID.
What is a query to retrieve all the data from a table?An SQL SELECT statement retrieves records from a database table according to clauses (for example, FROM and WHERE ) that specify criteria. The syntax is: SELECT column1, column2 FROM table1, table2 WHERE column2='value';
|