Mastering the SQL Subqueries With Multiple Tables
Mastering the tools to use subqueries involving multiple tables
The complexity of business processes and vast amounts of data in organizations often mean that a single table does not contain all the information needed to make informed decisions or answer critical questions. As a result, multiple tables are often required to gain a more complete understanding of the context and to enrich data analysis. So as a data analyst, it is essential to master SQL joins and subqueries.
Subqueries allow us to use the results of one query as the input for another query and are a powerful tool for filtering and manipulating data. But subqueries can be tricky to master, especially when dealing with multiple tables.
That’s why we’re here to help. In today’s tutorial, we’ll teach you how to use IN, NOT IN, EXISTS, and NOT EXISTS, which are common subqueries that allow you to filter data based on specific criteria.
To help you understand everything we cover here, here’s also a Youtube video from our channel.
IN vs NOT IN
‘IN’ and ‘NOT IN’ are used to filter data based on specific values.
The WHERE clause with an equals operator is commonly used to filter data. However, when filtering by a longer list of values, using the OR operator can be tedious. In such cases, we can use the IN operator to filter data. The IN operator allows us to specify a list of values to match against a given column. It is a more concise and readable way of filtering data compared to using the OR operator repeatedly.
Let’s take an example question from Amazon:
Link to the question: https://platform.stratascratch.com/coding/9838-find-details-of-workers-excluding-those-with-the-first-name-vipul-or-satish
Table: worker
For example, using this worker dataset from Amazon, we can find the workers of the departments ‘Admin’ and ‘HR’ by:
SELECT worker_id,
first_name,
last_name
FROM worker
WHERE department='Admin'
OR department='HR'
But if given a larger dataset with more values to filter on, it is a lot more readable to use the IN operator.
SELECT worker_id, first_name, last_name FROM worker WHERE department IN ('Admin', 'HR')
Sounds simple enough? Try to solve this question. Aside from IN, you’ll also have to use the NOT IN operator!
Using IN/NOT IN with subqueries
IN and NOT IN are not only useful in filtering based on a specified list of values. By using IN and NOT IN, we can also filter the main table based on a set of values from another table or subquery without having to join the tables together!
Let’s try solving this problem.
Link to the question: https://platform.stratascratch.com/coding/10147-find-countries-that-are-in-winemag_p1-dataset-but-not-in-winemag_p2
Table: winemag_p1
Table: winemag_p2
To filter out duplicate countries from the winemag_p1 dataset, we can use the NOT IN operator and a subquery that selects all the countries listed in winemag_p2. This is a more efficient alternative to using the OR operator to filter by each value. The final query becomes:
SELECT DISTINCT country
FROM winemag_p1
WHERE country NOT IN (SELECT DISTINCT country
FROM winemag_p2)
So, when it comes to using IN or NOT IN in SQL, it’s all about filtering data based on a specific column and a list of allowed or excluded values. To make it happen, you need to follow a few simple steps.
- First, you’ll want to select the columns you want to work with from a main table.
- Then, you can use a WHERE clause to filter where a specific column is either IN or NOT IN a list of values. This list of values can either be
• specified directly, in a tuple-like format; or
• indirectly, using a subquery that returns a list of eligible values for the column
Info box to explain the options:
SELECT col1, col2 FROM table WHERE col1 IN
- (‘allowed_value1’, ‘allowed_value2’, ‘allowed_value3’)
- (SELECT col1 FROM table2 GROUP BY col1 HAVING count(*)>2)
But here’s the catch: if you decide to use a subquery, it needs to return a list of values for the specific column you’re filtering on. For instance, if you’re filtering based on col1, the subquery should only return col1 values.
EXISTS and NOT EXISTS
Let’s discuss another SQL function that enables us to filter data based on the results from another table, which is called EXISTS and NOT EXISTS. But before we delve into some examples, let’s first understand the difference between IN/NOT IN and EXISTS/NOT EXISTS, as they can be easily mixed up.
Infobox:
Let me explain it to you in simple terms — both IN/NOT IN and EXISTS/NOT EXISTS are used in the WHERE clause, but their functionalities are different. IN and NOT IN filter rows by checking if a particular value is present or not, while EXISTS and NOT EXISTS filter based on the existence of rows in a subquery. So, keep this difference in mind while working with these functions.
Let’s move on and explore the EXISTS and NOT EXISTS functions with this question from Meta.
Link to the question: https://platform.stratascratch.com/coding/2081-recommendation-system
Table: users_friends
Table: users_pages
Discovery is the aim of recommendation systems, so the pages recommended should be new. In other words, we should only return pages once (i.e., not duplicated), and these should be pages that the user does not yet follow.
First, we need to join the user_friends table to the user_pages table to identify the pages that their friends follow.
SELECT f.user_id,
p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
But since friends may follow the same pages, we need to return only distinct page recommendations for each user. Easy peasy, we just add the DISTINCT keyword to the SELECT statement.
SELECT DISTINCT f.user_id,
p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
But now comes the real magic — using the NOT EXISTS operator to filter out pages that the user is already following. We create a subquery that checks for matching user_id and page_id combinations in the user_pages table and join it to the outer query. If this subquery returns any row, the user-page combination is removed. Otherwise, it is retained.
SELECT *
FROM users_pages pg
WHERE pg.user_id = f.user_id
AND pg.page_id = p.page_id)
And just like that, we have our solution! Our final query looks like this:
SELECT DISTINCT f.user_id,
p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
WHERE NOT EXISTS
(SELECT *
FROM users_pages pg
WHERE pg.user_id = f.user_id
AND pg.page_id = p.page_id)
Conclusion
With our tutorial today, you’ll be better equipped to make more complex SQL queries using multiple tables through IN, NOT IN, EXISTS, and NOT EXISTS. In summary, the IN operator is great for when you have a specific list of values to match against, while EXISTS is more useful for complex subquery filtering.
Check out “SQL Interview Questions” to practice more such questions.
Originally published at https://www.stratascratch.com.