Phase 1- SQL
Prerequisites
What is SQL?
SQL (Structured Query Language) is a declarative language for storing, manipulating and retreiving data in databases and many other data processing systems.
We write statement like queries in SQL to perform actions on the data in a data processing system.
Example: Query retreives all the records from a database table
SELECT * FROM <database_name>.<table_name>
SQL offers a simple interface to interact with data without having to write lengthy programs in any programming language. This has served wide veriety of data roles from various backgrounds well over four decades.
Why Data Engineers Need SQL?
Data engineering is a data aware role. Every data engineer functions as a data analyst first to understand the intricacies of the data. Data skills and domain knowledge about it is very crucial to this role.
Data engineering tools and processes have evolved quite a lot over the years. One tool kept its relevance through all these times i.e., SQL.
Since SQL offered a simple interface to deal with data, many data warehousing tools like Hive, Snowflake, BigQuery, and query engines like Apache Spark adopted SQL as one of their main interfaces.
SQL is used for DE tasks like
- Building data models
- Light-weight data cleaning
- Data transformations
- Data analysis and testing
- Building reports
Of course, SQL is not be-all and end-all solution for all data engineering tasks but it is arguably a very important skill in modern data engineering.
Important SQL Topics
SQL Topics |
-
Table/View Operations
CREATE
ALTER
TRUNCATE
DROP
-
Data Retrieval
SELECT
,DISTINCT
WHERE
clause withIN
,AND
,OR
,NOT
,LIKE
andBETWEEN
COUNT
,SUM
,MIN
,MAX
,AVG
-
Data Manipulation
INSERT
UPDATE
DELETE
MERGE
-
Aggregations
GROUP BY
ORDER BY
HAVING
-
Joins
INNER
LEFT
RIGHT
FULL OUTER
CROSS JOIN
ANTI JOIN
-
Windows/Analytics functions
SUM
,COUNT
,AVG
ROW_NUMBER
RANK
DENSE_RANK
LEAD
LAG
NTILE
-
Set Operations
UNION
MINUS
-
Others
CTE
UDF
- Stored Procedures
Note: This is not the complete list of all the topics in SQL for data engineers. It covers only necessary topics.
SQL Flavours
There are many flavours of SQL available for relational databases and query engines. ANSI standard is the universal standard for writing SQL queries.
Almost every database, query engine supports the ANSI SQL standard loosely. All the standard operations like SELECT
, INSERT
, UPDATE
and DELETE
should have the same syntax.
Database vendors like Oracle(PLSQL), Microsoft(T-SQL) and opensource databases like MySQL and PostgreSQL add a lot of features on top of the ANSI standard.
Resources
The below mentioned SQL playlist is a fantastic resource for folks with various experience levels alike. It uses Microsoft SQL Server and T-SQL for the tutorials.
For practice:
-
Hackerrank provides a good amount of problems with from easy to hard levels. This is good for beginners Check the below link.
Objective of learning SQL
SQL is a tool that is used across many levels and roles in the broader data domain. The problem statements can often come in plain language from technical and non-techincal folks alike.
As data engineers, we need to understand those problem statements in plain english and use SQL to quickly analyze the data and solve the problems. Ability to communicate back the results in simple language after the analysis is a huge plus.