Phase 1- SQL

By Mani Nekkalapudi GitHub

Prerequisites

What is SQL?

SQL (Structured Query Language) is a declarative language for storing, manipulating and retreiving data in databases and many other data processing systems.

We write statement like queries in SQL to perform actions on the data in a data processing system.

Example: Query retreives all the records from a database table

SELECT * FROM <database_name>.<table_name> 

SQL offers a simple interface to interact with data without having to write lengthy programs in any programming language. This has served wide veriety of data roles from various backgrounds well over four decades.

Why Data Engineers Need SQL?

Data engineering is a data aware role. Every data engineer functions as a data analyst first to understand the intricacies of the data. Data skills and domain knowledge about it is very crucial to this role.

Data engineering tools and processes have evolved quite a lot over the years. One tool kept its relevance through all these times i.e., SQL.

Since SQL offered a simple interface to deal with data, many data warehousing tools like Hive, Snowflake, BigQuery, and query engines like Apache Spark adopted SQL as one of their main interfaces.

SQL is used for DE tasks like

  1. Building data models
  2. Light-weight data cleaning
  3. Data transformations
  4. Data analysis and testing
  5. Building reports

Of course, SQL is not be-all and end-all solution for all data engineering tasks but it is arguably a very important skill in modern data engineering.

Important SQL Topics

Data Engineering Lifecycle
SQL Topics
  1. Table/View Operations

    • CREATE
    • ALTER
    • TRUNCATE
    • DROP
  2. Data Retrieval

    • SELECT, DISTINCT
    • WHERE clause with IN, AND, OR, NOT, LIKE and BETWEEN
    • COUNT, SUM, MIN, MAX, AVG
  3. Data Manipulation

    • INSERT
    • UPDATE
    • DELETE
    • MERGE
  4. Aggregations

    • GROUP BY
    • ORDER BY
    • HAVING
  5. Joins

    • INNER
    • LEFT
    • RIGHT
    • FULL OUTER
    • CROSS JOIN
    • ANTI JOIN
  6. Windows/Analytics functions

    • SUM, COUNT, AVG
    • ROW_NUMBER
    • RANK
    • DENSE_RANK
    • LEAD
    • LAG
    • NTILE
  7. Set Operations

    • UNION
    • MINUS
  8. Others

    • CTE
    • UDF
    • Stored Procedures

Note: This is not the complete list of all the topics in SQL for data engineers. It covers only necessary topics.

SQL Flavours

There are many flavours of SQL available for relational databases and query engines. ANSI standard is the universal standard for writing SQL queries.

Almost every database, query engine supports the ANSI SQL standard loosely. All the standard operations like SELECT, INSERT, UPDATE and DELETE should have the same syntax.

Database vendors like Oracle(PLSQL), Microsoft(T-SQL) and opensource databases like MySQL and PostgreSQL add a lot of features on top of the ANSI standard.

Resources

The below mentioned SQL playlist is a fantastic resource for folks with various experience levels alike. It uses Microsoft SQL Server and T-SQL for the tutorials.

For practice:

  • Hackerrank provides a good amount of problems with from easy to hard levels. This is good for beginners Check the below link.

    Hackerrank SQL problems

Objective of learning SQL

SQL is a tool that is used across many levels and roles in the broader data domain. The problem statements can often come in plain language from technical and non-techincal folks alike.

As data engineers, we need to understand those problem statements in plain english and use SQL to quickly analyze the data and solve the problems. Ability to communicate back the results in simple language after the analysis is a huge plus.