We all know Analytics is a Game Changer. And Every Data Analyst has known or used SQL extensively for Data Extraction.Almost every data analyst spends time in understanding the data then extracting the data based on the requirement for perform the needed analysis. So SQL deals with the structured data and such is the wide spread of SQL for Data Analysis. Let us now see how SQL helps in Analysis.
- SQL is a Simple Language. It can be called as a natural language for analyzing datasets because of the ease of understanding..Within the language of SQL these are common steps: 1) projections (SELECT), 2) filters and joins (WHERE), and 3) aggregations (GROUP BY). These are core operators in SQL. The vast majority of people have found the fundamental SQL query constructs to be straightforward and readable representation of everyday data analysis operations.
- SQL is Productive Language for Writing Queries. Some think that SQL is a bit of beast . Remember this that it is a DECLARATIVE Language out there, and as such, behaves in an entirely different way from object-oriented, or even functional languages. Now what do I mean by Declarative Language it’s where you “just” declare the nature of the results that you would like to get. Not how your computer shall compute those results. Isn’t that wonderful?
SELECT first_name, last_name FROM employees WHERE salary > 100000
Easy to understand. You don’t care where employee records physically come from. You just want those that have a decent salary.
This is perhaps why SQL has emerged as such an attractive alternative to the MapReduce framework for analyzing HDFS data. MapReduce requires the developer to specify, at each step, how the underlying data is to be processed. For the same “query”, the code is longer and more complex in MapReduce. For the vast majority of data analysis requirements, SQL is more than sufficient, and the additional expressiveness of MapReduce introduces complexity without providing significant benefits.
Now let us see set of SQL commands important for Data analysis.
Every data set contains a set of columns and rows. When interrogating a set of data the first step is to determine which columns within the set are of interest. When selecting set of columns it is important to understand what is needed as an output and then the further analysis is done on it.
SELECT first_name, last_name, Age, Address FROM employees
The next stage within the query framework is to specify the rows that are of interest. Since these cannot be identified in the same way as columns, using names, a different approach is used for row-based projections. This approach requires the use of filters – i.e. describing the attributes associated with row sets that are of interest. Within the SQL language row filtering is part of the WHERE clause syntax.
SELECT first_name, last_name, Age, Address FROM employees where Salary >100000
For business-driven queries this process of limiting rows is essential. A query can be structured to return a specified number or percent of rows starting with the first row after the offset.
Most query operations require the use of at least two data sets and this necessitates a “join” operation. At a simplistic level, a join is used to combine the fields within two or more data sets in a single query, based on common logical relationships between those various data sets. There are a number of ways that data sets can be joined:
- Inner – returns all rows where there is at least one match in both tables
- Full Outer – returns all rows from both tables, with matching rows from both sides where available. If there is no match, missing side will contain null.
- Outer left – returns all rows from left table, and matched rows from right table
- Outer right – returns all rows from right table, and matched rows from left table
- Cross – returns a Cartesian product of source sets, i.e. all rows from left table for each row in the right table
The process of defining and executing a SQL join is simple.
SELECT d.deptno, count(e.empno) FROM dept LEFT OUTER JOIN emp e ON (e.deptno = d.deptno) GROUP BY d.deptno ORDER BY d.deptno;
Aggregation is an important step in the process of analyzing data sets. Most operational, strategic and discovery-led queries rely on summarizing detailed level data.Therefore, the ability to simply and efficiently aggregate data is a key requirement when selecting a language. If the aggregation process is correctly implanted it can generate significant performance benefits, which creates new opportunities for organizations to boost their overall analysis and reporting capabilities.
The types of aggregation applied to a data set can vary from simple counts to sums to moving averages to statistical analysis such as standard deviations. Therefore, the ability to simply and efficiently aggregate data is a key requirement for any analytical data language.
SELECT COUNT(empno) AS no_of_employees, SUM(sal) AS total_salary, AVG(sal) As average_salary FROM emp;
So thus we can summarize that SQL is used to merge and retrieve data ,perform group and nested queries.Following are some examples of analytics specific use of SQL:
- In case of SAS language using PROC SQL we can write SQL queries to query, update and manipulate data.
- In R one can use the sqldf package for running sql queries on data frames.
- In Python pandasql library allows you to query pandas DataFrames using SQL syntax.
The primary of SQL for big data is not simply a default choice, but a conscious realization that SQL is the best suited language for basic analysis.
Share you views/opinion/comments with me in the comments section below.