Why SQL is a backbone for Data Analysis ?

We all know Analytics is a Game Changer. And Every Data Analyst has known or used SQL extensively for Data Extraction.Almost every data analyst spends time in understanding the data then extracting the data based on the requirement for perform the needed analysis. So SQL deals with the structured data and such is the wide spread of SQL for Data Analysis. Let us now see how SQL helps in Analysis.

  • SQL is a Simple Language. It can be called as a natural language for analyzing datasets because of the ease of understanding..Within the language of SQL these are common steps: 1) projections (SELECT), 2) filters and joins (WHERE), and 3) aggregations (GROUP BY). These are core operators in SQL. The vast majority of people have found the fundamental SQL query constructs to be straightforward and readable representation of everyday data analysis operations.
  • SQL is Productive Language for Writing Queries. Some think that SQL is a bit of beast . Remember this that it is a DECLARATIVE Language out there, and as such, behaves in an entirely different way from object-oriented, or even functional languages. Now what do I mean by Declarative Language it’s  where you “just” declare the nature of the results that you would like to get. Not how your computer shall compute those results. Isn’t that wonderful?
SELECT first_name, last_name FROM employees WHERE salary > 100000

Easy to understand. You don’t care where employee records physically come from. You just want those that have a decent salary.

This is perhaps why SQL has emerged as such an attractive alternative to the MapReduce framework for analyzing HDFS data. MapReduce requires the developer to specify, at each step, how the underlying data is to be processed. For the same “query”, the code is longer and more complex in MapReduce. For the vast majority of data analysis requirements, SQL is more than sufficient, and the additional expressiveness of MapReduce introduces complexity without providing significant benefits.

Now let us see set of SQL commands important for Data analysis. 

  • Projections

Every data set contains a set of columns and rows. When interrogating a set of data the first step is to determine which columns within the set are of interest. When selecting  set of columns it is important to understand what is needed as an output and then the further analysis is done on it.

SELECT first_name, last_name, Age, Address FROM employees 
  • Filter 

The next stage within the query framework is to specify the rows that are of interest. Since these cannot be identified in the same way as columns, using names, a different approach is used for row-based projections. This approach requires the use of filters – i.e. describing the attributes associated with row sets that are of interest. Within the SQL language row filtering is part of the WHERE clause syntax.

SELECT first_name, last_name, Age, Address FROM employees  where Salary >100000

For business-driven queries this process of limiting rows is essential. A query can be structured to return a specified number or percent of rows starting with the first row after the offset.

  • Joins

Most query operations require the use of at least two data sets and this necessitates a “join” operation. At a simplistic level, a join is used to combine the fields within two or more data sets in a single query, based on common logical relationships between those various data sets. There are a number of ways that data sets can be joined:

  • Inner – returns all rows where there is at least one match in both tables
  • Full Outer – returns all rows from both tables, with matching rows from both sides where available. If there is no match, missing side will contain null.
  • Outer left – returns all rows from left table, and matched rows from right table
  • Outer right – returns all rows from right table, and matched rows from left table
  • Cross – returns a Cartesian product of source sets, i.e. all rows from left table for each row in the right table

The process of defining and executing a SQL join is simple.

SELECT
d.deptno,
count(e.empno)
FROM dept
LEFT OUTER JOIN emp e ON (e.deptno = d.deptno)
GROUP BY d.deptno ORDER BY d.deptno;
  • Aggregate 

Aggregation is an important step in the process of analyzing data sets. Most operational, strategic and discovery-led queries rely on summarizing detailed level data.Therefore, the ability to simply and efficiently aggregate data is a key requirement when selecting a language. If the aggregation process is correctly implanted it can generate significant performance benefits, which creates new opportunities for organizations to boost their overall analysis and reporting capabilities.

The types of aggregation applied to a data set can vary from simple counts to sums to moving averages to statistical analysis such as standard deviations. Therefore, the ability to simply and efficiently aggregate data is a key requirement for any analytical data language.

SELECT
  COUNT(empno) AS no_of_employees,
  SUM(sal) AS total_salary,
  AVG(sal) As average_salary
FROM emp;

So thus we can summarize that SQL is used to merge and retrieve data ,perform group and nested queries.Following are some examples of analytics specific use of SQL:

  1. In case of SAS language using PROC SQL we can write SQL queries to query, update and manipulate data.
  2. In R one can use the sqldf package for running sql queries on data frames.
  3. In Python pandasql library allows you to query pandas DataFrames using SQL syntax.

The primary of SQL for big data is not simply a default choice, but a conscious realization that SQL is the best suited language for basic analysis.

Share you views/opinion/comments with me in the comments section below.

 

In seconds : Google BigQuery

We all know every second,every minute and everyday we are generating enormous amount of data through various activities. This data is BIG ! And to process such massive amount of datasets we need technologies that are able to give results in seconds. To handle these data-set,Google introduced its BigQuery which is a full managed and cloud based interactive Query source. It is used to analyze data containing billions of rows,using a SQL-like syntax.Google BigQuery is a cloud-based big data analytics web service for processing very large read-only data sets.

Google BigQuery is based on Dremel,a technology that has been pioneered by Google. Google calls BigQuery an “externalized version” of its home-brewed Dremel query service software. Dremel and BigQuery are based on Columnar storage and tree structure for processing queries and aggregating results across clusters.

Columnar storage means that it separates a record into column values and stores each value on different storage volume, whereas traditional databases normally store the whole record on one volume.

 Columnar storage of Dremel

Tree storage was designed to solve one of the challenges Google had in designing Dremel was how to dispatch queries and collect results across tens of thousands of machines in a matter of seconds. The tree architecture helps to process massive parallel distribution and aggregate the results from the leaves at fast speed.d6 Tree Structure

Google released BigQuery so that its in house product Dremel could be publicly available for any business or developer for their use. This release made it possible for those outside of Google to utilize the power of Dremel for their Big Data processing requirements. So BigQuery is a service for interactive analysis of massive datasets. We can query millions of rows which take “seconds” to write and seconds to return. It is a service which is accessed using REST API. BigQuery is ‘Real’ fast that can run ad-hoc multi terabytes of queries on data sets in seconds. BigQuery and Dremel share the same underlying architecture and performance characteristics. Users can fully utilize the power of Dremel by using BigQuery to take advantage of Google’s massive computational infrastructure. Thus we can say that Dremel is a backbone that provides the processing and runs across massive amounts of computers.

We can also use Third Party tools for loading data , visualization and BI in BigQuery. For loading data we can use the ETL tools like SQLStream,Informatica,Talend etc. And for visualization and BI Qlikview,Tableau,BIME etc.

BigQuery has many advantages. Some of the most important  advantages are its processing speed on queries and the next being prior to the release of BigQuery, companies were spending hundreds of thousands of dollars or more to effectively query this amount of data. Thus in comparison BigQuery’s cost is drastically lower. There’s a huge cost savings with BigQuery versus traditional data warehouse solutions. You can refer the BigQuery pricing model (https://developers.google.com/bigquery/docs/pricing) for detailed price information.

To conclude we can say that BigQuery is a query service that allows you to run SQL-like queries against multiple terabytes of data in a matter of seconds in an interactive manner.

Do let me know your thoughts in comments below.If you like what you just read & want to continue your analytics learning,subscribe through email.

The Elephant in the Room :Hadoop’s Parallel World

Here I am today talking about Elephant:Hadoop. We are in a world today where is there is exabytes of data being generated every data. Consider the following statistics

Every minute:

  • Facebook users share nearly 2.5 million pieces of content.
  • Twitter users tweet nearly 400,000 times.
  • Instagram users post nearly 220,000 new photos.
  • YouTube users upload 72 hours of new video content.
  • Apple users download nearly 50,000 apps.
  • Email users send over 200 million messages.
  • Amazon generates over $80,000 in online sales.

Isn’t is just too vast. And thus to handle this amount of data there must be some technologies in place. In order to cope, Google invented a new style of data processing known as MapReduce. A year after Google published a white paper describing the MapReduce framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. Apache Hadoop is one technology that has been the darling of Big Data talk. Hadoop is an open-source platform for storage and processing of diverse data types that enables data-driven enterprises to derive the complete value from all their data.

To understand Hadoop, we must understand two fundamental things about it. They are: How Hadoop stores files, and how it processes data.Imagine we have a file that was larger than our PC’s capacity. We could not store that file, right? Hadoop lets us store files bigger than what can be stored on one particular node or server. So that we can store very, very large files. It also lets us store many, many files.

The two critical components of Hadoop are:

1.The Hadoop Distributed File System(HDFS). HDFS is the storage system for a Hadoop cluster.When data lands in the cluster HDFS breaks it into pieces and distributes those pieces among the different servers participating in the cluster. HDFS breaks it into pieces and distributes those pieces among servers participating in cluster.

[Source:hadoop.apache.org]

2.MapReduce. Because Hadoop stores the entire dataset in small pieces across a collection of servers,analytical jobs can be distributed,in parallel,to each of the servers storing part of the data. Each server evaluates the question against its local fragment simultaneously and reports its results back for collation into comprehensive answer. MapReduce is the agent that distributes the work and collects the results. Map and Reduce are two functions with shuffle in between which is handled by the system.

Both HDFS and MapReduce are designed to continue to work in the face of system failure. HDFS continually monitors the data stored on the cluster. If a server becomes unavailable, a disk drive fails or data is damaged whether due to hardware or software problems,HDFS automatically restores the data from one of the known good replicas stored elsewhere on the cluster. Likewise,when an analysis job is running,MapReduce monitors progress of each of the servers participating in the job.If one of them fails before completing its work,MapReduce automatically starts another instance of the task on another server that has copy of the data. Thus Hadoop provides scalable,reliable and fault-tolerant services for data storage and analysis at very low cost.

Do let me know your thoughts in comments below.

If you like what you just read & want to continue your analytics learning,subscribe through emails.

Data Mining

Data Mining or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

Data mining derives its name from the similarities between searching for valuable information in a large database and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find where the value resides.

What Can Data Mining Do?

Although data mining is still in its infancy, companies in a wide range of industries – including retail, finance, heath care, manufacturing transportation, and aerospace – are already using data mining tools and techniques to take advantage of historical data. By using pattern recognition technologies and statistical and mathematical techniques to sift through warehoused information, data mining helps analysts recognize significant facts, relationships, trends, patterns, exceptions and anomalies that might otherwise go unnoticed.

For businesses, data mining is used to discover patterns and relationships in the data in order to help make better business decisions. Data mining can help spot sales trends, develop smarter marketing campaigns, and accurately predict customer loyalty. Specific uses of data mining include:

  • Market segmentation – Identify the common characteristics of customers who buy the same products from your company.
  • Customer churn – Predict which customers are likely to leave your company and go to a competitor.
  • Fraud detection – Identify which transactions are most likely to be fraudulent.
  • Direct marketing – Identify which prospects should be included in a mailing list to obtain the highest response rate.
  • Interactive marketing – Predict what each individual accessing a Web site is most likely interested in seeing.
  • Market basket analysis – Understand what products or services are commonly purchased together; e.g., beer and diapers.
  • Trend analysis – Reveal the difference between a typical customer this month and last.

Data Mining Technologies

The analytical techniques used in data mining are often well-known mathematical algorithms and techniques. What is new is the application of those techniques to general business problems made possible by the increased availability of data and inexpensive storage and processing power. Also, the use of graphical interfaces has led to tools becoming available that business experts can easily use.

Some of the tools used for data mining are:

Artificial neural networks – Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Decision trees – Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset.

Rule induction – The extraction of useful if-then rules from data based on statistical significance.

Genetic algorithms – Optimization techniques based on the concepts of genetic combination, mutation, and natural selection.

Nearest neighbor – A classification technique that classifies each record based on the records most similar to it in an historical database.

How Data Mining Works

How is data mining able to tell you important things that you didn’t know or what is going to happen next? That technique that is used to perform these feats is called modeling. Modeling is simply the act of building a model (a set of examples or a mathematical relationship) based on data from situations where the answer is known and then applying the model to other situations where the answers aren’t known. Modeling techniques have been around for centuries, of course, but it is only recently that data storage and communication capabilities required to collect and store huge amounts of data, and the computational power to automate modeling techniques to work directly on the data, have been available.

As a simple example of building a model, consider the director of marketing for a telecommunications company. He would like to focus his marketing and sales efforts on segments of the population most likely to become big users of long distance services. He knows a lot about his customers, but it is impossible to discern the common characteristics of his best customers because there are so many variables. From his existing database of customers, which contains information such as age, sex, credit history, income, zip code, occupation, etc., he can use data mining tools, such as neural networks, to identify the characteristics of those customers who make lots of long distance calls. For instance, he might learn that his best customers are unmarried females between the age of 34 and 42 who make in excess of $60,000 per year. This, then, is his model for high value customers, and he would budget his marketing efforts to accordingly.

Do let me know your thoughts in comments below.

If you like what you just read & want to continue your analytics learning,subscribe through emails.

 

Popular Analytics Tool

Business analytics is a fast growing field and there are many tools available in the market to serve the needs of organizations. The range of analytical software goes from relatively simple statistical tools in spreadsheets (ex-MS Excel) to statistical software packages (ex-KXEN, Statistica) to sophisticated business intelligence suites (ex-SAS, Oracle, SAP, IBM among the big players). Open source tools like R and Weka are also gaining popularity. Besides these, companies develop in-house tools designed for specific purposes.

Here is list of 10 most popular Analytics tools.

MS Excel

MS Excel: Almost every business user has access to MS Office suite and Excel. Excel is an excellent reporting and dash boarding tool. For most business projects, even if you run the heavy statistical analysis on different software but you will still end up using Excel for the reporting and presentation of results.

While most people are aware of its excellent reporting and graphing abilities, excel can be a powerful analytic tool in the hands of an experienced user. Latest versions of Excel can handle tables with up to 1 million rows making it a powerful yet versatile tool.

SAS

SAS: SAS is the 5000 pound gorilla of the analytics world and claims to be the largest independent vendor in the business intelligence market. It is the most commonly used software in the Indian analytics market despite its monopolistic pricing. SAS software has wide ranging capabilities from data management to advanced analytics.

SPSS Modeler

SPSS Modeler (Clementine): SPSS Modeler is a data mining software tool by SPSS Inc., an IBM company. It was originally named SPSS Clementine. This tool has an intuitive GUI and its point-and-click modelling capabilities are very comprehensive.

Statistica

Statistica: is a statistics and analytics software package developed by StatSoft. It provides data analysis, data management, data mining, and data visualization procedures. Statistica supports a wide variety of analytic techniques and is capable of meeting most needs of the business users. The GUI is not the most user-friendly and it may take a little more time to learn than some tools but it is a competitively priced product that is value for money.

Salford Systems

Salford systems: provides a host of predictive analytics and data mining tools for businesses. The company specialises in classification and regression tree algorithms. Its MARS algorithm was originally developed by world-renowned Stanford statistician and physicist, Jerome Friedman. The software is easy to use and learn.

KXEN

KXEN: is one of the few companies that is driving automated analytics. Their products, largely based on algorithms developed by the Russian mathematician Vladimir Vapnik, are easy to use, fast and can work with large amounts of data. Some users may not like the fact that KXEN works like a ‘black box’ and in most cases, it is difficult to understand and explain the results.

Agnoss

Angoss: Like Salford systems, Angoss has developed its products around classification and regression decision tree algorithms. The advantage of this is that the tools are easy to learn and use, and the results easy to understand and explain. The GUI is very user friendly and a lot of features have been added over the years to make this a powerful tool.

MATLAB

MATLAB: is a statistical computing software developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms and creation of user interfaces. There are many add-on toolboxes that extend MATLAB to specific areas of functionality, such as statistics, finance, image processing, bioinformatics, etc. Matlab is not a free software. However, there are clones like Octave and Scilab which are free and have similar functionality.

Open Source Software

R

R: R is a programming language and software environment for statistical computing and graphics. The R language is an open source tool and is widely used by the academia. For business users, the programming language does represent a hurdle. However, there are many GUIs available that can sit on R and enhance its user-friendliness.

WEKA

Weka: Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software, developed at the University of Waikato, New Zealand. Weka, along with R, is amongst the most popular open source software used by the business community. The software is written in the Java language and contains a GUI for interacting with data files and producing visual results and graphs.

Do let me know your thoughts in comments below.

If you like what you just read & want to continue your analytics learning,subscribe through emails.

Application of Business Analytics

Business Analytics is a field that is data-driven. This data is asset for the companies and thus they need to gain the most possible insights from it. Successful business analytics depends on data quality, skilled analysts who understand the technologies and the business and an organizational commitment to data-driven decision making. Analytics is not based on your guts but deriving possible relation between the available data.

Examples of BA uses include:

  • Exploring data to find new patterns and relationships (data mining)
  • Explaining why a certain result occurred (statistical analysis, quantitative analysis)
  • Experimenting to test previous decisions (A/B testing, multivariate testing)
  • Forecasting future results (predictive modeling, predictive analytics)

I shall talk about Data Mining,Statistical Techniques and Predictive modelling in my futher blogs. Lets us primarily focus on various applications of BA in this post.

Below is the list of industry applications of Business Analytics (Source:IBM)

  1. Automotive
  2. HealthCare
  3. Banking
  4. Insurance
  5. Retail
  6. Government
  7. Travel & Transportation
  8. Telecommunication

Thus BA can add lot of value to the business in various sphere from healthcare to telecommunication. It thus becomes very important to harness this data available in various fields and gain best possible insights.

To put it in other words Business Analytics as a tool, is that it can be applied in any industry where data is captured and accessible. This data can be used for a variety of reasons, ranging from improving customer service as well improving the organisation’s capability to predict fraud to offering valuable insights on online and digital information.

Do let me know your thoughts in comments below.

If you like what you just read & want to continue your analytics learning,subscribe through emails.

What is Big Data and Analytics ?

Today Big Data and Analytics are the words that are going around every where. Data is being poured over the internet in huge amount. Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data !! 

Now let us understand Big data and Analytics .

Big data analytics refers to the process of collecting, organizing and analyzing large sets of data to discover patterns and other useful information. Not only will big data analytics help you to understand the information contained within the data, but it will also help identify the data that is most important to the business and future business decisions. Big data analysts basically want the knowledge that comes from analyzing the data.

Stay tuned. Will get back with some more interesting posts on analytics and techniques.

Leave comment.

My Experience with SAS

So here is a post after a long time ! Currently I am working on SAS 9.2. SAS is a software suite that can mine, alter, manage and retrieve data from a variety of sources and perform statistical analysis on it. SAS provides a graphical point-and-click user interface for non-technical users and more advanced options through the SAS programming language. SAS programs has a DATA step, which retrieves and manipulates data, usually creating a SAS data set, and a PROC step, which analyzes the data.

So let us start with basics of SAS :

Temporary and Permanent SAS Data Sets

SAS data sets can have a one-level name or a two-level name. Typically, names of temporary SAS data sets have only one level and are stored in the WORK data library. The WORK data library is defined automatically at the beginning of the SAS session and is automatically deleted at the end of the SAS session. Procedures assume that SAS data sets that are specified with a one-level name are to be read from or written to the WORK data library, unless you specify a USER data library. For example, the following PROC SORT steps are equivalent. The second PROC SORT step assumes that the TRIGGERS data set is in the WORK data library:

proc print data=work.triggers;
run;

proc print data=triggers; 
run;

The SAS system options WORK=, WORKINIT, and WORKTERM affect how you work with temporary and permanent libraries.

Typically, two-level names represent permanent SAS data sets. A two-level name takes the form libref.SAS-data-set. The libref identifies an external storage location that stores SAS data sets in your operating environment. A LIBNAME statement associates a libref with an external storage location. In the following PROC PRINT step, PROCLIB is the libref and EMP is the SAS data set within the library:

libname proclib 'SAS-data-library';
proc print data=proclib.emp;
run;

Lets further look at SAS procedures which are an important part of SAS. SAS procedures help to carry out statistical analysis.A procedure is invoked in a “PROC step” which starts with the keyword PROC, such as:

Proc Sort Data=people
by height,weight
run;

The above Procedure tells SAS to Sort the dataset People using the parameters height and 
weight.

Descriptive statistics

PROC CORR
Correlations among a set of variables.

PROC CORR DATA=SASdataset  options;
          options:NOMISS ALPHA
   VAR variable(s);
   WITH variable(s);
PROC FREQ
Frequency tables,

PROC FREQ DATA=SASdataset;
   TABLES variable(s) / options;
           options:NOCOL NOROW NOPERCENT
   OUTPUT OUT=SASdataset;
PROC MEANS
Means, standard deviations, and a host of other univariate statistics for a set of variables.

PROC MEANS DATA=SASdataset  options;
           options:N  MEAN  STD  MIN  MAX  SUM  VAR  CSS USS
   VAR variable(s);
   BY variable(s);
   OUTPUT OUT=SASdataset  keyword=variablename ... ;
 Further on there are many procedures avaiable in SAS which help in univariate 
 analysis , Linear models ,ultility procedures and so on.