Lecture Notes: Regularization for Deep Learning – Towards Data Science

The following are Week 1 lecture notes covering the topic regularization. These are notes for the second of five courses in the Deep Learning Specialization on DeepLearning.ai. If you want more…
— Read on towardsdatascience.com/lecture-notes-regularization-be3e7f8e7749

Data Analysis Expression

Data Modeling with DAX – Overview

The decision makers in all the organizations have identified the need for analyzing the historical data of their organization in specific, and of the industry in general. This is becoming crucial day-by-day in the present competitive world, to meet the ever-changing business challenges.

Big Data and Business Intelligence have become the buzzwords in the business world. Data sources have become huge and data formats have become variant. The need of the hour is to have simple-to-use tools to handle the ever-flowing vast data in less time to gain insight and make relevant decisions at the appropriate time.

Data analysts can no longer wait for the required data to be processed by the IT department. They require a handy tool that enables them to quickly comprehend the required data and make it available in a format that helps the decision makers take required action at the right time.

Microsoft Excel has a powerful tool called as Power Pivot that was available as an add-in in the prior versions of Excel and is built-in feature in Excel 2016. The database of Power Pivot, called the data model and the formula language that works on the data model, called DAX (Data Analysis Expressions) enables an Excel user to perform tasks such as data modeling and analysis in no time.

In this tutorial, you will learn data modeling and analysis using DAX, based on the Power Pivot data model. A sample Profit and Analysis database is used for the illustrations throughout this tutorial.

Data Modeling and Analysis Concepts

The data that you obtain from different variety of sources, termed as raw data, needs processing before you can utilize it for analysis purposes. You will learn about these in the chapter − Data Modeling and Analysis Concepts.

Data Modeling and Analysis with Excel Power Pivot

Since the tool that you will be mastering in this tutorial is Excel Power Pivot, you need to know how the data modeling and analysis steps are carried out in Power Pivot. You will learn these at a broader level in the chapter – Data Modeling and Analysis with Excel Power Pivot.

As you proceed with the subsequent chapters, you will learn about the different facets of Power Pivot, DAX and DAX functions in data modeling and analysis.

By the end of the tutorial, you will be able to perform data modeling and analysis with DAX for any context at hand.

Data Modeling with DAX – Concepts

Business Intelligence (BI) is gaining importance in several fields and organizations. Decision making and forecasting based on historical data have become crucial in the evergrowing competitive world. There is huge amount of data available both internally and externally from diversified sources for any type of data analysis.

However, the challenge is to extract the relevant data from the available big data as per the current requirements, and to store it in a way that is amicable for projecting different insights from the data. A data model thus obtained with the usage of key business terms is a valuable communication tool. The data model also needs to provide a quick way of generating reports on an as needed basis.

Data modeling for BI systems enables you to meet many of the data challenges.

Prerequisites for a Data Model for BI

A data model for BI should meet the requirements of the business for which data analysis is being done. Following are the minimum basics that any data model has to meet −

The data model needs to be Business Specific

A data model that is suitable for one line of business might not be suitable for a different line of business. Hence, the data model must be developed based on the specific business, the business terms used, the data types, and their relationships. It should be based on the objectives and the type of decisions made in the organization.

The data model needs to have built-in Intelligence

The data model should include built-in intelligence through metadata, hierarchies, and inheritances that facilitate efficient and effective Business Intelligence process. With this, you will be able to provide a common platform for different users, eliminating repetition of the process.

The data model needs to be Robust

The data model should precisely present the data specific to the business. It should enable effective disk and memory storage so as to facilitate quick processing and reporting.

The data model needs to be Scalable

The data model should be able to accommodate the changing business scenarios in a quick and efficient way. New data or new data types might have to be included. Data refreshes might have to be handled effectively.

Data Modeling for BI

Data modeling for BI consists of the following steps −

  • Shaping the data
  • Loading the data
  • Defining the relationships between the tables
  • Defining data types
  • Creating new data insights

Shaping the Data

The data required to build a data model can be from various sources and can be in different formats. You need to determine which portion of the data from each of these data sources is required for specific data analysis. This is called Shaping the Data.

For example, if you are retrieving the data of all the employees in an organization, you need to decide what details of each employee are relevant to the current context. In other words, you need to determine which columns of the employee table are required to be imported. This is because, the lesser the number of columns in a table in the data model, the faster will be the calculations on the table.

Loading the Data

You need to load the identified data – the data tables with the chosen columns in each of the tables.

Defining the Relationships Between Tables

Next, you need to define the logical relationships between the various tables that facilitate combining data from those tables, i.e. if you have a table – Products – containing data about the products and a table – Sales – with the various sales transactions of the products, by defining a relationship between the two tables, you can summarize the sales, product wise.

Defining Data Types

Identifying the appropriate data types for the data in the data model is crucial for the accuracy of calculations. For each column in each table that you have imported, you need to define the data type. For example, text data type, real number data type, integer data type, etc.

Creating New Data Insights

This is a crucial step in date modeling for BI. The data model that is built might have to be shared with several people who need to understand data trends and make the required decisions in a very short time. Hence, creating new data insights from the source data will be effective, avoiding rework on the analysis.

The new data insights can be in the form of metadata that can be easily understood and used by specific business people.

Data Analysis

Once the data model is ready, the data can be analyzed as per the requirement. Presenting the analysis results is also an important step because the decisions will be made based on the reports.

Data Modeling Using Excel Power Pivot

Microsoft Excel Power Pivot is an excellent tool for data modeling and analysis.

  • Data model is the Power Pivot database.
  • DAX is the formula language that can be used to create metadata with the data in the data model by means of DAX formulas.
  • Power PivotTables in Excel created with the data and metadata in the data model enables you to analyze the data and present the results.

In this tutorial, you will learn data modeling with Power Pivot data model and DAX and data analysis with Power Pivot. If you are new to Power Pivot, please refer to the Excel Power Pivot tutorial.

You have learnt the data modeling process steps in the previous chapter – Data Modeling and Analysis Concepts. In this chapter, you will learn how to execute each of those steps with Power Pivot data model and DAX.

In the following sections, you will learn each of these process steps as applied to Power Pivot data model and how DAX is used.

Shaping the Data

In Excel Power Pivot, you can import data from various types of data sources and while importing, you can view and choose the tables and columns that you want to import.

  • Identify the data sources.
  • Find the data source types. For example, database or data service or any other data source.
  • Decide on what data is relevant in the current context.
  • Decide on the appropriate data types for the data. In Power Pivot data model, you can have only one data type for the entire column in a table.
  • Identify which of the tables are the fact tables and which are the dimensional tables.
  • Decide on the relevant logical relationships between the tables.

Loading Data into the Data Model

You can load data into the data model with several options provided in the Power Pivot window on the Ribbon. You can find these options in the group, Get External Data.

Loading Data into Data Model

You will learn how to load data from an Access database into the data model in the chapter – Loading Data into the Data Model.

For illustration purposes, an Access database with Profit and Loss data is used.

Defining Data Types in the Data Model

The next step in the data modeling process in Power Pivot is defining data types of the columns in the tables that are loaded into the data model.

You will learn how to define data types of the columns in the tables in the chapter – Defining Data Types in the Data Model.

Creating Relationships Between the Tables

The next step in the data modeling process in Power Pivot is creating relationships between the tables in the data model.

You will learn how to create relationships between the tables in the chapter – Extending the Data Model.

Creating New Data Insights

In the data model, you can create metadata necessary for creating new data insights by −

  • Creating Calculated Columns
  • Creating Date Table
  • Creating Measures

You can then analyze the data by creating dynamic Power PivotTables that are based on the columns in the tables and measures that appear as fields in the PivotTable Fields list.

Adding Calculated Columns

Calculated columns in a table are the columns that you add to a table by using DAX formulas.

You will learn how to add calculated columns in a table in the data model in the chapter – Extending the Data Model.

Creating Date Table

To use Time Intelligence Functions in DAX formulas to create metadata, you require a Date table. If you are new to Date tables, please refer to the chapter – Understanding Date Tables.

You will learn how to create a Date table in the data model in the chapter – Extending the Data Model.

Creating Measures

You can create various measures in the Data table by using the DAX functions and DAX formulas for different calculations as required for data analysis in the current context.

This is the crucial step of data modeling with DAX.

You will learn how to create the measures for various purposes of profit and loss analysis in the subsequent chapters.

Analyzing Data with Power PivotTables

You can create Power PivotTables for each of the facets of profit and loss analysis. As you learn how to create measures using DAX in the subsequent chapters, you will also learn how to analyze data with these measures using Power PivotTables.

Loading Data into the Data Model

You can load data from different types of data sources into the data model. For this, you can find various options in the Get External Data group on the Ribbon in the Power Pivot window.

Various Options

As you can observe, you can load data from databases, or from data services or several other types of data sources.

When you load data from a data source into the data model, a connection will be established with the data source. This enables data refresh when the source data changes.

Initiating with a New Data Model

In this section, you will learn how to model the data for profit and loss analysis. The data for analysis is in a Microsoft Access database.

You can initiate a new data model as follows −

  • Open a new Excel workbook
  • Click the PowerPivot tab on the Ribbon
  • Click Manage in the Data Model group
Manage Power Pivot

The Power Pivot window appears. The window will be blank as you have not yet loaded any data.

Loading Data from Access Database into the Data Model

To load the data from the Access database, carry out the following steps −

  • Click From Database in the Get External Data group on the Ribbon.
  • Click From Access in the dropdown list.
Access

Table Import Wizard dialog box appears.

  • Browse to the Access file.
  • Give a friendly name for the connection.
Table Import Wizard

  • Click the Next button. The next part of the Table Import Wizard appears.
  • In the Table Import Wizard, select the option – Select from a list of tables and views to choose the data to import.
choose how to imprt the data

  • Click the Next button. The next part of the Table Import Wizard appears as shown in the following screenshot.
  • Select all the tables.
  • Give friendly names to the tables. This is necessary because these names appear in the Power PivotTables and hence should be understood by everyone.
Finance Data

Choosing the Columns in the Tables

You might not require all the columns in the selected tables for the current analysis. Hence, you need to select only those columns that you selected while shaping the data.

  • Click the Preview & Filter button. The next part of the Table Import Wizard -Preview of the selected table – appears.
Preview Selected Table

  • As seen in the above screenshot, the column headers have check boxes. Select the columns you want to import in the selected table.
  • Click OK. Repeat the same for the other tables.

Importing Data into the Data Model

You are at the last stage of loading data into the data model. Click the Finish button in the Table Import Wizard. The next part of the Table Import Wizard appears.

The importing status will be displayed. The status finally displays Success when data loading is complete.

Importing

Viewing the Data in the Data Model

The imported tables appear in the Power Pivot window. This is the view of the data model

Circulation Area Tabs and Data

You can observe the following −

  • Each of the tables appear in a separate tab.
  • The tab names are the respective table names.
  • The area below the data is for the calculations.

Viewing the Connection Name

Click the Existing Connections in the Get External Data group. Existing Connections dialog box appears as shown in the following screenshot.

Existing Connections

As seen in the above screenshot, the connection name given appears under PowerPivot Data Connections.

Defining Data Types in the Data Model

In the Power Pivot data model, the entire data in a column must be of the same data type. To accomplish accurate calculations, you need to ensure that the data type of each column in each table in the data model is as per requirement.

Tables in the Data Model

In the data model created in the previous chapter, there are 3 tables −

  • Accounts
  • Geography Locn
  • Finance Data

Ensuring Appropriate Data Types

To ensure that the columns in the tables are as required, you need to check their data types in the Power Pivot window.

  • Click a column in a table.
  • Note the data type of the column as displayed on the Ribbon in the Formatting group.
Appropriate Data Types

If the data type of the selected column is not appropriate, change the data type as follows.

  • Click the down arrow next to the data type in the Formatting group.
  • Click the appropriate data type in the dropdown list.
  • Repeat for every column in all the tables in the data model.
Tables in Data Model

Columns in the Accounts Table

In the Accounts table, you have the following columns −

Sr.No Column & Description
1 AccountContains one account number for each row. The column has unique values and is used in defining the relationship with the Finance Data table.
2 ClassThe class associated with each account. Example – Expenses, Net Revenue, etc.
3 Sub ClassDescribes the type of expense or revenue. Example – People.

All the columns in the Accounts table are of descriptive in nature and hence are of Text data type.

Columns in the Geography Locn Table

The Geography Locn table contains data about each Profit Center.

The column Profit Center contains one profit center identity for each row. This column has unique values and is used in defining the relationship with the Finance Data table.

Columns in the Finance Data Table

In the Finance Data table, you have the following columns −

Column Description Data type
Fiscal Month Month and Year Text
Profit Center Profit Center identity Text
Account Account number.

Each account can have multiple Profit Centers.

Text
Budget Monthly budget amounts for each Profit Center. Currency
Actual Monthly actual amounts for each Profit Center. Currency
Forecast Monthly forecast amounts for each profit center. Currency
Actual People Month end actual number of employees for each Profit Center of each people Account. Whole Number
Budget People Month end budget number of employees for each Profit Center of each people Account. Whole Number
Forecast People Month end forecast number of employees for each Profit Center of each people Account. Whole Number

Types of Tables in the Data Model

Both Accounts and Geography Locn tables are the dimensional tables, also called as lookup tables.

Finance Data table is the fact table, also known as the data table. Finance Data table contains the data required for the profit and analysis calculations. You will also create metadata in the form of measures and calculated columns in this Finance Data table, so as to model the data for various types of profit and loss calculations, as you proceed with this tutorial.

Understanding Data Tables

Data Analysis involves browsing data over time and making calculations across time periods. For example, you might have to compare the current year’s profits with the previous year’s profits. Similarly, you might have to forecast the growth and profits in the coming years. For these, you need to use grouping and aggregations over a period of time.

DAX provides several Time Intelligence functions that help you perform most of such calculations. However, these DAX functions require a Date table for usage with the other tables in the data model.

You can either import a Date table along with other data from a data source or you can create a Date table by yourself in the data model.

In this chapter, you will understand different aspects of Date tables. If you are conversant with Date tables in the Power Pivot data model, you can skip this chapter and proceed with the subsequent chapters. Otherwise, you can understand the Date tables in the Power Pivot data model.

What is a Date Table?

A Date Table is a table in a data model, with at least one column of contiguous dates across a required duration. It can have additional columns representing different time periods. However, what is necessary is the column of contiguous dates, as required by the DAX Time Intelligence functions.

For example,

  • A Date table can have columns such as Date, Fiscal Month, Fiscal Quarter, and Fiscal Year.
  • A Date table can have columns such as Date, Month, Quarter, and Year.

Date Table with Contiguous Dates

Suppose you are required to make calculations in the range of a calendar year. Then, the Date table must have at least one column with a contiguous set of dates, including all the dates in that specific calendar year.

For example, suppose the data you want to browse has dates from April 1st, 2014 through November 30th, 2016.

  • If you have to report on a calendar year, you need a Date table with a column – Date, which contains all the dates from January 1st, 2014 to December 31st, 2016 in a sequence.
  • If you have to report on a fiscal year, and your fiscal year end is 30thJune, you need a Date table with a column – Date, which contains all the dates from July 1st, 2013 to June 30th, 2017 in a sequence.
  • If you have to report on both calendar and fiscal years, then you can have a single Date table spanning the required range of dates.

Your Date table must contain all of the days for the range of every year in the given duration. Thus, you will get contiguous dates within that period of time.

If you regularly refresh your data with new data, you will have the end date extended by a year or two, so that you do not have to update your Date table often.

A Date table looks like the following screenshot.

Data Table

Adding a Date Table to the Data Model

You can add a Date table to the data model in any of the following ways −

  • Importing from a relational database, or any other data source.
  • Creating a Date table in Excel and then copying or linking to a new table in Power Pivot.
  • Importing from Microsoft Azure Marketplace.

Creating a Date Table in Excel and Copying to the Data Model

Creating a Date table in Excel and copying to the data model is the easiest and most flexible way of creating a Data table in the data model.

  • Open a new worksheet in Excel.
  • Type – Date in the first row of a column.
  • Type the first date in the date range that you want to create in the second row in the same column.
  • Select the cell, click the fill handle and drag it down to create a column of contiguous dates in the required date range.

For example, type 1/1/2014, click the fill handle and drag down to fill the contiguous dates up to 31/12/2016.

  • Click the Date column.
  • Click the INSERT tab on the Ribbon.
  • Click Table.
  • Verify the table range.
  • Click OK.

The table of a single column of dates is ready in Excel.

Single Column of Dates

  • Select the table.
  • Click Copy on the Ribbon.
  • Click the Power Pivot window.
  • Click Paste on the Ribbon.
Paste

This will add the contents of the clipboard to a new table in the data model. Hence, you can use the same method to create a Date table in an existing data model also.

Paste preview dialog box appears as shown in the following screenshot.

  • Type Date in the Table Name box.
  • Preview the data.
Paste Preview

  • Check the box – Use first row as column headers.
  • Click OK.

This copies the contents of the clipboard to a new table in the data model.

Now, you have a Date table in the data model with a single column of contiguous dates. The header of the column is Date as you had given in the Excel table.

Adding New Date Columns to the Date Table

Next, you can add calculated columns to the Date table as per the requirement for your calculations.

For example, you can add columns – Day, Month, Year, and Quarter as follows −

  • Day=DAY(‘Date'[Date])
  • Month=MONTH(‘Date'[Date])
  • Year=YEAR(‘Date'[Date])
  • Quarter=CONCATENATE (“QTR “, INT ((‘Date'[Month]+2)/3))

The resulting Date table in the data model looks like the following screenshot.

Resulting Date table

Thus, you can add any number of calculated columns to the Date table. What is important and is required is that the Date table must have a column of contiguous dates that spans the duration of time over which you perform calculations.

Creating a Date Table for a Calendar Year

A calendar year typically includes the dates from 1st January to 31st December of a year and also includes the holidays marked for that particular year. When you perform calculations, you might have to take into account only the working days, excluding weekends and holidays.

Suppose, you want to create a Date table for the calendar year 2017.

  • Create an Excel table with a column Date, consisting of contiguous dates from 1st January 2017 to 31st December 2017. (Refer to the previous section to know how to do this.)
  • Copy the Excel table and paste it into a new table in the data model. (Refer to the previous section to know how to do this.)
  • Name the table as Calendar.
  • Add the following calculated columns −
    • Day =DAY(‘Calendar'[Date])
    • Month =MONTH(‘Calendar'[Date])
    • Year =YEAR(‘Calendar'[Date])
    • Day of Week =FORMAT(‘Calendar'[Date],”DDD”)
    • Month Name =FORMAT(‘Calendar'[Date],”MMM”)
Adding Calculated Columns

Adding Holidays to the Calendar Table

Add holidays to the Calendar Table as follows −

  • Get the list of declared holidays for the year.
  • For example, for the US, you can get the list of holidays for any required year from the following link http://www.calendar-365.com/.
  • Copy and paste them into an Excel worksheet.
  • Copy the Excel table and paste it into a new table in the data model.
  • Name the table as Holidays.
Name Table Holidays

  • Next, you can add a calculated column of holidays to the Calendar table using DAX LOOKUPVALUE function.

=LOOKUPVALUE(Holidays[Holiday],Holidays[Date],’Calendar'[Date])

DAX LOOKUPVALUE function searches the third parameter, i.e. Calendar[Date] in the second parameter, i.e. Holidays[Date] and returns the first parameter, i.e. Holidays[Holiday] if there is a match. The result will look like what is shown in the following screenshot.

Lookupvalue

Adding Columns to a Fiscal Year

A fiscal year typically includes the dates from 1st of the month after the fiscal year end to the next fiscal year end. For example, if the fiscal year end is 31stMarch, then the fiscal year ranges from 1st April to 31st March.

You can include the fiscal time periods in the calendar table using the DAX formulas −

  • Add a measure for FYE

    FYE:=3

  • Add the following calculated columns −
    • Fiscal Year

      =IF(‘Calendar'[Month]<=’Calendar'[FYE],’Calendar'[Year],’Calendar'[Year]+1)

    • Fiscal Month

      =IF(‘Calendar'[Month]<=’Calendar'[FYE],12-‘Calendar'[FYE]+’Calendar'[Month],’Calendar'[Month]-‘Calendar'[FYE] )

    • Fiscal Quarter

      =INT((‘Calendar'[Fiscal Month]+2)/3)

Adding Columns to Fiscal Year

Setting the Date Table Property

When you use DAX Time Intelligence functions such as TOTALYTD, PREVIOUSMONTH, and DATESBETWEEN, they require metadata to work correctly. Date Table Property sets such metadata.

To set the Date Table property −

  • Select Calendar table in the Power Pivot window.
  • Click the Design tab on the Ribbon.
  • Click Mark as Date Table in the Calendars group.
  • Click Mark as Date Table in the dropdown list.
Setting the Date Table Property

Mark as Date Table dialog box appears. Select the Date column in the Calendar table. This has to be the column of Date data type and has to have unique values. Click OK.

Mark as Date Table

Extending the Data Model

In this chapter, you will learn how to extend the data model created in the previous chapters. Extending a data model includes −

  • Addition of tables
  • Addition of calculated columns in an existing table
  • Creation of measures in an existing table

Of these, creating the measures is crucial, as it involves providing new data insights in the data model that will enable those using the data model avoid rework and also save time while analyzing the data and decision making.

As Profit and Loss Analysis involves working with time periods and you will be using DAX Time Intelligence functions, you require a Date table in the data model.

If you are new to Date tables, go through the chapter – Understanding Date Tables.

You can extend the data model as follows −

  • To create a relationship between the data table, i.e. Finance Data table and the Date table, you need to create a calculated column Date in the Finance Data table.
  • To perform different types of calculations, you need to create relationships between the data table – Finance Data and the lookup tables – Accounts and Geography Locn.
  • You need to create various measures that help you perform several calculations and carry out the required analysis.

These steps essentially constitute the data modeling steps for Profit and Loss Analysis using the data model. However, this is the sequence of steps for any type of data analysis that you want to perform with Power Pivot data model.

Further, you will learn how to create the measures and how to use them in the Power PivotTables in the subsequent chapters. This will give you sufficient understanding of data modeling with DAX and data analysis with Power PivotTables.

Adding a Date Table to the Data Model

Create a Date table for the time periods spanning the fiscal years as follows −

  • Create a table with a single column with header – Date and contiguous dates ranging from 7/1/2011 to 6/30/2018 in a new Excel worksheet.
  • Copy the table from Excel and paste it into the Power Pivot window. This will create a new table in the Power Pivot data model.
  • Name the table as Date.
  • Ensure that the Date column in the Date table is of data type – Date (DateTime).

Next, you need to add the calculated columns – Fiscal Year, Fiscal Quarter, Fiscal Month and Month to the Date table as follows −

Fiscal Year

Suppose the fiscal year end is June 30th. Then, a fiscal year spans from 1stJuly to 30th June. For example, the period July 1st, 2011 (7/1/2011) to June 30th, 2012 (6/30/2012) will be the fiscal year 2012.

In the Date table, suppose you want to represent the same as FY2012.

  • You need to first extract the financial year part of the Date and append it with FY.
    • For the dates in the months July 2011 to December 2011, the financial year is 1+2011.
    • For the dates in the months January 2012 to June 2012, the financial year is 0+2012.
    • To generalize, if the Month of Financial Year End is FYE, do the following −

      Integer Part of ((Month – 1)/FYE) + Year

    • Next, take the rightmost 4 characters to obtain the Financial Year.
  • In DAX, you can represent the same as −

    RIGHT(INT((MONTH(‘Date'[Date])-1)/’Date'[FYE])+YEAR(‘Date'[Date]),4)

  • Add the calculated column Fiscal Year in the Date table with the DAX formula −

    =”FY”&RIGHT(INT((MONTH(‘Date'[Date])-1)/’Date'[FYE])+YEAR(‘Date'[Date]),4)

Fiscal Quarter

If FYE represents the month of financial year end, the financial quarter is obtained as

Integer Part of ((Remainder of ((Month+FYE-1)/12) + 3)/3)

  • In DAX, you can represent the same as −

    INT((MOD(MONTH(‘Date'[Date])+’Date'[FYE]-1,12)+3)/3)

  • Add the calculated column Fiscal Quarter in the Date table with the DAX formula −

    =’Date'[FiscalYear]&”-Q”&FORMAT( INT((MOD(MONTH(‘Date'[Date]) + ‘Date'[FYE]-1,12) + 3)/3), “0”)

Fiscal Month

If FYE represents the financial year end, the financial month period is obtained as

(Remainder of (Month+FYE-1)/12) + 1

  • In DAX, you can represent the same as −

    MOD(MONTH(‘Date'[Date])+’Date'[FYE]-1,12)+1

  • Add the calculated column Fiscal Month in the Date table with the DAX formula −

    =’Date'[Fiscal Year]&”-P” & FORMAT(MOD(MONTH([Date])+[FYE]-1,12)+1,”00″)

Month

Finally, add the calculated column Month that represents the month number in a financial year as follows −

=FORMAT(MOD(MONTH([Date])+[FYE]-1,12)+1,”00″) & “-” & FORMAT([Date],”mmm”)

The resulting Date table looks like the following screenshot.

Month

Mark the table – Date as Date Table with the column – Date as the column with unique values as shown in the following screenshot.

Mark as the table

Adding Calculated Columns

To create a relationship between the Finance Data table and the Date table, you require a column of Date values in the Finance Data table.

  • Add a calculated column Date in the Finance Data table with the DAX formula −

    = DATEVALUE (‘Finance Data'[Fiscal Month])

Defining Relationships Between Tables in the Data Model

You have the following tables in the data model −

  • Data table – Finance Data
  • Lookup tables – Accounts and Geography Locn
  • Date table – Date

To define Relationships between the tables in the data model, following are the steps −

  • View the tables in the Diagram View of the Power Pivot.
  • Create the following relationships between the tables −
    • Relationship between Finance Data table and Accounts table with the column Account.
    • Relationship between Finance Data table and Geography Locn table with the column Profit Center.
    • Relationship between Finance Data table and Date table with the column Date.
Relationships

Hiding Columns from Client Tools

If there are any columns in a data table that you won’t be using as fields in any PivotTable, you can hide them in the data model. Then, they will not be visible in the PivotTable Fields list.

In the Finance Data table, you have 4 columns – Fiscal Month, Date, Account and Profit Center that you won’t be using as fields in any PivotTable. Hence, you can hide them so that they do not appear in the PivotTable Fields list.

  • Select the columns – Fiscal Month, Date, Account, and Profit Center in the Finance Data table.
  • Right-click and select Hide from Client Tools in the dropdown list.
Hiding Columns from Client Tools

Creating Measures in the Tables

You are all set for data modeling and analysis with DAX using the data model and Power PivotTables.

In the subsequent chapters, you will learn how to create measures and how to use them in Power PivotTables. You will create all the measures in the data table, i.e. Finance Data table.

You will create measures using DAX formulas in the data table – Finance Data, which you can use in any number of PivotTables for the data analysis. The measures are essentially the metadata. Creating measures in the data table is part of data modeling and summarizing them in the Power PivotTables is part of data analysis.

Base Finance Measures and Analysis

You can create various measures in the data model to be used in any number of Power PivotTables. This forms the data modeling and analysis process with the data model using DAX.

As you have learnt earlier in the previous sections, data modeling and analysis is dependent on specific business and context. In this chapter, you will learn data modeling and analysis based on a sample Profit and Loss database to understand how to create the required measures and use them in various Power PivotTables.

You can apply the same method for data modeling and analysis for any business and context

Creating Measures Based on Finance Data

To create any financial report, you need to make calculations of amounts for a particular time period, organization, account, or geographical location. You also need to perform the headcount and cost per headcount calculations. In the data model, you can create base measures that can be reused in creating other measures. This is an effective way of data modeling with DAX.

In order to perform calculations for profit and loss data analysis, you can create measures such as sum, year-over-year, year-to-date, quarter-to-date, variance, headcount, cost per headcount, etc. You can use these measures in the Power PivotTables to analyze the data and report the analysis results.

In the following sections, you will learn how to create the base finance measures and analyze data with those measures. The measures are termed as base measures as they can be used in creating other financial measures. You will also learn how to create measures for the previous time periods and use them in the analysis.

Creating Base Finance Measures

In the finance data analysis, budget and forecast play a major role.

Budget

A budget is an estimate of a company’s revenues and expenses for a financial year. The budget is calculated at the beginning of a financial year keeping in view the company’s goals and targets. Budget measures need to be analyzed from time to time during the financial year, as the market conditions may change and the company may have to align its goals and targets to the current trends in the industry.

Forecast

A financial forecast is an estimate of a company’s future financial outcomes by examining the company’s historical data of revenues and expenses. You can use financial forecasting for the following −

  • To determine how to allocate budget for a future period.
  • To track the expected performance of the company.
  • To take timely decisions to address shortfalls against the targets, or to maximize an emerging opportunity.

Actuals

To perform the budgeting and forecasting calculations, you require the actual revenue and expenses at any point in time.

You can create the following 3 base finance measures that can be used in creating other financial measures in the data mode −

  • Budget Sum
  • Actual Sum
  • Forecast Sum

These measures are the aggregation sums over the columns – Budget, Actual, and Forecast in the Finance Data table.

Create the base finance measures as follows −

Budget Sum

Budget Sum:=SUM(‘Finance Data'[Budget])

Actual Sum

Actual Sum:=SUM(‘Finance Data'[Actual])

Forecast Sum

Forecast Sum:=SUM(‘Finance Data'[Forecast])

Analyzing Data with Base Finance Measures

With the base finance measures and the Date table, you can perform your analysis as follow −

  • Create a Power PivotTable.
  • Add the field Fiscal Year from the Date table to Rows.
  • Add the measures Budget Sum, Actual Sum, and Forecast Sum (that appear as fields in the PivotTable Fields list) to Values.
PivotTable Fields

Creating Finance Measures for Previous Periods

With the three base finance measures and the Date table, you can create other finance measures.

Suppose you want to compare the Actual Sum of a Quarter with the Actual Sum of previous Quarter. You can create the measure – Prior Quarter Actual Sum.

Prior Quarter Actual Sum:=CALCULATE([Actual Sum], DATEADD(‘Date'[Date],1,QUARTER))

Similarly, you can create the measure – Prior Year Actual Sum.

Prior Year Actual Sum:=CALCULATE([Actual Sum], DATEADD(‘Date'[Date],1,YEAR))

Analyzing Data with Finance Measures for Previous Periods

With the base measures, measures for previous periods and the Date table, you can perform your analysis as follows −

  • Create a Power PivotTable.
  • Add the field Fiscal Quarter from the Date table to Rows.
  • Add the measures Actual Sum and Prior Quarter Actual Sum to Values.
  • Create another Power PivotTable.
  • Add the field Fiscal Year from the Date table to Rows.
  • Add the measures Actual Sum and Prior Year Actual Sum to Values.
Actual Sum

YoY Finance Measures and Analysis

Year-over-Year (YoY) is a measure of growth. It is obtained by subtracting the actual sum of the previous year from the actual sum.

If the result is positive, it reflects an increase in actual, and if it is negative, it reflects a decrease in actual, i.e. if we calculate year-over-year as −

year-over-year = (actual sum –prior year actual sum)

  • If the actual sum > the prior year actual sum, year-over-year will be positive.
  • If the actual sum < the prior year actual sum, year-over-year will be negative.

In the financial data, accounts such as the expense accounts will have debit (positive) amounts and the revenue accounts will have credit (negative) amounts. Hence, for the expense accounts, the above formula works fine.

However, for the revenue accounts, it should be the reverse, i.e.

  • If the actual sum > the prior year actual sum, year-over-year should be negative.
  • If the actual sum < the prior year actual sum, year-over-year should be positive.

Hence for the revenue accounts, you have to calculate year-over-year as −

year-over-year = -(actual sum – prior year actual sum)

Creating Year-over-Year Measure

You can create Year-over-Year measure with the following DAX formula −

YoY:=IF(CONTAINS(Accounts, Accounts[Class],”Net Revenue”),-([Actual Sum]-[Prior Year Actual Sum]), [Actual Sum]-[Prior Year Actual Sum])

In the above DAX formula −

  • DAX CONTAINS function returns TRUE, if a row has “Net Revenue” in the column Class in the Accounts table.
  • DAX IF function then returns –([Actual Sum]-[ Prior Year Actual Sum]).
  • Otherwise, DAX IF function returns [Actual Sum]-[ Prior Year Actual Sum].

Creating Year-over-Year Percentage Measure

You can represent Year-over-Year as a percentage with the ratio −

(YoY) / (Prior Year Actual Sum)

You can create the Year-over-Year Percentage measure with the following DAX formula −

YoY %:=IF([Prior Year Actual Sum], [YoY] / ABS([Prior Year Actual Sum]),BLANK())

DAX IF function is used in the above formula to ensure that there is no division by zero.

Analyzing Data with Year-over-Year Measures

Create a Power PivotTable as follows −

  • Add the fields Class and Sub Class from the Accounts table to Rows.
  • Add the measures – Actual Sum, Prior Year Actual Sum, YoY and YoY % to Values.
  • Insert a Slicer on the field Fiscal Year from the Date table.
  • Select FY2016 in the Slicer.
Row Labels

Creating Budget Year-over-Year Measure

You can create Budget Year-over-Year measure as follows −

Budget YoY: = IF(CONTAINS(Accounts,Accounts[Class],”Net Revenue”), – ([Budget Sum] – [Prior Year Actual Sum]), [Budget Sum] – [Prior Year Actual Sum])

Creating Budget Year-over-Year Percentage Measure

You can create Budget Year-over-Year Percentage measure as follows −

Budget YoY %:=IF([Prior Year Actual Sum] , [Budget YoY] / ABS ([Prior Year Actual Sum]) , BLANK())

Analyzing Data with Budget Year-over-Year Measures

Create a Power PivotTable as follows −

  • Add the fields Class and Sub Class from the Accounts table to Rows.
  • Add the measures – Budget Sum, Prior Year Actual Sum, Budget YoY and Budget YoY % to Values.
  • Insert a Slicer on the field Fiscal Year from the Date table.
  • Select FY2016 in the Slicer.
FY2016

Creating Forecast Year-over-Year Measure

You can create Forecast Year-over-Year measure as follows −

Forecast YoY:=IF(CONTAINS(Accounts,Accounts[Class],”Net Revenue”), – ([Forecast Sum] – [Prior Year Actual Sum]), [Forecast Sum] – [Prior Year Actual Sum])

Creating Forecast Year-over-Year Percentage Measure

You can create Forecast Year-over-Year Percentage measure as follows −

Forecast YoY %:=IF([Prior Year Actual Sum],[Forecast YoY]/ABS([Prior Year Actual Sum]),BLANK())

Analyzing Data with Forecast Year-over-Year Measures

Create a Power PivotTable as follows −

  • Add the fields Class and Sub Class from the Accounts table to Rows.
  • Add the measures – Forecast Sum, Prior Year Actual Sum, Forecast YoY and Forecast YoY % to Values.
  • Insert a Slicer on the field Fiscal Year from the Data table.
  • Select FY2016 in the Slicer.
Year-over-Year Measures

Variance Measures and Analysis

You can create variance measures such as variance to budget, variance to forecast, and forecast variance to budget. You can also analyze the finance data based on these measures.

Creating Variance to Budget Sum Measure

Create Variance to Budget Sum measure (VTB Sum) as follows −

VTB Sum:=[Budget Sum]-[Actual Sum]

Creating Variance to Budget Percentage Measure

Create Variance to Budget Percentage measure (VTB %) as follows −

VTB %:=IF([Budget Sum],[VTB Sum]/ABS([Budget Sum]),BLANK())

Analyzing Data with Variance to Budget Measures

Create a Power PivotTable as follows −

  • Add Fiscal Year from the Date table to Rows.
  • Add the measures Actual Sum, Budget Sum, VTB Sum, VTB % from the Finance Data table to Values.
Variance to Budget Measures

Creating Variance to Forecast Sum Measure

Create Variance to Forecast Sum (VTF Sum) measure as follows −

VTF Sum:=[Forecast Sum]-[Actual Sum]

Creating Variance to Forecast Percentage Measure

Create Variance to Forecast Percentage measure (VTF %) as follows −

VTF %:=IF([Forecast Sum],[VTF Sum]/ABS([Forecast Sum]),BLANK())

Analyzing Data with Variance to Forecast Measures

Create a Power PivotTable as follows −

  • Add Fiscal Year from the Date table to Rows.
  • Add the measures Actual Sum, Forecast Sum, VTF Sum, VTF % from the Finance Data table to Values.

Creating Forecast Variance to Budget Sum Measure

Create Forecast Variance to Budget Sum (Forecast VTB Sum) measure as follows −

Forecast VTB Sum:=[Budget Sum]-[Forecast Sum]

Creating Forecast Variance to Budget Percentage Measure

Create Forecast Variance to Budget Percentage (Forecast VTB Percentage) measure as follows −

Forecast VTB %:=IF([Budget Sum],[Forecast VTB Sum]/ABS([Budget Sum]),BLANK())

Analyzing Data with Forecast Variance to Budget Measures

Create a Power PivotTable as follows −

  • Add Fiscal Year from the Date table to Rows.
  • Add the measures Budget Sum, Forecast Sum, Forecast VTB Sum, Forecast VTB % from the Finance Data table to Values.
Forecast Variance to Budget Measures

Year-to-Date Measures and Analysis

To calculate a result that includes a starting balance from the beginning of a period, such as a fiscal year, up to a specific period in time, you can use DAX Time Intelligence functions. This will enable you to analyze data on a month level.

In this chapter, you will learn how to create Year-to-Date measures and how to carry out data analysis with the same.

Creating Year-to-Date Actual Sum Measure

Create Year-to-Date Actual Sum measure as follows −

YTD Actual Sum:=TOTALYTD([Actual Sum], ‘Date'[Date], ALL(‘Date’), “6/30”)

Creating Year-to-Date Budget Sum Measure

Create Year-to-Date Budget Sum measure as follows −

YTD Budget Sum:=TOTALYTD([Budget Sum], ‘Date'[Date], ALL(‘Date’), “6/30”)

Creating Year-to-Date Forecast Sum Measure

Create Year-to-Date Forecast Sum measure as follows −

YTD Forecast Sum:=TOTALYTD([Forecast Sum], ‘Date'[Date], ALL(‘Date’), “6/30”)

Creating Prior Year-to-Date Actual Sum Measure

Create Prior Year-to-Date Actual Sum measure as follows −

Prior YTD Actual Sum:=TOTALYTD([Prior Year Actual Sum], ‘Date'[Date], ALL(‘Date’), “6/30”)

Analyzing Data with Year-to-Date Measures

Create a Power PivotTable as follows −

  • Add Month from Date table to Rows.
  • Add the measures Actual Sum, YTD Actual Sum, YTD Budget Sum, and YTD Forecast Sum from the Finance Data table to Values.
  • Insert a Slicer on the Fiscal Year from the Date table.
  • Select FY2016 in the Slicer.
Fiscal Year

Create a Power PivotTable as follows −

  • Add Month from Date table to Rows.
  • Add the measures Actual Sum, YTD Actual Sum, Prior Year Actual Sum, and Prior Year YTD Actual Sum from the Finance Data table to Values.
  • Insert a Slicer on the Fiscal Year from the Date table.
  • Select FY2016 in the Slicer.
Prior Year Actual Sum

Quarter-to-Date Measures and Analysis

To calculate a result that includes a starting balance from the beginning of a period, such as a fiscal quarter, up to a specific period in time, you can use DAX Time Intelligence functions. This will enable you to analyze data on a month level.

In this chapter, you will learn how to create Quarter-to-Date measures and how to carry out data analysis with the same.

Creating Quarter-to-Date Sum Measure

Create Quarter-to-Date Actual Sum measure as follows −

QTD Actual Sum:=TOTALQTD([Actual Sum],’Date'[Date],ALL(‘Date’))

Creating Quarter-to-Date Budget Sum Measure

Create Quarter-to-Date Budget Sum measure as follows −

QTD Budget Sum:=TOTALQTD([Budget Sum], ‘Date'[Date], ALL(‘Date’))

Creating Quarter-to-Date Forecast Sum Measure

Create Quarter-to-Date Budget Sum measure as follows −

QTD Budget Sum:=TOTALQTD([Budget Sum], ‘Date'[Date], ALL(‘Date’))

Creating Quarter-to-Date Forecast Sum Measure

Create Quarter-to-Date Forecast Sum measure as follows −

QTD Forecast Sum:=TOTALQTD([Forecast Sum], ‘Date'[Date], ALL(‘Date’))

Creating Prior Quarter-to-Date Actual Sum Measure

Create Prior Quarter-to-Date Actual Sum measure as follows −

Prior QTD Actual Sum:=TOTALQTD([Prior Quarter Actual Sum], ‘Date'[Date], ALL(‘Date’))

Analyzing Data with Quarter-to-Date Measures

Create a Power PivotTable as follows −

  • Add Fiscal Month from Date table to Rows.
  • Add the measures Actual Sum, QTD Actual Sum, QTD Budget Sum, and QTD Forecast Sum from Finance Data table to Values.
  • Insert a Slicer on the Fiscal Quarter from the Date table.
  • Select FY2016-Q2 in the Slicer.
FY2016-Q2

Create a Power PivotTable as follows −

  • Add Fiscal Month from Date table to Rows.
  • Add the measures Actual Sum, QTD Actual Sum, Prior Quarter Actual Sum, and Prior QTD Actual Sum from Finance Data table to Values.
  • Insert a Slicer on the Fiscal Quarter from Date table.
  • Select FY2016-Q1 in the Slicer.
FY2016-Q1

Budget Measures and Analysis

Budgeting involves estimating the cash flows of a company over a financial year. The financial position of the company, its goals, expected revenues, and expenses are taken into account in budgeting.

However, the market conditions may change during the financial year and the company may have to reset its goals. This requires analyzing the financial data with the budget estimated at the beginning of the financial year (Budget Sum) and the actual expended sum from the beginning of the financial year to date (YTD Actual Sum).

At any time during a financial year, you can calculate the following −

Unexpended Balance

Unexpended Balance is the budget remaining after the actual expenses, i.e.

Unexpended Balance = YTD Budget Sum – YTD Actual Sum

Budget Attainment %

Budget Attainment % is the percentage of the budget that you have spent to date, i.e.

Budget Attainment % = YTD Actual Sum/YTD Budget Sum

These calculations help those companies that use budgeting to make decisions.

Creating Unexpended Balance Measure

You can create Unexpended Balance measure as follows −

Unexpended Balance:=CALCULATE( [YTD Budget Sum],ALL(‘Finance Data'[Date]) )-[YTD Actual Sum]

Creating Budget Attainment Percentage Measure

You can create Budget Attainment Percentage measure as follows −

Budget Attainment %:=IF([YTD Budget Sum],[YTD Actual Sum]/CALCULATE([YTD Budget Sum],ALL(‘Finance Data'[Date])),BLANK())

Analyzing Data with Budget Measures

Create a Power PivotTable as follows −

  • Add Month from the Date table to Rows.
  • Add the measures Budget Sum, YTD Budget Sum, YTD Actual Sum, Budget Attainment % and Unexpended Balance from Finance Data table to Values.
  • Insert a Slicer on the Fiscal Year field.
  • Select FY2016 in the Slicer.
Budget Measures

Forecast Measures and Analysis

You can use Forecast measures to analyze the finance data and help an organization make necessary adjustments in its goals and targets for the year, to align the company’s performance to the changing business requirements.

You need to update the forecasts regularly to keep up with the changes. You can then compare the most recent forecast to the budget for the rest of the period in the financial year so that the company can make the required adjustments to meet the business changes.

At any time during a financial year, you can calculate the following −

Forecast Attainment %

Forecast Attainment % is the percentage of the forecast sum that you have spent to date, i.e.

Forecast Attainment % = YTD Actual Sum/YTD Forecast Sum

Forecast Unexpended Balance

Forecast Unexpended Balance is the Forecast Sum remaining after the actual expenses, i.e

Forecast Unexpended Balance = YTD Forecast Sum – YTD Actual Sum

Budget Adjustment

Budget Adjustment is the adjustment in the budget sum an organization needs to make (an increase or decrease) based on the forecast.

Budget Adjustment = Forecast Unexpended Balance – Unexpended Balance

The budget needs to be increased if the resulting value is positive. Otherwise, it can be adjusted for some other purpose.

Creating Forecast Attainment Percentage Measure

You can create Forecast Attainment Percentage measure as follows −

Forecast Attainment Percentage:= IF([YTD Forecast Sum], [YTD Actual Sum]/[YTD Forecast Sum], BLANK())

Creating Forecast Unexpended Balance Measure

You can create Forecast Unexpended Balance measure as follows −

Forecast Unexpended Balance:=[YTD Forecast Sum]-[YTD Actual Sum]

Creating Budget Adjustment Measure

You can create Budget Adjustment measure as follows −

Budget Adjustment:=[Forecast Unexpended Balance]-[Unexpended Balance]

Analyzing Data with Forecast Measures

Create a Power PivotTable as follows −

  • Add Month from Date table to Rows.
  • Add the measures Budget Sum, YTD Budget Sum, YTD Actual Sum, Budget Attainment % and Unexpended Balance from Finance Data table to Values.
  • Insert a Slicer on Fiscal Year.
  • Select FY2016 in the Slicer.
Data with Forecast Measures

Count of Months Measures

You can create the Count of Months measures that can be used in creating Headcount measures and Cost Per Head measures. These measures count the distinct values of Fiscal Month column where the Actual column / Budget column / Forecast column has non-zero values in the Finance Data table. This is required because the Finance Data table contains zero values in the Actual column and those rows are to be excluded while calculating Headcount and Cost per Head.

Creating Count of Actual Months Measure

You can create Count of Actual Months measure as follows −

CountOfActualMonths:=CALCULATE(DISTINCTCOUNT(‘FinanceData’ [Fiscal Month]),’Finance Data'[Actual]<>0)

Creating Count of Budget Months Measure

You can create Count of Budget Months measure as follows −

CountOfBudgetMonths:=CALCULATE(DISTINCTCOUNT(‘FinanceData’ [Fiscal Month]),’Finance Data'[Budget]<>0)

Creating Count of Forecast Months Measure

You can create Count of Forecast Months measure as follows −

CountOfForecastMonths:=CALCULATE(DISTINCTCOUNT(‘FinanceData’ [Fiscal Month]),’Finance Data'[Forecast]<>0)

Ending Headcount Measures

You can create Ending Headcount measures for a specific period of time. The Ending Headcount is the sum of the people as on the last date in the specified period for which we have a non-blank sum of people.

The Ending Headcount is obtained as follows −

  • For a Month − Sum of People at the end of the specific Month.
  • For a Quarter − Sum of People at the end of the last Month of the specific Quarter.
  • For a Year − Sum of People at the end of the last Month of the specific Year.

Creating Actual Ending Headcount Measure

You can create Actual Ending Headcount measure as follows −

Actual Ending Head Count:=CALCULATE(SUM(‘Finance Data'[Actual People]),LASTNONBLANK(‘Finance Data'[Date], IF(CALCULATE(SUM(‘Finance Data'[Actual People]), ALL(Accounts))=0, BLANK(), CALCULATE(SUM(‘Finance Data'[Actual People]), ALL(Accounts)))), ALL(Accounts))

DAX LASTNONBLANK function as used above returns the last date for which you have a non-blank sum of people so that you can calculate the sum of people on that date.

Creating Budget Ending Headcount Measure

You can create Budget Ending Headcount measure as follows −

Budget Ending Head Count: = CALCULATE(SUM(‘Finance Data'[Budget People]),LASTNONBLANK(‘Finance Data'[Date], IF(CALCULATE(SUM(‘Finance Data'[Budget People]), ALL(Accounts))=0, BLANK(), CALCULATE(SUM(‘Finance Data'[Budget People]), ALL(Accounts)))), ALL(Accounts))

Creating Forecast Ending Headcount Measure

You can create Forecast Ending Headcount measure as follows −

Forecast Ending Head Count:= CALCULATE(SUM(‘Finance Data'[Forecast People]), LASTNONBLANK(‘Finance Data'[Date], IF(CALCULATE(SUM(‘Finance Data'[Forecast People]), ALL(Accounts))=0, BLANK(),CALCULATE(SUM(‘Finance Data'[Forecast People]), ALL(Accounts)))), ALL(Accounts))

Creating Prior Year Actual Ending Headcount Measuree

You can create Prior Year Actual Ending Headcount measure as follows −

Prior Year Actual Ending Headcount:=CALCULATE(‘Finance Data'[Actual Ending Head Count], DATEADD(‘Date'[Date],-1,YEAR))

Analyzing Data with Ending Headcount Measures

Create a Power PivotTable as follows −

  • Add the fields Fiscal Year and Month from the Date table to Rows.
  • Add the measures Actual Ending Headcount, Budget Ending Headcount, Forecast Ending Headcount, Prior Year Actual Ending Headcount from Finance Data table to Values.
  • Insert a Slicer on the Fiscal Year field.
  • Select FY2016 in the Slicer.
Ending Headcount Measures

Average Headcount Measures

In the previous chapter, you have learnt how to calculate ending headcounts for a specific period. Likewise, you can create the average monthly headcount for any given selection of months.

The Average Monthly Headcount is the sum of the monthly headcounts divided by the number of months in the selection.

You can create these measures using DAX AVERAGEX function.

Creating Actual Average Headcount Measure

You can create Actual Average Headcount measure as follows −

Actual Average Headcount:=AVERAGEX(VALUES(‘Finance Data'[Fiscal Month]), [Actual Ending Head Count])

Creating Budget Average Headcount Measure

You can create Actual Average Headcount measure as follows −

Budget Average Headcount:=AVERAGEX(VALUES(‘Finance Data'[Fiscal Month]), [Budget Ending Head Count])

Creating Forecast Average Headcount Measure

You can create Forecast Average Headcount measure as follows −

Forecast Average Headcount:=AVERAGEX( VALUES(‘Finance Data'[Fiscal Month]), [Actual Ending Head Count])

Creating Prior Year Actual Average Headcount Measure

You can create Prior Year Actual Average Headcount measure as follows −

Prior Year Actual Average Headcount:=CALCULATE(‘Finance Data'[Actual Average Headcount], DATEADD(‘Date'[Date], -1, YEAR))

Analyzing Data with Average Headcount Measures

Create a Power PivotTable as follows −

  • Add the fields Fiscal Year and Month from the Date table to Rows.
  • Add the measures Actual Average Headcount, Budget Average Headcount, Forecast Average Headcount, Prior Year Actual Average Headcount from Finance Data table to Values.
  • Insert a Slicer on the Fiscal Year field.
  • Select FY2016 in the Slicer.
Average Headcount Measures

Total Headcount Measures

In the previous chapters, you have learnt how to create Count of Months measures and Average Headcount measures. You can use these measures to calculate the base Headcount Measures −

  • Actual Total Headcount
  • Budget Total Headcount
  • Forecast Total Headcount

In the subsequent chapters, you will learn how to use these base Headcount measures in other calculations such as YoY Headcount and Variance measures.

Creating Actual Total Headcount Measure

You can create Actual Total Headcount Measure as follows −

Actual Total Head Count:= ‘Finance Data'[Actual Average Headcount]*’Finance Data'[CountOfActualMonths]

Creating Budget Total Headcount Measure

You can create Budget Total Headcount Measure as follows −

Budget Total Headcount:= ‘Finance Data'[Budget Average Headcount]*’Finance Data'[CountOfBudgetMonths]

Creating Forecast Total Headcount Measure

You can create Forecast Total Headcount Measure as follows −

Forecast Total Headcount:= ‘Finance Data'[Forecast Average Headcount]*’Finance Data'[CountOfForecastMonths]

YoY Headcount Measures and Analysis

In the previous chapter, you have learnt how to create base Headcount measures – i.e. Actual Total Headcount, Budget Total Headcount, and Forecast Total Headcount.

In this chapter, you will learn how to create Year-Over-Year Headcount measures and how you can analyze the data with these measures.

Creating Year-over-Year Actual Ending Headcount Measure

You can create Year-over-Year Actual Ending Headcount Measure as follows −

YoY Actual Ending Headcount:=[Actual Ending Head Count]-[Prior Year Actual Ending Headcount]

Creating Year-over-Year Actual Average Headcount Measure

You can create Year-over-Year Actual Average Headcount Measure as follows −

YoY Actual Average Headcount:= [Actual Average Headcount]-[Prior Year Actual Average Headcount]

Creating Year-over-Year Actual Total Headcount Measure

You can create Year-over-Year Actual Total Headcount Measure as follows −

YoY Actual Total Headcount:=[Actual Total Head Count]-[Prior Year Actual Total Headcount]

Analyzing Data with Year-over-Year Actual Headcount Measures

Create a Power PivotTable as follows −

  • Add the fields Fiscal Quarter and Month from the Date table to Rows.
  • Add the measures – Actual Ending Head Count, Prior Year Actual Ending Head Count, YoY Actual Ending Head Count to Values.
  • Insert a Slicer on the field Fiscal Year.
  • Select FY2016 in the Slicer.

Create another Power PivotTable on the same worksheet as follows −

  • Add the fields Fiscal Quarter and Month from the Date table to Rows.
  • Add the measures – Actual Average Head Count, Prior Year Actual Average Head Count, YoY Actual Average Head Count to Values.

Connect the Slicer to this PivotTable as follows −

  • Click the Slicer.
  • Click the Options tab under Slicer Tools on the Ribbon.
  • Click Report Connections.

Report Connections dialog box appears.

  • Select the above two PivotTables.
  • Click OK.
Year-Over-Year Headcount Measures

Creating Year-over-Year Budget Ending Headcount Measure

You can create Year-over-Year Budget Ending Headcount Measure as follows −

YoY Budget Ending Headcount:= [Budget Ending Head Count]-[Prior Year Actual Ending Headcount]

Creating Year-over-Year Budget Average Headcount Measure

You can create Year-over-Year Budget Average Headcount Measure as follows −

YoY Budget Average Headcount:= [Budget Average Headcount]-[Prior Year Actual Average Headcount]

Creating Year-over-Year Budget Total Headcount Measure

You can create Year-over-Year Budget Total Headcount Measure as follows −

YoY Budget Total Headcount:=[Budget Total Headcount]-[Prior Year Actual Total Headcount]

Creating Year-over-Year Forecast Ending Headcount Measure

You can create Year-over-Year Forecast Ending Headcount Measure as follows −

YoY Forecast Ending Headcount:= [Forecast Ending Head Count]-[Prior Year Actual Ending Headcount]

Creating Year-over-Year Forecast Average Headcount Measure

You can create Year-over-Year Forecast Average Headcount Measure as follows −

YoY Forecast Average Headcount:= [Forecast Average Headcount]-[Prior Year Actual Average Headcount]

Creating Year-over-Year Forecast Total Headcount Measure

You can create Year-over-Year Forecast Total Headcount Measure as follows −

YoY Forecast Total Headcount:=[Forecast Total Headcount]-[Prior Year Actual Total Headcount]

Variance Headcount Measures

You can create the Variance Headcount measures based on the Headcount measures that you have created so far.

Creating Variance to Budget Ending Headcount Measure

You can create Variance to Budget Ending Headcount Measure as follows −

VTB Ending Head Count:= ‘Finance Data'[Budget Ending Head Count]-‘Finance Data'[Actual Ending Head Count]

Creating Variance to Budget Average Headcount Measure

You can create Variance to Budget Average Headcount Measure as follows −

VTB Average Head Count:= ‘Finance Data'[Budget Average Headcount]-‘Finance Data'[Actual Average Headcount

Creating Variance to Budget Total Headcount Measure

You can create Variance to Budget Total Headcount Measure as follows −

VTB Total Head Count:= ‘Finance Data'[Budget Total Headcount]-‘Finance Data'[Actual Total Head Count]

Creating Variance to Forecast Ending Headcount Measure

You can create Variance to Forecast Ending Headcount Measure as follows −

VTF Ending Head Count:= ‘Finance Data'[Forecast Ending Head Count]-‘Finance Data'[Actual Ending Head Count]

Creating Variance to Forecast Average Headcount Measure

You can create Variance to Forecast Average Headcount Measure as follows −

VTF Average Head Count:= ‘Finance Data'[Forecast Average Headcount]-‘Finance Data'[Actual Average Headcount]

Creating Variance to Forecast Total Headcount Measure

You can create Variance to Forecast Total Headcount Measure as follows −

VTF Total Head Count:= ‘Finance Data'[Forecast Total Headcount]-‘Finance Data'[Actual Total Head Count]

Creating Forecast Variance to Budget Ending Headcount Measure

You can create Forecast Variance to Budget Ending Headcount Measure as follows −

Forecast VTB Ending Head Count:= ‘Finance Data'[Budget Ending Head Count]-‘Finance Data'[Forecast Ending Head Count]

Creating Forecast Variance to Budget Average Headcount Measure

You can create Forecast Variance to Budget Average Headcount Measure as follows −

Forecast VTB Average Head Count:=’Finance Data'[Budget Average Headcount]-‘Finance Data'[Forecast Average Headcount]

Creating Forecast Variance to Budget Total Headcount Measure

You can create Forecast Variance to Budget Total Headcount Measure as follows −

Forecast VTB Total Head Count:= ‘Finance Data'[Budget Total Headcount]-‘Finance Data'[Forecast Total Headcount

Cost Per Headcount Measures and Analysis

You have learnt about the two major categories of Measures −

  • Finance Measures.
  • Headcount Measures.

The third major category of measures that you will learn is People Cost Measures. Any organization will be interested to know the annualized cost per head. Annualized cost per head represents the cost to the company of having one employee on a full year basis.

To create Cost Per Head measures, you need to first create certain preliminary People Cost Measures. In the Accounts table, you have a column – Sub Class that contains People as one of the values. Hence, you can apply a filter on the Accounts table on the Sub Class column to obtain the filter context onto the Finance Data table to obtain People Cost.

You can use thus obtain People Cost measures and Count of Months measures to create Annualized People Cost measures. You can finally create Annualized Cost Per Head measures from Annualized People Cost measures and Average Head Count measures.

Creating Actual People Cost Measure

You can create Actual People Cost measure as follows −

Actual People Cost:=CALCULATE(‘Finance Data'[Actual Sum], FILTER(‘Finance Data’, RELATED(Accounts[Sub Class])=”People”))

Creating Budget People Cost Measure

You can create Budget People Cost measure as follows −

Budget People Cost:=CALCULATE(‘Finance Data'[Budget Sum], FILTER(‘Finance Data’, RELATED(Accounts[Sub Class])=”People”))

Creating Forecast People Cost Measure

You can create Forecast People Cost measure as follows −

Forecast People Cost:=CALCULATE(‘Finance Data'[Forecast Sum], FILTER(‘Finance Data’, RELATED(Accounts[Sub Class])=”People”))

Creating Annualized Actual People Cost Measure

You can create Annualized Actual People Cost measure as follows −

Annualized Actual People Cost:=IF([CountOfActualMonths],[Actual People Cost]*12/[CountOfActualMonths],BLANK())

Creating Annualized Budget People Cost Measure

You can create Annualized Budget People Cost measure as follows −

Annualized Budget People Cost:=IF([CountOfBudgetMonths], [Budget People Cost]*12/[CountOfBudgetMonths],BLANK())

Creating Annualized Forecast People Cost Measure

You can create Annualized Forecast People Cost measure as follows −

Annualized Forecast People Cost:=IF([CountOfForecastMonths],[Forecast People Cost]*12/[CountOfForecastMonths],BLANK())

Creating Actual Annualized Cost Per Head Measure

You can create Actual Annualized Cost Per Head (CPH) measure as follows −

Actual Annualized CPH:=IF([Actual Average Headcount], [Annualized Actual People Cost]/[Actual Average Headcount],BLANK() )

Creating Budget Annualized Cost Per Head Measure

You can create Budget Annualized Cost Per Head (CPH) measure as follows −

Budget Annualized CPH:=IF([Budget Average Headcount],[Annualized Budget People Cost]/[Budget Average Headcount],BLANK())

Creating Forecast Annualized Cost Per Head Measure

You can create Forecast Annualized Cost Per Head (CPH) measure as follows −

Forecast Annualized CPH:=IF([Forecast Average Headcount],[Annualized Forecast People Cost]/[Forecast Average Headcount], BLANK())

Creating Prior Year Actual Annualized Cost Per Head Measure

You can create Prior Year Actual Annualized Cost Per Head (CPH) measure as follows −

Prior Year Actual Annualized CPH:=CALCULATE([Actual Annualized CPH], DATEADD(‘Date'[Date],-1,YEAR) )

Analyzing Data with Cost Per Head Measures

Create a Power PivotTable as follows −

  • Add the fields Fiscal Quarter and Fiscal Month from Date table to Rows.
  • Add the measures Actual Annualized CPH, Budget Annualized CPH, and Forecast Annualized CPH to Columns.
  • Add the field Fiscal Year from Date table to Filters.
  • Select FY2016 in the Filter.
Cost per Head Measures

Create another Power PivotTable as follows −

  • Add the field Fiscal Quarter from Date table to Rows.
  • Add the measures Actual Annualized CPH, and Prior Year Actual Annualized CPH to Columns.
  • Insert a Slicer on the field Fiscal Year from Date table.
  • Select FY2015 and FY2016 on the Slicer.
Actual Annualized CPH

Rate Variance and Volume Variance

You have learnt how to create measures for Annualized Cost Per Head and Total Headcount. You can use these measures to create Rate Variance and Volume Variance measures.

  • Rate Variance measures calculate what portion of a Currency Variance is caused by differences in Cost Per Head.
  • Volume Variance measures calculate how much of the Currency Variance is driven by fluctuation in Headcount.

Creating Variance to Budget Rate Measure

You can create Variance to Budget Rate measure as follows −

VTB Rate:=([Budget Annualized CPH]/12-[Actual Annualized CPH]/12)*[Actual Total Head Count]

Creating Variance to Budget Volume Measure

You can create Variance to Budget Volume measure as follows −

VTB Volume:=[VTB Total Head Count]*[Budget Annualized CPH]/12

Analyzing Data with Variance to Budget Measures

Create a Power PivotTable as follows −

  • Add the fields Fiscal Quarter and Fiscal Month from Date table to Rows.
  • Add the measures Actual Annualized CPH, Budget Annualized CPH, VTB Rate, VTB Volume, VTB Sum to Values.
  • Add the fields Fiscal Year from Date table and Sub Class from Accounts table to Filters.
  • Select FY2016 in the Fiscal Year Filter.
  • Select People in the Sub Class Filter.
  • Filter Row Labels for Fiscal Quarter values FY2016-Q1 and FY2016-Q2.
Data with Variance to Budget Measures

You can observe the following in the above PivotTable −

  • VTB Sum value shown is only for Sub Class – People.
  • For Fiscal Quarter FY2016-Q1, VTB Sum is $4,705,568, VTB Rate is $970,506,297, and VTB Volume is $-965,800,727.
  • VTB Rate measure calculates that $970,506,297 of the Variance to Budget (VTB Sum) is caused by the difference in Cost per Head, and $-965,800,727 is caused by the difference in Headcount.
  • If you add VTB Rate and VTB Volume, you will get $4,705,568, the same value as returned by VTB Sum for Sub Class People.
  • Similarly, for Fiscal Quarter FY2016-Q2, VTB Rate is $1,281,467,662, and VTB Volume is $-1,210,710,978. If you add VTB Rate and VTB Volume, you will get $70,756,678, which is the VTB Sum value shown in the PivotTable.

Creating Year-Over-Year Rate Measure

You can create Year-Over-Year Rate measure as follows −

YoY Rate:=([Actual Annualized CPH]/12-[Prior Year Actual Annualized CPH]/12)*[Actual Total Head Count]

Creating Year-Over-Year Volume Measure

You can create Year-Over-Year Volume measure as follows −

YoY Volume:=[YoY Actual Total Headcount]*[Prior Year Actual Annualized CPH]/12

Creating Variance to Forecast Rate Measure

You can create Variance to Forecast Rate measure as follows −

VTF Rate:=([Forecast Annualized CPH]/12-[Actual Annualized CPH]/12)*[Actual Total Head Count]

Creating Variance to Forecast Volume Measure

You can create Variance to Forecast Volume measure as follows −

VTF Volume:=[VTF Total Head Count]*[Forecast Annualized CPH]/12

Analyzing Data with Variance to Forecast Measures

Create a Power PivotTable as follows −

  • Add the fields Fiscal Quarter and Fiscal Month from Date table to Rows.
  • Add the measures Actual Annualized CPH, Forecast Annualized CPH, VTF Rate, VTF Volume, VTF Sum to Values.
  • Add the fields Fiscal Year from Date table and Sub Class from Accounts table to Filters.
  • Select FY2016 in the Fiscal Year Filter.
  • Select People in the Sub Class Filter.
  • Filter Row Labels for Fiscal Quarter values FY2016-Q1 and FY2016-Q2.
Data with Variance to Forecast Measures

Creating Forecast Variance to Budget Rate Measure

You can create Forecast Variance to Budget Rate measure as follows −

Forecast VTB Rate:=([Budget Annualized CPH]/12-[Forecast Annualized CPH]/12)*[Forecast Total Headcount]

Creating Forecast Variance to Budget Volume Measure

You can create Forecast Variance to Budget Volume measure as follows −

Forecast VTB Volume:=[Forecast VTB Total Head Count]*[Budget Annualized CPH]/12

Analyzing Data with Forecast Variance to Budget Measures

Create a Power PivotTable as follows −

  • Add the fields Fiscal Quarter and Fiscal Month from Date table to Rows.
  • Add the measures Budget Annualized CPH, Forecast Annualized CPH, Forecast VTB Rate, Forecast VTB Volume, Forecast VTB Sum to Values.
  • Add the fields Fiscal Year from Date table and Sub Class from Accounts table to Filters.
  • Select FY2016 in the Fiscal Year Filter.
  • Select People in the Sub Class Filter.
  • Filter Row Labels for Fiscal Quarter values FY2016-Q1 and FY2016-Q2.
Data with Forecast Variance Budget Measures


BI Intelligence Qlik Overview

QlikView

QlikView is a leading Business Discovery Platform. It is unique in many ways as compared to the traditional BI platforms. As a data analysis tool, it always maintains the relationship between the data and this relationship can be seen visually using colors. It also shows the data that are not related. It provides both direct and indirect searches by using individual searches in the list boxes.

QlikView’s core and patented technology has the feature of in-memory data processing, which gives superfast result to the users. It calculates aggregations on the fly and compresses data to 10% of original size. Neither users nor developers of QlikView applications manage the relationship between data. It is managed automatically.

Features of QlikView

QlikView has patented technology, which enables it to have many features that are useful in creating advanced reports from multiple data sources quickly. Following is a list of features that makes QlikView very unique.

  • Data Association is maintained automatically − QlikView automatically recognizes the relationship between each piece of data that is present in a dataset. Users need not preconfigure the relationship between different data entities.
  • Data is held in memory for multiple users, for a super-fast user experience − The structure, data and calculations of a report are all held in the memory (RAM) of the server.
  • Aggregations are calculated on the fly as needed − As the data is held in memory, calculations are done on the fly. No need of storing pre-calculated aggregate values.
  • Data is compressed to 10% of its original size − QlikView heavily uses data dictionary. Only essential bits of data in memory is required for any analysis. Hence, it compresses the original data to a very small size.
  • Visual relationship using colors − The relationship between data is not shown by arrow or lines but by colors. Selecting a piece of data gives specific colors to the related data and another color to unrelated data.
  • Direct and Indirect searches − Instead of giving the direct value a user is looking for, they can input some related data and get the exact result because of the data association. Of course, they can also search for a value directly.

QlikView – Installation

Download QlikView

The Free Personal Edition of QlikView can be downloaded from QlikView Personal Edition. You need to register with your details to be able to download.

After downloading, the installation is a very straightforward process in which you need to accept the license agreement and provide the target folder for installation. The screen shots given below describe the entire setup process.

Start the installation wizard

Double clicking the QlikViewDesktop_x64Setup.exe will present a screen to select the language of your choice. On selecting English, the following screen is displayed. Then click Next.

Installation step-1

Accept the License Agreement

Read the license agreement and if you agree, choose the “I accept the terms in the license agreement” option. Then click “Next”.

Installation step-2

Provide your Details

Provide your name and organization details. Then Click “Next”.

Installation step-3

Choose the Destination Folder

You may accept the default destination for installation or alter it. Then click “Next”.

Installation step-4

Choose the Setup Type

Choose the setup type as “Complete”. Then click “Next”.

Installation step-5

Begin Installation

In this screen, you finally decide to really start the installation. You can still go back and change some options if needed. Assuming you are fine with everything so far, click “Install”.

Installation step-6

Installation Complete

The installation completion screen appears after successful installation. Click “Finish”.

Installation step-7

Verify the Installation

You can verify the installation by going to the Windows Start menu and clicking on the QlikView icon. The screen appears as shown below.

Verify the Install

You are now ready to learn QlikView.

QlikView – Architecture

As a leading Business Discovery Platform, QlikView is built with a very different approach to data discovery than other traditional platforms. QlikView does not first build a query and then fetch the result based on the query. Rather, it forms associations between different data objects as soon as it is loaded and prompts the user to explore the data in any way. There is no predefined data drill down paths. The data drill down paths can happen in any direction as long as the data is available and associated.

Of course, a user can also play a role in creating the associations between data elements using data modeling approach available in QlikView.

Architectural Overview

QlikView’s architecture consists of a front end to visualize the processed data and a back end to provide the security and publication mechanism for QlikView user documents. The diagram given below depicts the internal working of QlikView. The architecture is discussed in detail below the picture.

qlikview_architecture

Front End

The Front end in QlikView is a browser-based access point for viewing the QlikView documents. It contains the QlikView Server, which is mainly used by the Business users to access the already created BI reports through an internet or intranet URL.

Business users explore and interact with data using this front end and derive conclusions about the data. They also collaborate with other users on a given set of reports by sharing insights and exploring data together, in real time or off-line. These user documents are in the format .qvw, which can also be stored in the windows OS as a standalone document

The QlikView server in the front end manages the client server communication between the user and QlikView backend system.

Back End

The QlikView backend consists of QlikView desktop and QlikView publisher.

The QlikView desktop is a wizard-driven Windows environment, which has the features to load and transform data from its source. Its drag and drop feature is used to create the GUI layout of the reports that becomes visible in the frontend. The file types, which are created by QlikView desktop are stored with an extension of .qvw. These are the files that are passed on to the QlikView server in the front end, which serves the users with these files. Also .qvw files can be modified to store the data-inly files, which are known as .qvd files. They are binary files, which contain only the data and not the GUI components.

The QlikView publisher is used as distribution service to distribute the .qvw documents among various QlikView servers and users. It handles the authorization and access privileges. It also does the direct loading of data from data sources by using the connection strings defined in the .qvw files.

QlikView – Navigation

In this chapter, we will get acquainted with the screens available to a user for doing various operations. We will learn the basic navigation and know the different functions performed by the icons available in QlikView.

Getting Started Screen

This screen is a gentle introduction to navigate around QlikView. This screen comes up when you start QlikView and keep the Show start page when launching QlikView option checked. If you scroll down the examples section in the left, you can click any of the examples like − Movies DatabaseData Visualization etc. to invoke it and see how QlikView works. Feel free to click around! On moving to the right, you notice the Recent and Favourites link, which show all the recently visited QlikView documents and the documents you want to visit frequently.

Verify the Install

Menu Commands

On closing the ‘Getting Started’ window, we get the main interface with all the available Menu commands. They represent the entire set of features available in QlikView. Given below is an overview of each section of the Menu Commands.

All menu

File Menu

This menu is used to create a new QlikView file and open the existing files from both local system and QlikView server. The important features in this menu are −

  • Mail as attachment to email the currently open document as an attachment.
  • Reduce Data to view only the report layout and database structure without any data.
  • Table viewer option is used to see the structure of the tables, fields and their association in a graphical view.

All menu

Edit Menu

This menu is used to carry out the editing options like copy, paste, cut and using format painter. The important features in this menu are −

  • Active All option activates all the sheet objects present in the opened sheet.
  • Removeto remove a sheet from the active window.
  • Advanced search option is used to do a search with advanced search expressions using multi box.

All menu

View Menu

This menu is used to view the standard toolbars and zoom in/ zoom out features. It also displays all the active sheets as a cascade menu. The important features in this menu are −

  • Turn on/off WebView mode toggles the WebView mode and local view mode.
  • Current Selections displays the field name and file values of the selected sheet objects.
  • Design Grid is used to toggle the sheet object placeholders for active object(s) and snap-to-grid for sizing and moving objects in the layout.

View menu

Selection Menu

This menu is used to select and clear the selection of values in the sheet objects. It also provides the feature of going back and forward into different logical statements of the sheet, you are working on. The important features in this menu are −

  • Lock locks all the values in current selection.
  • Unlock unlocks all the locked values in the current selection.

View menu

Layout Menu

Layout Menu is used to add tabbed sheets, select different sheets and rearrange sheet objects. The important features in this menu are −

  • Promote sheet moves the current sheet or tab one step forward.
  • Demote sheet moves the current sheet or tab one step backward.
  • Delete sheet deletes the active sheet and everything in it.

Layout menu

Settings Menu

Settings menu is used to set the user preferences, document properties, and sheet properties. The important features in this menu are −

  • Variable overview all the non-hidden variables and their values in a single list.
  • Expression Overview shows expressions from the document, sheet and sheet objects as a single list.

Settings menu

Bookmark Menu

This menu is used to create bookmarks to different documents for faster retrieval.

View menu

Reports Menu

Reports menu is used to create new reports and edit the existing reports. You can edit the layout, add pages to the report, and also delete reports.

Reports menu

Tools Menu

Tools menu is a very prominent menu, frequently used for creating charts and opening the QlikView management console. The important features in this menu are −

  • Quick Chart Wizard creates simple charts without using the great number of different settings and options available.
  • Time Chart Wizard creates time series charts.
  • Statistics Chart Wizard is used to apply common statistical tests on data.

Tools menu

Object Menu

This menu is used to create new sheet objects and modify the existing ones. The sheet properties option opens the page to set the parameters defining the sheet. The important features in this menu are −

  • Copy Sheet − creates a copy of the sheet along with the all the sheet objects.
  • Copy Image to Clipboard − Copies a bitmap picture of the sheet area to Clipboard.
  • Remove − completely removes the sheet along with the sheet objects.

object menu

The Window and Help menus are used to organize the different windows of QlikView application and provide help documentation.

QlikView – Excel Files

QlikView accepts Excel spreadsheet for data analysis by simple drag and drop action. You need to open the QlikView main window and drag and drop the excel file into the interface. It will automatically create the sheet showing the excel data.

Select the Excel file

Keep the main window of QlikView open and browse for the excel file you want to use.

1_excel_load_data_drag

Select a Data Source

On dropping the excel file into the main window, the File wizard appears. The File Type is already chosen as Excel. Under Labels, choose Embedded Labels. Click “Next step” to proceed.

2_excel_file_wizard

Load Script

The Load script appears which shows the command that loads the data into the QlikView document. This command can be edited.

3_excel_load_script

Now, the Excel wizard prompts to save the file in the form of *.qvw file extension. It asks to select a location where you need to save the file. Click “Next step” to proceed. Now it is time to see the data that is loaded from the Excel file. We use a Table Box sheet object to display this data.

Create Table Box

The Table Box is a sheet object to display the available data as a table. It is invoked from the menu Layout → New Sheet Object → Table Box.

4_table_box_option

On clicking Next, we get the option to choose the fields from the Table Box. You can use the Promote or Demote buttons to rearrange the fields.

5_Table_box_fields

Table Box Data

On completing the above step, the Table Box Sheet Object appears which shows the data that is read from the Excel file.

6_table_box_result

QlikView – Delimited File

QlikView can use the data in plane text file where the fields are separated by characters like comma, tab, semicolon etc. Here, we will take CSV as an example. A file in which each column of data is separated by a comma is known as a CSV file. It is a very widely used file format to store plane text-data organized as columns and rows.

QlikView loads csv files using the Data from files options available in the script editor under the File Menu. Alternatively, you can also open a new QlikView document and press control+E to get the script editor window as shown below. Choose the file Product_details.csv from the appropriate path.

1_dilimeted_insert_script

Select the File Headers

On opening the selected CSV file, a window as shown below comes up. Under Labels choose Embedded Labels, as our file has a header row as its first row. Click Finish.

csv_file_header

Load Script

The loading of the file into QlikView is done through the load script, which can be seen in the screen shot below. Hence, when we use any delimited file, we can tweak the below script as per the file format.

3_dilimeted_load_script

Now the script wizard prompts to save the file in the form of *.qvw file extension. It asks to select a location where you need to save the file. Click “Next step” to proceed. Now, it is time to see the data that is loaded from the delimited file. We use a Table Box sheet object to display this data.

Create Table Box

The Table Box is a sheet object to display the available data as a table. It is invoked from the menu Layout → New Sheet Object → Table Box.

4_table_box_option

On clicking Next, we get the option to choose the fields from the Table Box. You can use the Promote or Demote buttons to rearrange the fields.

5_Table_box_fields

Table Box Data

On completing the above step, the Table Box Sheet Object appears which shows the data that is read from the Excel file.

6_table_box_result

QlikView – XML File

XML is a file format, which shares both the file format and the data on the World Wide Web, intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language (XML). Similar to HTML it contains markup tags. However, unlike HTML where the markup tag describes structure of the page, in XML the markup tags describe the meaning of the data contained into the file. QlikView can use the data from XML files.

The process to load the data from XML files is similar to the loading of delimited files we have seen earlier. Open the script editor. Click on the menu Insert → Load Statement → Load from File. Browse for the XML file you wish to load. In this example, we are choosing the employee_dat.xml file.

1_file_insert_script

Select the XML File Structure

On opening the selected XML file, a window comes up as shown below. Under the File Type section in the left, choose XML. The content of the XML file now appears as a table along with the header column. Click Finish.

2_xml_file_type

File Loader Script

The loading of the XML file into QlikView is done through the load script, which can be seen below. So when we use any XML file, we can tweak the below given script to rename the columns or change the file location etc.

3_xml_load_scriptNow the script wizard prompts you to save the file in the form of *.qvw file extension. It asks to select a location where you need to save the file. Click “Next step” to proceed. Now it is time to see the data that is loaded from the XML file. We use a Table Box sheet object to display this data.

Create Table Box

The Table Box is a sheet object to display the available data as a table. It is invoked from the menu Layout → New Sheet Object → Table Box.

table_box_optionOn clicking Next, we get the option to choose the fields from the Table Box. You can use the Promote or Demote buttons to rearrange the fields.

5_xml_select_fileds

Table Box Data

On completing the above step, the Table Box Sheet Object appears which shows the data that is read from the Excel file.

6_xml_final_data

QlikView – Web File

QlikView can process files from the web, which are in the HTML format. It can extract data from HTML tables. The URL of the web file to be processed is given as an input and QlikView fetches both, the structure and content of the file. Then it analyzes the structure of the page extracting the relevant data from the HTML tables present in the page. We choose the Web files option from the Data from files section under the Data tab of script Editor.

1_webfile_load

Give the URL as Input

On selecting the Web files option, we get a new window to give the URL as input. In this example, we are choosing the List of sovereign states and dependent territories in Asia as the input page from Wikipedia. Mention the URL and click Next.

web_file_wizard

Select the Table from the Web File

On opening the selected Web file, the window shown below comes up. Here we can see the various tables present in the webpage labeled as @1, @1, @3 and so on. Choose the first table and click Next twice.

3_webfile_file_type

Select the Columns of the Table

From the above table, we can choose only the columns we need by removing the unwanted columns using the cross sign.

4_webfile_choose_fields

Load Script

The loading of the file into QlikView is done through the load script, which can be seen in the screen shot given below. Hence, when we use any delimited file, we can tweak the below given script as per the file format.

5_webfile_load_scriptNow the script wizard prompts to save the file in the form of *.qvw file extension. It asks to select a location where you need to save the file. Click “Next step” to proceed. Now it is time to see the data that is loaded from the web file. We use a Table Box sheet object to display this data.

Create Table Box

The Table Box is a sheet object to display the available data as a table. It is invoked from the menu Layout → New Sheet Object → Table Box.table_box_option

On clicking Next, we get the option to choose the fields from the Table Box. You can use the Promote or Demote buttons to rearrange the fields.

6_webfile_table_box_fields

Table Box Data

On completing the above step, the Table Box Sheet Object appears, which shows the data that is read from the Web file. Mark the Non-Englishcharacters !!

7_webfile_final_data

QlikView – Database Connection

QlikView can connect to most of the popular databases like MySQL, SQL Server, Oracle, Postgress etc. It can fetch data and table structures into QlikView environment and store the results in its memory for further analysis. The steps to connect to any of these databases involves creating an ODBC connection using a DSN and then using this DSN to fetch the data.

For this tutorial, we will be connecting to MySQL database. This tutorial assumes you have a MySQL environment available. Create an ODBC DSN (Data Source Name) for MySQL, following these steps − to create DSN.Name the DSN as mysqluserdsn or you may prefer to use the existing one if you have already created a DSN for MySql.

Connecting to the Database

For this chapter we will use the MySql inbuilt database named sakila. We create a new QlikView document and open the script editor (pressing Control+E). Under the tab Data, we locate the section named Database. Choose ODBC from the drop down list and click Connect. The following window opens. Choose the DSN named mysqluserdns and click Test Connection. The message Connection Test succeeded should appear.

 database_file_connect

Verifying Connection to Database

On successful connection, the screen given below appears showing the connection to the DB in the main window of the script editor.

database_connected

Select Database Table

Click Select iin the above window to get the list of tables and columns. Here as we have created the DSN with sakila as the default database we get the list of tables and columns from this database. We can choose another database from the database drop down list as shown in the screenshot given below. We will continue using the sakila database for this chapter.

database_select_table

Table Loader Script

On Clicking OK in the above window, we get back to the main script editor showing the script for using the table named actor.

database_table_script

Showing the Result in qvd File

Now the data loaded into QlikView document needs to be stored permanently to be analyzed further. For this, we will edit the script to store the data in the form of a qvd file. Press Control+E to open the edit script window and write the following code.

In the code, we give appropriate names to the columns and mention the table name above the load statement. In addition, we give a path where the generated qvd file will be stored. Save this file as QV_mysql.qvw

datbase_to_qvd

Using the qvd File

The qvd file can be loaded into the main document and used to create graphs and tables for further analysis. Press Control+R to reload the QV_mysql.qvw file and click Next in the chart wizard. Choose the straight table to be created with actor_id, first_name, last_name as the dimensions and count of actor_id as the expression. A chart appears as given below.

database_chart

QlikView – Inline Data

Data can be entered into a QlikView document by directly typing or pasting it. This feature is a quick method to get the data from the clipboard into the QlikView. The script editor provides this feature under the Insert tab.

Script Editor

To open the Inline data load option, we open the script editor and go to Insert → Load Statement → Load Inline.

inline_script_editor

Inserting Data

On opening the above screen, we get a spreadsheet-like document where we can type the values. We can also paste the values already available in the clipboard. Please note the column headers are created automatically. Click Finish.

inline_scripting

Load Script

The command, which loads the data, is created in the background which can be seen in the script editor.

inline_script_editor

Table Box Data

On creating a Table Box Sheet Object, we see the data that is read from the Inline data load option.

inline_data_chart

QlikView – Resident Load

QlikView can load data from tables already existing in its RAM, which is already processed by a script. This requirement arises when you want to create a table deriving data from an already existing table in the same script. Please note that both the new table and the existing table should be in the same script.

Creating the Load Script

Open the script editor (or use Control+E) and mention the following script. Here we create an inline table named Regions with sales data for different regions. Then we create another table named Total to calculate the total sales by Region Names. Finally we drop the table Regions, as in this .qvw file we only need the table named Total for data analysis.

resident_load_edit_script

Table Box Data

On creating a Table Box Sheet Object, we see the data that is read from the resident data load option.

inline_script_editor

QlikView – Preceding Load

QlikView Preceding load is a load type in which we use a load statement, which takes the columns of another load statement present in the same script. The data read by the first Load statement, which is at the bottom of the script editor window and then used by the load statements above it.

Load Script

The below given screen shot shows the script for data, which is loaded as Inline data and then the max function is applied to one of the columns. The load statement at the bottom makes the data available in QlikView’s memory, which is used by the second load statement above the first load statement. The second load statement applies the max function with group by clause.

1_preceeding_load_script

Table Box Data

On creating a Table Box Sheet Object, we see the data that is read from the Inline data load option.

2_preceeding_load_table_box

QlikView – Incremental Load

As the volume of data in the data source of a QlikView document increases, the time taken to load the file also increases which slows down the process of analysis. One approach to minimize this time taken to load data is to load only the records that are new in the source or the updated ones. This concept of loading only the new or changed records from the source into the QlikView document is called Incremental Load.

To identify the new records from source, we use either a sequential unique key or a date time stamp for each row. These values of unique key or data time field has to flow from the source file to QlikView document.

Let us consider the following source file containing product details in a retail store. Save this as a .csv file in the local system where it is accessible by QlikView. Over a period of time some more products are added and the description of some product changes.

Product_Id,Product_Line,Product_category,Product_Subcategory
1,Sporting Goods,Outdoor Recreation,Winter Sports & Activities
2,"Food, Beverages & Tobacco",Food Items,Fruits & Vegetables
3,Apparel & Accessories,Clothing,Uniforms
4,Sporting Goods,Athletics,Rugby
5,Health & Beauty,Personal Care
6,Arts & Entertainment,Hobbies & Creative Arts,Musical Instruments
7,Arts & Entertainment,Hobbies & Creative Arts,Orchestra Accessories
8,Arts & Entertainment,Hobbies & Creative Arts,Crafting Materials
9,Hardware,Tool Accessories,Power Tool Batteries
10,Home & Garden,Bathroom Accessories,Bath Caddies
11,"Food, Beverages & Tobacco",Food Items,Frozen Vegetables
12,Home & Garden,Lawn & Garden,Power Equipment

Loading the Data into QlikView

We will load the above CSV file using the script editor (Control+E) by choosing the Table Files option as shown below. Here we also save the data into a QVD file in the local system. Save the QlikView document as a .qvw file.

 incr_laod_create_qvd

Verifying the Data Loaded.

We can check the data loaded to QlikView document by creating a sheet object called Table Box. This is available in the Layout menu and New Sheet Objects sub-menu.

table_box_option

Creating the Table Layout

On selecting the Table Box sheet object, we get to the next screen, which is used to select the columns and their positions in the table to be created. We choose the following columns and their positions and click Finish.

Incr_load_product_details

Viewing the Existing Data

The following chart showing the data as laid out in the previous step appears.

incremental_load_data

Updating the Source Data

Let us add the following three more records to the source data. Here, the Product IDs are the unique numbers, which represent new records.

13,Office Supplies,Presentation Supplies,Display
14,Hardware,Tool Accessories,Jigs
15,Baby & Toddler,Diapering,Baby Wipes

Incremental load script

Now, we write the script to pull only the new records form the source.

// Load the data from the stored qvd.
Stored_Products:
LOAD Product_Id, 
     Product_Line, 
     Product_category, 
     Product_Subcategory
FROM
[E:\Qlikview\data\products.qvd]
(qvd);

//Select the maximum value of Product ID.
Max_Product_ID:
Load max(Product_Id) as MaxId
resident Stored_Products;

//Store the Maximum value of product Id in a variable.
Let MaxId = peek('MaxId',-1);

	 drop table Stored_Products;


//Pull the rows that are new.	 
NewProducts:
LOAD Product_Id,Product_Line, Product_category,Product_Subcategory
	 from [E:\Qlikview\data\product_categories.csv]
	 (txt, codepage is 1252, embedded labels, delimiter is ',', msq)
	 where Product_Id > $(MaxId);
	 
//Concatenate the new values with existing qvd.
Concatenate
LOAD Product_Id,Product_Line, Product_category, 
     Product_Subcategory
FROM [E:\Qlikview\data\products.qvd](qvd);

//Store the values in qvd.
store NewProducts into [E:\Qlikview\data\products.qvd](qvd);

The above script fetches only the new records, which are loaded and stored into the qvd file. As we see the records with the new Product IDs 13, 14 and 15.

incremental_load_final_data

QlikView – Data files (QVD)

One of the important features of QlikView, which makes it so distinguished is the ability to store very large amount of data in a very compressed size and store it along with the QlikView documents. Therefore, once the document is created we need not connect to the data source, as the data is already stored along with the layout of the document. This is achieved through QVD file, which is a flat file stored with the .qvd extension. A QVD file stores data for one QlikView document and it is created using the script editor available in the QlikView document.

Advantages of Using QVD Files

The advantages of using QVD files in QlikView are as follows −

  • Faster Loading of Data
  • Gracefully support scaling up as the data volume grows
  • Used in incremental load
  • Data from multiple sources can be clubbed to one data set
  • Extract data in parallel

Creating QVD Files

QVD files are created using the STORE statement during the loading of QlikView files. This statement creates a single qvd file, which gets stored in the specified location as a file; separate than the QVW file through which it is created.

Given below is an example of storing the qvd file after the data is loaded into the QlikView document by reading a source file.

csv_to_qvd.JPG

Using QVD Files

A QVD file is loaded to a QlikView document in a similar way as other files like CSV, Excel and delimited files are used. We use the the Open option available under the File menu and browse for the QVD file we created before. On opening it gives us a window to see the data, select the column headers and do any data transformation required

open_qvd_file

QVD File Loader Script

On clicking Finish, the edit script window appears which shows the code used to load the QVD file. We can edit this code further. For example, to get only the few of the columns to be displayed or apply any inbuilt function etc. Click finish to load the file into the current QlikView document. Save the QlikView document as use_qvd.qvw.

qvd_file_load_script

Displaying Data from QVD File

Reload the document by using Control+R and choose the menu option Layout → New Sheet Objects → Table Box. A window appears showing all the columns from the table present in the QVD file. Select “Add All” to add all the columns to the display table. Use the “Promote/Demote” option to change the order of the columns. Click “Finish”. The following screen appears showing the content of the QVD file.

qvd_file_data_table

QlikView – Sheet and Objects

Every QlikView document is made of at least one worksheet called Main. We can add more sheets, which are like many pages of the same QlikView document. Sheets help us display multiple data formats like – multiple charts or multiple tables. Each sheet can contain various sheet objects. In addition, sheets can be rearranged using Promote Sheet/Demote Sheet option and can be removed from the QlikView document using Remove Sheet option.

Sheets_add_remove

Sheet Properties

Sheets have various properties, which can be set to customize the sheets. For example, we can set the name of the sheets and its colors. Right click anywhere in the sheet and choose the Properties option. Then choose the following properties.

  • Sheet Settings → Color. − This will set the background colour of the Sheet.
  • Tab Settings → Custom Colors. − This will set the color for the Tab where the Sheet name appears.
  • Title. − This will set the name of the Sheet.

Sheets_properties

Creating Sheet Objects

Sheet Objects are the QlikView data elements that are embedded in the sheet. They display the data that is loaded into the QlikView’s memory. Each sheet object is tied to a data source and one or more of its columns. Sheet Objects are created from the layout menu as shown below.

sheet_objects_menu

Using Sheet Objects

Sheet Objects display the data from a data source and all the objects in a sheet are associated with each other. Let’s create a List Box and a Multi Box and see this association on action.

Creating List Box

The List box displays data from a column of a table available in QlikView memory. Choose the option List Box from the Add Sheet Objects option and set the properties as given below.

sheet_list_box

Creating Multi Box

A Multi Box represents data from multiple columns from a table. Choose the option Multi Box from the Add Sheet Objects option and set the properties as shown below.

sheet_multi_boxOn completing the above given steps, the following window appears which shows both the objects.

sheet_list_multi_boxes

Association between Sheet Objects

We can see how the sheet objects are linked to each other by choosing the one option from the Multi Box, which highlights the associated row in the List Box. Let us choose “Diapering” under the Product Category drop down list in Multi Box. The window shown below appears.

sheet_objects_association

QlikView – Scripting

Scripting is a very powerful feature in QlikView, which enables the control of the data load options and data transformations. It enables the use of many inbuilt functions available in QlikView and creates subroutines to be used across multiple scripts within a QlikViewdocument.

Scripting is done using the Script Editor. It is accessed from the File menu using Script Editor. We can also press Control +E to open the script editor window. It prepopulates some data formats that are set as default formats for the data to be processed. For example, the Thousand separator is a comma and date is in Month-day-year format. These can be changed to suit the incoming data as per the need.

 script Editor

Script Editor Features

Script editor has many features, which are accessed from the menu in the script editor window, which is a different menu from the main menu. Given below is a list of important features.

File Menu

  • Reload − Reloads the script and fetches the new data.
  • Upper/Lower Case − Converts the case of words as QlikView is case sensitive.
  • Comment − Used to comment blocks of code.

Edit Menu

  • Clear Entire Script − Clears the active script tab.
  • Open Script File − Opens the saved script files.
  • Table Viewer − Used to see the tabular view of the data being loaded.

Insert Menu

  • Environment Variables − Inserts a standard list of Environment variables.
  • Script Files − Allows to browse for script files and insert them.
  • Connect/Disconnect Statement − Used to connect or disconnect from external databases.

Tab Menu

  • Insert Tab − Inserts a Tab at the current cursor position.
  • Promote/Demote Tab − Allows to move the tabs from left to right and vice versa.
  • Merge with Previous − Used to merge the content of active tag with previous tab.

Tools Menu

  • ODBC Administrator 64 bit/ODBC Administrator 32 bit − Allows to set the correct DSN information for data sources.
  • Editor Preferences − Allows you to configure the text font and size, help features, shortcuts, default-scripting engine etc.
  • Syntax Check − Used to validate the syntax of the script code.

QlikView – Functions

QlikView has many built-in functions, which are available to be applied to data that is already available in memory. These functions are organized into many categories and the syntax of the function appears as soon as it is selected. We can click on the Paste button to get the expression into the editor and supply the arguments.

Create Table Box

Create a Table Box by following the menu as shown in the screen shot given below.

table_box_optionOn completing the above given step, we get a window to show the Calculation condition at the bottom left.

3_functions_calculation_fields.

List of Functions

Click on the button next to calculation condition and go to the Function tab. It shows the list of functions available.

4_functions_all_functionsOn choosing String from the functions category, we can see only few functions, which take a string as an argument.

5_functions_stringIn the next chapters, we will see the use of many important functions.

QlikView – IntervalMatch

QlikView IntervalMatch is a powerful function used to match distinct numeric values to numeric intervals. It is useful in analyzing how the events actually happened versus the planned events. The example of a scenario where it is used is in the assembly lines of the production houses where the belts are planned to run at certain times and for certain duration. However, the actual run can happen at different points in time because of breakdown etc.

Example

Consider an assembly line where there are three belts named A, B and C. They are planned to start & stop at specific times of a day. In a given day, we study the actual start and end time and analyze what all happened in that day. For this, we consider two sets of observations as shown below.

# Data Set for AssembilyLine.
StartTime,EndTime, BeltNo
00:05,4:20, A
1:50,2:45,B
3:15,10:30,C
	
# Data set for the events happened.
ActualTime,Product
1:10,Start Belt A
2:24,Stop Belt A
3:25,Restart Belt A
4:35,Stop Belt A
2:20,Start Belt B
3:11, Stop Belt B
3:15,Start Belt C
11:20, Stop Belt C

Creating the Script

We open the script editor in a new QlikView document using Control+E. The following code creates the required tables as inline data. After creating this script, press control+R to reload the data into the QlikView document.

intervalmatch_create_table

Creating Sheet Object

Let us create a Table Box sheet object to show the data generated by the IntervalMatch function. Go to the menu item Layout → New Sheet Object → Table Box. The following window appears in which we mention the Title of the table and select the required fields to be displayed.

 intervalmatch_create_table

Showing the Table Box

On clicking OK in the above window, a table appears showing the field ActualTime matched to the intervals StartTime and EndTime.

intervalmatch_table_data

QlikView – Aggregate Functions

QlikView Aggregate functions are used to produce aggregate data from the rows of the table. The functions are applied to the columns when creating the load script. Given below is a sample list of Aggregate functions. We also need to apply the Group by clause appropriately when applying the aggregate functions.

  • SUM gives the sum of the numeric values of the column.
  • AVG gives the average of the numeric values of the column.
  • MAX gives the maximum of the numeric values of the column.
  • MIN gives the minimum of the numeric values of the column.

Example

Consider the following data stored as product_sales.csv in the local system. It represents the sales figures for different product lines and product category in a store.

Product_Line,Product_category,Quantity,Value
Sporting Goods,Outdoor Recreation,12,5642
Food, Beverages & Tobacco,38,2514
Apparel & Accessories,Clothing,54,2365
Apparel & Accessories,Costumes & Accessories,29,4487
Sporting Goods,Athletics,11,812
Health & Beauty,Personal Care,21,6912
Arts & Entertainment,Hobbies & Creative Arts,58,5201
Arts & Entertainment,Paintings,73,8451
Arts & Entertainment,Musical Instruments,41,1245
Hardware,Tool Accessories,2,456
Home & Garden,Bathroom Accessories,36,241
Food,Drinks,54,1247
Home & Garden,Lawn & Garden,29,5462
Office Supplies,Presentation Supplies,22,577
Hardware,Blocks,53,548
Baby & Toddler,Diapering,19,1247

Creating the Load Script

We open the script editor in a new QlikView document using Control+E. The following code creates the required tables as inline data. After creating this script press control+R to reload the data into the QlikView document.

Aggregate_create_script

Creating Sheet Object

Let us create a Table Box sheet object to show the data generated by the Aggregate function. Go to the menu Layout → New Sheet Object → Table Box. The following window appears in which we mention the Title of the table and the select the required fields to be displayed. Clicking OK displays the data from the CSV file in the QlikView Table Box as shown below.

Aggregate_data

Applying SUM() function

Given below is the load script to find the sum of the sales quantity and sales value across the Product Lines and product categories.

Aggregate_sum_scriptClick OK and press Control+R to reload the data into QlikView document. Now follow the same steps as given above in − Creating Sheet Objects to create a QlikView Table Box for displaying the result of the script as shown below.

Aggregate_sum_data

Applying AVG() function

Given below is the load script to create the average of the sales quantity and sales value across each Product Line.

# Average sales of Quantity and value in each Product Line.
LOAD Product_Line, 
     avg(Quantity),
	 avg(Value)
FROM
[E:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq)
Group by Product_Line;

Click OK and press Control+R to reload the data into QlikView document. Now follow the same steps as given above in − Creating Sheet Objects to create a QlikView Table Box for displaying the result of the script as shown below.

Aggregate_average_data

Applying MAX() & MIN() function

Given below is the load script to create the maximum and minimum of the sales quantity across each Product Line.

# Maximum and Minimum sales in each product Line.
LOAD Product_Line,
     max(Quantity) as MaxQuantity,
     min(Quantity) as MinQuantity
FROM
[E:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq)
Group by Product_Line;

Click OK and Control+R to reload the data into QlikView document. Now follow the same steps as above in − Creating Sheet Objects to create a QlikView Table Box for displaying the result of the script as shown below.

Aggregate_max_min_data

QlikView – Match Function

The Match() function in QlikView is used to match the value of a string on expression with data value present in a column. It is similar to the in function that we see in SQL language. It is useful to fetch rows containing specific strings and it also has an extension in form of wildmatch() function.

Let us consider the following data as input file for the examples illustrated below.

Product_Id,Product_Line,Product_category,Product_Subcategory
1,Sporting Goods,Outdoor Recreation,Winter Sports & Activities
2,Food, Beverages & Tobacco,Food Items,Fruits & Vegetables
3,Apparel & Accessories,Clothing,Uniforms
4,Sporting Goods,Athletics,Rugby
5,Health & Beauty,Personal Care
6,Arts & Entertainment,Hobbies & Creative Arts,Musical Instruments
7,Arts & Entertainment,Hobbies & Creative Arts,Orchestra Accessories
8,Arts & Entertainment,Hobbies & Creative Arts,Crafting Materials
9,Hardware,Tool Accessories,Power Tool Batteries
10,Home & Garden,Bathroom Accessories,Bath Caddies
11,Food, Beverages & Tobacco,Food Items,Frozen Vegetables
12,Home & Garden,Lawn & Garden,Power Equipment
13,Office Supplies,Presentation Supplies,Display
14,Hardware,Tool Accessories,Jigs
15,Baby & Toddler,Diapering,Baby Wipes

Load Script with Match() Function

The following script shows the Load script, which reads the file named product_categories.csv. We search the field Product_Line for values matching with strings ‘Food’ and ‘Sporting Goods’.

Matched_func_script

Creating Sheet Object

Let us create a Table Box sheet object to show the data generated by the match function. Go to the menu Layout → New Sheet Object → Table Box. The following window appears in which we mention the Title of the table and then select the required fields to be displayed. Clicking OK displays the data from the CSV file in the QlikView Table Box as shown below.

Matched_func_data

Load Script with Wildmatch() Function

The wildmatch() function is an extension of match() function in which we can use wildcards as part of the strings used to match the values with values in the fields being searched for. We search for the strings ‘Off*’,’*ome*.

wildmatch_script

Creating Sheet Object

Let us create a Table Box sheet object to show the data generated by the wildmatch function. Go to the menu item Layout → New Sheet Object → Table Box. The following window appears in which we mention the Title of the table and then select the required fields to be displayed. Clicking OK displays the data from the CSV file in the QlikView Table Box as shown below.

wildmatch_data

QlikView – Rank Function

The Rank() function in QlikView is used to display the rank of the values in a field as well as return rows with specific rank value. So it is used in two scenarios. First scenario is in QlikView charts to display the ranks of the values in the field and second is in Aggregate function to display only the rows, which have a specific rank value.

Input Data

The data used in the examples describing Rank function is given below. You can save this as a .csv file in a path in your system where it is accessible by QlikView.

Product_Id,Product_Line,Product_category,Quantity,Value
1,Sporting Goods,Outdoor Recreation,12,5642
2,Food, Beverages & Tobacco,38,2514
3,Apparel & Accessories,Clothing,54,2365
4,Apparel & Accessories,Costumes & Accessories,29,4487
5,Sporting Goods,Athletics,11,812
6,Health & Beauty,Personal Care,21,6912
7,Arts & Entertainment,Hobbies & Creative Arts,58,5201
8,Arts & Entertainment,Paintings,73,8451
9,Arts & Entertainment,Musical Instruments,41,1245
10,Hardware,Tool Accessories,2,456
11,Home & Garden,Bathroom Accessories,36,241
12,Food,Drinks,54,1247
13,Home & Garden,Lawn & Garden,29,5462
14,Office Supplies,Presentation Supplies,22,577
15,Hardware,Blocks,53,548
16,Baby & Toddler,Diapering,19,1247
17,Baby & Toddler,Toys,9,257
18,Home & Garden,Pipes,81,1241
19,Office Supplies,Display Board,29,2177

Load Script

The above data is loaded to the QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the Table Files option from the Data from Files tab and browse for the file containing the above data. Click OK and press Control+R to load the data into QlikView’s memory.

rank_laod_script

Creating Chart with Rank() Function

Next, we follow the steps given below to create a chart, which shows the rank of the filed Value described with respect to the dimension Product_Line.

Select the Chart Type

Click on the Chart wizard and choose the option straight table as the chart type. Click Next.

rank_chart_type

Select the Chart Dimension

From the First Dimension drop down list, choose Product_Line as dimension. Click Next.

rank_select_diemnsions

Define the Chart Expression

In the custom expression field, mention the rank expression as shown below. Here we are considering the numeric field named Value, which represents the Sales value for each category under each Product Line. Click Next.

rank_chart_expression

Displaying the Chart

On clicking Finish in the above step, the following chart appears which shows the rank of the sales value of each Product Line.

rank_chart_data

Using Aggregate Function with Rank

The aggregate functions like − max, min etc. can take rank as an argument to return rows satisfying certain rank values. We consider the following expression to be out in the script editor, which will give the rows containing highest sales under each Product line.

# Load the records with highest sales value for each product line.
LOAD Product_Line, 
     max(Value,1)
FROM
[E:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq)
group by Product_Line;

Creating Sheet Object

Let us create a Table Box sheet object to show the data generated by the above given script. Go to the menu Layout → New Sheet Object → Table Box. The following window appears in which we mention the Title of the table and select the required fields to be displayed. Clicking OK displays the data from the CSV file in the QlikView Table Box as shown below.

rank_aggr_func

QlikView – Peek Function

The peek() function in QlikView is used to fetch the value of a field from a previous record and use it in calculations.

Input Data

Let us consider the monthly sales figure as shown below. Save the data with file name monthly_sales.csv.

Month,Sales Volume
March,2145
April,2458
May,1245
June,5124
July,7421
August,2584
September,5314
October,7846
November,6532
December,4625
January,8547
February,3265

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the Table Files option from for the file containing the above data. Edit the load script to add the following code. Click OK and click Control+R to load the data into QlikView’s memory.

LOAD Month, 
     [Sales Volume],
     peek('Sales Volume') as Prevmonth
FROM
[C:\Qlikview\data\monthly_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Creating Sheet Object

Let us create a Table Box sheet object to show the data generated by the above script. Go to the menu item Layout → New Sheet Object → Table Box. The following window appears in which we mention the Title of the table and select the required fields to be displayed. Clicking OK displays the data from the csv file in the QlikView Table Box as shown below. Also set the sort order as shown below to get the result in the same order of the field Monthas it is in the source.

Monthly_sales_load_orderOn completing the above steps and clicking Finish, we get the Table box showing the data as given below.

Monthly_sales_prevmonth

Using peek() Function in Calculations

The peek() can be used in calculations involving other columns. Let us display the percentage change for sales volume for each month. The following script achieves this result.

LOAD
Month, [Sales Volume],
peek('Sales Volume') as Prevvolume,
(([Sales Volume]-peek('Sales Volume')))/peek('Sales Volume')*100  as Difference
FROM
[C:\Qlikview\data\monthly_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Creating Sheet Object

Let us create a Table Box sheet object to show the data generated by the above script. Go to the menu item Layout → New Sheet Object → Table Box. The following window appears in which we mention the Title of the table and select the required fields to be displayed. Clicking OK displays the data from the CSV file in the QlikView Table Box as shown below.

Monthly_sales_prevmonth_diff

QlikView – RangeSum Function

The RangeSum() function in QlikView is used to do a selective sum on chosen fields which is not easily achieved by the sum function. It can take expressions containing other functions as its arguments and return the sum of those expressions.

Input Data

Let us consider the monthly sales figure as shown below. Save the data with file name monthly_sales.csv.

Month,Sales Volume
March,2145
April,2458
May,1245
June,5124
July,7421
August,2584
September,5314
October,7846
November,6532
December,4625
January,8547
February,3265

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the Table Files option from the Data from Files tab and browse for the file containing the above data. Edit the load script to add the following code. Click OK and click Control+R to load the data into QlikView’s memory.

LOAD 
Month, [Sales Volume]
FROM
[C:\Qlikview\data\monthly_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Applying RangeSum() Function

With the above data loaded into QlikView’s memory, we edit the script to add a new column, which will give a rolling sum of the month wise sales volume. For this, we also take the help of the peek function discussed in the earlier chapter to hold the value of the previous record and add it to the sales volume of the current record. The following script achieves the result.

LOAD
Month, [Sales Volume],
rangesum([Sales Volume],peek('Rolling')) as Rolling
FROM
[C:\Qlikview\data\monthly_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Creating Sheet Object

Let us create a Table Box sheet object to show the data generated by the above given script. Go to the menu Layout → New Sheet Object → Table Box.

The following window appears in which we mention the Title of the table and select the required fields to be displayed. Clicking OK displays the data from the CSV file in the QlikView Table Box as shown below.

RangeSum

QlikView – Documents

QlikView documents are the files that contain all the objects used for the data presentation and analysis. It contains the sheets, variables, data model, source-data connection details, and even the data that is loaded after pulling it from the source.

Document Properties

We can quickly find out the basic information of a QlikView document. Click on Help → document Support Info. Given below is a sample output.

document_details

Sheet Objects

The QlikView document contains various Sheet objects, which can be moved around by dragging them and placed anywhere in the document. Let us create two sheet objects, a Table box and a Statistics Box. You can follow the earlier chapters where we have already learnt to create sheet objects. In addition, we are using the file Product_sales.csv, which is mentioned here.

Scheduling a Document

A QlikView document can be scheduled to refresh at some desired intervals. This is done using the Schedule tab available under the Document properties window.

document_scheduling

QlikView – List Box

list box represents the list of all the values of a specific field. Selecting a value in list box highlights the related values in other sheet objects. This helps in faster visual analysis. It is also very useful to follow a drill down path among various sheet objects. It also has a search feature, which allows to search for specific values in the list box which is very helpful for a very long list of values.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Value
Sporting Goods,Outdoor Recreation,5642
Food, Beverages & Tobacco,2514
Apparel & Accessories,Clothing,2365
Apparel & Accessories,Costumes & Accessories,4487
Sporting Goods,Athletics,812
Health & Beauty,Personal Care,6912
Arts & Entertainment,Hobbies & Creative Arts,5201
Arts & Entertainment,Paintings,8451
Arts & Entertainment,Musical Instruments,1245
Hardware,Tool Accessories,456
Home & Garden,Bathroom Accessories,241
Food,Drinks,1247
Home & Garden,Lawn & Garden,5462
Office Supplies,Presentation Supplies,577
Hardware,Blocks,548
Baby & Toddler,Diapering,1247
Baby & Toddler,Toys,257
Home & Garden,Pipes,1241
Office Supplies,Display Board,2177

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the Table Files option from the Data from Files tab and browse for the file containing the above data. Edit the load script to add the following code. Click OK and click Control+R to load the data into QlikView’s memory.

LOAD Product_Line, 
     Product_category, 
     Value
FROM
[C:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Creating List Box

Creation of List Box involves navigating through menu Layout → New Sheet Object → List Box. The following screen shows these steps.

 list_box_creationNext, we choose Product category as the field on which we build the list box.

list_box_properies

List Box Data

Finishing the above steps brings the following screen, which shows the values of Product category as a list box.

list_box_data

Search Option in List Box

When the List Box contains very large number of values, it is difficult to scroll down and look for it. So the search box at the top of the list box can be used to type the search string. The relevant values appear as soon as the first letter is typed.

list_box_search

List Box and Association

Other Sheet Objects automatically get associated with the List Box and the association is easily observed by selecting values form the list box.

list_box_association

QlikView – Multi Box

Multi Box represents the list of all the values from multiple fields as drop down values. Similar to list box, the selection of a value in Multi Box highlights the related values in other sheet objects. This helps in faster visual analysis. It is also very useful to follow a drill down path among various sheet objects.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Value
Sporting Goods,Outdoor Recreation,5642
Food, Beverages & Tobacco,2514
Apparel & Accessories,Clothing,2365
Apparel & Accessories,Costumes & Accessories,4487
Sporting Goods,Athletics,812
Health & Beauty,Personal Care,6912
Arts & Entertainment,Hobbies & Creative Arts,5201
Arts & Entertainment,Paintings,8451
Arts & Entertainment,Musical Instruments,1245
Hardware,Tool Accessories,456
Home & Garden,Bathroom Accessories,241
Food,Drinks,1247
Home & Garden,Lawn & Garden,5462
Office Supplies,Presentation Supplies,577
Hardware,Blocks,548
Baby & Toddler,Diapering,1247
Baby & Toddler,Toys,257
Home & Garden,Pipes,1241
Office Supplies,Display Board,2177

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the Table Files option from the Data from Files tab and browse for the file containing the above data. Edit the load script to add the following code. Click OK and click Control+R to load the data into the QlikView’s memory.

LOAD Product_Line, 
     Product_category, 
     Value
FROM
[C:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Creating Multi Box

Creation of Multi Box involves navigating through menu Layout → New Sheet Object → Multi Box. The following screen shows these steps.

Multi_box_createNext we choose the fields of the Products sales tables to build the Multi Box.

Multi_box_fields

Multi Box Data

Finishing the above steps brings the following screen, which shows the values of Product category as a Multi box.

Multi_box_data

Multi Box and Association

Other Sheet Objects automatically get associated with the Multi Box and the association is easily observed by selecting values from the Multi Box.

Multi_box_association

QlikView – Text Object

QlikView text Object is used to show some descriptive information about the QlikView report being displayed. It can also show calculations based on certain expressions. It is mainly used for displaying nicely formatted information using colors and different font types in a box separately from the other Sheet Objects.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Value
Sporting Goods,Outdoor Recreation,5642
Food, Beverages & Tobacco,2514
Apparel & Accessories,Clothing,2365
Apparel & Accessories,Costumes & Accessories,4487
Sporting Goods,Athletics,812
Health & Beauty,Personal Care,6912
Arts & Entertainment,Hobbies & Creative Arts,5201
Arts & Entertainment,Paintings,8451
Arts & Entertainment,Musical Instruments,1245
Hardware,Tool Accessories,456
Home & Garden,Bathroom Accessories,241
Food,Drinks,1247
Home & Garden,Lawn & Garden,5462
Office Supplies,Presentation Supplies,577
Hardware,Blocks,548
Baby & Toddler,Diapering,1247
Baby & Toddler,Toys,257
Home & Garden,Pipes,1241
Office Supplies,Display Board,2177

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data. Edit the load script to add the following code. Click “OK” and press “Control+R” to load the data into the QlikView’s memory.

LOAD Product_Line, 
     Product_category, 
     Value
FROM
[C:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Creating Table Box

For the above data, let us create a Table Box , which will show the data in a tabular form. Go to the menu Layout → New Sheet Object → Table Boxand choose the column as shown below.

2_text_object_create_table_boxClick Apply and then OK to finish creating the Table box. The following screen appears.

3_text_object_create_table_box

Creating Text Object

For the above data, let us create a Text Object. Go to the menu Layout → New Sheet Object → Text Object as shown below.

4_text_object_create

Creating Text Object Expression

On the text box created above, right click and choose properties. Then enter the content to be displayed on the Text Object in the Text box under the General tab as shown below.

5_text_object_edit

Choosing the Background Color

The background color of the Text Object can be set using the background option in the General tab.

6_text_object_edit

The Text Object

The final Text Object is shown below. If we click on some Product Line to filter it, then the content in the Text Object changes accordingly to reflect the new values.

7_text_object_final_data

QlikView – Bar Chart

Bar charts are very widely used charting method to study the relation between two dimensions in form of bars. The height of the bar in the graph represents the value of one dimension. The number of bars represent the sequence of values or grouped values of another dimension.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Value
Sporting Goods,Outdoor Recreation,5642
Food, Beverages & Tobacco,2514
Apparel & Accessories,Clothing,2365
Apparel & Accessories,Costumes & Accessories,4487
Sporting Goods,Athletics,812
Health & Beauty,Personal Care,6912
Arts & Entertainment,Hobbies & Creative Arts,5201
Arts & Entertainment,Paintings,8451
Arts & Entertainment,Musical Instruments,1245
Hardware,Tool Accessories,456
Home & Garden,Bathroom Accessories,241
Food,Drinks,1247
Home & Garden,Lawn & Garden,5462
Office Supplies,Presentation Supplies,577
Hardware,Blocks,548
Baby & Toddler,Diapering,1247
Baby & Toddler,Toys,257
Home & Garden,Pipes,1241
Office Supplies,Display Board,2177

Load Script

The above data is loaded to the QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option form the “Data from Files” tab and browse for the file containing the above data. Edit the load script to add the following code. Click “OK” and press “Control+R” to load the data into the QlikView’s memory.

LOAD Product_Line, 
     Product_category, 
     Value
FROM
[C:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Creating Table Box(Sheet Object)

For the above data, let us create a Table Box, which will show the data in a tabular form. Go to the menu Layout → New Sheet Object → Table Boxand choose the column as shown below.

2_text_object_create_table_boxClick Apply and then OK to finish creating the Table box. The below given screen appears.

3_text_object_create_table_box

Using the Quick Chart Wizard

To start creating a bar chart, we will use the quick chart wizard. On clicking it, the following screen appears which prompts for selecting the chart type. Choose bar Chart and click Next.

5_bar_chart_chart_type

Choose the Chart Dimension

Choose Product Line as the First Dimension.

6_bar_chart_chart_diemnsion

Choose the Chart Expression

The chart expression is used to apply the functions like Sum, Average, or Count on the fields with numeric values. We will apply the Sum function on the filed named Value. Click Next.

7_bar_chart_chart_expression

Choose the Chart Format

The Chart format defines the style and orientation of the chart. We choose the first option in each category. Click Next.

8_bar_chart_chart_format

The Bar Chart

The Bar chart appears as shown below. It shows the height of the field value for different product lines.

9_bar_chart_display

QlikView – Pie Chart

pie-chart is a representation of values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each slice is also represented in the chart. QlikView creates pie-chart using the chart wizard or chart Sheet Object.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Value
Sporting Goods,Outdoor Recreation,5642
Food, Beverages & Tobacco,2514
Apparel & Accessories,Clothing,2365
Apparel & Accessories,Costumes & Accessories,4487
Sporting Goods,Athletics,812
Health & Beauty,Personal Care,6912
Arts & Entertainment,Hobbies & Creative Arts,5201
Arts & Entertainment,Paintings,8451
Arts & Entertainment,Musical Instruments,1245
Hardware,Tool Accessories,456
Home & Garden,Bathroom Accessories,241
Food,Drinks,1247
Home & Garden,Lawn & Garden,5462
Office Supplies,Presentation Supplies,577
Hardware,Blocks,548
Baby & Toddler,Diapering,1247
Baby & Toddler,Toys,257
Home & Garden,Pipes,1241
Office Supplies,Display Board,2177

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data. Edit the load script to add the following code. Click “OK” and press “Control+R” to load the data into the QlikView’s memory.

LOAD Product_Line, 
     Product_category, 
     Value
FROM
[C:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Using the Quick Chart Wizard

To start creating a Pie chart, we will use the quick chart wizard. On clicking it, the following screen appears which prompts for selecting the chart type. Choose Pie Chart and click Next.

2_pie_chart_chart_type

Choose the Chart Dimension

Choose Product Line as the First Dimension.

3_pie_chart_dimension

Choose the Chart Expression

The chart expression is used to apply the functions like Sum, Average or Count on the fields with numeric values. We will apply the Sum function on the filed named Value. Click Next.

4_pie_chart_expression

Choose the Chart Format

The Chart format defines the style and orientation of the chart. We choose the third option. Click Next.

5_pie_chart_format

The Pie Chart

The Bar chart appears as shown below. It shows the height of the field value for different product lines.

6_pie_chart_display

QlikView – Dashboard

A Dashboard is a powerful feature to display values from many fields simultaneously. QlikeView’s feature of data association in memory can display the dynamic values in all the sheet objects.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Value
Sporting Goods,Outdoor Recreation,5642
Food, Beverages & Tobacco,2514
Apparel & Accessories,Clothing,2365
Apparel & Accessories,Costumes & Accessories,4487
Sporting Goods,Athletics,812
Health & Beauty,Personal Care,6912
Arts & Entertainment,Hobbies & Creative Arts,5201
Arts & Entertainment,Paintings,8451
Arts & Entertainment,Musical Instruments,1245
Hardware,Tool Accessories,456
Home & Garden,Bathroom Accessories,241
Food,Drinks,1247
Home & Garden,Lawn & Garden,5462
Office Supplies,Presentation Supplies,577
Hardware,Blocks,548
Baby & Toddler,Diapering,1247
Baby & Toddler,Toys,257
Home & Garden,Pipes,1241
Office Supplies,Display Board,2177

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data. Edit the load script to add the following code. Click “OK” and press “Control+R” to load the data into the QlikView’s memory.

LOAD Product_Line, 
     Product_category, 
     Value
FROM
[C:\Qlikview\data\product_sales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Choose Matrices

We choose the fields from the above input data as matrices to be displayed in the dashboard. For this, we follow the steps in the menu Layout → Select Fields.

3_dahboard_matrices_filedsIn the next screen, choose the available fields to be displayed in the dashboard. Click “OK”.

4_dahboard_listbox_fieldsThe following screen appears displaying all the fields

5_dahboard_listbox_display

Adding Chart to Dashboard

NNow we add a chart to the dashboard by right-clicking anywhere in the sheet and choosing New Sheet Object → Chart.

6_dahboard_add_chart

Choose the Chart Type

Let us choose the chart type as a bar chart to display the sales values for various product Lines.

7_dahboard_add_bar_chart

Chart Dimension

Let us select the Product Line as the Chart Dimension.

8_dahboard_chart_dimension

Chart Expression

The expression to display the sales value for the Product Line dimension is written in the expression editor.

9_dahboard_chart_expression

The Dashboard

Given below is the dashboard displayed after finishing the above steps.

10_dahboard_data

Using the Dashboard

The values in the above Dashboard can be selected for filtering specific products and the chart changes accordingly. In addition, the associated values are highlighted.

11_dahboard_data-associated

QlikView – Data Transformation

Data Transformation is the process of modifying the existing data to a new data format. It can also involve filtering out or adding some specific values to the existing data set. QlikView can carry out data transformation after reading it to its memory and using many in-built functions.

Input Data

Let us consider the following input data, which represents the sales figures of each month. This is stored as a csv file with name quarterly_sales.csv

Month,SalesVolume
March,2145
April,2458
May,1245
Sales Values in Q2
June,5124
July,7421
August,2584
Sales Values in Q3
September,5314
October,7846
November,6532
December,4625
January,8547
February,3265

Loading the Data

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option form the “Data from Files” tab and browse for the file quarterlt_sales.csv. Click next.

1_dt_file_wizard

The Transformation Wizard

The next screen prompts us to choose some data transformation. Click on the button Enable Transformation Step.

2_dt_edit_transformations

Transformation Type

In this step, we will select the transformation to eliminate the rows, which describe the quarter. We select Garbage → delete marked and select the two rows, which are not required. Click Next.

3_dt_garbage_rows

File Wizard Options

After selecting the type of Transformation and the rows to be removed, the next screen prompts us for any further transformation like selecting a where clause or adding any Prefixes. We will ignore this step and click Finish.

4_dt_next_garbage_rows

Load Script for Transformed Data

The Load script for the above data after all the transformation steps are complete is given below.

5_dt_load_script

Displaying Transformed Data

The transformed data can be displayed by using a Table Box sheet object. The steps to create it are given below.

6_dt_create_table_boxNext, we choose the fields for the Table Box.

7_dt_table_box_fieldsThe Table Box now displays the data in the sheet.

8_dt_fina_data

QlikView – Fill Function

The Fill function in QlikView is used to fill values from existing fields into a new field.

Input Data

Let us consider the following input data, which represents the actual and forecasted sales figures.

Month,Forecast,Actual
March,2145,2247
April,2458,
May,1245,
June,5124,3652
July,7421,7514
August,2584,
September,5314,4251
October,7846,6354
November,6532,7451
December,4625,1424
January,8547,7852
February,3265,

Load Script

The above data is loaded to the QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data.

1_fill_choose_fileAfter clicking Next in the above step, we choose the Enable Transformation Step button to carry out the required data transformation.

2_fill_file_wizard

Selecting the Data Transformation

As we are going to use the Fill function, let us choose the Fill tab, which displays th empty values under the Actual Field.

3_fill_transformation

Create the Fill Condition

On clicking the Fill button, the option to choose target column and the cell condition appears. We choose column three, as we want to fill the empty values of this column with values from same row in column two. Also, choose the Cell Value as empty so that only the empty cells will be overwritten with new values.

4_fill_cell_condition

Transformed Data

On completing the above steps, we get the transformed data as shown below.

5_fill_filled_cells

Load Script for transformed Data

The load script for the transformed data can be seen using the script editor. The script shows the expression, which replaces the empty cell values.

6_fill_laod_script

Display Transformed Data

The transformed data can be seen by creating a Table Box using the option in the menu Layout → New Sheet Object.

7_fill_table_box

QlikView – Column Manipulation

Column Manipulation is a type of Data Transformation in which a new column is populated with values from an existing column, which meets certain criteria. The criteria can be an expression, which is created as part of the Data Transformation step.

Input Data

Let us consider the following input data, which represents the actual and forecasted sales figures.

Month,Forecast,Actual
March,2145,2247
April,2458,2125
May,1245,2320
June,5124,3652
July,7421,7514
August,2584,3110
September,5314,4251
October,7846,6354
November,6532,7451
December,4625,1424
January,8547,7852
February,3265,2916

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data. After clicking Next, we choose the Enable Transformation Step button to carry out the required data transformation.

1_cm_file_wizard

Selecting the Data Transformation

Choose the Column tab and then choose the New button. It asks to specify the New column and the Row Condition. We specify column 3 as the source column and pick the values, which start with two as the Row Condition.

2_cm_add_column

Transformed Data

On completing the above steps, we get the transformed data as shown below.

3_cm_added_column

Load Script for Transformed Data

The load script for the Transformed data can be seen using the script editor. The script shows the expression, which creates the new column with required values.

4_cm_file_wizard_script

Display Transformed Data

The transformed data can be seen by creating a Table Box using the option in the menu Layout → New Sheet Object.

5_cm_completed

QlikView – Rotating Tables

The Rotating table in QlikView is similar to the column and row transpose feature in Microsoft Excel but with some additional options. We can transpose columns in multiple directions and they give different results. In this chapter, we will be seeing the normal transpose option of converting rows to columns.

Input Data

Let us consider the following input data, which represents the actual and forecasted sales figures.

Month,Forecast,Actual
March,2145,2247
April,2458,
May,1245,
June,5124,3652
July,7421,7514
August,2584,
September,5314,4251
October,7846,6354
November,6532,7451
December,4625,1424
January,8547,7852
February,3265,

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data.

1_file_wizardAfter clicking Next, we choose the Enable Transformation Step button to carry out the required data transformation.

Selecting the Data Transformation

As we are going to use the Rotate function, let us choose the Rotate tab which displays the values of all the fields.

2_rotate_option

Apply Rotate

We click the Transpose button to transpose the above data. The transposed data appears as shown below.

3_rotate_rotated

Load Script for transformed Data

The load script for the Transformed data can be seen using the script editor. The script shows the expression, which replaces the empty cell values.

4_rotate_load_script

Display Transformed Data

The transformed data can be seen by creating a Table Box using the option in the menu Layout → New Sheet Object.

5_rotate_loaded_data

QlikView – Dimensions and Measures

Dimensions and Measures are fundamental entities, which are always used in data analysis. For example, consider the result of the analysis, “what is the percentage change in volume of sales for each quarter?” In this case, each quarter represents the Dimensions, which is the name of the quarter. The percentage change in volume represents the Measures, which is a calculation with respect to each value in the dimension. Below are some widely accepted definition of these two terms.

  • Dimension − It is a descriptive field in the data set which represents few distinct values. Examples − Month, Year, Product ID etc.
  • Measures − It is a numeric field on which some calculations are performed for each distinct value of dimension.

Input Data

Let us consider the following input data, which represents the sales volume and Revenue of different product lines and product categories in different regions. Save the data into a .csv file.

ProductID,ProductCategory,Region,SalesVolume, Revenue
1,Outdoor Recreation,Europe,457,25841
2,Clothing,Europe,125,54281
3,Costumes & Accessories,South Asia,781,54872
4,Athletics,South Asia,839,87361
5,Personal Care,Australia,473,15425
6,Arts & Entertainment,North AMerica,625,84151
7,Hardware,South America,772,45812

Load Script

The above data is loaded to the QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the Table Files option from the Data from Files tab and browse for the file containing the above data. Click OK and press Control+R to load the data into the QlikView’s memory

Table Structure

We can see the structure of the table by following the menu File → Table Viewer or pressing Control+T. The following screen comes up in which we have marked the dimensions inside a green box and the measures inside a red box.

1_view_table

Using Dimensions and Measures

Let us create a straight table chart showing the calculation using above dimensions and measures. Click on the Quick Chart Wizard as shown below.

Next, click on the Straight Table option. Click Next.

2_dm_straight_table

Choose Dimensions

In this screen, we choose Region as the dimension as we want to select the total revenue for each region.

3_dm_table_diemsnion

Choose Measure

The Next screen prompts for applying the calculation on a measure field. We choose to apply Sum on the field Revenue.

4_dm_table_measure

Final chart

On completing the above steps, we get the final chart which shows the total revenue(Measure) for each region(Dimension).

5_dm_final_chart

QlikView – Star Schema

A start schema model is a type of data model in which multiple dimensions are linked to a single fact table. Of course, in bigger models there can be multiple facts tables linked to multiple dimensions and other fact tables. The usefulness of this model lies in performing fast queries with minimal joins among various tables. The fact table contains data, which are measures and have numeric values. Calculations are applied on the fields in the fact table. The unique keys of the dimension tables are used in linking it to the fat table, which also has a key usually with the same field name. Therefore, the Fact table contains the keys from the entire dimension table and forms a concatenated primary key used in various queries.

Input Data

Given below is a list of tables, which contain the data for different products from various suppliers and regions. Also the supply happens at different time intervals, which are captured in the Time dimension table.

Product Dimension

It contains the Product Category and Product Names. The Product ID field is the unique Key.

ProductID,ProductCategory,ProductName
1,Outdoor Recreation,Winter Sports & Activities
2,Clothing,Uniforms
3,Lawn & Garden	Power, Equipment
4,Athletics,Rugby
5,Personal Care,Shaver
6,Arts & Entertainment,Crafting Materials
7,Hardware,Power Tool Batteries

Region Dimension

It contains the Region Names where the suppliers are based. The RegionID field is the unique Key.

RegionID,Continent,Country
3,North America, USA
7,South America, Brazil
12,Asia,China
2,Asia,Japan
5,Europe,Belgium

Supplier Dimension

It contains the Supplier Names, which supply the above products. The SupplierID field is the unique Key.

SupplierID,SupplierName
3S12,Supre Suppliers
4A15,ABC Suppliers
4S66,Max Sports
5F244,Nice Foods
8A45,Artistic angle

Time Dimension

It contains the Time periods when the supply of the above products occur. The TimeID field is the unique Key.

TimeID,Year,Month
1,2012,Feb
2,2012,May
3,2012,Sep
4,2013,Aug
5,2014,Jan
6,2014,Nov

Supplier Quantity Fact

It contains the values for the quantities supplied and percentage of defects in them. It joins to each of the above dimensions through keys with same name.

ProductID,RegionID,TimeID,SupplierID,Quantity, DefectPercentage
1,3,3,5F244,8452,12
2,3,1,4S66,5124,8.25
3,7,1,8A45,5841,7.66
4,12,2,4A15,5123,1.25
5,5,3,4S66,7452,8.11
6,2,5,4A15,5142,3.66
7,2,1,4S66,452,2.06

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the Table Files option from the Data from Files tab and browse for the file containing the above data. Click OK and press Control+R to load the data into QlikView’s memory. Below is the script which appears after each of the above file is read.

LOAD ProductID, 
     ProductCategory, 
     ProductName
FROM
[C:\Qlikview\images\StarSchema\Product_dimension.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

LOAD TimeID, 
     Year, 
     Month
FROM
[C:\Qlikview\images\StarSchema\Time.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

LOAD SupplierID, 
     SupplierName
FROM
[C:\Qlikview\images\StarSchema\Suppliers.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

LOAD RegionID, 
     Continent, 
     Country
FROM
[C:\Qlikview\images\StarSchema\Regions.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

LOAD ProductID, 
     RegionID, 
     TimeID, 
     SupplierID, 
     Quantity, 
     DefectPercentage
FROM
[C:\Qlikview\images\StarSchema\Supplier_quantity.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Star Schema Data Model

After reading the above data into QlikView memory, we can look at the data model, which shows all the tables, fields, and relationship in form of a star schema.

starschema_model

QlikView – Synthetic Key

Synthetic Key is QlikView’s solution to create an artificial key when there is ambiguity about which key to use between two tables. This situation arises when two tables have two or more fields in common. QlikView’s feature of creating association in memory automatically detects this scenario and creates an additional table, which will hold the value of the new key created.

Input Data

Let us consider the following two CSV data files, which are used as input for further illustrations.

Sales:
ProductID,ProductCategory,Country,SaleAmount
1,Outdoor Recreation,Italy,4579
2,Clothing,USA,4125
3,Costumes & Accessories,South Korea,6521

Product:
ProductID, Country
3,Brazil
3,China
2,Korea
1,USA

Load Script

We load the above input data using the script editor, which is invoked by pressing Control+E. Choose the option Table Files and browse for the Input file.

synthetic_key_load_script

Data Model

Next, we look at the data model by using the menu command for table viewer, Control+T. The following screen comes up, which shows the creation of a third table that supplies the value of the synthetic key as both the tables have ProductID and Country as matching keys.

synthetic_key_data_model

Impact of Synthetic key

Synthetic keys indicate the flaw in the data model that is being used. They do not cause any issue in the correctness of the data or performance of the report. Things will work fine if a big data model has one or two instances of synthetic keys. However, if we have too many of them, then that is an implication to redesign the data model.

QlikView – Generating Data

Many times, we need some data to be generated programmatically by the software being used, which is not coming from a source. For example, 100 random numbers or just the dates of 23rd week of a year. A data analyst may need such data to be created to perform some analysis on the data that does not contain these values as it arrived. QlikView provides a function called Autogenerate, which can be used for such requirement.

Data Requirement

Consider a scenario where we need to find only the dates, which are a Thursday or a Sunday. We need to find it for the range starting today until the end of the year. We create the following script, which will achieve this.

generate_data_load_script

Explanation

We declare two variables to capture the first day of the current month and end of the year. Next we apply various functions and a filter condition to generate the required values. The recno() function creates one record for each of these dates. We add Autogenerate function giving the variables as the range.

Generated data

On loading the above script to QlikView’s memory and creating a Table Box using the menu Layout → New Sheet Objects → Table Box, we get the data created as shown below.

generate_data_calendar

QlikView – Cross Tables

While analyzing data, we come across situations where we desire columns to become rows and vice-versa. It is not just about transposing, it also involves rolling up many columns together or repeating many values in a row many times to achieve the desired column and row layout in the table.

Input data

Consider the following input data, which shows region wise sales of a certain product for each quarter. We create a delimited file (CSV) with the below given data.

Quarter,Region1,Region2,Region 3
Q1,124,421,471
Q2,415,214,584
Q3,417,321,582
Q4,751,256,95

Loading Input Data

We load the above input data using the script editor, which is invoked by pressing Control+E. Choose the option Table Files and browse for the Input file. After choosing the options as shown below, click Next.

1_cross_tab_file_wizard

Crosstable Options

In the next window (File Wizard → Options), click on the Crosstable button. It highlights the columns in different colors. The pink color shows the qualifier field, which is going to be repeated across many rows for each value of in the Attribute Field. The cell values under the Attribute fields are taken as the data. Click OK.

2_cross_tab_choose_fields

Crosstable Transformation

The transformed data appears in which all the Region fields are clubbed to one column but with values repeating for each quarter.

3_cross_tab_result1

Load Script

The Load script for the crosstable transformations shows the commands given below.

4_cross_tab_load_script

Crosstable Data

On creating a Table Box sheet object using the menu Layout → New Sheet Objects → Table Box, we get the following output.

5_cross_tab_table_box

QlikView – Straight Tables

Straight Tables are most widely used sheet object to display data in QlikView. They are very simple yet powerful with features like column rearrangement, sorting and coloring the background etc.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Value
Sporting Goods,Outdoor Recreation,5642
Food, Beverages & Tobacco,2514
Apparel & Accessories,Clothing,2365
Apparel & Accessories,Costumes & Accessories,4487
Sporting Goods,Athletics,812
Health & Beauty,Personal Care,6912
Arts & Entertainment,Hobbies & Creative Arts,5201
Arts & Entertainment,Paintings,8451
Arts & Entertainment,Musical Instruments,1245
Hardware,Tool Accessories,456
Home & Garden,Bathroom Accessories,241
Food,Drinks,1247
Home & Garden,Lawn & Garden,5462
Office Supplies,Presentation Supplies,577
Hardware,Blocks,548
Baby & Toddler,Diapering,1247
Baby & Toddler,Toys,257
Home & Garden,Pipes,1241
Office Supplies,Display Board,2177

Load Script

The above data is loaded to the QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data. The following screen appears.

1_File_wizardClick “OK” and press “Control+R” to load the data into the QlikView’s memory

Create Table Box

Next, we create a new sheet Object of type Table Box. We follow the menu as shown below.

3_straight_table_menu

Select the Columns

QlikView prompts for the columns to be chosen which will be displayed in the final Table Box. We choose all the columns and use the Promote or Demoteoption to set the order of the columns.

4_straight_table_choose_fields

Select Display Style

Next, we choose the style tab to give specific background colors to the display data. The current style option lists many pre-built styles. We choose Pyjama Red with Stripes every two rows.

6_straight_table_pyjama

Column Re-ordering

We can reorder the positions of the columns by pressing and holding the mouse button at the column headers and then dragging it to the desired position.

7_straight_table_pyjama_reorder

QlikView – Pivot Tables

Pivot Tables are widely used in data analysis to present sum of values across many dimensions available in the data. QlikView’s Chart option has the feature to create a Pivot Table by choosing the appropriate chart type.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Value
Sporting Goods,Outdoor Recreation,5642
Food, Beverages & Tobacco,2514
Apparel & Accessories,Clothing,2365
Apparel & Accessories,Costumes & Accessories,4487
Sporting Goods,Athletics,812
Health & Beauty,Personal Care,6912
Arts & Entertainment,Hobbies & Creative Arts,5201
Arts & Entertainment,Paintings,8451
Arts & Entertainment,Musical Instruments,1245
Hardware,Tool Accessories,456
Home & Garden,Bathroom Accessories,241
Food,Drinks,1247
Home & Garden,Lawn & Garden,5462
Office Supplies,Presentation Supplies,577
Hardware,Blocks,548
Baby & Toddler,Diapering,1247
Baby & Toddler,Toys,257
Home & Garden,Pipes,1241
Office Supplies,Display Board,2177

Load Script

The above data is loaded to the QlikView’s memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data. The following screen appears.

1_File_wizardClick “OK” and press “Control+R” to load the data into QlikView’s memory.

Select Chart Type

Next, we use the chart wizard to select the Pivot Table option. Click Next.

2_pivot_select_chart

Select Chart Dimension

In the next screen, we choose Product_Line as the first dimension for the chart.

3_pivot_choose_dimension

Select Chart Expression

The next screen prompts us for selecting the chart expression where we choose the sum of value.

4_pivot_choose_sum

Select the Chart Format

On clicking next, we get the screen to choose chart format in which we select Pyjama Green as the style and the default mode.

5_pivot_chart_style

Pivot Chart Data

Completing the above steps gives us the final chart as below.

6_pivot_pivot_final

QlikView – Set Analysis

QlikView’s Set Analysis feature is used to segregate the data in different sheet objects into many sets and keeps the values unchanged in some of them. In simpler terms, it creates an option to not associate some sheet objects with others while the default behavior is all sheet objects get associated with each other. This helps in filtering the data in one sheet object and seeing the corresponding result in others, while the sheet object chosen as a different set displays values as per its own filters.

Input Data

Let us consider the following input data, which represents the sales figure of different product lines and product categories.

Product_Line,Product_category,Month,Value
Arts & Entertainment,Hobbies & Creative Arts,Jan,5201
Arts & Entertainment,Paintings,Feb,8451
Arts & Entertainment,Musical Instruments,Jan,1245
Baby & Toddler,Diapering,Mar,1247
Baby & Toddler,Toys,Dec,257
Apparel & Accessories,Clothing,Feb,574
Apparel & Accessories,Costumes & Accessories,Apr,1204
Arts & Entertainment,Musical Instruments,Apr,3625
Baby & Toddler,Diapering,Apr,1281
Apparel & Accessories,Clothing,Jul,2594
Arts & Entertainment,Paintings,Sep,6531
Baby & Toddler,Toys,May,7421
Apparel & Accessories,Clothing,Aug,2541
Arts & Entertainment,Paintings,Oct,2658
Arts & Entertainment,Musical Instruments,Mar,1185
Baby & Toddler,Diapering,Jun,1209

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the “Table Files” option from the “Data from Files” tab and browse for the file containing the above data. A screen appears as shown below.

1_set_ana_load_script

Create Table Box

Choose all the fields available to create a table box using the menu option Layout → New Sheet Objects → Table Box and a list box containing the month’s field using the menu option Layout → New Sheet Objects → List Box. Also, create a straight table chart showing the total sales under each product category.

4_set_ana_sales_table

Data Association

Now we can observe the association between these three sheet objects by selecting some values in one of them. Let us select the month Apr and Jan from the Month list Box. We can see the change in values in the Table Box and chart showing the related values.

5_set_ana_association

Clone Object

Next, we clone the sales sum chart to produce a new set of data not associated with other sheet objects. Right click on the chart Sales Sum and click on the option Clone as shown below. Another copy of the same chart appears in the QlikView document.

6_set_ana_clone

Set Expression

Next, we choose the second copy of the chart Sales Sum and right click it to get the chart properties. We create an expression called Sales values writing the formula under the Definition tab as shown below.

7_set_ana_setexpression

Applying Set Analysis

On completing the above given steps, we find that when we select the month June we get the associated values in the Table Box and Sales Sum chart. However, the April sales does not change as it is based on the data from the set expression.

8_set_ana_final

QlikView – Joins

Joins in QlikView are used to combine data from two data sets into one. Joins in QlikView mean the same as in joins in SQL. Only the column and row values that match the join conditions are shown in the output. In case you are completely new to joins, you may like to first learn about them here.

Input Data

Let us consider the following two CSV data files, which are used as input for further illustrations.

Product List:
ProductID,ProductCategory
1,Outdoor Recreation
2,Clothing
3,Costumes & Accessories
4,Athletics
5,Personal Care
6,Hobbies & Creative Arts

ProductSales:
ProductID,ProductCategory,SaleAmount
4,Athletics,1212
5,Personal Care,5211
6,Hobbies & Creative Arts,1021
7,Display Board,2177
8,Game,1145
9,soap,1012
10,Beverages & Tobacco,2514

Inner Join

We load the above input data using the script editor, which is invoked by pressing Control+E. Choose the option Table Files and browse for the Input file. Then we edit the commands in the script to create an inner join between the tables.

load_scriptInner join fetches only those rows, which are present in both the tables. In this case, the rows available in both Product List and Product Sales table are fetched. We create a Table Box using the menu Layout → New Sheet Objects → Table Box where we choose all the three fields – ProductID, ProductCategory and SaleAmount to be displayed.

inner_join

Left Join

Left join involves fetching all the rows from the table in the left and the matching rows from the table in the right.

Load Script

Sales:
LOAD ProductID, 
     ProductCategory, 
     SaleAmount
FROM
[C:\Qlikview\data\product_lists.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

LEFT JOIN(Sales)

LOAD ProductID, 
     ProductCategory
FROM
[C:\Qlikview\data\Productsales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

We create a Table Box using the menu Layout → New Sheet Objects → Table Box, where we choose all the three fields − ProductID, ProductCategory and SaleAmount to be displayed.

left_join

Right Join

Right join involves fetching all the rows from the table in the right and the matching rows from the table in the left.

Load Script

Sales:
LOAD ProductID, 
     ProductCategory, 
     SaleAmount
FROM
[C:\Qlikview\data\product_lists.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

RIGHT JOIN(Sales)

LOAD ProductID, 
     ProductCategory
FROM
[C:\Qlikview\data\Productsales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

We create a Table Box using the menu Layout → New Sheet Objects → Table Box, where we choose all the three fields – ProductID, ProductCategory and SaleAmount to be displayed.

right_join

Outer Join

Outer join involves fetching all the rows from the table in the right as well as from the table in the left.

Load Script

Sales:
LOAD ProductID, 
     ProductCategory, 
     SaleAmount
FROM
[C:\Qlikview\data\product_lists.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

OUTER JOIN(Sales)

LOAD ProductID, 
     ProductCategory
FROM
[C:\Qlikview\data\Productsales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

We create a Table Box using the menu Layout → New Sheet Objects → Table Box where we choose all the three fields – ProductID, ProductCategory and SaleAmount to be displayed.

outer_join

QlikView – Keeps

The keep command in QlikView is used to combine data from two data sets keeping both the data sets available in memory. It is very similar to joins we covered in the previous chapter except for two major differences. First difference is − in case of keep; both the datasets are available in QlikView’s memory while in join the load statements produce only one data set from which you have to choose the columns. The second difference being − there is no concept of outer keep where as we have outer join available in case of joins.

Input Data

Let us consider the following two CSV data files, which are used as input for further illustrations.

Product List:
ProductID,ProductCategory
1,Outdoor Recreation
2,Clothing
3,Costumes & Accessories
4,Athletics
5,Personal Care
6,Hobbies & Creative Arts

Product Sales:
ProductID,ProductCategory,SaleAmount
4,Athletics,1212
5,Personal Care,5211
6,Hobbies & Creative Arts,1021
7,Display Board,2177
8,Game,1145
9,soap,1012
10,Beverages & Tobacco,2514

Inner Keep

We load the above input data using the script editor, which is invoked by pressing Control+E. Choose the option Table Files and browse for the Input file. Then we edit the commands in the script to create an inner keep between the tables.

keep_scriptInner keep fetches only those rows, which are present in both the tables. In this case, the rows available in both Product List and Product Sales table are fetched. We create a Table Boxes using the menu Layout → New Sheet Objects → Table Box.

First, we choose only the productSales table, which gives us the fields – ProductID, ProductCategory and SaleAmount to be displayed.

keep_product_salesNext, we choose the ProductList data set, which gives us the fields ProductID and ProductCategory.

keep_product_listFinally, we choose the All Tables option and get all the available fields from all the tables.

keep_both_tablesThe following report shows all the Tables Boxes from the above given steps.

keep_chart_innerkeep

Left Keep

Left keep is similar to left join, which keeps all the rows from the table in the left along with both the data set being available in QlikView’s memory.

Left keep Script

The following script is used to create the resulting data sets with left keep command.

productsales:
LOAD ProductID, 
     ProductCategory, 
     SaleAmount
FROM
[C:\Qlikview\data\product_lists.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

left keep(productsales)
productlists:
LOAD ProductID, 
     ProductCategory
FROM
[C:\Qlikview\data\Productsales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Left keep Data

When we change the script as above and refresh the data in the report using Control+R, we get the following data in the sheet objects.

keep_chart_leftkeep

Right Keep

Right keep is similar to left join, which keeps all the rows from the table in the right along with both the data set being available in QlikView’s memory.

Right keep Script

The following script is used to create the resulting data sets with left keep command.

productsales:
LOAD ProductID, 
     ProductCategory, 
     SaleAmount
FROM
[C:\Qlikview\data\product_lists.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

right keep(productsales)
productlists:
LOAD ProductID, 
     ProductCategory
FROM
[C:\Qlikview\data\Productsales.csv]
(txt, codepage is 1252, embedded labels, delimiter is ',', msq);

Right keep data

When we change the script as above and refresh the data in the report using Control+R, we get the following data in the sheet objects.

keep_chart_rightkeep

QlikView – Concatenation

Concatenation feature in QlikView is used to append the rows from one table to another. It happens even when the tables have different number of columns. It differs from both Join and Keep command, as it does not merge the matching rows from two tables into one row.

Input Data

Let us consider the following two CSV data files, which are used as input for further illustrations. Please note the second data set has an additional column named Country.

SalesRegionOld.csv
ProductID,ProductCategory,Region,SaleAmount
1,Outdoor Recreation,Europe,4579
2,Clothing,Europe,4125
3,Costumes & Accessories,South Asia,6521
4,Athletics,South Asia,4125
5,Personal Care,Australia,5124
6,Arts & Entertainment,North AMerica,1245
7,Hardware,South America,456

SalesRegionNew.csv
ProductID,ProductCategory,Region,Country,SaleAmount
6,Arts & Entertainment,North AMerica,USA,1245
7,Hardware,South America,Brazil,456
8,Home & Garden,South America,Brazil,241
9,Food,South Asia,Singapore,1247
10,Home & Garden,South Asia,China,5462
11,Office Supplies,Australia,Australia,577

Load Script

We load the above input data using the script editor, which is invoked by pressing Control+E. Choose the option Table Files and browse for the Input file. Then we edit the commands in the script to apply the concatenation between the tables.

concatenate_load_scriptNext, we load the above data to QlikView’s memory and create a Table Box by using the menu Layout → New Sheet Objects → Table Box where we choose all the available fields to be displayed as shown below.

concatenate_select_table

Concatenated Data

Completing above steps we get the Table box displayed as shown below. Please note the duplicate rows for the product ID 6 and 7. Concatenate does not eliminate the duplicates.

concatenated_table_chart

QlikView – Master Calendar

In QlikView, many times we need to create a calendar reference object, which can be linked to any data set present in QlikView’s memory. For example, you have a table that captures the sales amount and sales date but does not store the weekday or quarter, which corresponds to that date. In such a scenario, we create a Master Calendar which will supply the additional date fields like Quarter, Day etc. as required by any data set.

Input Data

Let us consider the following CSV data files, which are used as input for further illustrations.

SalesDate,SalesVolume
3/28/2012,3152
3/30/2012,2458
3/31/2012,4105
4/8/2012,6245
4/10/2012,5816
4/11/2012,3522

Load Script

We load the above input data using the script editor, which is invoked by pressing Control+E. Choose the option Table Files and browse for the Input file.

1_mc_load_scriptNext, we load the above data to QlikView’s memory and create a Table Boxby using the menu Layout → New Sheet Objects → Table Box where we choose all the available fields to be displayed as shown below.

2_mc_initial_data

Create Master Calendar

Next, we create the Master Calendar by writing the following script in the script editor. Here we use the table DailySales as a resident table from which we capture the Maximum and Minimum dates. We load each of the dates within this range using the second load statement above the resident load. Finally, we have a third load statement, which extracts the year, quarter, month etc. from the SalesDate values.

3_mc_calendar_script

Select Fields

After creation of the complete load script along with the master calendar, we create a table box to view the data using the menu Layout → New Sheet Objects → Table Box

4_mc_select_fields

Final Data

The final output shows the table showing the Quarter and Month values, which are created using the Sales data and Master Calendar.

5_mc_final_data

QlikView – Mapping Tables

Mapping table is a table, which is created to map the column values between two tables. It is also called a Lookup table, which is only used to look for a related value from some other table.

Input Data

Let us consider the following input data file, which represents the sales values in different regions.

ProductID,ProductCategory,Region,SaleAmount
1,Outdoor Recreation,Europe,4579
2,Clothing,Europe,4125
3,Costumes & Accessories,South Asia,6521
4,Athletics,South Asia,4125
5,Personal Care,Australia,5124
6,Arts & Entertainment,North AMerica,1245
7,Hardware,South America,456
8,Home & Garden,South America,241
9,Food,South Asia,1247
10,Home & Garden,South Asia,5462
11,Office Supplies,Australia,577

The following data represents the countries and their regions.

Region,Country
Europe,Germany
Europe,Italy
South Asia,Singapore
South Asia,Korea
North AMerica,USA
South America,Brazil
South America,Peru
South Asia,China
South Asia,Sri Lanka

Load Script

The above data is loaded to QlikView memory by using the script editor. Open the Script editor from the File menu or press Control+E. Choose the Table Files option from the Data from Files tab and browse for the file containing the above data. Click OK and ess Control+R to load the data into the QlikView’s memory.

1_mt_load_script

Create Table Box

Let us create two table boxes for each of the above table as shown below. Here we cannot get the value of country in the Sales region report.

2_mt_table_boxes

Create the Mapping Table

The following script produces the mapping table, which maps the region value from the sales table with the country value from the MapCountryRegion table.

3_mt_mapping_script

Table Chart

On completing the above steps and creating a Table box to view the data, we get the country columns along with other columns from Sales table.

4_mt_final_data

QlikView – Circular Reference

Circular Reference occurs when we can traverse from one table to another using two or more different paths. This means you can join Table1 with Table2 directly using a column or you can also first join Table1 with Table3 and then table3 with Table2. This can lead to incorrect result in the output formed by a data model, which loads all these three tables. QlikView prevents the load of such data into its memory once it recognizes a circular reference.

Input Data

Let us consider the following three CSV data files, which are used as input for further illustrations.

SalesCountries:
ProductID,ProductCategory,Country,SaleAmount
1,Outdoor Recreation,Italy,4579
2,Clothing,USA,4125
3,Costumes & Accessories,South Korea,6521
4,Athletics,Japan,4125
5,Personal Care,Brazil,5124
6,Arts & Entertainment,China,1245
7,Hardware,South America,456
8,Home & Garden,Peru,241
9,Food,India,1247
10,Home & Garden,Singapore,5462
11,Office Supplies,Hungary,577

ProductCountry:
ProductID, Country
3,Brazil
3,China
2,Korea
1,USA
2,Singapore
7,Sri Lanka
1,Italy

Load Script

We load the above input data using the script editor, which is invoked by pressing Control+E. Choose the option Table Files and browse for the Input file.

1_cr_load_script

Data Load

After creating the above script, we load the data to QlikView’s memory using the command Control+R. This is when we get the error prompt mentioning the presence of circular loop in the tables getting loaded.

2_warning

Data Model

To find the exact cause of the above warning we can look at the data model by using the menu command for table viewer – Control+T. The following screen comes up, which clearly shows the circular reference. Here the join between RegionCountry and SalesRegion can be directly achieved using the field Region. It can also be achieved by first going to the table ProductCountry, using the field Country and then mapping ProdcutID with Salesregion.

3_cr_data_model

Resolving Circular Reference

The above circular reference can be resolved by renaming some of the columns in the data sets so that QlikView does not form an association between the tables automatically using the column names. For this, we will rename country column in RegionCountry to SalesCountry. In the data set ProdcuCountry, we rename the Country column to ProductCountry.

4_cr_load_script_rectified

Rectified Data Model

The Rectified data model after renaming the column above can be seen using the command Control+T. Now we can see that the relationship between the tables does not form a loop.

5_cr_data_model_rectifiedPressing Control+R to reload the data does not give us the warning anymore and we can use this data to create reports.



Advertiseme

SPARK IN HADOOP…ALL YOU NEED TO KNOW

Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop framework is based on a simple programming model (MapReduce) and it enables a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process.

As against a common belief, Spark is not a modified version of Hadoopand is not, really, dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only.

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Features of Apache Spark

Apache Spark has following features.

  • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.
  • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
  • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.

Spark Built on HadoopThere are three ways of Spark deployment as explained below.

  • Standalone − Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.
  • Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack.
  • Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.

Components of Spark

The following illustration depicts the different components of Spark.

Components of Spark

Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming

Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.

MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).

GraphX

GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction.

Apache Spark – RDD

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.

Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS). Although this framework provides numerous abstractions for accessing a cluster’s computational resources, users still want more.

Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The following illustration explains how the current framework works, while doing the iterative operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O, and serialization, which makes the system slow.

Iterative Operations on MapReduce

Interactive Operations on MapReduce

User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable storage, which can dominate application execution time.

The following illustration explains how the current framework works while doing the interactive queries on MapReduce.

Interactive Operations on MapReduce

Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.

Recognizing this problem, researchers developed a specialized framework called Apache Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.

Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It will store intermediate results in a distributed memory instead of Stable storage (Disk) and make the system faster.

Note − If the Distributed memory (RAM) is sufficient to store intermediate results (State of the JOB), then it will store those results on the disk.

Iterative Operations on Spark RDD

Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries are run on the same set of data repeatedly, this particular data can be kept in memory for better execution times.

Interactive Operations on Spark RDDBy default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory, in which case Spark will keep the elements around on the cluster for much faster access, the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

Apache Spark – Installation

Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system. The following steps show how to install Apache Spark.

Step 1: Verifying Java Installation

Java installation is one of the mandatory things in installing Spark. Try the following command to verify the JAVA version.

$java -version 

If Java is already, installed on your system, you get to see the following response −

java version "1.7.0_71" 
Java(TM) SE Runtime Environment (build 1.7.0_71-b13) 
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

In case you do not have Java installed on your system, then Install Java before proceeding to next step.

Step 2: Verifying Scala installation

You should Scala language to implement Spark. So let us verify Scala installation using following command.

$scala -version

If Scala is already installed on your system, you get to see the following response −

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.

Step 3: Downloading Scala

Download the latest version of Scala by visit the following link Download Scala. For this tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in the download folder.

Step 4: Installing Scala

Follow the below given steps for installing Scala.

Extract the Scala tar file

Type the following command for extracting the Scala tar file.

$ tar xvf scala-2.11.6.tgz

Move Scala software files

Use the following commands for moving the Scala software files, to respective directory (/usr/local/scala).

$ su – 
Password: 
# cd /home/Hadoop/Downloads/ 
# mv scala-2.11.6 /usr/local/scala 
# exit 

Set PATH for Scala

Use the following command for setting PATH for Scala.

$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation

After installation, it is better to verify it. Use the following command for verifying Scala installation.

$scala -version

If Scala is already installed on your system, you get to see the following response −

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Step 5: Downloading Apache Spark

Download the latest version of Spark by visiting the following link Download Spark. For this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it, you will find the Spark tar file in the download folder.

Step 6: Installing Spark

Follow the steps given below for installing Spark.

Extracting Spark tar

The following command for extracting the spark tar file.

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz 

Moving Spark software files

The following commands for moving the Spark software files to respective directory (/usr/local/spark).

$ su – 
Password:  

# cd /home/Hadoop/Downloads/ 
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark 
# exit 

Setting up the environment for Spark

Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.

export PATH=$PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc

Step 7: Verifying the Spark Installation

Write the following command for opening Spark shell.

$spark-shell

If spark is installed successfully then you will find the following output.

Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
   ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.0 
      /_/  
		
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) 
Type in expressions to have them evaluated. 
Spark context available as sc  
scala> 

Apache Spark – Core Programming

Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines. RDDs can be created in two ways; one is by referencing datasets in external storage systems and second is by applying transformations (e.g. map, filter, reducer, join) on existing RDDs.

The RDD abstraction is exposed through a language-integrated API. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.

Spark Shell

Spark provides an interactive shell − a powerful tool to analyze data interactively. It is available in either Scala or Python language. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.

Open Spark Shell

The following command is used to open Spark shell.

$ spark-shell

Create simple RDD

Let us create a simple RDD from the text file. Use the following command to create a simple RDD.

scala> val inputfile = sc.textFile(“input.txt”)

The output for the above command is

inputfile: org.apache.spark.rdd.RDD[String] = input.txt MappedRDD[1] at textFile at :12

The Spark RDD API introduces few Transformations and few Actions to manipulate RDD.

RDD Transformations

RDD transformations returns pointer to new RDD and allows you to create dependencies between RDDs. Each RDD in dependency chain (String of Dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD.

Spark is lazy, so nothing will be executed unless you call some transformation or action that will trigger job creation and execution. Look at the following snippet of the word-count example.

Therefore, RDD transformation is not a set of data but is a step in a program (might be the only step) telling Spark how to get data and what to do with it.

Given below is a list of RDD transformations.

Actions

The following table gives a list of Actions, which return values.

Programming with RDD

Let us see the implementations of few RDD transformations and actions in RDD programming with the help of an example.

Example

Consider a word count example − It counts each word appearing in a document. Consider the following text as an input and is saved as an input.txt file in a home directory.

input.txt − input file.

people are not as beautiful as they look, 
as they walk or as they talk.
they are only as beautiful  as they love, 
as they care as they share.

Follow the procedure given below to execute the given example.

Open Spark-Shell

The following command is used to open spark shell. Generally, spark is built using Scala. Therefore, a Spark program runs on Scala environment.

$ spark-shell

If Spark shell opens successfully then you will find the following output. Look at the last line of the output “Spark context available as sc” means the Spark container is automatically created spark context object with the name sc. Before starting the first step of a program, the SparkContext object should be created.

Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop 
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
   ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.0 
      /_/  
		
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) 
Type in expressions to have them evaluated. 
Spark context available as sc 
scala>

Create an RDD

First, we have to read the input file using Spark-Scala API and create an RDD.

The following command is used for reading a file from given location. Here, new RDD is created with the name of inputfile. The String which is given as an argument in the textFile(“”) method is absolute path for the input file name. However, if only the file name is given, then it means that the input file is in the current location.

scala> val inputfile = sc.textFile("input.txt")

Execute Word count Transformation

Our aim is to count the words in a file. Create a flat map for splitting each line into words (flatMap(line ⇒ line.split(“ ”)).

Next, read each word as a key with a value ‘1’ (<key, value> = <word,1>)using map function (map(word ⇒ (word, 1)).

Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).

The following command is used for executing word count logic. After executing this, you will not find any output because this is not an action, this is a transformation; pointing a new RDD or tell spark to what to do with the given data)

scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_);

Current RDD

While working with the RDD, if you want to know about current RDD, then use the following command. It will show you the description about current RDD and its dependencies for debugging.

scala> counts.toDebugString

Caching the Transformations

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Use the following command to store the intermediate transformations in memory.

scala> counts.cache()

Applying the Action

Applying an action, like store all the transformations, results into a text file. The String argument for saveAsTextFile(“ ”) method is the absolute path of output folder. Try the following command to save the output in a text file. In the following example, ‘output’ folder is in current location.

scala> counts.saveAsTextFile("output")

Checking the Output

Open another terminal to go to home directory (where spark is executed in the other terminal). Use the following commands for checking output directory.

[hadoop@localhost ~]$ cd output/ 
[hadoop@localhost output]$ ls -1 
 
part-00000 
part-00001 
_SUCCESS

The following command is used to see output from Part-00000 files.

[hadoop@localhost output]$ cat part-00000

Output

(people,1) 
(are,2) 
(not,1) 
(as,8) 
(beautiful,2) 
(they, 7) 
(look,1) 

The following command is used to see output from Part-00001 files.

[hadoop@localhost output]$ cat part-00001 

Output

(walk, 1) 
(or, 1) 
(talk, 1) 
(only, 1) 
(love, 1) 
(care, 1) 
(share, 1) 

UN Persist the Storage

Before UN-persisting, if you want to see the storage space that is used for this application, then use the following URL in your browser.

http://localhost:4040

You will see the following screen, which shows the storage space used for the application, which are running on the Spark shell.

storage spaceIf you want to UN-persist the storage space of particular RDD, then use the following command.

Scala> counts.unpersist()

You will see the output as follows −

15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from persistence list 
15/06/27 00:57:33 INFO BlockManager: Removing RDD 9 
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_1 
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_1 of size 480 dropped from memory (free 280061810) 
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_0 
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_0 of size 296 dropped from memory (free 280062106) 
res7: cou.type = ShuffledRDD[9] at reduceByKey at :14

For verifying the storage space in the browser, use the following URL.

http://localhost:4040/

You will see the following screen. It shows the storage space used for the application, which are running on the Spark shell.

Storage space for application

Apache Spark – Deployment

Spark application, using spark-submit, is a shell command used to deploy the Spark application on a cluster. It uses all respective cluster managers through a uniform interface. Therefore, you do not have to configure your application for each one.

Example

Let us take the same example of word count, we used before, using shell commands. Here, we consider the same example as a spark application.

Sample Input

The following text is the input data and the file named is in.txt.

people are not as beautiful as they look, 
as they walk or as they talk. 
they are only as beautiful  as they love, 
as they care as they share.

Look at the following program −

SparkWordCount.scala

import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark._  

object SparkWordCount { 
   def main(args: Array[String]) { 

      val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil, Map(), Map()) 
		
      /* local = master URL; Word Count = application name; */  
      /* /usr/local/spark = Spark Home; Nil = jars; Map = environment */ 
      /* Map = variables to work nodes */ 
      /*creating an inputRDD to read text file (in.txt) through Spark context*/ 
      val input = sc.textFile("in.txt") 
      /* Transform the inputRDD into countRDD */ 
		
      val count = input.flatMap(line  line.split(" ")) 
      .map(word  (word, 1)) 
      .reduceByKey(_ + _) 
       
      /* saveAsTextFile method is an action that effects on the RDD */  
      count.saveAsTextFile("outfile") 
      System.out.println("OK"); 
   } 
}

Save the above program into a file named SparkWordCount.scala and place it in a user-defined directory named spark-application.

Note − While transforming the inputRDD into countRDD, we are using flatMap() for tokenizing the lines (from text file) into words, map() method for counting the word frequency and reduceByKey() method for counting each word repetition.

Use the following steps to submit this application. Execute all steps in the spark-application directory through the terminal.

Step 1: Download Spark Ja

Spark core jar is required for compilation, therefore, download spark-core_2.10-1.3.0.jar from the following link Spark core jar and move the jar file from download directory to spark-application directory.

Step 2: Compile program

Compile the above program using the command given below. This command should be executed from the spark-application directory. Here, /usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar is a Hadoop support jar taken from Spark library.

$ scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar" SparkPi.scala

Step 3: Create a JAR

Create a jar file of the spark application using the following command. Here, wordcount is the file name for jar file.

jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar

Step 4: Submit spark application

Submit the spark application using the following command −

spark-submit --class SparkWordCount --master local wordcount.jar

If it is executed successfully, then you will find the output given below. The OKletting in the following output is for user identification and that is the last line of the program. If you carefully read the following output, you will find different things, such as −

  • successfully started service ‘sparkDriver’ on port 42954
  • MemoryStore started with capacity 267.3 MB
  • Started SparkUI at http://192.168.1.217:4040
  • Added JAR file:/home/hadoop/piapplication/count.jar
  • ResultStage 1 (saveAsTextFile at SparkPi.scala:11) finished in 0.566 s
  • Stopped Spark web UI at http://192.168.1.217:4040
  • MemoryStore cleared
15/07/08 13:56:04 INFO Slf4jLogger: Slf4jLogger started 
15/07/08 13:56:04 INFO Utils: Successfully started service 'sparkDriver' on port 42954. 
15/07/08 13:56:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.217:42954] 
15/07/08 13:56:04 INFO MemoryStore: MemoryStore started with capacity 267.3 MB 
15/07/08 13:56:05 INFO HttpServer: Starting HTTP Server 
15/07/08 13:56:05 INFO Utils: Successfully started service 'HTTP file server' on port 56707. 
15/07/08 13:56:06 INFO SparkUI: Started SparkUI at http://192.168.1.217:4040 
15/07/08 13:56:07 INFO SparkContext: Added JAR file:/home/hadoop/piapplication/count.jar at http://192.168.1.217:56707/jars/count.jar with timestamp 1436343967029 
15/07/08 13:56:11 INFO Executor: Adding file:/tmp/spark-45a07b83-42ed-42b3b2c2-823d8d99c5af/userFiles-df4f4c20-a368-4cdd-a2a7-39ed45eb30cf/count.jar to class loader 
15/07/08 13:56:11 INFO HadoopRDD: Input split: file:/home/hadoop/piapplication/in.txt:0+54 
15/07/08 13:56:12 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2001 bytes result sent to driver 
 (MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11), which is now runnable 
15/07/08 13:56:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11) 
15/07/08 13:56:13 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at SparkPi.scala:11) finished in 0.566 s 
15/07/08 13:56:13 INFO DAGScheduler: Job 0 finished: saveAsTextFile at SparkPi.scala:11, took 2.892996 s
OK 
15/07/08 13:56:13 INFO SparkContext: Invoking stop() from shutdown hook 
15/07/08 13:56:13 INFO SparkUI: Stopped Spark web UI at http://192.168.1.217:4040 
15/07/08 13:56:13 INFO DAGScheduler: Stopping DAGScheduler 
15/07/08 13:56:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 
15/07/08 13:56:14 INFO Utils: path = /tmp/spark-45a07b83-42ed-42b3-b2c2823d8d99c5af/blockmgr-ccdda9e3-24f6-491b-b509-3d15a9e05818, already present as root for deletion. 
15/07/08 13:56:14 INFO MemoryStore: MemoryStore cleared 
15/07/08 13:56:14 INFO BlockManager: BlockManager stopped 
15/07/08 13:56:14 INFO BlockManagerMaster: BlockManagerMaster stopped 
15/07/08 13:56:14 INFO SparkContext: Successfully stopped SparkContext 
15/07/08 13:56:14 INFO Utils: Shutdown hook called 
15/07/08 13:56:14 INFO Utils: Deleting directory /tmp/spark-45a07b83-42ed-42b3b2c2-823d8d99c5af 
15/07/08 13:56:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!  

Step 5: Checking output

After successful execution of the program, you will find the directory named outfile in the spark-application directory.

The following commands are used for opening and checking the list of files in the outfile directory.

$ cd outfile 
$ ls 
Part-00000 part-00001 _SUCCESS

The commands for checking output in part-00000 file are −

$ cat part-00000 
(people,1) 
(are,2) 
(not,1) 
(as,8) 
(beautiful,2) 
(they, 7) 
(look,1)

The commands for checking output in part-00001 file are −

$ cat part-00001 
(walk, 1) 
(or, 1) 
(talk, 1) 
(only, 1) 
(love, 1) 
(care, 1) 
(share, 1)

Go through the following section to know more about the ‘spark-submit’ command.

Spark-submit Syntax

spark-submit [options]  [app arguments]

Options

The table given below describes a list of options −

Advanced Spark Programming

Spark contains two different types of shared variables − one is broadcast variables and second is accumulators.

  • Broadcast variables − used to efficiently, distribute large values.
  • Accumulators − used to aggregate the information of particular collection.

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node, a copy of a large input dataset, in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage.

The data broadcasted this way is cached in serialized form and is deserialized before running each task. This means that explicitly creating broadcast variables, is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code given below shows this −

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

Output −

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster, so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after its broadcast, in order to ensure that all nodes get the same value of the broadcast variable.

Accumulators

Accumulators are variables that are only “added” to through an associative operation and can therefore, be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE − this is not yet supported in Python).

An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

The code given below shows an accumulator being used to add up the elements of an array −

scala> val accum = sc.accumulator(0) 
 
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

If you want to see the output of above code then use the following command −

scala> accum.value 

Output

res2: Int = 10 

Numeric RDD Operations

Spark allows you to do different operations on numeric data, using one of the predefined API methods. Spark’s numeric operations are implemented with a streaming algorithm that allows building the model, one element at a time.

These operations are computed and returned as a StatusCounter object by calling status() method.

 

Data Analysis Process

Data Analysis is a process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful information, suggesting conclusions and supporting decision-making

.
Types of Data Analysis
Several data analysis techniques exist encompassing various domains such as business, science, social science, etc. with a variety of names. The major data analysis approaches are −

Data Mining
Business Intelligence
Statistical Analysis
Predictive Analytics
Text Analytics
Data Mining
Data Mining is the analysis of large quantities of data to extract previously unknown, interesting patterns of data, unusual data and the dependencies. Note that the goal is the extraction of patterns and knowledge from large amounts of data and not the extraction of data itself.

Data mining analysis involves computer science methods at the intersection of the artificial intelligence, machine learning, statistics, and database systems.

The patterns obtained from data mining can be considered as a summary of the input data that can be used in further analysis or to obtain more accurate prediction results by a decision support system.

Business Intelligence
Business Intelligence techniques and tools are for acquisition and transformation of large amounts of unstructured business data to help identify, develop and create new strategic business opportunities.

The goal of business intelligence is to allow easy interpretation of large volumes of data to identify new opportunities. It helps in implementing an effective strategy based on insights that can provide businesses with a competitive market-advantage and long-term stability.

Statistical Analysis
Statistics is the study of collection, analysis, interpretation, presentation, and organization of data.

In data analysis, two main statistical methodologies are used −

Descriptive statistics − In descriptive statistics, data from the entire population or a sample is summarized with numerical descriptors such as −

Mean, Standard Deviation for Continuous Data

Frequency, Percentage for Categorical Data

Inferential statistics − It uses patterns in the sample data to draw inferences about the represented population or accounting for randomness. These inferences can be −

answering yes/no questions about the data (hypothesis testing)

estimating numerical characteristics of the data (estimation)

describing associations within the data (correlation)

modeling relationships within the data (E.g. regression analysis)

Predictive Analytics
Predictive Analytics use statistical models to analyze current and historical data for forecasting (predictions) about future or otherwise unknown events. In business, predictive analytics is used to identify risks and opportunities that aid in decision-making.

Text Analytics
Text Analytics, also referred to as Text Mining or as Text Data Mining is the process of deriving high-quality information from text. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data using means such as statistical pattern learning, and finally evaluation and interpretation of the output.

Data Analysis Process
Data Analysis is defined by the statistician John Tukey in 1961 as “Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”

Thus, data analysis is a process for obtaining large, unstructured data from various sources and converting it into information that is useful for −

Answering questions
Test hypotheses
Decision-making
Disproving theories
Data Analysis with Excel
Microsoft Excel provides several means and ways to analyze and interpret data. The data can be from various sources. The data can be converted and formatted in several ways. It can be analyzed with the relevant Excel commands, functions and tools – encompassing Conditional Formatting, Ranges, Tables, Text functions, Date functions, Time functions, Financial functions, Subtotals, Quick Analysis, Formula Auditing, Inquire Tool, What-if Analysis, Solvers, Data Model, PowerPivot, PowerView, PowerMap, etc.

Data Analysis is a process of collecting, transforming, cleaning, and modeling data with the goal of discovering the required information. The results so obtained are communicated, suggesting conclusions, and supporting decision-making. Data visualization is at times used to portray the data for the ease of discovering the useful patterns in the data. The terms Data Modeling and Data Analysis mean the same.

Data Analysis Process consists of the following phases that are iterative in nature −

  • Data Requirements Specification
  • Data Collection
  • Data Processing
  • Data Cleaning
  • Data Analysis
  • Communication

Data Analysis Process

Data Requirements Specification

The data required for analysis is based on a question or an experiment. Based on the requirements of those directing the analysis, the data necessary as inputs to the analysis is identified (e.g., Population of people). Specific variables regarding a population (e.g., Age and Income) may be specified and obtained. Data may be numerical or categorical.

Data Collection

Data Collection is the process of gathering information on targeted variables identified as data requirements. The emphasis is on ensuring accurate and honest collection of data. Data Collection ensures that data gathered is accurate such that the related decisions are valid. Data Collection provides both a baseline to measure and a target to improve.

Data is collected from various sources ranging from organizational databases to the information in web pages. The data thus obtained, may not be structured and may contain irrelevant information. Hence, the collected data is required to be subjected to Data Processing and Data Cleaning.

Data Processing

The data that is collected must be processed or organized for analysis. This includes structuring the data as required for the relevant Analysis Tools. For example, the data might have to be placed into rows and columns in a table within a Spreadsheet or Statistical Application. A Data Model might have to be created.

Data Cleaning

The processed and organized data may be incomplete, contain duplicates, or contain errors. Data Cleaning is the process of preventing and correcting these errors. There are several types of Data Cleaning that depend on the type of data. For example, while cleaning the financial data, certain totals might be compared against reliable published numbers or defined thresholds. Likewise, quantitative data methods can be used for outlier detection that would be subsequently excluded in analysis.

Data Analysis

Data that is processed, organized and cleaned would be ready for the analysis. Various data analysis techniques are available to understand, interpret, and derive conclusions based on the requirements. Data Visualization may also be used to examine the data in graphical format, to obtain additional insight regarding the messages within the data.

Statistical Data Models such as Correlation, Regression Analysis can be used to identify the relations among the data variables. These models that are descriptive of the data are helpful in simplifying analysis and communicate results.

The process might require additional Data Cleaning or additional Data Collection, and hence these activities are iterative in nature.

Communication

The results of the data analysis are to be reported in a format as required by the users to support their decisions and further action. The feedback from the users might result in additional analysis.

The data analysts can choose data visualization techniques, such as tables and charts, which help in communicating the message clearly and efficiently to the users. The analysis tools provide facility to highlight the required information with color codes and formatting in tables and charts.


 

R Programming

What is R Programing

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is open source distributed by GNU,is widely used for statistical programming,Business intelligence,graphical presentation and reporting.it is #1 choice for data Scientist for data analyst and representation.Also R is good for Statistical calculation  on arrays, lists, vectors and matrices.A tool for DATA analysis.

Local Environment Setup

R can be install on Window and Linux.

Window Environment : https://cran.r-project.org/bin/windows/base

Linux Installation :https://cran.r-project.org/bin/linux/

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or
‘help.start()’ for an HTML browser interface to help.
Type ‘q()’ to quit R.

at R prompt to install the required package. For example, the following command will install plotrix package which is required for 3D charts.

> install.packages("plotrix")

R Command Prompt

When local Environment is set up then you can start using R by typing this command

$ R , this will launch R interpreter with > prompt’… then you can type your command.

> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"                                                                   

first statement defines a string variable myString, where we assign (<-) a string “Hello, World!” and then next statement print() is being used to print the value stored in variable myString

R Script File
Mostly R program is written in Scripfile then execute it at the Window or LInux command prompt
by R interpreter called Rscript.

Let write a Scriptfile
# My program in R Programming
myString <- “Hello, World!”
print ( myString)

Save the above code in a file hello.R and execute it,
execute it at Window command prompt

$ Rscript hello.R

it produces the following result.

[1] “Hello, World!”

Comments
Comments are like helping text in your R program and will be ignored by the interpreter while executing your program. Single comment is written using # in the beginning of the statement as follows

# My first program in R Programming

Next Blog will be about R Data Types
Vectors, Lists, Matrices, Arrays, Factors and Data Frames

MongoDB

THE QUICK GUIDE

MongoDB is a cross-platform, document oriented database that provides, high performance, high availability, and easy scalability. MongoDB works on concept of collection and document.

Database

Database is a physical container for collections. Each database gets its own set of files on the file system. A single MongoDB server typically has multiple databases.

Collection

Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection exists within a single database. Collections do not enforce a schema. Documents within a collection can have different fields. Typically, all documents in a collection are of similar or related purpose.

Document

A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.

The following table shows the relationship of RDBMS terminology with MongoDB.

RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id provided by mongodb itself)
Database Server and Client
Mysqld/Oracle mongod
mysql/sqlplus mongo

Sample Document

Following example shows the document structure of a blog site, which is simply a comma separated key value pair.

{
   _id: ObjectId(7df78ad8902c)
   title: 'MongoDB Overview', 
   description: 'MongoDB is no sql database',
   by: 'tutorials point',
   url: 'http://www.tutorialspoint.com',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 100, 
   comments: [	
      {
         user:'user1',
         message: 'My first comment',
         dateCreated: new Date(2011,1,20,2,15),
         like: 0 
      },
      {
         user:'user2',
         message: 'My second comments',
         dateCreated: new Date(2011,1,25,7,45),
         like: 5
      }
   ]
}

_id is a 12 bytes hexadecimal number which assures the uniqueness of every document. You can provide _id while inserting the document. If you don’t provide then MongoDB provides a unique id for every document. These 12 bytes first 4 bytes for the current timestamp, next 3 bytes for machine id, next 2 bytes for process id of MongoDB server and remaining 3 bytes are simple incremental VALUE.

MongoDB – Advantages

Any relational database has a typical schema design that shows number of tables and the relationship between these tables. While in MongoDB, there is no concept of relationship.

Advantages of MongoDB over RDBMS

  • Schema less − MongoDB is a document database in which one collection holds different documents. Number of fields, content and size of the document can differ from one document to another.
  • Structure of a single object is clear.
  • No complex joins.
  • Deep query-ability. MongoDB supports dynamic queries on documents using a document-based query language that’s nearly as powerful as SQL.
  • Tuning.
  • Ease of scale-out − MongoDB is easy to scale.
  • Conversion/mapping of application objects to database objects not needed.
  • Uses internal memory for storing the (windowed) working set, enabling faster access of data.

Why Use MongoDB?

  • Document Oriented Storage − Data is stored in the form of JSON style documents.
  • Index on any attribute
  • Replication and high availability
  • Auto-sharding
  • Rich queries
  • Fast in-place updates
  • Professional support by MongoDB

Where to Use MongoDB?

  • Big Data
  • Content Management and Delivery
  • Mobile and Social Infrastructure
  • User Data Management
  • Data Hub

MongoDB – Environment

Let us now see how to install MongoDB on Windows.

Install MongoDB On Windows

To install MongoDB on Windows, first download the latest release of MongoDB from https://www.mongodb.org/downloads. Make sure you get correct version of MongoDB depending upon your Windows version. To get your Windows version, open command prompt and execute the following command.

C:\>wmic os get osarchitecture
OSArchitecture
64-bit
C:\>

32-bit versions of MongoDB only support databases smaller than 2GB and suitable only for testing and evaluation purposes.

Now extract your downloaded file to c:\ drive or any other location. Make sure the name of the extracted folder is mongodb-win32-i386-[version] or mongodb-win32-x86_64-[version]. Here [version] is the version of MongoDB download.

Next, open the command prompt and run the following command.

C:\>move mongodb-win64-* mongodb
   1 dir(s) moved.
C:\>

In case you have extracted the MongoDB at different location, then go to that path by using command cd FOLDER/DIR and now run the above given process.

MongoDB requires a data folder to store its files. The default location for the MongoDB data directory is c:\data\db. So you need to create this folder using the Command Prompt. Execute the following command sequence.

C:\>md data
C:\md data\db

If you have to install the MongoDB at a different location, then you need to specify an alternate path for \data\db by setting the path dbpath in mongod.exe. For the same, issue the following commands.

In the command prompt, navigate to the bin directory present in the MongoDB installation folder. Suppose my installation folder is D:\set up\mongodb

C:\Users\XYZ>d:
D:\>cd "set up"
D:\set up>cd mongodb
D:\set up\mongodb>cd bin
D:\set up\mongodb\bin>mongod.exe --dbpath "d:\set up\mongodb\data" 

This will show waiting for connections message on the console output, which indicates that the mongod.exe process is running successfully.

Now to run the MongoDB, you need to open another command prompt and issue the following command.

D:\set up\mongodb\bin>mongo.exe
MongoDB shell version: 2.4.6
connecting to: test
>db.test.save( { a: 1 } )
>db.test.find()
{ "_id" : ObjectId(5879b0f65a56a454), "a" : 1 }
>

This will show that MongoDB is installed and run successfully. Next time when you run MongoDB, you need to issue only commands.

D:\set up\mongodb\bin>mongod.exe --dbpath "d:\set up\mongodb\data" 
D:\set up\mongodb\bin>mongo.exe

Install MongoDB on Ubuntu

Run the following command to import the MongoDB public GPG key −

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10

Create a /etc/apt/sources.list.d/mongodb.list file using the following command.

echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' 
   | sudo tee /etc/apt/sources.list.d/mongodb.list

Now issue the following command to update the repository −

sudo apt-get update

Next install the MongoDB by using the following command −

apt-get install mongodb-10gen = 2.2.3

In the above installation, 2.2.3 is currently released MongoDB version. Make sure to install the latest version always. Now MongoDB is installed successfully.

Start MongoDB

sudo service mongodb start

Stop MongoDB

sudo service mongodb stop

Restart MongoDB

sudo service mongodb restart

To use MongoDB run the following command.

mongo

This will connect you to running MongoDB instance.

MongoDB Help

To get a list of commands, type db.help() in MongoDB client. This will give you a list of commands as shown in the following screenshot.

DB Help

MongoDB Statistics

To get stats about MongoDB server, type the command db.stats() in MongoDB client. This will show the database name, number of collection and documents in the database. Output of the command is shown in the following screenshot.

DB Stats

MongoDB – Data Modelling

Data in MongoDB has a flexible schema.documents in the same collection. They do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.

Some considerations while designing Schema in MongoDB

  • Design your schema according to user requirements.
  • Combine objects into one document if you will use them together. Otherwise separate them (but make sure there should not be need of joins).
  • Duplicate the data (but limited) because disk space is cheap as compare to compute time.
  • Do joins while write, not on read.
  • Optimize your schema for most frequent use cases.
  • Do complex aggregation in the schema.

Example

Suppose a client needs a database design for his blog/website and see the differences between RDBMS and MongoDB schema design. Website has the following requirements.

  • Every post has the unique title, description and url.
  • Every post can have one or more tags.
  • Every post has the name of its publisher and total number of likes.
  • Every post has comments given by users along with their name, message, data-time and likes.
  • On each post, there can be zero or more comments.

In RDBMS schema, design for above requirements will have minimum three tables.

RDBMS Schema Design

While in MongoDB schema, design will have one collection post and the following structure −

{
   _id: POST_ID
   title: TITLE_OF_POST, 
   description: POST_DESCRIPTION,
   by: POST_BY,
   url: URL_OF_POST,
   tags: [TAG1, TAG2, TAG3],
   likes: TOTAL_LIKES, 
   comments: [	
      {
         user:'COMMENT_BY',
         message: TEXT,
         dateCreated: DATE_TIME,
         like: LIKES 
      },
      {
         user:'COMMENT_BY',
         message: TEXT,
         dateCreated: DATE_TIME,
         like: LIKES
      }
   ]
}

So while showing the data, in RDBMS you need to join three tables and in MongoDB, data will be shown from one collection only.

MongoDB – Create Database

In this chapter, we will see how to create a database in MongoDB.

The use Command

MongoDB use DATABASE_NAME is used to create database. The command will create a new database if it doesn’t exist, otherwise it will return the existing database.

Syntax

Basic syntax of use DATABASE statement is as follows −

use DATABASE_NAME

Example

If you want to create a database with name <mydb>, then use DATABASEstatement would be as follows −

>use mydb
switched to db mydb

To check your currently selected database, use the command db

>db
mydb

If you want to check your databases list, use the command show dbs.

>show dbs
local     0.78125GB
test      0.23012GB

Your created database (mydb) is not present in list. To display database, you need to insert at least one document into it.

>db.movie.insert({"name":"tutorials point"})
>show dbs
local      0.78125GB
mydb       0.23012GB
test       0.23012GB

In MongoDB default database is test. If you didn’t create any database, then collections will be stored in test database.

MongoDB – Drop Database

In this chapter, we will see how to drop a database using MongoDB command.

The dropDatabase() Method

MongoDB db.dropDatabase() command is used to drop a existing database.

Syntax

Basic syntax of dropDatabase() command is as follows −

db.dropDatabase()

This will delete the selected database. If you have not selected any database, then it will delete default ‘test’ database.

Example

First, check the list of available databases by using the command, show dbs.

>show dbs
local      0.78125GB
mydb       0.23012GB
test       0.23012GB
>

If you want to delete new database <mydb>, then dropDatabase()command would be as follows −

>use mydb
switched to db mydb
>db.dropDatabase()
>{ "dropped" : "mydb", "ok" : 1 }
>

Now check list of databases.

>show dbs
local      0.78125GB
test       0.23012GB
>

MongoDB – Create Collection

In this chapter, we will see how to create a collection using MongoDB.

The createCollection() Method

MongoDB db.createCollection(name, options) is used to create collection.

Syntax

Basic syntax of createCollection() command is as follows −

db.createCollection(name, options)

In the command, name is name of collection to be created. Options is a document and is used to specify configuration of collection.

Parameter Type Description
Name String Name of the collection to be created
Options Document (Optional) Specify options about memory size and indexing

Options parameter is optional, so you need to specify only the name of the collection. Following is the list of options you can use −

Field Type Description
capped Boolean (Optional) If true, enables a capped collection. Capped collection is a fixed size collection that automatically overwrites its oldest entries when it reaches its maximum size. If you specify true, you need to specify size parameter also.
autoIndexId Boolean (Optional) If true, automatically create index on _id field.s Default value is false.
size number (Optional) Specifies a maximum size in bytes for a capped collection. If capped is true, then you need to specify this field also.
max number (Optional) Specifies the maximum number of documents allowed in the capped collection.

While inserting the document, MongoDB first checks size field of capped collection, then it checks max field.

Examples

Basic syntax of createCollection() method without options is as follows −

>use test
switched to db test
>db.createCollection("mycollection")
{ "ok" : 1 }
>

You can check the created collection by using the command show collections.

>show collections
mycollection
system.indexes

The following example shows the syntax of createCollection() method with few important options −

>db.createCollection("mycol", { capped : true, autoIndexId : true, size : 
   6142800, max : 10000 } )
{ "ok" : 1 }
>

In MongoDB, you don’t need to create collection. MongoDB creates collection automatically, when you insert some document.

>db.tutorialspoint.insert({"name" : "tutorialspoint"})
>show collections
mycol
mycollection
system.indexes
tutorialspoint
>

MongoDB – Drop Collection

In this chapter, we will see how to drop a collection using MongoDB.

The drop() Method

MongoDB’s db.collection.drop() is used to drop a collection from the database.

Syntax

Basic syntax of drop() command is as follows −

db.COLLECTION_NAME.drop()

Example

First, check the available collections into your database mydb.

>use mydb
switched to db mydb
>show collections
mycol
mycollection
system.indexes
tutorialspoint
>

Now drop the collection with the name mycollection.

>db.mycollection.drop()
true
>

Again check the list of collections into database.

>show collections
mycol
system.indexes
tutorialspoint
>

drop() method will return true, if the selected collection is dropped successfully, otherwise it will return false.

MongoDB – Datatypes

MongoDB supports many datatypes. Some of them are −

  • String − This is the most commonly used datatype to store the data. String in MongoDB must be UTF-8 valid.
  • Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit depending upon your server.
  • Boolean − This type is used to store a boolean (true/ false) value.
  • Double − This type is used to store floating point values.
  • Min/ Max keys − This type is used to compare a value against the lowest and highest BSON elements.
  • Arrays − This type is used to store arrays or list or multiple values into one key.
  • Timestamp − ctimestamp. This can be handy for recording when a document has been modified or added.
  • Object − This datatype is used for embedded documents.
  • Null − This type is used to store a Null value.
  • Symbol − This datatype is used identically to a string; however, it’s generally reserved for languages that use a specific symbol type.
  • Date − This datatype is used to store the current date or time in UNIX time format. You can specify your own date time by creating object of Date and passing day, month, year into it.
  • Object ID − This datatype is used to store the document’s ID.
  • Binary data − This datatype is used to store binary data.
  • Code − This datatype is used to store JavaScript code into the document.
  • Regular expression − This datatype is used to store regular expression.

MongoDB – Insert Document

In this chapter, we will learn how to insert document in MongoDB collection.

The insert() Method

To insert data into MongoDB collection, you need to use MongoDB’s insert()or save() method.

Syntax

The basic syntax of insert() command is as follows −

>db.COLLECTION_NAME.insert(document)

Example

>db.mycol.insert({
   _id: ObjectId(7df78ad8902c),
   title: 'MongoDB Overview', 
   description: 'MongoDB is no sql database',
   by: 'tutorials point',
   url: 'http://www.tutorialspoint.com',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 100
})

Here mycol is our collection name, as created in the previous chapter. If the collection doesn’t exist in the database, then MongoDB will create this collection and then insert a document into it.

In the inserted document, if we don’t specify the _id parameter, then MongoDB assigns a unique ObjectId for this document.

_id is 12 bytes hexadecimal number unique for every document in a collection. 12 bytes are divided as follows −

_id: ObjectId(4 bytes timestamp, 3 bytes machine id, 2 bytes process id, 
   3 bytes incrementer)

To insert multiple documents in a single query, you can pass an array of documents in insert() command.

Example

>db.post.insert([
   {
      title: 'MongoDB Overview', 
      description: 'MongoDB is no sql database',
      by: 'tutorials point',
      url: 'http://www.tutorialspoint.com',
      tags: ['mongodb', 'database', 'NoSQL'],
      likes: 100
   },
	
   {
      title: 'NoSQL Database', 
      description: "NoSQL database doesn't have tables",
      by: 'tutorials point',
      url: 'http://www.tutorialspoint.com',
      tags: ['mongodb', 'database', 'NoSQL'],
      likes: 20, 
      comments: [	
         {
            user:'user1',
            message: 'My first comment',
            dateCreated: new Date(2013,11,10,2,35),
            like: 0 
         }
      ]
   }
])

To insert the document you can use db.post.save(document) also. If you don’t specify _id in the document then save() method will work same as insert() method. If you specify _id then it will replace whole data of document containing _id as specified in save() method.

MongoDB – Query Document

In this chapter, we will learn how to query document from MongoDB collection.

The find() Method

To query data from MongoDB collection, you need to use MongoDB’s find()method.

Syntax

The basic syntax of find() method is as follows −

>db.COLLECTION_NAME.find()

find() method will display all the documents in a non-structured way.

The pretty() Method

To display the results in a formatted way, you can use pretty() method.

Syntax

>db.mycol.find().pretty()

Example

>db.mycol.find().pretty()
{
   "_id": ObjectId(7df78ad8902c),
   "title": "MongoDB Overview", 
   "description": "MongoDB is no sql database",
   "by": "tutorials point",
   "url": "http://www.tutorialspoint.com",
   "tags": ["mongodb", "database", "NoSQL"],
   "likes": "100"
}
>

Apart from find() method, there is findOne() method, that returns only one document.

RDBMS Where Clause Equivalents in MongoDB

To query the document on the basis of some condition, you can use following operations.

Operation Syntax Example RDBMS Equivalent
Equality {<key>:<value>} db.mycol.find({“by”:”tutorials point”}).pretty() where by = ‘tutorials point’
Less Than {<key>:{$lt:<value>}} db.mycol.find({“likes”:{$lt:50}}).pretty() where likes < 50
Less Than Equals {<key>:{$lte:<value>}} db.mycol.find({“likes”:{$lte:50}}).pretty() where likes <= 50
Greater Than {<key>:{$gt:<value>}} db.mycol.find({“likes”:{$gt:50}}).pretty() where likes > 50
Greater Than Equals {<key>:{$gte:<value>}} db.mycol.find({“likes”:{$gte:50}}).pretty() where likes >= 50
Not Equals {<key>:{$ne:<value>}} db.mycol.find({“likes”:{$ne:50}}).pretty() where likes != 50

AND in MongoDB

Syntax

In the find() method, if you pass multiple keys by separating them by ‘,’ then MongoDB treats it as AND condition. Following is the basic syntax of AND −

>db.mycol.find(
   {
      $and: [
         {key1: value1}, {key2:value2}
      ]
   }
).pretty()

Example

Following example will show all the tutorials written by ‘tutorials point’ and whose title is ‘MongoDB Overview’.

>db.mycol.find({$and:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty() {
   "_id": ObjectId(7df78ad8902c),
   "title": "MongoDB Overview", 
   "description": "MongoDB is no sql database",
   "by": "tutorials point",
   "url": "http://www.tutorialspoint.com",
   "tags": ["mongodb", "database", "NoSQL"],
   "likes": "100"
}

For the above given example, equivalent where clause will be ‘ where by = ‘tutorials point’ AND title = ‘MongoDB Overview’ ‘. You can pass any number of key, value pairs in find clause.

OR in MongoDB

Syntax

To query documents based on the OR condition, you need to use $or keyword. Following is the basic syntax of OR −

>db.mycol.find(
   {
      $or: [
         {key1: value1}, {key2:value2}
      ]
   }
).pretty()

Example

Following example will show all the tutorials written by ‘tutorials point’ or whose title is ‘MongoDB Overview’.

>db.mycol.find({$or:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()
{
   "_id": ObjectId(7df78ad8902c),
   "title": "MongoDB Overview", 
   "description": "MongoDB is no sql database",
   "by": "tutorials point",
   "url": "http://www.tutorialspoint.com",
   "tags": ["mongodb", "database", "NoSQL"],
   "likes": "100"
}
>

Using AND and OR Together

Example

The following example will show the documents that have likes greater than 10 and whose title is either ‘MongoDB Overview’ or by is ‘tutorials point’. Equivalent SQL where clause is ‘where likes>10 AND (by = ‘tutorials point’ OR title = ‘MongoDB Overview’)’

>db.mycol.find({"likes": {$gt:10}, $or: [{"by": "tutorials point"},
   {"title": "MongoDB Overview"}]}).pretty()
{
   "_id": ObjectId(7df78ad8902c),
   "title": "MongoDB Overview", 
   "description": "MongoDB is no sql database",
   "by": "tutorials point",
   "url": "http://www.tutorialspoint.com",
   "tags": ["mongodb", "database", "NoSQL"],
   "likes": "100"
}
>

MongoDB – Update Document

MongoDB’s update() and save() methods are used to update document into a collection. The update() method updates the values in the existing document while the save() method replaces the existing document with the document passed in save() method.

MongoDB Update() Method

The update() method updates the values in the existing document.

Syntax

The basic syntax of update() method is as follows −

>db.COLLECTION_NAME.update(SELECTION_CRITERIA, UPDATED_DATA)

Example

Consider the mycol collection has the following data.

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will set the new title ‘New MongoDB Tutorial’ of the documents whose title is ‘MongoDB Overview’.

>db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB Tutorial'}})
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"New MongoDB Tutorial"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>

By default, MongoDB will update only a single document. To update multiple documents, you need to set a parameter ‘multi’ to true.

>db.mycol.update({'title':'MongoDB Overview'},
   {$set:{'title':'New MongoDB Tutorial'}},{multi:true})

MongoDB Save() Method

The save() method replaces the existing document with the new document passed in the save() method.

Syntax

The basic syntax of MongoDB save() method is shown below −

>db.COLLECTION_NAME.save({_id:ObjectId(),NEW_DATA})

Example

Following example will replace the document with the _id ‘5983548781331adf45ec7’.

>db.mycol.save(
   {
      "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point New Topic",
         "by":"Tutorials Point"
   }
)
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"Tutorials Point New Topic",
   "by":"Tutorials Point"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>

MongoDB – Delete Document

In this chapter, we will learn how to delete a document using MongoDB.

The remove() Method

MongoDB’s remove() method is used to remove a document from the collection. remove() method accepts two parameters. One is deletion criteria and second is justOne flag.

  • deletion criteria − (Optional) deletion criteria according to documents will be removed.
  • justOne − (Optional) if set to true or 1, then remove only one document.

Syntax

Basic syntax of remove() method is as follows −

>db.COLLECTION_NAME.remove(DELLETION_CRITTERIA)

Example

Consider the mycol collection has the following data.

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will remove all the documents whose title is ‘MongoDB Overview’.

>db.mycol.remove({'title':'MongoDB Overview'})
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>

Remove Only One

If there are multiple records and you want to delete only the first record, then set justOne parameter in remove() method.

>db.COLLECTION_NAME.remove(DELETION_CRITERIA,1)

Remove All Documents

If you don’t specify deletion criteria, then MongoDB will delete whole documents from the collection. This is equivalent of SQL’s truncate command.

>db.mycol.remove()
>db.mycol.find()
>

MongoDB – Projection

In MongoDB, projection means selecting only the necessary data rather than selecting whole of the data of a document. If a document has 5 fields and you need to show only 3, then select only 3 fields from them.

The find() Method

MongoDB’s find() method, explained in MongoDB Query Document accepts second optional parameter that is list of fields that you want to retrieve. In MongoDB, when you execute find() method, then it displays all fields of a document. To limit this, you need to set a list of fields with value 1 or 0. 1 is used to show the field while 0 is used to hide the fields.

Syntax

The basic syntax of find() method with projection is as follows −

>db.COLLECTION_NAME.find({},{KEY:1})

Example

Consider the collection mycol has the following data −

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will display the title of the document while querying the document.

>db.mycol.find({},{"title":1,_id:0})
{"title":"MongoDB Overview"}
{"title":"NoSQL Overview"}
{"title":"Tutorials Point Overview"}
>

Please note _id field is always displayed while executing find() method, if you don’t want this field, then you need to set it as 0.

MongoDB – Limit Records

In this chapter, we will learn how to limit records using MongoDB.

The Limit() Method

To limit the records in MongoDB, you need to use limit() method. The method accepts one number type argument, which is the number of documents that you want to be displayed.

Syntax

The basic syntax of limit() method is as follows −

>db.COLLECTION_NAME.find().limit(NUMBER)

Example

Consider the collection myycol has the following data.

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will display only two documents while querying the document.

>db.mycol.find({},{"title":1,_id:0}).limit(2)
{"title":"MongoDB Overview"}
{"title":"NoSQL Overview"}
>

If you don’t specify the number argument in limit() method then it will display all documents from the collection.

MongoDB Skip() Method

Apart from limit() method, there is one more method skip() which also accepts number type argument and is used to skip the number of documents.

Syntax

The basic syntax of skip() method is as follows −

>db.COLLECTION_NAME.find().limit(NUMBER).skip(NUMBER)

Example

Following example will display only the second document.

>db.mycol.find({},{"title":1,_id:0}).limit(1).skip(1)
{"title":"NoSQL Overview"}
>

Please note, the default value in skip() method is 0.

MongoDB – Sort Records

In this chapter, we will learn how to sort records in MongoDB.

The sort() Method

To sort documents in MongoDB, you need to use sort() method. The method accepts a document containing a list of fields along with their sorting order. To specify sorting order 1 and -1 are used. 1 is used for ascending order while -1 is used for descending order.

Syntax

The basic syntax of sort() method is as follows −

>db.COLLECTION_NAME.find().sort({KEY:1})

Example

Consider the collection myycol has the following data.

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will display the documents sorted by title in the descending order.

>db.mycol.find({},{"title":1,_id:0}).sort({"title":-1})
{"title":"Tutorials Point Overview"}
{"title":"NoSQL Overview"}
{"title":"MongoDB Overview"}
>

Please note, if you don’t specify the sorting preference, then sort() method will display the documents in ascending order.

MongoDB – Indexing

Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a collection to select those documents that match the query statement. This scan is highly inefficient and require MongoDB to process a large volume of data.

Indexes are special data structures, that store a small portion of the data set in an easy-to-traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the index.

The ensureIndex() Method

To create an index you need to use ensureIndex() method of MongoDB.

Syntax

The basic syntax of ensureIndex() method is as follows().

>db.COLLECTION_NAME.ensureIndex({KEY:1})

Here key is the name of the field on which you want to create index and 1 is for ascending order. To create index in descending order you need to use -1.

Example

>db.mycol.ensureIndex({"title":1})
>

In ensureIndex() method you can pass multiple fields, to create index on multiple fields.

>db.mycol.ensureIndex({"title":1,"description":-1})
>

ensureIndex() method also accepts list of options (which are optional). Following is the list −

Parameter Type Description
background Boolean Builds the index in the background so that building an index does not block other database activities. Specify true to build in the background. The default value is false.
unique Boolean Creates a unique index so that the collection will not accept insertion of documents where the index key or keys match an existing value in the index. Specify true to create a unique index. The default value is false.
name string The name of the index. If unspecified, MongoDB generates an index name by concatenating the names of the indexed fields and the sort order.
dropDups Boolean Creates a unique index on a field that may have duplicates. MongoDB indexes only the first occurrence of a key and removes all documents from the collection that contain subsequent occurrences of that key. Specify true to create unique index. The default value is false.
sparse Boolean If true, the index only references documents with the specified field. These indexes use less space but behave differently in some situations (particularly sorts). The default value is false.
expireAfterSeconds integer Specifies a value, in seconds, as a TTL to control how long MongoDB retains documents in this collection.
v index version The index version number. The default index version depends on the version of MongoDB running when creating the index.
weights document The weight is a number ranging from 1 to 99,999 and denotes the significance of the field relative to the other indexed fields in terms of the score.
default_language string For a text index, the language that determines the list of stop words and the rules for the stemmer and tokenizer. The default value is english.
language_override string For a text index, specify the name of the field in the document that contains, the language to override the default language. The default value is language.

MongoDB – Aggregation

Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. In SQL count(*) and with group by is an equivalent of mongodb aggregation.

The aggregate() Method

For the aggregation in MongoDB, you should use aggregate() method.

Syntax

Basic syntax of aggregate() method is as follows −

>db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

Example

In the collection you have the following data −

{
   _id: ObjectId(7df78ad8902c)
   title: 'MongoDB Overview', 
   description: 'MongoDB is no sql database',
   by_user: 'tutorials point',
   url: 'http://www.tutorialspoint.com',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 100
},
{
   _id: ObjectId(7df78ad8902d)
   title: 'NoSQL Overview', 
   description: 'No sql database is very fast',
   by_user: 'tutorials point',
   url: 'http://www.tutorialspoint.com',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 10
},
{
   _id: ObjectId(7df78ad8902e)
   title: 'Neo4j Overview', 
   description: 'Neo4j is no sql database',
   by_user: 'Neo4j',
   url: 'http://www.neo4j.com',
   tags: ['neo4j', 'database', 'NoSQL'],
   likes: 750
},

Now from the above collection, if you want to display a list stating how many tutorials are written by each user, then you will use the following aggregate()method −

> db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : 1}}}])
{
   "result" : [
      {
         "_id" : "tutorials point",
         "num_tutorial" : 2
      },
      {
         "_id" : "Neo4j",
         "num_tutorial" : 1
      }
   ],
   "ok" : 1
}
>

Sql equivalent query for the above use case will be select by_user, count(*) from mycol group by by_user.

In the above example, we have grouped documents by field by_user and on each occurrence of by_user previous value of sum is incremented. Following is a list of available aggregation expressions.

Expression Description Example
$sum Sums up the defined value from all documents in the collection. db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$sum : “$likes”}}}])
$avg Calculates the average of all given values from all documents in the collection. db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$avg : “$likes”}}}])
$min Gets the minimum of the corresponding values from all documents in the collection. db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$min : “$likes”}}}])
$max Gets the maximum of the corresponding values from all documents in the collection. db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$max : “$likes”}}}])
$push Inserts the value to an array in the resulting document. db.mycol.aggregate([{$group : {_id : “$by_user”, url : {$push: “$url”}}}])
$addToSet Inserts the value to an array in the resulting document but does not create duplicates. db.mycol.aggregate([{$group : {_id : “$by_user”, url : {$addToSet : “$url”}}}])
$first Gets the first document from the source documents according to the grouping. Typically this makes only sense together with some previously applied “$sort”-stage. db.mycol.aggregate([{$group : {_id : “$by_user”, first_url : {$first : “$url”}}}])
$last Gets the last document from the source documents according to the grouping. Typically this makes only sense together with some previously applied “$sort”-stage. db.mycol.aggregate([{$group : {_id : “$by_user”, last_url : {$last : “$url”}}}])

Pipeline Concept

In UNIX command, shell pipeline means the possibility to execute an operation on some input and use the output as the input for the next command and so on. MongoDB also supports same concept in aggregation framework. There is a set of possible stages and each of those is taken as a set of documents as an input and produces a resulting set of documents (or the final resulting JSON document at the end of the pipeline). This can then in turn be used for the next stage and so on.

Following are the possible stages in aggregation framework −

  • $project − Used to select some specific fields from a collection.
  • $match − This is a filtering operation and thus this can reduce the amount of documents that are given as input to the next stage.
  • $group − This does the actual aggregation as discussed above.
  • $sort − Sorts the documents.
  • $skip − With this, it is possible to skip forward in the list of documents for a given amount of documents.
  • $limit − This limits the amount of documents to look at, by the given number starting from the current positions.
  • $unwind − This is used to unwind document that are using arrays. When using an array, the data is kind of pre-joined and this operation will be undone with this to have individual documents again. Thus with this stage we will increase the amount of documents for the next stage.

MongoDB – Replication

Replication is the process of synchronizing data across multiple servers. Replication provides redundancy and increases data availability with multiple copies of data on different database servers. Replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.

Why Replication?

  • To keep your data safe
  • High (24*7) availability of data
  • Disaster recovery
  • No downtime for maintenance (like backups, index rebuilds, compaction)
  • Read scaling (extra copies to read from)
  • Replica set is transparent to the application

How Replication Works in MongoDB

MongoDB achieves replication by the use of replica set. A replica set is a group of mongod instances that host the same data set. In a replica, one node is primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set can have only one primary node.

  • Replica set is a group of two or more nodes (generally minimum 3 nodes are required).
  • In a replica set, one node is primary node and remaining nodes are secondary.
  • All data replicates from primary to secondary node.
  • At the time of automatic failover or maintenance, election establishes for primary and a new primary node is elected.
  • After the recovery of failed node, it again join the replica set and works as a secondary node.

A typical diagram of MongoDB replication is shown in which client application always interact with the primary node and the primary node then replicates the data to the secondary nodes.

MongoDB Replication

Replica Set Features

  • A cluster of N nodes
  • Any one node can be primary
  • All write operations go to primary
  • Automatic failover
  • Automatic recovery
  • Consensus election of primary

Set Up a Replica Set

In this tutorial, we will convert standalone MongoDB instance to a replica set. To convert to replica set, following are the steps −

  • Shutdown already running MongoDB server.
  • Start the MongoDB server by specifying — replSet option. Following is the basic syntax of –replSet −
mongod --port "PORT" --dbpath "YOUR_DB_DATA_PATH" --replSet "REPLICA_SET_INSTANCE_NAME"

Example

mongod --port 27017 --dbpath "D:\set up\mongodb\data" --replSet rs0
  • It will start a mongod instance with the name rs0, on port 27017.
  • Now start the command prompt and connect to this mongod instance.
  • In Mongo client, issue the command rs.initiate() to initiate a new replica set.
  • To check the replica set configuration, issue the command rs.conf(). To check the status of replica set issue the command rs.status().

Add Members to Replica Set

To add members to replica set, start mongod instances on multiple machines. Now start a mongo client and issue a command rs.add().

Syntax

The basic syntax of rs.add() command is as follows −

>rs.add(HOST_NAME:PORT)

Example

Suppose your mongod instance name is mongod1.net and it is running on port 27017. To add this instance to replica set, issue the command rs.add()in Mongo client.

>rs.add("mongod1.net:27017")
>

You can add mongod instance to replica set only when you are connected to primary node. To check whether you are connected to primary or not, issue the command db.isMaster() in mongo client.

MongoDB – Sharding

Sharding is the process of storing data records across multiple machines and it is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations.

Why Sharding?

  • In replication, all writes go to master node
  • Latency sensitive queries still go to master
  • Single replica set has limitation of 12 nodes
  • Memory can’t be large enough when active dataset is big
  • Local disk is not big enough
  • Vertical scaling is too expensive

Sharding in MongoDB

The following diagram shows the sharding in MongoDB using sharded cluster.

MongoDB Sharding

In the following diagram, there are three main components −

  • Shards − Shards are used to store data. They provide high availability and data consistency. In production environment, each shard is a separate replica set.
  • Config Servers − Config servers store the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards. In production environment, sharded clusters have exactly 3 config servers.
  • Query Routers − Query routers are basically mongo instances, interface with client applications and direct operations to the appropriate shard. The query router processes and targets the operations to shards and then returns results to the clients. A sharded cluster can contain more than one query router to divide the client request load. A client sends requests to one query router. Generally, a sharded cluster have many query routers.

MongoDB – Create Backup

In this chapter, we will see how to create a backup in MongoDB.

Dump MongoDB Data

To create backup of database in MongoDB, you should use mongodumpcommand. This command will dump the entire data of your server into the dump directory. There are many options available by which you can limit the amount of data or create backup of your remote server.

Syntax

The basic syntax of mongodump command is as follows −

>mongodump

Example

Start your mongod server. Assuming that your mongod server is running on the localhost and port 27017, open a command prompt and go to the bin directory of your mongodb instance and type the command mongodump

Consider the mycol collection has the following data.

>mongodump

The command will connect to the server running at 127.0.0.1 and port 27017 and back all data of the server to directory /bin/dump/. Following is the output of the command −

DB Stats

Following is a list of available options that can be used with the mongodumpcommand.

Syntax Description Example
mongodump –host HOST_NAME –port PORT_NUMBER This commmand will backup all databases of specified mongod instance. mongodump –host tutorialspoint.com –port 27017
mongodump –dbpath DB_PATH –out BACKUP_DIRECTORY This command will backup only specified database at specified path. mongodump –dbpath /data/db/ –out /data/backup/
mongodump –collection COLLECTION –db DB_NAME This command will backup only specified collection of specified database. mongodump –collection mycol –db test

Restore data

To restore backup data MongoDB’s mongorestore command is used. This command restores all of the data from the backup directory.

Syntax

The basic syntax of mongorestore command is −

>mongorestore

Following is the output of the command −

DB Stats

MongoDB – Deployment

When you are preparing a MongoDB deployment, you should try to understand how your application is going to hold up in production. It’s a good idea to develop a consistent, repeatable approach to managing your deployment environment so that you can minimize any surprises once you’re in production.

The best approach incorporates prototyping your set up, conducting load testing, monitoring key metrics, and using that information to scale your set up. The key part of the approach is to proactively monitor your entire system – this will help you understand how your production system will hold up before deploying, and determine where you will need to add capacity. Having insight into potential spikes in your memory usage, for example, could help put out a write-lock fire before it starts.

To monitor your deployment, MongoDB provides some of the following commands −

mongostat

This command checks the status of all running mongod instances and return counters of database operations. These counters include inserts, queries, updates, deletes, and cursors. Command also shows when you’re hitting page faults, and showcase your lock percentage. This means that you’re running low on memory, hitting write capacity or have some performance issue.

To run the command, start your mongod instance. In another command prompt, go to bin directory of your mongodb installation and type mongostat.

D:\set up\mongodb\bin>mongostat

Following is the output of the command −

mongostat

mongotop

This command tracks and reports the read and write activity of MongoDB instance on a collection basis. By default, mongotop returns information in each second, which you can change it accordingly. You should check that this read and write activity matches your application intention, and you’re not firing too many writes to the database at a time, reading too frequently from a disk, or are exceeding your working set size.

To run the command, start your mongod instance. In another command prompt, go to bin directory of your mongodb installation and type mongotop.

D:\set up\mongodb\bin>mongotop

Following is the output of the command −

mongotop

To change mongotop command to return information less frequently, specify a specific number after the mongotop command.

D:\set up\mongodb\bin>mongotop 30

The above example will return values every 30 seconds.

Apart from the MongoDB tools, 10gen provides a free, hosted monitoring service, MongoDB Management Service (MMS), that provides a dashboard and gives you a view of the metrics from your entire cluster.

MongoDB – Java

In this chapter, we will learn how to set up MongoDB JDBC driver.

Installation

Before you start using MongoDB in your Java programs, you need to make sure that you have MongoDB JDBC driver and Java set up on the machine. You can check Java tutorial for Java installation on your machine. Now, let us check how to set up MongoDB JDBC driver.

  • You need to download the jar from the path Download mongo.jar. Make sure to download the latest release of it.
  • You need to include the mongo.jar into your classpath.

Connect to Database

To connect database, you need to specify the database name, if the database doesn’t exist then MongoDB creates it automatically.

Following is the code snippet to connect to the database −

import com.mongodb.client.MongoDatabase; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class ConnectToDB { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 
   
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 
      System.out.println("Credentials ::"+ credential);     
   } 
}

Now, let’s compile and run the above program to create our database myDb as shown below.

$javac ConnectToDB.java 
$java ConnectToDB

On executing, the above program gives you the following output.

Connected to the database successfully 
Credentials ::MongoCredential{
   mechanism = null, 
   userName = 'sampleUser', 
   source = 'myDb', 
   password = <hidden>, 
   mechanismProperties = {}
}

Create a Collection

To create a collection, createCollection() method of com.mongodb.client.MongoDatabase class is used.

Following is the code snippet to create a collection −

import com.mongodb.client.MongoDatabase; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class CreatingCollection { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 
     
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      //Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb");  
      
      //Creating a collection 
      database.createCollection("sampleCollection"); 
      System.out.println("Collection created successfully"); 
   } 
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection created successfully

Getting/Selecting a Collection

To get/select a collection from the database, getCollection() method of com.mongodb.client.MongoDatabase class is used.

Following is the program to get/select a collection −

import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase; 

import org.bson.Document; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class selectingCollection { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 
     
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb");  
      
      // Creating a collection 
      System.out.println("Collection created successfully"); 

      // Retieving a collection
      MongoCollection<Document> collection = database.getCollection("myCollection"); 
      System.out.println("Collection myCollection selected successfully"); 
   }
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection created successfully 
Collection myCollection selected successfully

Insert a Document

To insert a document into MongoDB, insert() method of com.mongodb.client.MongoCollection class is used.

Following is the code snippet to insert a document −

import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase; 

import org.bson.Document;  
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class InsertingDocument { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 

      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 

      // Retrieving a collection
      MongoCollection<Document> collection = database.getCollection("sampleCollection"); 
      System.out.println("Collection sampleCollection selected successfully");

      Document document = new Document("title", "MongoDB") 
      .append("id", 1)
      .append("description", "database") 
      .append("likes", 100) 
      .append("url", "http://www.tutorialspoint.com/mongodb/") 
      .append("by", "tutorials point");  
      collection.insertOne(document); 
      System.out.println("Document inserted successfully");     
   } 
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection sampleCollection selected successfully 
Document inserted successfully

Retrieve All Documents

To select all documents from the collection, find() method of com.mongodb.client.MongoCollection class is used. This method returns a cursor, so you need to iterate this cursor.

Following is the program to select all documents −

import com.mongodb.client.FindIterable; 
import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase;  

import java.util.Iterator; 
import org.bson.Document; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class RetrievingAllDocuments { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 

      // Creating Credentials 
      MongoCredential credential;
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb");  
      
      // Retrieving a collection 
      MongoCollection<Document> collection = database.getCollection("sampleCollection");
      System.out.println("Collection sampleCollection selected successfully"); 

      // Getting the iterable object 
      FindIterable<Document> iterDoc = collection.find(); 
      int i = 1; 

      // Getting the iterator 
      Iterator it = iterDoc.iterator(); 
    
      while (it.hasNext()) {  
         System.out.println(it.next());  
      i++; 
      }
   } 
}

On compiling, the above program gives you the following result −

Document{{
   _id = 5967745223993a32646baab8, 
   title = MongoDB, 
   id = 1, 
   description = database, 
   likes = 100, 
   url = http://www.tutorialspoint.com/mongodb/, by = tutorials point
}}  
Document{{
   _id = 7452239959673a32646baab8, 
   title = RethinkDB, 
   id = 2, 
   description = database, 
   likes = 200, 
   url = http://www.tutorialspoint.com/rethinkdb/, by = tutorials point
}}

Update Document

To update a document from the collection, updateOne() method of com.mongodb.client.MongoCollection class is used.

Following is the program to select the first document −

import com.mongodb.client.FindIterable; 
import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase; 
import com.mongodb.client.model.Filters; 
import com.mongodb.client.model.Updates; 

import java.util.Iterator; 
import org.bson.Document;  
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class UpdatingDocuments { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 
     
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 

      // Retrieving a collection 
      MongoCollection<Document> collection = database.getCollection("sampleCollection");
      System.out.println("Collection myCollection selected successfully"); 

      collection.updateOne(Filters.eq("id", 1), Updates.set("likes", 150));       
      System.out.println("Document update successfully...");  
      
      // Retrieving the documents after updation 
      // Getting the iterable object
      FindIterable<Document> iterDoc = collection.find(); 
      int i = 1; 

      // Getting the iterator 
      Iterator it = iterDoc.iterator(); 

      while (it.hasNext()) {  
         System.out.println(it.next());  
         i++; 
      }     
   }  
}

On compiling, the above program gives you the following result −

Document update successfully... 
Document {{
   _id = 5967745223993a32646baab8, 
   title = MongoDB, 
   id = 1, 
   description = database, 
   likes = 150, 
   url = http://www.tutorialspoint.com/mongodb/, by = tutorials point
}}

Delete a Document

To delete a document from the collection, you need to use the deleteOne()method of the com.mongodb.client.MongoCollection class.

Following is the program to delete a document −

import com.mongodb.client.FindIterable; 
import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase; 
import com.mongodb.client.model.Filters;  

import java.util.Iterator; 
import org.bson.Document; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class DeletingDocuments { 
   
   public static void main( String args[] ) {  
   
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 );
      
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 

      // Retrieving a collection
      MongoCollection<Document> collection = database.getCollection("sampleCollection");
      System.out.println("Collection sampleCollection selected successfully"); 

      // Deleting the documents 
      collection.deleteOne(Filters.eq("id", 1)); 
      System.out.println("Document deleted successfully...");  
      
      // Retrieving the documents after updation 
      // Getting the iterable object 
      FindIterable<Document> iterDoc = collection.find(); 
      int i = 1; 

      // Getting the iterator 
      Iterator it = iterDoc.iterator(); 

      while (it.hasNext()) {  
         System.out.println("Inserted Document: "+i);  
         System.out.println(it.next());  
         i++; 
      }       
   } 
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection sampleCollection selected successfully 
Document deleted successfully...

Dropping a Collection

To drop a collection from a database, you need to use the drop() method of the com.mongodb.client.MongoCollection class.

Following is the program to delete a collection −

import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase;  

import org.bson.Document;  
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class DropingCollection { 
   
   public static void main( String args[] ) {  

      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 

      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb");  
      
      // Creating a collection 
      System.out.println("Collections created successfully"); 

      // Retieving a collection
      MongoCollection<Document> collection = database.getCollection("sampleCollection");

      // Dropping a Collection 
      collection.drop(); 
      System.out.println("Collection dropped successfully");
   } 
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection sampleCollection selected successfully 
Collection dropped successfully

Listing All the Collections

To list all the collections in a database, you need to use the listCollectionNames() method of the com.mongodb.client.MongoDatabase class.

Following is the program to list all the collections of a database −

import com.mongodb.client.MongoDatabase; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class ListOfCollection { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 

      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 

      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 
      System.out.println("Collection created successfully"); 
      for (String name : database.listCollectionNames()) { 
         System.out.println(name); 
      } 
   }
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection created successfully 
myCollection 
myCollection1 
myCollection5

Remaining MongoDB methods save(), limit(), skip(), sort() etc. work same as explained in the subsequent tutorial.

MongoDB – PHP

To use MongoDB with PHP, you need to use MongoDB PHP driver. Download the driver from the url Download PHP Driver. Make sure to download the latest release of it. Now unzip the archive and put php_mongo.dll in your PHP extension directory (“ext” by default) and add the following line to your php.ini file −

extension = php_mongo.dll

Make a Connection and Select a Database

To make a connection, you need to specify the database name, if the database doesn’t exist then MongoDB creates it automatically.

Following is the code snippet to connect to the database −

<?php
   // connect to mongodb
   $m = new MongoClient();
	
   echo "Connection to database successfully";
   // select a database
   $db = $m->mydb;
	
   echo "Database mydb selected";
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected

Create a Collection

Following is the code snippet to create a collection −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->createCollection("mycol");
   echo "Collection created succsessfully";
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection created succsessfully

Insert a Document

To insert a document into MongoDB, insert() method is used.

Following is the code snippet to insert a document −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->mycol;
   echo "Collection selected succsessfully";
	
   $document = array( 
      "title" => "MongoDB", 
      "description" => "database", 
      "likes" => 100,
      "url" => "http://www.tutorialspoint.com/mongodb/",
      "by" => "tutorials point"
   );
	
   $collection->insert($document);
   echo "Document inserted successfully";
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection selected succsessfully
Document inserted successfully

Find All Documents

To select all documents from the collection, find() method is used.

Following is the code snippet to select all documents −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->mycol;
   echo "Collection selected succsessfully";

   $cursor = $collection->find();
   // iterate cursor to display title of documents
	
   foreach ($cursor as $document) {
      echo $document["title"] . "\n";
   }
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection selected succsessfully {
   "title": "MongoDB"
}

Update a Document

To update a document, you need to use the update() method.

In the following example, we will update the title of inserted document to MongoDB Tutorial. Following is the code snippet to update a document −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->mycol;
   echo "Collection selected succsessfully";

   // now update the document
   $collection->update(array("title"=>"MongoDB"), 
      array('$set'=>array("title"=>"MongoDB Tutorial")));
   echo "Document updated successfully";
	
   // now display the updated document
   $cursor = $collection->find();
	
   // iterate cursor to display title of documents
   echo "Updated document";
	
   foreach ($cursor as $document) {
      echo $document["title"] . "\n";
   }
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection selected succsessfully
Document updated successfully
Updated document {
   "title": "MongoDB Tutorial"
}

Delete a Document

To delete a document, you need to use remove() method.

In the following example, we will remove the documents that has the title MongoDB Tutorial. Following is the code snippet to delete a document −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->mycol;
   echo "Collection selected succsessfully";
   
   // now remove the document
   $collection->remove(array("title"=>"MongoDB Tutorial"),false);
   echo "Documents deleted successfully";
   
   // now display the available documents
   $cursor = $collection->find();
	
   // iterate cursor to display title of documents
   echo "Updated document";
	
   foreach ($cursor as $document) {
      echo $document["title"] . "\n";
   }
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection selected succsessfully
Documents deleted successfully

In the above example, the second parameter is boolean type and used for justOne field of remove() method.

Remaining MongoDB methods findOne(), save(), limit(), skip(), sort()etc. works same as explained above.

MongoDB – Relationships

Relationships in MongoDB represent how various documents are logically related to each other. Relationships can be modeled via Embedded and Referenced approaches. Such relationships can be either 1:1, 1:N, N:1 or N:N.

Let us consider the case of storing addresses for users. So, one user can have multiple addresses making this a 1:N relationship.

Following is the sample document structure of user document −

{
   "_id":ObjectId("52ffc33cd85242f436000001"),
   "name": "Tom Hanks",
   "contact": "987654321",
   "dob": "01-01-1991"
}

Following is the sample document structure of address document −

{
   "_id":ObjectId("52ffc4a5d85242602e000000"),
   "building": "22 A, Indiana Apt",
   "pincode": 123456,
   "city": "Los Angeles",
   "state": "California"
}

Modeling Embedded Relationships

In the embedded approach, we will embed the address document inside the user document.

{
   "_id":ObjectId("52ffc33cd85242f436000001"),
   "contact": "987654321",
   "dob": "01-01-1991",
   "name": "Tom Benzamin",
   "address": [
      {
         "building": "22 A, Indiana Apt",
         "pincode": 123456,
         "city": "Los Angeles",
         "state": "California"
      },
      {
         "building": "170 A, Acropolis Apt",
         "pincode": 456789,
         "city": "Chicago",
         "state": "Illinois"
      }
   ]
}

This approach maintains all the related data in a single document, which makes it easy to retrieve and maintain. The whole document can be retrieved in a single query such as −

>db.users.findOne({"name":"Tom Benzamin"},{"address":1})

Note that in the above query, db and users are the database and collection respectively.

The drawback is that if the embedded document keeps on growing too much in size, it can impact the read/write performance.

Modeling Referenced Relationships

This is the approach of designing normalized relationship. In this approach, both the user and address documents will be maintained separately but the user document will contain a field that will reference the address document’s id field.

{
   "_id":ObjectId("52ffc33cd85242f436000001"),
   "contact": "987654321",
   "dob": "01-01-1991",
   "name": "Tom Benzamin",
   "address_ids": [
      ObjectId("52ffc4a5d85242602e000000"),
      ObjectId("52ffc4a5d85242602e000001")
   ]
}

As shown above, the user document contains the array field address_idswhich contains ObjectIds of corresponding addresses. Using these ObjectIds, we can query the address documents and get address details from there. With this approach, we will need two queries: first to fetch the address_ids fields from user document and second to fetch these addresses from addresscollection.

>var result = db.users.findOne({"name":"Tom Benzamin"},{"address_ids":1})
>var addresses = db.address.find({"_id":{"$in":result["address_ids"]}})

MongoDB – Database References

As seen in the last chapter of MongoDB relationships, to implement a normalized database structure in MongoDB, we use the concept of Referenced Relationships also referred to as Manual References in which we manually store the referenced document’s id inside other document. However, in cases where a document contains references from different collections, we can use MongoDB DBRefs.

DBRefs vs Manual References

As an example scenario, where we would use DBRefs instead of manual references, consider a database where we are storing different types of addresses (home, office, mailing, etc.) in different collections (address_home, address_office, address_mailing, etc). Now, when a user collection’s document references an address, it also needs to specify which collection to look into based on the address type. In such scenarios where a document references documents from many collections, we should use DBRefs.

Using DBRefs

There are three fields in DBRefs −

  • $ref − This field specifies the collection of the referenced document
  • $id − This field specifies the _id field of the referenced document
  • $db − This is an optional field and contains the name of the database in which the referenced document lies

Consider a sample user document having DBRef field address as shown in the code snippet −

{
   "_id":ObjectId("53402597d852426020000002"),
   "address": {
   "$ref": "address_home",
   "$id": ObjectId("534009e4d852427820000002"),
   "$db": "tutorialspoint"},
   "contact": "987654321",
   "dob": "01-01-1991",
   "name": "Tom Benzamin"
}

The address DBRef field here specifies that the referenced address document lies in address_home collection under tutorialspoint database and has an id of 534009e4d852427820000002.

The following code dynamically looks in the collection specified by $refparameter (address_home in our case) for a document with id as specified by $id parameter in DBRef.

>var user = db.users.findOne({"name":"Tom Benzamin"})
>var dbRef = user.address
>db[dbRef.$ref].findOne({"_id"😦dbRef.$id)})

The above code returns the following address document present in address_home collection −

{
   "_id" : ObjectId("534009e4d852427820000002"),
   "building" : "22 A, Indiana Apt",
   "pincode" : 123456,
   "city" : "Los Angeles",
   "state" : "California"
}

MongoDB – Covered Queries

In this chapter, we will learn about covered queries.

What is a Covered Query?

As per the official MongoDB documentation, a covered query is a query in which −

  • All the fields in the query are part of an index.
  • All the fields returned in the query are in the same index.

Since all the fields present in the query are part of an index, MongoDB matches the query conditions and returns the result using the same index without actually looking inside the documents. Since indexes are present in RAM, fetching data from indexes is much faster as compared to fetching data by scanning documents.

Using Covered Queries

To test covered queries, consider the following document in the userscollection −

{
   "_id": ObjectId("53402597d852426020000002"),
   "contact": "987654321",
   "dob": "01-01-1991",
   "gender": "M",
   "name": "Tom Benzamin",
   "user_name": "tombenzamin"
}

We will first create a compound index for the users collection on the fields gender and user_name using the following query −

>db.users.ensureIndex({gender:1,user_name:1})

Now, this index will cover the following query −

>db.users.find({gender:"M"},{user_name:1,_id:0})

That is to say that for the above query, MongoDB would not go looking into database documents. Instead it would fetch the required data from indexed data which is very fast.

Since our index does not include _id field, we have explicitly excluded it from result set of our query, as MongoDB by default returns _id field in every query. So the following query would not have been covered inside the index created above −

>db.users.find({gender:"M"},{user_name:1})

Lastly, remember that an index cannot cover a query if −

  • Any of the indexed fields is an array
  • Any of the indexed fields is a subdocument

MongoDB – Analyzing Queries

Analyzing queries is a very important aspect of measuring how effective the database and indexing design is. We will learn about the frequently used $explain and $hint queries.

Using $explain

The $explain operator provides information on the query, indexes used in a query and other statistics. It is very useful when analyzing how well your indexes are optimized.

In the last chapter, we had already created an index for the users collection on fields gender and user_name using the following query −

>db.users.ensureIndex({gender:1,user_name:1})

We will now use $explain on the following query −

>db.users.find({gender:"M"},{user_name:1,_id:0}).explain()

The above explain() query returns the following analyzed result −

{
   "cursor" : "BtreeCursor gender_1_user_name_1",
   "isMultiKey" : false,
   "n" : 1,
   "nscannedObjects" : 0,
   "nscanned" : 1,
   "nscannedObjectsAllPlans" : 0,
   "nscannedAllPlans" : 1,
   "scanAndOrder" : false,
   "indexOnly" : true,
   "nYields" : 0,
   "nChunkSkips" : 0,
   "millis" : 0,
   "indexBounds" : {
      "gender" : [
         [
            "M",
            "M"
         ]
      ],
      "user_name" : [
         [
            {
               "$minElement" : 1
            },
            {
               "$maxElement" : 1
            }
         ]
      ]
   }
}

We will now look at the fields in this result set −

  • The true value of indexOnly indicates that this query has used indexing.
  • The cursor field specifies the type of cursor used. BTreeCursor type indicates that an index was used and also gives the name of the index used. BasicCursor indicates that a full scan was made without using any indexes.
  • n indicates the number of documents matching returned.
  • nscannedObjects indicates the total number of documents scanned.
  • nscanned indicates the total number of documents or index entries scanned.

Using $hint

The $hint operator forces the query optimizer to use the specified index to run a query. This is particularly useful when you want to test performance of a query with different indexes. For example, the following query specifies the index on fields gender and user_name to be used for this query −

>db.users.find({gender:"M"},{user_name:1,_id:0}).hint({gender:1,user_name:1})

To analyze the above query using $explain −

>db.users.find({gender:"M"},{user_name:1,_id:0}).hint({gender:1,user_name:1}).explain()

MongoDB – Atomic Operations

MongoDB does not support multi-document atomic transactions. However, it does provide atomic operations on a single document. So if a document has hundred fields the update statement will either update all the fields or none, hence maintaining atomicity at the document-level.

Model Data for Atomic Operations

The recommended approach to maintain atomicity would be to keep all the related information, which is frequently updated together in a single document using embedded documents. This would make sure that all the updates for a single document are atomic.

Consider the following products document −

{
   "_id":1,
   "product_name": "Samsung S3",
   "category": "mobiles",
   "product_total": 5,
   "product_available": 3,
   "product_bought_by": [
      {
         "customer": "john",
         "date": "7-Jan-2014"
      },
      {
         "customer": "mark",
         "date": "8-Jan-2014"
      }
   ]
}

In this document, we have embedded the information of the customer who buys the product in the product_bought_by field. Now, whenever a new customer buys the product, we will first check if the product is still available using product_available field. If available, we will reduce the value of product_available field as well as insert the new customer’s embedded document in the product_bought_by field. We will use findAndModifycommand for this functionality because it searches and updates the document in the same go.

>db.products.findAndModify({ 
   query:{_id:2,product_available:{$gt:0}}, 
   update:{ 
      $inc:{product_available:-1}, 
      $push:{product_bought_by:{customer:"rob",date:"9-Jan-2014"}} 
   }    
})

Our approach of embedded document and using findAndModify query makes sure that the product purchase information is updated only if it the product is available. And the whole of this transaction being in the same query, is atomic.

In contrast to this, consider the scenario where we may have kept the product availability and the information on who has bought the product, separately. In this case, we will first check if the product is available using the first query. Then in the second query we will update the purchase information. However, it is possible that between the executions of these two queries, some other user has purchased the product and it is no more available. Without knowing this, our second query will update the purchase information based on the result of our first query. This will make the database inconsistent because we have sold a product which is not available.

MongoDB – Advanced Indexing

Consider the following document of the users collection −

{
   "address": {
      "city": "Los Angeles",
      "state": "California",
      "pincode": "123"
   },
   "tags": [
      "music",
      "cricket",
      "blogs"
   ],
   "name": "Tom Benzamin"
}

The above document contains an address sub-document and a tags array.

Indexing Array Fields

Suppose we want to search user documents based on the user’s tags. For this, we will create an index on tags array in the collection.

Creating an index on array in turn creates separate index entries for each of its fields. So in our case when we create an index on tags array, separate indexes will be created for its values music, cricket and blogs.

To create an index on tags array, use the following code −

>db.users.ensureIndex({"tags":1})

After creating the index, we can search on the tags field of the collection like this −

>db.users.find({tags:"cricket"})

To verify that proper indexing is used, use the following explain command −

>db.users.find({tags:"cricket"}).explain()

The above command resulted in “cursor” : “BtreeCursor tags_1” which confirms that proper indexing is used.

Indexing Sub-Document Fields

Suppose that we want to search documents based on city, state and pincode fields. Since all these fields are part of address sub-document field, we will create an index on all the fields of the sub-document.

For creating an index on all the three fields of the sub-document, use the following code −

>db.users.ensureIndex({"address.city":1,"address.state":1,"address.pincode":1})

Once the index is created, we can search for any of the sub-document fields utilizing this index as follows −

>db.users.find({"address.city":"Los Angeles"})

Remember that the query expression has to follow the order of the index specified. So the index created above would support the following queries −

>db.users.find({"address.city":"Los Angeles","address.state":"California"})

It will also support the following query −

>db.users.find({"address.city":"LosAngeles","address.state":"California",
   "address.pincode":"123"})

MongoDB – Indexing Limitations

In this chapter, we will learn about Indexing Limitations and its other components.

Extra Overhead

Every index occupies some space as well as causes an overhead on each insert, update and delete. So if you rarely use your collection for read operations, it makes sense not to use indexes.

RAM Usage

Since indexes are stored in RAM, you should make sure that the total size of the index does not exceed the RAM limit. If the total size increases the RAM size, it will start deleting some indexes, causing performance loss.

Query Limitations

Indexing can’t be used in queries which use −

  • Regular expressions or negation operators like $nin, $not, etc.
  • Arithmetic operators like $mod, etc.
  • $where clause

Hence, it is always advisable to check the index usage for your queries.

Index Key Limits

Starting from version 2.6, MongoDB will not create an index if the value of existing index field exceeds the index key limit.

Inserting Documents Exceeding Index Key Limit

MongoDB will not insert any document into an indexed collection if the indexed field value of this document exceeds the index key limit. Same is the case with mongorestore and mongoimport utilities.

Maximum Ranges

  • A collection cannot have more than 64 indexes.
  • The length of the index name cannot be longer than 125 characters.
  • A compound index can have maximum 31 fields indexed.

MongoDB – ObjectId

We have been using MongoDB Object Id in all the previous chapters. In this chapter, we will understand the structure of ObjectId.

An ObjectId is a 12-byte BSON type having the following structure −

  • The first 4 bytes representing the seconds since the unix epoch
  • The next 3 bytes are the machine identifier
  • The next 2 bytes consists of process id
  • The last 3 bytes are a random counter value

MongoDB uses ObjectIds as the default value of _id field of each document, which is generated while the creation of any document. The complex combination of ObjectId makes all the _id fields unique.

Creating New ObjectId

To generate a new ObjectId use the following code −

>newObjectId = ObjectId()

The above statement returned the following uniquely generated id −

ObjectId("5349b4ddd2781d08c09890f3")

Instead of MongoDB generating the ObjectId, you can also provide a 12-byte id −

>myObjectId = ObjectId("5349b4ddd2781d08c09890f4")

Creating Timestamp of a Document

Since the _id ObjectId by default stores the 4-byte timestamp, in most cases you do not need to store the creation time of any document. You can fetch the creation time of a document using getTimestamp method −

>ObjectId("5349b4ddd2781d08c09890f4").getTimestamp()

This will return the creation time of this document in ISO date format −

ISODate("2014-04-12T21:49:17Z")

Converting ObjectId to String

In some cases, you may need the value of ObjectId in a string format. To convert the ObjectId in string, use the following code −

>newObjectId.str

The above code will return the string format of the Guid −

5349b4ddd2781d08c09890f3

MongoDB – Map Reduce

As per the MongoDB documentation, Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. MongoDB uses mapReduce command for map-reduce operations. MapReduce is generally used for processing large data sets.

MapReduce Command

Following is the syntax of the basic mapReduce command −

>db.collection.mapReduce(
   function() {emit(key,value);},  //map function
   function(key,values) {return reduceFunction}, {   //reduce function
      out: collection,
      query: document,
      sort: document,
      limit: number
   }
)

The map-reduce function first queries the collection, then maps the result documents to emit key-value pairs, which is then reduced based on the keys that have multiple values.

In the above syntax −

  • map is a javascript function that maps a value with a key and emits a key-value pair
  • reduce is a javascript function that reduces or groups all the documents having the same key
  • out specifies the location of the map-reduce query result
  • query specifies the optional selection criteria for selecting documents
  • sort specifies the optional sort criteria
  • limit specifies the optional maximum number of documents to be returned

Using MapReduce

Consider the following document structure storing user posts. The document stores user_name of the user and the status of post.

{
   "post_text": "tutorialspoint is an awesome website for tutorials",
   "user_name": "mark",
   "status":"active"
}

Now, we will use a mapReduce function on our posts collection to select all the active posts, group them on the basis of user_name and then count the number of posts by each user using the following code −

>db.posts.mapReduce( 
   function() { emit(this.user_id,1); }, 
	
   function(key, values) {return Array.sum(values)}, {  
      query:{status:"active"},  
      out:"post_total" 
   }
)

The above mapReduce query outputs the following result −

{
   "result" : "post_total",
   "timeMillis" : 9,
   "counts" : {
      "input" : 4,
      "emit" : 4,
      "reduce" : 2,
      "output" : 2
   },
   "ok" : 1,
}

The result shows that a total of 4 documents matched the query (status:”active”), the map function emitted 4 documents with key-value pairs and finally the reduce function grouped mapped documents having the same keys into 2.

To see the result of this mapReduce query, use the find operator −

>db.posts.mapReduce( 
   function() { emit(this.user_id,1); }, 
   function(key, values) {return Array.sum(values)}, {  
      query:{status:"active"},  
      out:"post_total" 
   }
	
).find()

The above query gives the following result which indicates that both users tom and mark have two posts in active states −

{ "_id" : "tom", "value" : 2 }
{ "_id" : "mark", "value" : 2 }

In a similar manner, MapReduce queries can be used to construct large complex aggregation queries. The use of custom Javascript functions make use of MapReduce which is very flexible and powerful.

MongoDB – Text Search

Starting from version 2.4, MongoDB started supporting text indexes to search inside string content. The Text Search uses stemming techniques to look for specified words in the string fields by dropping stemming stop words like a, an, the, etc. At present, MongoDB supports around 15 languages.

Enabling Text Search

Initially, Text Search was an experimental feature but starting from version 2.6, the configuration is enabled by default. But if you are using the previous version of MongoDB, you have to enable text search with the following code −

>db.adminCommand({setParameter:true,textSearchEnabled:true})

Creating Text Index

Consider the following document under posts collection containing the post text and its tags −

{
   "post_text": "enjoy the mongodb articles on tutorialspoint",
   "tags": [
      "mongodb",
      "tutorialspoint"
   ]
}

We will create a text index on post_text field so that we can search inside our posts’ text −

>db.posts.ensureIndex({post_text:"text"})

Using Text Index

Now that we have created the text index on post_text field, we will search for all the posts having the word tutorialspoint in their text.

>db.posts.find({$text:{$search:"tutorialspoint"}})

The above command returned the following result documents having the word tutorialspoint in their post text −

{ 
   "_id" : ObjectId("53493d14d852429c10000002"), 
   "post_text" : "enjoy the mongodb articles on tutorialspoint", 
   "tags" : [ "mongodb", "tutorialspoint" ]
}
{
   "_id" : ObjectId("53493d1fd852429c10000003"), 
   "post_text" : "writing tutorials on mongodb",
   "tags" : [ "mongodb", "tutorial" ] 
}

If you are using old versions of MongoDB, you have to use the following command −

>db.posts.runCommand("text",{search:" tutorialspoint "})

Using Text Search highly improves the search efficiency as compared to normal search.

Deleting Text Index

To delete an existing text index, first find the name of index using the following query −

>db.posts.getIndexes()

After getting the name of your index from above query, run the following command. Here, post_text_text is the name of the index.

>db.posts.dropIndex("post_text_text")

MongoDB – Regular Expression

Regular Expressions are frequently used in all languages to search for a pattern or word in any string. MongoDB also provides functionality of regular expression for string pattern matching using the $regex operator. MongoDB uses PCRE (Perl Compatible Regular Expression) as regular expression language.

Unlike text search, we do not need to do any configuration or command to use regular expressions.

Consider the following document structure under posts collection containing the post text and its tags −

{
   "post_text": "enjoy the mongodb articles on tutorialspoint",
   "tags": [
      "mongodb",
      "tutorialspoint"
   ]
}

Using regex Expression

The following regex query searches for all the posts containing string tutorialspoint in it −

>db.posts.find({post_text:{$regex:"tutorialspoint"}})

The same query can also be written as −

>db.posts.find({post_text:/tutorialspoint/})

Using regex Expression with Case Insensitive

To make the search case insensitive, we use the $options parameter with value $i. The following command will look for strings having the word tutorialspoint, irrespective of smaller or capital case −

>db.posts.find({post_text:{$regex:"tutorialspoint",$options:"$i"}})

One of the results returned from this query is the following document which contains the word tutorialspoint in different cases −

{
   "_id" : ObjectId("53493d37d852429c10000004"),
   "post_text" : "hey! this is my post on TutorialsPoint", 
   "tags" : [ "tutorialspoint" ]
} 

Using regex for Array Elements

We can also use the concept of regex on array field. This is particularly very important when we implement the functionality of tags. So, if you want to search for all the posts having tags beginning from the word tutorial (either tutorial or tutorials or tutorialpoint or tutorialphp), you can use the following code −

>db.posts.find({tags:{$regex:"tutorial"}})

Optimizing Regular Expression Queries

  • If the document fields are indexed, the query will use make use of indexed values to match the regular expression. This makes the search very fast as compared to the regular expression scanning the whole collection.
  • If the regular expression is a prefix expression, all the matches are meant to start with a certain string characters. For e.g., if the regex expression is ^tut, then the query has to search for only those strings that begin with tut.

Working with RockMongo

RockMongo is a MongoDB administration tool using which you can manage your server, databases, collections, documents, indexes, and a lot more. It provides a very user-friendly way for reading, writing, and creating documents. It is similar to PHPMyAdmin tool for PHP and MySQL.

Downloading RockMongo

You can download the latest version of RockMongo from here: http://rockmongo.com/downloads

Installing RockMongo

Once downloaded, you can unzip the package in your server root folder and rename the extracted folder to rockmongo. Open any web browser and access the index.php page from the folder rockmongo. Enter admin/admin as username/password respectively.

Working with RockMongo

We will now be looking at some basic operations that you can perform with RockMongo.

Creating New Database

To create a new database, click Databases tab. Click Create New Database. On the next screen, provide the name of the new database and click on Create. You will see a new database getting added in the left panel.

Creating New Collection

To create a new collection inside a database, click on that database from the left panel. Click on the New Collection link on top. Provide the required name of the collection. Do not worry about the other fields of Is Capped, Size and Max. Click on Create. A new collection will be created and you will be able to see it in the left panel.

Creating New Document

To create a new document, click on the collection under which you want to add documents. When you click on a collection, you will be able to see all the documents within that collection listed there. To create a new document, click on the Insert link at the top. You can enter the document’s data either in JSON or array format and click on Save.

Export/Import Data

To import/export data of any collection, click on that collection and then click on Export/Import link on the top panel. Follow the next instructions to export your data in a zip format and then import the same zip file to import back data.

MongoDB – GridFS

GridFS is the MongoDB specification for storing and retrieving large files such as images, audio files, video files, etc. It is kind of a file system to store files but its data is stored within MongoDB collections. GridFS has the capability to store files even greater than its document size limit of 16MB.

GridFS divides a file into chunks and stores each chunk of data in a separate document, each of maximum size 255k.

GridFS by default uses two collections fs.files and fs.chunks to store the file’s metadata and the chunks. Each chunk is identified by its unique _id ObjectId field. The fs.files serves as a parent document. The files_id field in the fs.chunks document links the chunk to its parent.

Following is a sample document of fs.files collection −

{
   "filename": "test.txt",
   "chunkSize": NumberInt(261120),
   "uploadDate": ISODate("2014-04-13T11:32:33.557Z"),
   "md5": "7b762939321e146569b07f72c62cca4f",
   "length": NumberInt(646)
}

The document specifies the file name, chunk size, uploaded date, and length.

Following is a sample document of fs.chunks document −

{
   "files_id": ObjectId("534a75d19f54bfec8a2fe44b"),
   "n": NumberInt(0),
   "data": "Mongo Binary Data"
}

Adding Files to GridFS

Now, we will store an mp3 file using GridFS using the put command. For this, we will use the mongofiles.exe utility present in the bin folder of the MongoDB installation folder.

Open your command prompt, navigate to the mongofiles.exe in the bin folder of MongoDB installation folder and type the following code −

>mongofiles.exe -d gridfs put song.mp3

Here, gridfs is the name of the database in which the file will be stored. If the database is not present, MongoDB will automatically create a new document on the fly. Song.mp3 is the name of the file uploaded. To see the file’s document in database, you can use find query −

>db.fs.files.find()

The above command returned the following document −

{
   _id: ObjectId('534a811bf8b4aa4d33fdf94d'), 
   filename: "song.mp3", 
   chunkSize: 261120, 
   uploadDate: new Date(1397391643474), md5: "e4f53379c909f7bed2e9d631e15c1c41",
   length: 10401959 
}

We can also see all the chunks present in fs.chunks collection related to the stored file with the following code, using the document id returned in the previous query −

>db.fs.chunks.find({files_id:ObjectId('534a811bf8b4aa4d33fdf94d')})

In my case, the query returned 40 documents meaning that the whole mp3 document was divided in 40 chunks of data.

MongoDB – Capped Collections

Capped collections are fixed-size circular collections that follow the insertion order to support high performance for create, read, and delete operations. By circular, it means that when the fixed size allocated to the collection is exhausted, it will start deleting the oldest document in the collection without providing any explicit commands.

Capped collections restrict updates to the documents if the update results in increased document size. Since capped collections store documents in the order of the disk storage, it ensures that the document size does not increase the size allocated on the disk. Capped collections are best for storing log information, cache data, or any other high volume data.

Creating Capped Collection

To create a capped collection, we use the normal createCollection command but with capped option as true and specifying the maximum size of collection in bytes.

>db.createCollection("cappedLogCollection",{capped:true,size:10000})

In addition to collection size, we can also limit the number of documents in the collection using the max parameter −

>db.createCollection("cappedLogCollection",{capped:true,size:10000,max:1000})

If you want to check whether a collection is capped or not, use the following isCapped command −

>db.cappedLogCollection.isCapped()

If there is an existing collection which you are planning to convert to capped, you can do it with the following code −

>db.runCommand({"convertToCapped":"posts",size:10000})

This code would convert our existing collection posts to a capped collection.

Querying Capped Collection

By default, a find query on a capped collection will display results in insertion order. But if you want the documents to be retrieved in reverse order, use the sort command as shown in the following code −

>db.cappedLogCollection.find().sort({$natural:-1})

There are few other important points regarding capped collections worth knowing −

  • We cannot delete documents from a capped collection.
  • There are no default indexes present in a capped collection, not even on _id field.
  • While inserting a new document, MongoDB does not have to actually look for a place to accommodate new document on the disk. It can blindly insert the new document at the tail of the collection. This makes insert operations in capped collections very fast.
  • Similarly, while reading documents MongoDB returns the documents in the same order as present on disk. This makes the read operation very fast.

MongoDB – Auto-Increment Sequence

MongoDB does not have out-of-the-box auto-increment functionality, like SQL databases. By default, it uses the 12-byte ObjectId for the _id field as the primary key to uniquely identify the documents. However, there may be scenarios where we may want the _id field to have some auto-incremented value other than the ObjectId.

Since this is not a default feature in MongoDB, we will programmatically achieve this functionality by using a counters collection as suggested by the MongoDB documentation.

Using Counter Collection

Consider the following products document. We want the _id field to be an auto-incremented integer sequence starting from 1,2,3,4 upto n.

{
  "_id":1,
  "product_name": "Apple iPhone",
  "category": "mobiles"
}

For this, create a counters collection, which will keep track of the last sequence value for all the sequence fields.

>db.createCollection("counters")

Now, we will insert the following document in the counters collection with productid as its key −

{
  "_id":"productid",
  "sequence_value": 0
}

The field sequence_value keeps track of the last value of the sequence.

Use the following code to insert this sequence document in the counters collection −

>db.counters.insert({_id:"productid",sequence_value:0})

Creating Javascript Function

Now, we will create a function getNextSequenceValue which will take the sequence name as its input, increment the sequence number by 1 and return the updated sequence number. In our case, the sequence name is productid.

>function getNextSequenceValue(sequenceName){

   var sequenceDocument = db.counters.findAndModify({
      query:{_id: sequenceName },
      update: {$inc:{sequence_value:1}},
      new:true
   });
	
   return sequenceDocument.sequence_value;
}

Using the Javascript Function

We will now use the function getNextSequenceValue while creating a new document and assigning the returned sequence value as document’s _id field.

Insert two sample documents using the following code −

>db.products.insert({
   "_id":getNextSequenceValue("productid"),
   "product_name":"Apple iPhone",
   "category":"mobiles"
})

>db.products.insert({
   "_id":getNextSequenceValue("productid"),
   "product_name":"Samsung S3",
   "category":"mobiles"
})

As you can see, we have used the getNextSequenceValue function to set value for the _id field.

To verify the functionality, let us fetch the documents using find command −

>db.products.find()

The above query returned the following documents having the auto-incremented _id field −

{ "_id" : 1, "product_name" : "Apple iPhone", "category" : "mobiles"}

{ "_id" : 2, "product_name" : "Samsung S3", "category" : "mobiles" }