CloudFormation is a tool which can spin up resources on AWS. If someone wants to be an AWS expert, CloudFormation is an essential service to master.
Before we jump into writing a CloudFormation template, let’s have a brief history about how to manage AWS infrastructure before CloudFormation.
Without CloudFormation, automating a process is time-consuming because of building tools to assist with automation, e.g., log in AWS console and manually provision servers.
As we know, the first 70% of SQL is pretty straightforward but the remaining 30% can be pretty tricky.
So, in this blog, some popular hard SQL interview questions will be covered for people to sharpen their skills.
Part 1: How much a key metric, e.g., monthly active users, changes between months, e.g., a table named ‘logins’ is shown as below.
Q: find the month-over-month percentage change for monthly active users
Solutions:
WITH mau AS
(
SELECT
DATE_TRUNC('month', date) AS month_timestamp,
COUNT(DISTINCT user_id) AS mau
FROM logins
GROUP BY DATE_TRUNC('month', date)
) SELECT a.month_timestamp AS previous_month, a.mau AS previous_mau, b.month_timestamp…
As a BI Analyst working in an Online Travelling Agency company, interpreting customer behaviors data into meaningful insights is a Business As Usual task.
Google Analytics is a popular web analytics service tracking website traffic. Therefore, for BI&Reporting team, how to interpret Google Analytics data seems to be an essential skill.
In the technical side, standard SQL can be used in Google BigQuery(cloud-based data warehousing platform) to generate data insights from Google Analytics.
Let’s take some query samples to have a look. Firstly, some dynamic values need to be understood before we write the first query:
The maximum length of a Google Analytics payload is 8192 bytes. It is useful to check if you are approaching this value with some of your hits because if the payload length exceeds this, the hit is never sent to GA.
Today i will show you how to send the payload size as a custom dimension to GA with each hit. The tool is Google Tag Manager.
Before starting, creating a new hit-scoped custom dimension in GA is essential, named ‘Hit Payload Length’ and check its index, which will be used in the next step.
Then, create a custom task…
Due to the massive volume of data, Spark is built to handle big data in many user cases. It is an open source project on Apache.
Spark can use data stored in a variety of formats, including parquet files.
Spark is a general-purpose distributed data processing engine that is suitable for use.
On top of the Spark core data processing engine, there are libraries for SQL, machine learning, etc. Tasks most frequently associated with Spark include ETL and SQL batch jobs across large data sets.
It has an extensive set of developer libraries and APIs and supports languages such as…
Let me give you an example here to answer this question.
Every company has many departments, and different departments use different tools to store their data. For example, marketing team may use hubspot tool.
Now, we have different departments which store different types of data in a company.
However, insightful information is needed to make business decisions through those large amount of data.
What can we do?
Maybe we can connect all the databases everytime to generate reports. However, it will cost us large amount of time, then the term of data integration is raised.
Data integration is a process…
The practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer.
On-premise:
Cloud providers:
Different kinds of hosting:
Last blog I wrote why we need a Data Warehouse.
It is a centralized relational database that pulls together data from different sources (CRM, marketing stack, etc.) for better business insights.
It stores current and historical data are used for reporting and analysis.
However, here is the problem:
1 Define Business Requirements
Because it touches all areas of a company, all departments need to be onboard with the design. Each department needs to understand what the benefits of data warehouse and what results they can expect from it.
What objectives we can focus on:
Last Blog I demonstrated the data pipeline we can use CrUX to analyze the site performance. This is from a BI developer perspective.
However, for a company, especially the leadership team, what they want is the final dashboard that generated from BI department, so management plan can be gained.
I already wrote how to query from Bigquery and what site speed metrics we can use from the introduction of CrUX blog and public dataset analysis blog.
So this blog I will show you what kind of dashboard we can generate after the steps of data collection from Google public dataset…
A current Data Analyst in a subsidiary under Webjet, with experience in applying data science techniques to business.