Data-driven companies look for Business / Data Analysts in order to leverage the enormous amount of data generated by their customers. Insights from big data have helped industries stay efficient and competitive. This has given exponential rise to the demand for expert big data professionals.
Data Analytics revolves around the concepts of Quantitative Methods of solving business and social issues. So apart from your programming and database skills it’s important you brush up the basics of statistics before your Data Analyst Interview. Here is the list of most frequently asked interview questions – With Answers to help you shine out of your resume during the interview.
So let’s get to it –
What is Outlier?
An outlier is a term commonly used to refer to an observation that lies an abnormal distance from other values in a random sample from a population. There are two kinds of outliers – Univariate and Multivariate.
We detect outliners with these methods:
- Box plot method – An outliner is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and below the lower quartile).
- Standard deviation method – If any value is higher or lower than mean ± (3*standard deviation), it is an outlier.
Name the best tools used for data analysis.
Some of the most used and popular data analysis tools are:
- Microsoft PowerBI
- Rapid Miner
- R Programming
- Google Fusion Tables
- Google Search Operators
What is data validation?
Data validation is the checking of accuracy and quality of source data before using, importing, or processing data. It is a form of data cleaning and we use different types of validation depending on destination constraints or objectives. A few of the data validation benefits are:
- Ensuring that the data collected from different sources meet the Business requirement
- Ensuring Data Accuracy
- Better Decision Making
- Better Strategy and Enhanced Market Goals
What is Time Series Analysis?
Time Series is an ordered sequence of values of a variable at equally spaced time intervals. We often observe time series occur while looking at industrial data like agricultural production, stock trade market, retail purchase, etc. The time series analysis or trend analysis helps us understand the underlying factors and structure that produced the observed data. Time Series Analysis is used for many applications and we just need to fit a model and proceed to forecast, monitor, or even feedback and feedforward control.
The Time Series data is considered in three types based on the interval:
- Time series data: A set of observations on the values that a variable takes at different times.
- Cross-sectional data: Data of one or more variables, collected at the same point in time.
- Pooled data: A combination of time series data and cross-sectional data.
Differentiate between variance and covariance.
Variance defines the spread between a data set from its mean value. Its value is calculated by the probability-weighted average of squared deviations from the expected value. So the larger the variance, the larger the distance between the numbers in the set and the mean. Conversely, a smaller variance means the numbers in the set are closer to the mean.
Covariance is the measure of how two random variables will change when they are compared to each other. For instance, covariance describes the returns on two different investments over a period of time when compared to different variables.
What are the best ways to manage multi-source problems?
To tackle multi-source problems, we need to:
- Restructure schemas in order to accomplish schema integration
- Identify similar data records and combine them into one record with all the useful attributes
- Use floating-point numerical representations and application sensitive notions of matching or nearly matching.
- Use a powerful visualization tool
What are R-Squared and Adjusted R-Squared?
R-Squared measures the proportion of the variation in the dependent variable explained by all or selected independent variables in the model. Not all independent variables help in explaining the dependent (target) variable and contribute to predicting the target variable.
Mathematically, R-squared is the ratio of the sum of squares of residuals (SSres) and the total sum of squares (SStot), subtracted from 1.
Adjusted R-Squared measures the proportion of variation explained by only those independent variables that really help in explaining the dependent variable. Mathematically, Adjusted R-Squared can be calculated in terms of the sum of squares.
, or can be adjusted to represent as,
What is Normal Distribution?
Also known as the Gaussian distribution and the bell curve, the normal distribution is a probability distribution function that fits many natural phenomena like heights, measurement error, IQ scores, etc. It describes how the values of a variable are distributed. In normal distribution most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are similarly unlikely.
Name some of the most useful statistical methods used in data analysis.
The statistical methods that are mostly used by data analysts are:
- Linear Regression
- Resampling Methods
- Subset Selection
- Sample Size Determination
- Hypothesis Testing
- Standard Deviation
What is the difference between data mining and data profiling?
Data mining is the practice of automatically discovering patterns and trends that go beyond simple query and reporting techniques while searching large stores of data. It uses mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD).
Some of the properties of Data Mining are:
- Automatic discovery of patterns
- Prediction of likely outcomes
- Creation of actionable information
- Focus on large data sets and databases
Whereas Data Profiling deals with evaluating the identifying the anomalies and problems like missing data, inconsistent data, duplicate records. These anomalies need a cleanup or should go through a conforming process before the data is delivered. Data Profiling focuses on checking the relationships between tables, complex business rules, and helps us with a better understanding of the data source, and prevents problems
What is Data Cleansing? What are its best practices?
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. Incorrect data pose a threat of outcomes and algorithms being unreliable, even though they may look correct.
There are no absolute ways to data cleaning processes and vary with datasets. Few of the best practices in data cleaning are:
- Removal of duplicate or irrelevant observations
- Fixing structural errors like strange naming conventions, typos, or incorrect capitalization
- Filtering unwanted outliers
- Handling missing data properly
- Validating and QA
What Is Standard Deviation?
The standard deviation is a statistic that measures any degree of variation of a dataset relative to the average spread of data around the mean. It is calculated as the square root of the variance and is denoted by σ.
If the data points dispersed away from the mean, the deviation will be higher within the data set, thus the higher the standard deviation. Standard deviation is often used on the annual rate of return of an investment, it shows the historical volatility of that investment. For instance, a volatile stock shows a high standard deviation, while the stable stocks show lower standard deviations.
Why is version control important?
Version control is important and is required for the following:
- To compare files, identify differences, and consolidate the changes seamlessly.
- To keep track of application development by identifying stages of development, testing, QA, and production.
- To maintains a complete history of project files and can be used as backups.
- To track changes made in the code of different files.
What is a good data model?
A data model can be considered good and developed if it:
- Does near accurate prediction of outcomes
- Accommodates growing data, business needs and responds well
- Scales in proportion to the change in data size
- Offers tangible and profitable results
What are clustered and non-clustered indexes in SQL? Explain the difference between the two.
Indexes help us speeds up finding of row associated with key values from the table or view in a database. Clustered indexes are faster to read as they are physically stored on the table and can only be one per table. Whereas non-clustered indexes are created separately, which slows the reading of data. Unlike clustered indexes there can be many non-clustered indexes.
Here are few striking differences between the two types of SQL indexes:
|Parameters||Clustered Index||Non-Clustered Index|
|Used for||Used to sort the records and store clustered index physically in memory||Used to create a logical order for data rows and uses pointers for physical data files.|
|Storing||Allows us to store data pages in the leaf nodes of the index||Never stores data pages in the leaf nodes of the index|
|Size||Large||Smaller compared to Clustered Index|
|Type of key||Primary Keys Of The Table||Uses unique constraint on the table|
|Main Feature||Improve the performance of data retrieval||Should be created on columns which are used in joins|
|Additional disk space||Not Required||Required to store the index separately|
|Data accessing||Faster||Slower compared to the clustered index|
You have 10 bags of marbles with 10 marbles in each bag. All but one bag has marbles which weigh 10g each. The exception’s marbles weigh 11g each. How would you determine which bag has 11g marbles using a weighing scale only once?
You can find the heavier bag of marbles by taking a different number of marbles, up to 10, from each bag, placing them in a new bag, and weighing the result. For example, you take 1 from the first bag, 2 from the second, all the way up to the final bag, from which you’ll take all 10 marbles and place them in the new bag. If you use a series sum to find the number of marbles (or you’ve counted them as you placed them in the bag), and multiply the total number by the majority weight (10 in this instance), you can then use this number to find out where the weight “problem” is. Weigh the marbles you’ve placed into the new bag and subtract this number from the projected weight. The difference will be the bag from which you took that many marbles. This is the heavier bag.
Frequently asked interview question in this article is featured from the questions our learners faced during their recent data analyst interviews. These questions are handpicked to help you cover important topics that are likely to be asked. You can consider this as the checklist of concepts you should be aware of before going for your next data analyst interview.
If you are curious about learning in-depth data analytics, data science to be in the front of fast-paced technological advancements, check out Jain University & IIHT’s PG Program in Big Data Development. Happy Learning.