Applied Statistics For Data Analysis/science - Programming

Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,149,955 members, 7,806,767 topics. Date: Tuesday, 23 April 2024 at 11:04 PM

Applied Statistics For Data Analysis/science - Programming - Nairaland

Nairaland Forum / Science/Technology / Programming / Applied Statistics For Data Analysis/science (1609 Views)

Free Udemy Courses 100% Free [coupon Codes Applied] / Who's Interested In Learning Python For Data Science (from Scratch) / Programming Not Enough Anymore: Data Analysis, Ml And A.i Is The Future. (2) (3) (4)

(1) (Reply) (Go Down)

Applied Statistics For Data Analysis/science by ibromodzi: 8:38am On Dec 11, 2020

Data analysis involves inspecting data to gain insights that inform conclusions and impact decision making. The theoretical framework of data analysis is strongly built on statistics and logical techniques (mathematics) while the implementation of its concepts heavily relies on computer science. These three fields are what gave birth to data analysis/ science and as such, this tutorial series is focused on highlighting the important statistical concepts that are employed in different data analytics tasks with implementation in statistical tools of choice of the readers.

I have come in contact with a sizeable number of people who are just finding their way into data science and I've noticed that most people use the top-down approach whereby they first concentrate on using different statistical tools while paying little or no attention to the underlining theoretical concepts upon which these principles are built. The implication of this is that many start to question the reason why they got into data science in the first place because they find it somehow difficult to pinpoint the kind of problems they are solving with these tools or the exact questions they are trying to answer with the data.
if you find yourself in this category, don't be infuriated, I was once in the same dilemma, just take a deep breath, grab your tools, and follow along in this series.

3 Likes 2 Shares

Re: Applied Statistics For Data Analysis/science by ibromodzi: 6:40pm On Dec 12, 2020

Topics to cover
We are going to cover several topics that border on descriptive and inferential statistics.
Descripive statisitcis
1. Types of variables
2. Measure of central tendency
3. Measure of spread
3. Graphs and plots

Inferential statistics
1. Hypothesis testing
2. Parametric assumptions
3. Sample inference (one sample and two samples)
4. Chi-square test of independence
5. One-Way ANOVA
6. Linear regression and correlation
7. Logistic regression

Tools to use
The tremendous improvement in technology has made it possible to implement almost any statistical concept using different tools. In light of this, you can follow this series using any tool(s) of your choice, but personally, I'll be combining a number of Python libraries together with Microsoft Excel.

Resources;

https://www.pythonfordatascience.org/home
https://online.stat.psu.edu/stat200/home
https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php

3 Likes 1 Share

Re: Applied Statistics For Data Analysis/science by ibromodzi: 3:54pm On Dec 14, 2020

Series 1: Variables
Objectives
At the end of this series, you should be able to:
1. Define variables
2. Know different types of variables
3. Know the difference between variables and constants
4. Know what explanatory and response variables are

What are variables?

Variables are characteristics that are measured and can take on different values. In other words, something that varies between cases or observations. In contrast, a constant always remains unchanged for all observations in a research study. Let's take on few examples to understand this better.

Example 1: A researcher wants to study the relationship between the educational qualification and the level of awareness of COVID-19 protocols in a sample of 100 male passengers. The variables are;
(a) Educational qualification which could range from none to tertiary education
(b) Awareness level which could be defined using Likert scale
Also, we have 100 observations/cases, biological sex (male) is a constant.

Types of Variable

1. Categorical variable: they are names or labels (e.g gender, race, state) with no logical order or with a logical order but inconsistent differences between groups (e.g., rankings). Categorical variables are also known as qualitative variables.

2. Numerical variables: they are variables with quantifiable measurements e.g height, weight, and average rainfall. They are also known as quantitative variables.

Example 2: A team of clinical researchers want to study the relationship between age and obesity. Weight here can be quantified either in Kilogram or other units, it is therefore a quantitative(numerical) variable while gender is a category (or label) and is therefore a categorical (qualitative) variable.

Variables can as well be grouped into explanatory (independent) and response (dependent) variables. In such a case, we are trying to use one variable to predict or explain the difference in another variable.

Example 3: A researcher wants to predict nutritional status using racial origin. He then takes a random sample of 100 individuals of distinct race. The explanatory variable here is race and the response variable is nutritional status.

The next series is going to be on how to describe different variables.

4 Likes

Re: Applied Statistics For Data Analysis/science by ibromodzi: 7:46pm On Dec 18, 2020

Series Two
Objectives
At the end of this series, you should be able to:
1. Know what measure of central tendency is
2. How to describe your data using measure of central tendency
3. Know the appropriate measure to use in describing your data

What are measures of central tendency?
A measure of central tendency is a value that attempts to describe a set of data by identifying the central position within that set of data. Therefore, measures of central tendency are also referred to as measures of central location. Because they try to give us the summary of our data, measure of central tendency are also referred to as summary statistics. The most commonly encountered measure of central tendency is the mean (also called average), but this is not the only measure as we also have the mode and the median. Now let us look at what different measure tells us about our data.

Mean
The mean is the average of all the values in the dataset; it is calculated by summing up the values which is then divided by the number of the values. For example, if we collect the age of five boys as 9, 12, 7, 8, 10, the mean is the addition of these values (46) divided by the number of the boys (i.e 5) which gives us 9.2. The mean therefore describes the most common value in your data. This, however, is rarely the actual value observed in your data.
Despite being the most common measure of central tendency, mean has two major drawbacks;

(a) Susceptibility to outliers: Outliers are unusually large or small numerical values compared to the rest of the data set. For example, if the wages of developers is a tech company is 30k, 37k, 45k, 16k, 25k and 100k. The mean salary for this six staff will be 169.9k. However, careful inspection of the raw data will reveal that the mean value might not be the best measure to describe this data as most wages fall in the 30 - 45k range. The mean is being affected by two small and large values. In this situation, using a better measure of central tendency (such as the median)should be considered.

(b) Skewness: When the dataset is heavily tailed to one side, the mean does not best describe the data as it loses its ability to provide the best central location for the data because the skewed data is dragging it away from the typical value.

Median
When our data is arranged in order of magnitude, the median is the middle score. It is less affected by skewed data and outliers.

Mode
The mode is the most frequent score in our data set.

Let's see how we can get these measures using Python

import statistics # for calculating our mean, median and mode

def calc_measure():
    # Let's create a random list
    num_list = [4,2,1,9,3,6,4]
    
    # we can calculate the measures here    
    mean = statistics.mean(num_list)
    median = statistics.median(num_list)
    mode = statistics.mode(num_list)
    
    print(num_list)
    print(f"The mean is {mean}"
    print(f"The median is {median}"
    print(f"The mode is {mode}"
    
calc_measure()

Challenge: implement mean, median and mode in Python without using any library

3 Likes

Re: Applied Statistics For Data Analysis/science by Nobody: 12:31pm On Dec 20, 2020

well done boss

Re: Applied Statistics For Data Analysis/science by StevDesmond(m): 1:11pm On Dec 20, 2020

Following

Re: Applied Statistics For Data Analysis/science by Kingray10: 11:45pm On Dec 21, 2020

I want to learn data analysis, please can you guide me through

I don't know where to start with.
It's kind of broad...

Re: Applied Statistics For Data Analysis/science by kingreign(m): 1:18pm On Feb 06, 2021

StevDesmond:
Following

Hello, you tried contacting me via the mail feature. Pls contact me via my mobile number. 07034581213.

Re: Applied Statistics For Data Analysis/science by ibromodzi: 6:57am On Feb 22, 2021

Hmmm, it has been a while here.

Re: Applied Statistics For Data Analysis/science by ibromodzi: 7:00am On Feb 22, 2021

Kingray10:
I want to learn data analysis, please can you guide me through
I don't know where to start with.
It's kind of broad...

Where exactly do you need help? You can go through the Chronicles thread created by sir Ejiod, it has a load of resources that can help you. Meanwhile, you can send a pm via email if you want to have a discussion.

Re: Applied Statistics For Data Analysis/science by Kingray10: 1:03am On Mar 12, 2021

ibromodzi:

Where exactly do you need help? You can go through the Chronicles thread created by sir Ejiod, it has a load of resources that can help you. Meanwhile, you can send a pm via email if you want to have a discussion.

Thanks.

Re: Applied Statistics For Data Analysis/science by noob03saibot(m): 3:42am On Apr 06, 2021

Nice thread. Really informative. Would want to ask, please got any idea why pca is used? And how can one make meaning / interpretations from it? Reason why I ask is, why is it useful since I can't seem to interpret it by seeing which Cluster of people belong to which group unlike before pca is performed on the dataset.

Re: Applied Statistics For Data Analysis/science by SuperKlean(m): 8:45pm On Apr 08, 2021

ibromodzi:

Where exactly do you need help? You can go through the Chronicles thread created by sir Ejiod, it has a load of resources that can help you. Meanwhile, you can send a pm via email if you want to have a discussion.

please can I pm you too? I need help also as I'm interested in data analysis. I just started python but I need someone to put me through, thanks.

Re: Applied Statistics For Data Analysis/science by Felixitie(m): 10:09pm On Apr 08, 2021

SuperKlean:
please can I pm you too? I need help also as I'm interested in data analysis. I just started python but I need someone to put me through, thanks.

You need good knowledge of excel, sql, powerbi or Tableau.. Start from excel. GL

Re: Applied Statistics For Data Analysis/science by SuperKlean(m): 10:34pm On Apr 08, 2021

Felixitie:

You need good knowledge of excel, sql, powerbi or Tableau.. Start from excel. GL

thanks a lot

Re: Applied Statistics For Data Analysis/science by Nobody: 12:56am On Apr 09, 2021

noob03saibot:
Nice thread. Really informative. Would want to ask, please got any idea why pca is used? And how can one make meaning / interpretations from it? Reason why I ask is, why is it useful since I can't seem to interpret it by seeing which Cluster of people belong to which group unlike before pca is performed on the dataset.

I want to believe the meaning of the abbreviated “PCA” is principal component factor analysis. It is used to show interrelationships of items and how they load on a factor. It is also used to reduce the number of items in a scale. For cluster of people in a group why not MGA.

Re: Applied Statistics For Data Analysis/science by Nobody: 12:57am On Apr 09, 2021

noob03saibot:
Nice thread. Really informative. Would want to ask, please got any idea why pca is used? And how can one make meaning / interpretations from it? Reason why I ask is, why is it useful since I can't seem to interpret it by seeing which Cluster of people belong to which group unlike before pca is performed on the dataset.

Hope it’s clear

Re: Applied Statistics For Data Analysis/science by ibromodzi: 5:22pm On Apr 09, 2021

noob03saibot:
Nice thread. Really informative. Would want to ask, please got any idea why pca is used? And how can one make meaning / interpretations from it? Reason why I ask is, why is it useful since I can't seem to interpret it by seeing which Cluster of people belong to which group unlike before pca is performed on the dataset.

Principal Component Analysis is used to reduce the dimensionality of a dataset which contains a lot of variables. PCA allows you to bring down this number of variables by creating new representative variables that are not correlated in vector space and can be used in your analysis.

With PCA, you are able to reduce the dimension of your dataset while minimizing information loss at the same time.

Re: Applied Statistics For Data Analysis/science by ibromodzi: 5:23pm On Apr 09, 2021

SuperKlean:
please can I pm you too? I need help also as I'm interested in data analysis. I just started python but I need someone to put me through, thanks.

No problem

Re: Applied Statistics For Data Analysis/science by noob03saibot(m): 4:47am On Apr 13, 2021

ibromodzi:

Principal Component Analysis is used to reduce the dimensionality of a dataset which contains a lot of variables. PCA allows you to bring down this number of variables by creating new representative variables that are not correlated in vector space and can be used in your analysis.

With PCA, you are able to reduce the dimension of your dataset while minimizing information loss at the same time.

Brother, I know all these. What I am trying to seek is, how do we interpret / read meaning to it. Most of the features are compressed, in the process we loose alot of meaning from the data.

Let's say for instance, I have a data that contains 1,000 rows and 35 features. (first name, last name, age, height, weight, salary, total balance etc) before I perform pca on the dataset, I can read and interpret the features. I can tell that row number 1 has observation of (John, Smith, 39, 1.7, 80kg, 20000, 100000 etc) and the rows go down till like row 1,000

Now whhen I perform pca on this dataset, the 35 columns might shrink to 5 features. The feature names would vanish. The new result/ output that comes out for the dataset we have performed, how do we read meaning to it? It's basically almost impossible to make meaning out of it.

And another thing or problem I discovered with it is, once I build my model based on the pca output. I find it difficult to reuse that model for another dataset that I would love to make predictions on. It would say something like the new dataset doesn't have the same features as that used to build the model. Even when I have performed pca on the new dataset I want to predict.

So that's why I am asking, what's really the use of this pca of a thing when I can't make out meaning from its results and I am finding it really difficult to reuse a model I have built from a dataset that I used pca on.

Don't know if you understand me. Sorry for the lengthy read.

Re: Applied Statistics For Data Analysis/science by noob03saibot(m): 4:50am On Apr 13, 2021

ejibaba:

I want to believe the meaning of the abbreviated “PCA” is principal component factor analysis. It is used to show interrelationships of items and how they load on a factor. It is also used to reduce the number of items in a scale. For cluster of people in a group why not MGA.

Yeah, principal component analysis. Don't get you, sorry, please what do you mean by MGA?

Re: Applied Statistics For Data Analysis/science by Nobody: 8:58am On Apr 13, 2021

noob03saibot:

Yeah, principal component analysis. Don't get you, sorry, please what do you mean by MGA?

Multi group analysis. Since you intend measuring differences in groups. I hope I got you right?

Re: Applied Statistics For Data Analysis/science by Nobody: 9:19am On Apr 13, 2021

noob03saibot:

Brother, I know all these. What I am trying to seek is, how do we interpret / read meaning to it. Most of the features are compressed, in the process we loose alot of meaning from the data.

Let's say for instance, I have a data that contains 1,000 rows and 35 features. (first name, last name, age, height, weight, salary, total balance etc) before I perform pca on the dataset, I can read and interpret the features. I can tell that row number 1 has observation of (John, Smith, 39, 1.7, 80kg, 20000, 100000 etc) and the rows go down till like row 1,000

Now whhen I perform pca on this dataset, the 35 columns might shrink to 5 features. The feature names would vanish. The new result/ output that comes out for the dataset we have performed, how do we read meaning to it? It's basically almost impossible to make meaning out of it.

And another thing or problem I discovered with it is, once I build my model based on the pca output. I find it difficult to reuse that model for another dataset that I would love to make predictions on. It would say something like the new dataset doesn't have the same features as that used to build the model. Even when I have performed pca on the new dataset I want to predict.

So that's why I am asking, what's really the use of this pca of a thing when I can't make out meaning from its results and I am finding it really difficult to reuse a model I have built from a dataset that I used pca on.

Don't know if you understand me. Sorry for the lengthy read.

PCA is not the right technique from your analysis. However, I will come down to the level of your analysis and explain it this way. First, columns not row pls. So check the way the data was coded. Second, since what you got is 5 variables from 35. It thus, mean the other factors did not load sufficient to be counted as part of the features. So when rotated, 5 features only was valid to be counted as a part of your model, also I guess you didn’t specify to it the number of factors you needed. So for it to produce 5 features it then mean other features are not supposed to be part of the model. So only 5 variables out of the 35 are valid. Third, When the name vanishes you are expected to check the items in the 5 features and take an action and that action you take would be dependent on the basis of where the features were obtained. If theoretical you are expected to check the items and be sure they allow you rename the vanish feature based on previous theoretical underpinnings. I will stop at this point. But you used a wrong example so I really can’t explain more.

Re: Applied Statistics For Data Analysis/science by noob03saibot(m): 3:44pm On Apr 13, 2021

ejibaba:

Multi group analysis. Since you intend measuring differences in groups. I hope I got you right?

OK, gotcha. One more thing, why is it necessary to scale our data before conducting kmeans?

I noticed something, though people say it's wrong. But anytime I don't scale it (standard scaler) I find it really easy to get easy understanding from the data. That is after picking my n_clusters from my elbow plot.
I then put up some features in a scatterplot then set hue= Clusters.

When I do something like the above without scaling, the features are very easy to understand and easily grouped. For instance, if I don't scale it, I might go with a figure decided by the elbow plot, let's say 2 Clusters.

Then I pair some features in a scatterplot. Let's say features, "credit amount" and "age". Then set hue= Clusters. Now, in the chart I might see two different colors representing some interesting findings. Example green color may indicate people with a credit amount of below 5000, then yellow may indicate people with a credit amount of more than 5,000. This is so easy to discover and make meaning of by mere looking at the chart, the scatterplot.

However, if I scale the data and try the above, it's always difficult for me to make out any meaningful thing from what I plotted.

That's why I am asking, is it really necessary to scale our data before carrying out kmeans elbow plot.

Sorry I am asking too many questions o. @ Ibromodzi, I don't mean to derail your thread. I am a huge but silent follower of your posts.

Re: Applied Statistics For Data Analysis/science by noob03saibot(m): 3:45pm On Apr 13, 2021

ejibaba:

PCA is not the right technique from your analysis. However, I will come down to the level of your analysis and explain it this way. First, columns not row pls. So check the way the data was coded. Second, since what you got is 5 variables from 35. It thus, mean the other factors did not load sufficient to be counted as part of the features. So when rotated, 5 features only was valid to be counted as a part of your model, also I guess you didn’t specify to it the number of factors you needed. So for it to produce 5 features it then mean other features are not supposed to be part of the model. So only 5 variables out of the 35 are valid. Third, When the name vanishes you are expected to check the items in the 5 features and take an action and that action you take would be dependent on the basis of where the features were obtained. If theoretical you are expected to check the items and be sure they allow you rename the vanish feature based on previous theoretical underpinnings. I will stop at this point. But you used a wrong example so I really can’t explain more.

hmmm! OK o. Would do more digging into it. Thanks man

Re: Applied Statistics For Data Analysis/science by Nobody: 4:03pm On Apr 13, 2021

noob03saibot:
hmmm! OK o. Would do more digging into it. Thanks man

Pls do.

PCA, EFA, CFA and CCA are all self explanatory. Just know when to use them the rest is history.

Re: Applied Statistics For Data Analysis/science by Alejoas(m): 8:25am On Apr 14, 2021

ibromodzi:

No problem

this is for you and your friend.In case you need external battery for your laptop, i have for sale. It is guaranteed to impress you. Powerbank for Laptop!Our external battery for laptop is ready to increase your productivity, guaranteed to power you for between ten to twenty four hours.

50k for 100,000mah guaranteed to power you pc for more than 10 hours. 75k for 180,000mah guaranteed to power pc for more than 18hours. All are built with brand new liFePo4 battery.

wwwdotcooldipodotcom

0806.5316.307

Re: Applied Statistics For Data Analysis/science by NemB: 4:31pm On Apr 14, 2021

Hello. I'm Interested.
Your website isn't working

Alejoas:

this is for you and your friend.In case you need external battery for your laptop, i have for sale. It is guaranteed to impress you. Powerbank for Laptop!Our external battery for laptop is ready to increase your productivity, guaranteed to power you for between ten to twenty four hours.

50k for 100,000mah guaranteed to power you pc for more than 10 hours. 75k for 180,000mah guaranteed to power pc for more than 18hours. All are built with brand new liFePo4 battery.

wwwdotcooldipodotcom

0806.5316.307

Re: Applied Statistics For Data Analysis/science by Alejobs: 4:04am On Apr 25, 2021

NemB:
Hello. I'm Interested.
Your website isn't working

Come to WhatsApp then. You are supposed to replace the dot with a '.'........

08065,316,307

(1) (Reply)

Need Help On My Project Write Up. / Python Final Year Project / Someone Good With Bootstrap And Html And Css Needed For Urgent Project

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 72
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.