Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,143,298 members, 7,780,702 topics. Date: Thursday, 28 March 2024 at 08:00 PM

Data Science Tutorial For Beginners With Python Programming Language - Programming - Nairaland

Nairaland Forum / Science/Technology / Programming / Data Science Tutorial For Beginners With Python Programming Language (8600 Views)

Let's Build A Simple Blog With Python(django) / Astro Programming Language 0.2 (indefinite release) (2) (3) (4)

(1) (2) (Reply) (Go Down)

Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 8:41am On Aug 29, 2016
Hello NL,
I will attempt to share the little I know about "Data Science" using the python programming language.

The tutorial will work you through a typical Data Science problem where we will scrape for web data, analyze and visualize the dataset to extract values from it.

Am not sure if someone out there will be interested?


Here is the content of the tutorial:-
1) Introduction
2) Web data Scrapping
3) Data Cleaning
4) Data Analysis and Visualization


PS: At the end of this tutorial, you will be able to pick a small dataset available online and, using Python language, quickly calculate descriptive statistics and show their results with basic charts and tables.

7 Likes 3 Shares

Re: Data Science Tutorial For Beginners With Python Programming Language by Nasa28(m): 8:42am On Aug 29, 2016
following
Re: Data Science Tutorial For Beginners With Python Programming Language by Favorite1: 8:48am On Aug 29, 2016
Following
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 8:56am On Aug 29, 2016
Ok, anyway lets start with an introduction.
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 9:36am On Aug 29, 2016
[size=20pt]Introduction[/size]

Even though this tutorial is for beginners, I will assume some basic knowlegde of the following;-
~ Mathematics/Statistics
~ Use of Computer
~ Python programming


[size=15pt]What is "Data Science"?[/size]
There are several definitions available on "Data Science". Am going to use a simple one: "Data Science involves extracting and interpreting data effectively and presenting it in a simple, non-technical language to the end-users" (source: Edureka.co).

According to (Drewconway.com data-science-venn-diagram), data science lies at the intersection of:
• Hacking skills
• Math and statistics knowledge
• Substantive expertise



There’s a joke that says "a data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician".


[size=15pt]Tutorial Tool[/size]
There are several tools for data science which includes:-
~ Python
~ R
~ MATLAB
~ SAS
~ Julia
~ SQL
~ RapidMiner
~ DataRobot
~ Weka
~ SPSS
Any tool you decide to choose is good. However, I will use is the Python programming for this tutorial, because it is one of the top data science tools out there for crunching data. It is closely followed by R.


[size=15pt]Tutorial Dataset[/size]
The dataset we are going to use for this tutorial is the: Birthday list on NairaLand homepage (i.e: NairaLand Forum Members' Birthday Data).

[img]http://2.bp..com/-FrimAg6bIts/V7XZg_rir3I/AAAAAAAABFY/JKfvcnRiHusAEWTTfF7Rytj0fWQ2duvVACLcB/s1600/Sample_NL_Birthdays.bmp[/img]




Using the Birthday dataset we will attempt to answer some questions like:-
a) How many members are celebrating their birthdays today?
b) Who is the oldest and youngest member celebrating his/her birthdays today?
c) What is the average age the celebrants?
d) How old will each celebrant be in 10years?
e) How old was each celebrant when NairaLand was established


[size=15pt]Python Packages[/size]
Since we are using Python programming, let me list the packages/libraries needed for this tutorial and what we are going to use them for.
~ re, requests, BeautifulSoup: libraries for Scraping and Cleaning the data
~ pandas, datetime: libraries for Analyzing and Visualizing the data

You need to have them installed on your python distribution using "pip install package_name". Or get a python distribution (such as Anaconda python or Enthought canopy) that have all the packages installed by default.

That is it for this class.
Next we will look at how to scrap the birthday data from Nairaland home page.

4 Likes 3 Shares

Re: Data Science Tutorial For Beginners With Python Programming Language by Stconvict(m): 1:10pm On Aug 29, 2016
Following...
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 2:14pm On Aug 29, 2016
I forgot to mention the applications or uses of Data Science.

Here are some Applications/Uses of Data Science:-

~ Internet Search
~ Digital Advertisements (Targeted Advertising and re-targeting)
~ Recommender Systems
~ Image Recognition
~ Speech Recognition
~ Gaming
~ Price Comparison Websites
~ Airline Route Planning
~ Fraud and Risk Detection
~ Delivery logistics
~ Self Driving Cars
~ Robots

Apart from the applications mentioned above, data science is also used in Marketing, Finance, Human Resources, Health Care, Government Policies and every possible industry where data gets generated.

Read more at Analytics Vidhya

1 Like 1 Share

Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 7:11am On Aug 30, 2016
[size=20pt]Web data Scrapping - Scrap birthday data from Nairaland home page[/size]

In this class, we are going to collect our dataset for this tutorial. Recall that our dataset is the birthday list of Nairaland members on the home page. The home page url is at: https://www.nairaland.com/home. If you scroll down the page, you will see the list of members having their Birthday today!

Fine, we now know where our data is located on the Web. So we need to collect it for Data Science purpose. We can easily go to the web page, then copy and edit the Birthday list for our analysis.

But since we are going to collect this birthday dataset over a long period of time (probably for one year) on daily bases, copying and editing the list will not be efficient. So we need a way to automate the boring process by doing what is known as web scraping.

Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet). Web scraping is the term for using a program to download and process content from the Web. For example, Google runs many web scraping programs to index web pages for its search engine.


[size=15pt]Understanding the Birthday dataset[/size]
Before we start collecting the dataset, let try to understand the nature of the dataset.

The Birthday list is in this format: rodbel(29), Sirolad(29), mokei(27)...

The first word is the username of the member and his age in braces. That is: member_username(age).

The format above isn't friendly for Data Science. Tabular datasets are more suitable for Data Science, so we need to clean it into a tabular format useful in Python data science.

Note: If you inspect the html of the Birthday list, you should see that it is contained in a cell of table tag (< td > ......... < /td >wink.

In Summary: We want to scrap data from this format "rodbel(29), Sirolad(29), mokei(27)" into tabular format.




[size=15pt]Warning before Scraping a website[/size]

There are a few points that we need to go over before we start scraping.

~ Always check the website’s terms and conditions before you scrape them. They usually have terms that limit how often you can scrape or what you can you scrape

~ Because your script will run much faster than a human can browse, make sure you don’t hammer their website with lots of requests. This may even be covered in the terms and conditions of the website.

~ You can get into legal trouble if you overload a website with your requests or you attempt to use it in a way that violates the terms and conditions you agreed to.

~ Websites change all the time, so your scraper will break some day. Know this: You will have to maintain your scraper if you want it to keep working.

~ Unfortunately the data you get from websites can be a mess. As with any data parsing activity, you will need to clean it up to make it useful to you.

With that out of the way, let’s start scraping!

Lets extract the Birthday list first using these python modules: re, requests, and BeautifulSoup. Then in the next class, we will clean the dataset into tabular format.

The code below does exactly the extraction for us.

# import the libraries we are going to use

# libraries for Scraping and Cleaning the data
import re
import requests
from bs4 import BeautifulSoup

# libraries for Analyzing and Visualizing the data
import pandas as pd
from datetime import datetime


# Scraping out the raw html code of nairaland home page
url = "https://www.nairaland.com/home"
raw_html = requests.get(url) # returns the complete url html code

# print (raw_html.text)

raw_data = raw_html.text # save the text in an object

soup_data = BeautifulSoup(raw_data, "lxml"wink # use BeautifulSoup module read the html into xml to and save it in an object

# lets display only the part of the data we need. It is contained in the cell of table tag (<td>wink
soup_data("td"wink



The completed code is available in Jupyter/IPython notebook at: http://nbviewer.jupyter.org/github/forum2k9/NairaLand-Members-Birthday-Data/blob/master/NairaLand_Members_Birthday_Data.ipynb


In the next class, we will Clean the data into a friendly (tabular) format.

Re: Data Science Tutorial For Beginners With Python Programming Language by noordean(m): 10:40am On Aug 30, 2016
Good job boss. Please where is the Nairaland's terms and conditions located? can't see it
Re: Data Science Tutorial For Beginners With Python Programming Language by kingofthejungle(m): 1:22pm On Aug 30, 2016
u have a nice blog though
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 1:39pm On Aug 30, 2016
noordean:
Good job boss.
Please where is the Nairaland's terms and conditions located?
can't see it

I also searched couldn't it except for the posting Rules
Re: Data Science Tutorial For Beginners With Python Programming Language by tajoo: 2:53pm On Aug 30, 2016
thanks umar...im totally a novice but i find it interesting
Re: Data Science Tutorial For Beginners With Python Programming Language by Fedric(m): 7:42pm On Aug 30, 2016
umaryusuf:


I also searched couldn't it except for the posting Rules
bro what programming language can i use to create a forum like nairaland? phython or php?
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 5:46am On Aug 31, 2016
Before I post about today's class on: Data Cleaning, let me metion this....

The technique of web scrapping explained above can be used to extract virtually any kind of data from the web. For example: you can extract phone numbers or email from web url using web scrapping technique, all you have to do is to define the exact pattern of what you intend to extract.

Weather data, Stock data, Social/Media data etc can all be scraped from the web.
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 7:35am On Aug 31, 2016
[size=15pt]Data Cleaning - Clean the data into a friendly format[/size]
Lets clean our data by extracting all irrelevant text out and keep only the birthday list in the format of: Username, age. To be saved in a CSV file


# lets read out the text only ignoring the tag cell in a table
for data in soup_data("td"wink:
print (data.text)

# Obviously, we don't need every text above. So use the 're' module, to extract only the relevant birthday list
# Note: I will ignore those members whose ages are not displayed, so that we don't have to deal with NaN values in our data

member_found = None

re_match = "[\w]+\([\d]+\)" # any word count+1 followed-by '(' followed-by any number count+1 followed-by ')'

for data in soup_data("td"wink:
data_found = re.findall(re_match, data.text)

if data_found:
member_found = data_found

print (member_found)

# Lets further clean up the list to seperate Usernames from age
# Use list comprehension to replace the last brace "wink" with empty "" in member_found above

member_found_replaced = [x.replace("wink", ""wink for x in member_found] # replaces "wink" by ""

print (member_found_replaced)

# Now split "member_found_replaced" based on '(' between the usernames and age
# we use for loop to loop through each item of the "member_found_replaced" list above

for y in member_found_replaced:
member_cleaned = y.split("("wink
print (member_cleaned)


# we first declare "member_cleaned" as empty dictiory, so we can append individaul list above into it

member_cleaned = {}

for y in member_found_replaced:
temp_data = y.split("("wink

member_cleaned[temp_data[0]] = int(temp_data[1])

print (member_cleaned)

# covert the dictionary "member_cleaned" above into a Pandas DataFrame
# Note: in python 3, we have to convert the dictionary items into a list to work with Pandas DataFrame


# define the column names
columns_name = ["Username", "Age"]

# df = pd.DataFrame(member_cleaned.items(), columns = columns_name ) # this is for python 2
df = pd.DataFrame(list(member_cleaned.items()), columns = columns_name )

df

# Lets add a column for today's date

# using the datetime module


todays_date = datetime.now().date()

df["Date"] = todays_date

df

# Let save the dataframe into csv file
# we name the csv file with the current date, i.e: 14/08/2016 will be 20160814 for the file name

csv_name = todays_date.strftime("%Y%m%d"wink

df.to_csv(csv_name + ".csv"wink


After you have collected the dataset for months, you can then Merge all csv files into one file using pandas concat() method. The concat() method takes in list of dataframes (the CSVs) to be merge together.

As mentioned earlier, the complete source code is on IPython notebook at: http://nbviewer.jupyter.org/github/forum2k9/NairaLand-Members-Birthday-Data/blob/master/NairaLand_Members_Birthday_Data.ipynb

In the next class, we will discus on how to Analyze and Visualize the dataset.

1 Like 1 Share

Re: Data Science Tutorial For Beginners With Python Programming Language by Fedric(m): 8:03am On Aug 31, 2016
great job, bro. your student is fully present.
Re: Data Science Tutorial For Beginners With Python Programming Language by Stconvict(m): 11:07pm On Aug 31, 2016
Thanks Umar!
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 6:22am On Sep 01, 2016
No one asked any question so far! Well it means no problem with the classes above.

We have to go through the above process of Data Scraping/Collection and Cleaning due to the nature and location of our dataset. "We have cook our data before we eat it".

In a cases where you have already cooked data, then you won't pass through the collection and cleaning process.

So at this point, each day you run the script, it will extract the day's Nairaland members birthday list and save it in a CSV file. So by the time you run the script for one week, you will have 7 different CSV file containing birthday list for those seven days.

By the time you run it for 1year, you would have enough dataset to make analyses, visualizations and predictions.



In the last class, we will do some basic analyses and visualizations.

[img]http://4.bp..com/-zYjoae56TRo/V7dYxe7rYoI/AAAAAAAABF4/2xP5Q--oNqM1HdwDQJbEQDLI6Fp69kl0ACLcB/s1600/areaplot.png[/img]

To Analyze and Visualize our data, below are some of the questions we are going to answer:-

a) How many members are celebrating their birthdays today?

b) Who is the oldest and youngest member celebrating his/her birthdays today?

c) What is the average age the celebrants?

d) How old will each celebrant be in 10years?

e) How old was each celebrant when NairaLand was established?

Till then, drop your questions or comments/suggestions.
Re: Data Science Tutorial For Beginners With Python Programming Language by neahyo(m): 8:51am On Sep 01, 2016
umaryusuf:
No one asked any question so far! Well it means no problem with the classes above.

We have to go through the above process of Data Scraping/Collection and Cleaning due to the nature and location of our dataset. "We have cook our data before we eat it".

In a cases where you have already cooked data, then you won't pass through the collection and cleaning process.

So at this point, each day you run the script, it will extract the day's Nairaland members birthday list and save it in a CSV file. So by the time you run the script for one week, you will have 7 different CSV file containing birthday list for those seven days.

By the time you run it for 1year, you would have enough dataset to make analyses, visualizations and predictions.



In the last class, we will do some basic analyses and visualizations.

[img]http://4.bp..com/-zYjoae56TRo/V7dYxe7rYoI/AAAAAAAABF4/2xP5Q--oNqM1HdwDQJbEQDLI6Fp69kl0ACLcB/s1600/areaplot.png[/img]

To Analyze and Visualize our data, below are some of the questions we are going to answer:-

a) How many members are celebrating their birthdays today?

b) Who is the oldest and youngest member celebrating his/her birthdays today?

c) What is the average age the celebrants?

d) How old will each celebrant be in 10years?

e) How old was each celebrant when NairaLand was established?

Till then, drop your questions or comments/suggestions.
Thumbs up to you bro for sharing your codes. I use R for my analysis but I'm also a Python enthusiast; the python code for cleaning and scrapping the data is really esoteric. My question is this:
1. After collecting the data, I decided to scrape and clean it using Excel (I can't use python since I'm still a learner), what is the python command for importing the data which is already in a csv format.
2. How do I subset the command in order to answer objectives 1-5.
3. I'm proficient in R but I'm really passionate about learning Python. How can you help me? I have downloaded some materials already anyways but its not helping enough.
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 12:31pm On Sep 01, 2016
neahyo:

Thumbs up to you bro for sharing your codes. I use R for my analysis but I'm also a Python enthusiast; the python code for cleaning and scrapping the data is really esoteric. My question is this:
1. After collecting the data, I decided to scrape and clean it using Excel (I can't use python since I'm still a learner), what is the python command for importing the data which is already in a csv format.
2. How do I subset the command in order to answer objectives 1-5.
3. I'm proficient in R but I'm really passionate about learning Python. How can you help me? I have downloaded some materials already anyways but its not helping enough.

Good to hear you are proficient in R, I would love to see you recode this tutorial in R - that is an R version to compare! Maybe you can post it here in the nearest future.

I also want to challenge other gurus that use other packages like: MATLAB, SPSS, Excel, Julia, Java etc to kindly provide there versions for this tutorial.





In response to your questions:-

1) There are several command to work with excel/csv files in Python. But since we used pandas library in this tutorial, the pandas command to import:-
~ Excel is: read_excel()
~ CSV is: read_csv()



2) The next class will answer the question on: How do I subset the command in order to answer objectives 1-5.



3) I will advice you use the same approach you used when learn R to learn Python.

But have it at the back of your mind that python is general purpose language on like R, so don't waste your time learning packages in Python that are not useful for a Data Scientist.

Here is a quick learning path to follow:-
~ Learn Python Basics
~ Learn Python Regular Expression
~ Learn Python Object Oriented Programming
~ Learn Python Data Science libraries

2 Likes 1 Share

Re: Data Science Tutorial For Beginners With Python Programming Language by Stconvict(m): 5:51am On Sep 03, 2016
Have you considered Julia Umar?
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 8:31am On Sep 03, 2016
Stconvict:
Have you considered Julia Umar?

Yeah, that is another good one for Data Science. ARe you using it?
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 9:02am On Sep 03, 2016
[size=15pt]Data Analysis and Visualization - Analyze and Visualize the data[/size]

In this last class "Analyze and Visualize the data", we will do some kind of Quick Data Exploration.

Following is the library we will use: pandas

Remember that we have imported it earlier in our code. So we use its commands to explore the dataset by attempting to answer some useful questions such as:-

a) How many members are celebrating their birthdays today?
b) Who is the oldest and youngest member celebrating his/her birthdays today?
c) What is the average age the celebrants?
d) How old will each celebrant be in 10years?
e) How old was each celebrant when NairaLand was established?


Use describe() function to get the summary statistics of numerical fields. Use sort_values() function to know the old and young ages.
Create a new column for the ages in ten year to come. And create another column for age at 2005.

To do some plottings, let plot the first 10 youngest members.

Note: If you are using IPython (Jupyter) notebook, to display the plot within the notebook you have to call this magic command: %matplotlib inline

# Checking the statistical summary of the age column
df.describe()

# First 10 oldest members celbrating
df.sort_values(by="Age", ascending=False)[:10]

# First 10 youngest members celebrating
df.sort_values(by="Age", ascending=True)[:10]

# to answer, How old will each celebrant be in 10years?
df["Age_10_Plus"] = df["Age"] + 10
df

# age at 2005 when NairaLand was established
df["Age_at_2005"] = df["Age"] - 11
df

# First 10 youngest members celebrating
youngest_10 = df.sort_values(by="Age", ascending=True)[:10]

# To display the plot within the Jupyter notebook
%matplotlib inline

youngest_10.plot(x="Username", y="Age", kind="bar", title="10 Youngest Members Celebrating"wink

youngest_10.plot(x="Username", y="Age", kind="barh", title="10 Youngest Members Celebrating"wink

# Area plot, just to compare the three colums
df.plot.area()

# box plot on df for the three columns, if there are outliers you will see them
"""In statistics, an outlier is an observation point that is distant from other observations.
An outlier may be due to variability in the measurement or it may indicate experimental error;
the latter are sometimes excluded from the data set."""

df.plot.box()



Horizontal Bar Plot
[img]http://3.bp..com/-pdRUjRMyt2s/V7dYvoIlGnI/AAAAAAAABF0/uLJITBL427Y1j_1Kp9LDkAIALhVfXO6xACLcB/s1600/10%2Byoungest%2Bmembers%2Bcelebrating.png[/img]

Area Plot
[img]http://4.bp..com/-zYjoae56TRo/V7dYxe7rYoI/AAAAAAAABF4/2xP5Q--oNqM1HdwDQJbEQDLI6Fp69kl0ACLcB/s1600/areaplot.png[/img]

Box Plot
[img]http://3.bp..com/-DYfHvDBUCkA/V7dYxzGqoFI/AAAAAAAABF8/aFORbu_xiQ03Kcj1iwwSEqP2_wO1NPg5ACLcB/s1600/boxplot.png[/img]


As mentioned before, the complete code notebook is at: http://nbviewer.jupyter.org/github/forum2k9/NairaLand-Members-Birthday-Data/blob/master/NairaLand_Members_Birthday_Data.ipynb

That is it.

I hope now, you will be able to pick a small dataset available online and using Python to quickly calculate descriptive statistics and show their results with basic charts and tables.

Goodluck in your data science career.

1 Like

Re: Data Science Tutorial For Beginners With Python Programming Language by Stconvict(m): 1:57pm On Sep 03, 2016
umaryusuf:


Yeah, that is another good one for Data Science. ARe you using it?
Yeah. I'm Nypro. Omaar Yosif. grin
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 3:45pm On Sep 03, 2016
Stconvict:

Yeah. I'm Nypro. Omaar Yosif. grin

Lol! Since its you, I challenge you to reproduce this thread with Julia?

1 Like

Re: Data Science Tutorial For Beginners With Python Programming Language by Stconvict(m): 7:58pm On Sep 03, 2016
umaryusuf:


Lol! Since its you, I challenge you to reproduce this thread with Julia?
I would have but I'm currently working on some project.
I will definitely write this in Julia once I'm done. So challenge accepted. grin

1 Like

Re: Data Science Tutorial For Beginners With Python Programming Language by LoveDecay(m): 7:36am On Sep 04, 2016
Good work Umar, you have done quite well.

I dont know why Seun wont provide nairaland datasets. At least from the politics section. Reddit gave out their 2015 data set.

I am into text analytics.

1 Like

Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 5:22pm On Sep 04, 2016
LoveDecay:
Good work Umar, you have done quite well.

I dont know why Seun wont provide nairaland datasets. At least from the politics section. Reddit gave out their 2015 data set.

I am into text analytics.


You are right boss.

Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a structured manner. If you can get what you need through an API, it is almost always preferred approach over web scrapping.
Re: Data Science Tutorial For Beginners With Python Programming Language by ibnquasale(m): 4:32am On Sep 05, 2016
umaryusuf:


Lol! Since its you, I challenge you to reproduce this thread with Julia?

Stconvict:

I would have but I'm currently working on some project.
I will definitely write this in Julia once I'm done. So challenge accepted. grin


Good job Omaar, My super admin! Nypro and Julia. This is Ahmed
Re: Data Science Tutorial For Beginners With Python Programming Language by umaryusuf(m): 7:15am On Sep 06, 2016
ibnquasale:

Good job Omaar, My super admin! Nypro and Julia. This is Ahmed

You are welcome boss.
Re: Data Science Tutorial For Beginners With Python Programming Language by Fosi: 9:55pm On Oct 24, 2016
Hmmmm............Good to know.

I just started learning about Hadoop 3 days ago (because i need to build a datalake ) and the more i read, the more i realized that I need to know other prog. language like Python and R.

Kudos to you for sharing your knowledge with people.

(1) (2) (Reply)

Can Developers Do Without The Use Of Google Or Stackoverflow? / What's Tools Do I Need To Start Building A Messaging App / Which Programming Language Holds Brighter Prospects In Nigeria

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 83
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.