Sunday, October 20, 2019

Data Wrangling: Joins

Data Wrangling: Joins



Joins (Merge):
We merge or join two DataFrames by using merge(). 
Merge operation is similar to Join operation in SQL. DataFrames must have same column names on which merging happens.
Arguments to merge() allows us to perform natural join, left join, right join and full outer join:-
-left: DataFrame1
-right: DataFrame2
-on: col names to join. Must be found in both left and right DataFrame objects.
- how: type of join needs to be performed left, right, outer, inner. Default is inner join.

Types of Merge:

  1. Natural Join: Keeps all rows that match from the DataFrames.
                                  how='inner'
  2. Full Outer Join: Kep all rows from both DataFrame.
                                 how='outer'
  3. Left Outer Join: Keeps all rows of DataFrame x and only hose from y that match.
                                how='left'
  4. Right Outer Join: Keep all rows of DataFramse y and only those from x that match,
                                how='right'


Venn Diagrams for better understanding:

1. Natural Join:





2. Full-Outer Join:






3. Right-Outer Join:





4. Left-Outer Join:


Click this for Join Example




Data Wrangling

Data Wrangling: Group By, Join, Combine, Pivot,
Melt & Reshape.


Data Wrangling:
The process of cleaning the data enough to input the analytical algorithm is called Data Wrangling. It is also called as Data Munging.  

1. Melt:  Reshaping a date from wide to long in pandas python can be done with melt().
  • Step 1: Create a DataFrame: Below data shown is in wide.

  • Step 2: Use melt() as shown below. This data will be in long
    • id_vars=['countires']: ids which need to be left unaltered i.e countries in this case.
    • var_name='metrics': column names changed to metrics.
    • value_name='values': changed to values.
2. Pivot: Riverse of melt() i.e. from long to wide.
  • Step 1: Create a DataFrame: Below data shown is in long.
  • Step 2: Use pivot() as shown below. This data will be in wide
    • index='countires': column used as an index.

3. Group By: Similar to group by in SQL.
  • Step 1: Create a DataFrame: 

  • Step 2: Use pivot() as shown below.
    first(): prints first enteries in all the groups formed.






  • Step 3: mean()  can be used to find the mean of the column/row.
Also,

  • Step 4: unstack()  can also be used to find the mean after group by.

    Click this for more on Group BY Example 


Group By Example

Group By Example


Example:
  1. Getting mean score of a group using groupby().
  2. Getting sum of score of a group using groupby().
  3. Descriptive statistics of a group using groupby().
  4. Group the entire dataframe by Subject and Exam and then find the sum of score of students.
Answers:
Create DataFrame:

1. Getting mean score of a group using groupby().

2. Getting sum of score of a group using groupby().

3. Descriptive statistics of a group using groupby().

4. Group the entire dataframe by Subject and Exam and then find the sum of score of students.

Saturday, October 19, 2019

Let's Get Started!



Lets get Started!

1. Running Jupyter:
You can execute your program in Jupyter. This can be accessed from notebooks.azure.com for creating libraries. It has pandas and numpy packages installed, where you do not have to install separately.

2. Importing libraries using 'import' statement and alias:
To import pandas and numpy libraries, use following statement-
  • import pandas as pd
  • import numpy as np

3. Print statement:
To print statement-
  • print('hello')

4. Loops using 'for' and 'while':

6. Python 2 Vs Python 3:
There is difference in Python version. E.g is given below-
  • print 'hello' - In python 3, this statement will not be executed but it will print in python 2.

7.Dictionary and Lists:
  • Dictionary: As the name says, it hold word-meaning pairs. Similarly, Python dictionary holds key-value pair. To declare dictionary:   
          >>>xdict={'x':1, 'y':2} # is key and is value.
          >>>xdict
  Output: {'x':1 , 'y':2 }

  • List: Python doesn't support arrays, so it has List. To declare list:
         >>>xlist = [1,2,3,4]
        >>>xlist
  Output:  [1,2,3,4]
       >>>for x in xlist
                        print(x)
  Output:  1 2 3 4
      >>>ylist=[1,[2,3],(4,5)False, 'No']
               >>>ylist
  Output:  [1,[2,3],(4,5)False, 'No']



Python




   Python: Data Analytics!

 This blog is to focus on how to analyze data using Python and also to reach to the Data Science level. It includes concepts of Python from basic to intermediate level. All the concepts are explained with an example. It also has some real-life examples. Major examples will be updated soon, as currently, I am working on it. More topics will be updated as well. As mentioned, I will be updating some key techniques (advance) with Python programming for Data Science.
Feel free to comment and share your opinions.
Comment your doubts, I will try to clear them.
Happy Coding!!



Pandas: Pandas is a package which extracts data from CSV into DataFrame (will discuss in this blog) and allows us to do various things:
  • Calculate statistics and gives solutions about data e.g their mean, median, average, min, max and so on.
  • Cleans the data e.g removing missing values, removing duplicate values, filtering rows or columns with some data and so on.
  • Visualizes the data with the help of Matplotlib(discuss later) and plots data in pie chart, histogram, bubbles and more.
  • After cleaning the data, it stores and transformed data back into CSV or any other format.
Numpy: Numpy (Numeric Python) is also a package for scientific computing with Python. It is used to work with N-dimentsional array, Linear algebra, random number, Fourier Transform, etc.  It deals with multi-dimensional arrays and matrices. 




My Kaggle:
I have done analysis on a few data on Kaggle as well. Do check out and follow me ;)
PS: I just started :P
https://www.kaggle.com/heebahsaleem



My Publications:
Improved Image Steganography Algorithm using Huffman Code:
https://www.ijcaonline.org/archives/volume147/number12/25702-2016911242
[Foundation of Computer Science(FS), NY, USA, Volume 147 - Number 12, Tariq H, Saleem H]


My LinkedIn Profile:
https://www.linkedin.com/in/heebah-saleem-202615102/



More on Python will be updated soon on this page......... ;)
By the time, Happy coding :)