Saturday, November 2, 2019

BeautifulSoup

BeautifulSoup

    When we want to download website and scrape or get the data manually from the site, we use BeautifulSoup.
The command to run BeautifulSoup:
                         
           from requests import get 
url='http://dataquestio.github.io/web-scraping-pages/simple.html'
response=get(url)
print(response.text[:500])

where,
  • response=get('url') : url is stored in response
  • [:500] : prints first 500 words. 

1.



2. Response is an object. This page has  status_code  property which means the page is downloaded successfully.


3. A status_code  of 200  means page was downloaded successfully.  generally indicates success, and indicates an error.


4. To print the content of the page, use the content  property.



5. Once we see what is inside the  file, we use BeautifulSoup and look at different parts of the file.



6.  We can select all the elements by using children property of soup. Children return a list generator, so we call list function on it.


7. We can use get_text  method to extract all the text inside the tag.


8. If we want to extract a single tag line, we can use find_all method.


9. If we want to search for any tag that has class and id.



10. We can also search for items using css-selectors. Below finds all  tags that are inside div.


  •     p a — finds all a tags inside of a p tag.
  •    body p a — finds all a tags inside of a p tag inside of a body tag.
  •  html body — finds all body tags inside of an html tag.
  •  p.outer-text - finds all p tags with a class of outer-text.
  •  p#first — finds all p tags with an id of first.
  •  body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

     Below is the example for BeautifulSoup:
Extracting and Scraping Weather Data






No comments:

Post a Comment