BeautifulSoup
When we want to
download website and scrape or get the data manually from the site, we
use BeautifulSoup.
The command to run BeautifulSoup:
from requests import get
url='http://dataquestio.github.io/web-scraping-pages/simple.html'response=get(url)print(response.text[:500])
where,
- response=get('url') : url is stored in response
- [:500] : prints first 500 words.
2. Response is an object. This page
has status_code property which means the page is
downloaded successfully.
3. A status_code of 200 means
page was downloaded successfully. 2 generally indicates
success, 4 and 5 indicates an error.
5. Once we see what is inside
the file, we use BeautifulSoup and look at different parts of the file.
6. We can select all the
elements by using children property of soup. Children return
a list generator, so we call list function on it.
7. We can use get_text method
to extract all the text inside the tag.
8. If we want to extract a single
tag line, we can use find_all method.
9. If we want to search for any tag
that has class and id.
10. We can also search for items
using css-selectors. Below finds all p tags
that are inside div.
- p a — finds all a tags inside of a p tag.
- body p a — finds all a tags inside of a p tag inside of a body tag.
- html body — finds all body tags inside of an html tag.
- p.outer-text - finds all p tags with a class of outer-text.
- p#first — finds all p tags with an id of first.
- body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
Below is the example
for BeautifulSoup:
No comments:
Post a Comment