web scrapping 101

In this project we will understand the DOM and interact with it, we will learn and assure HTML and some python best practices like requirements.txt and pip.

Pre Requisites

Go ahead and read these:

HTML
We will try to understand the DOM

Python Extras

pip

pip is the python package manage, from the web:

is the standard package manager for Python. It allows you to install and manage additional packages that are not part of the Python standard library

requirements.txt

“Requirements files” are files containing a list of items to be installed using pip install like so:

pip install -r requirements.txt

Requirements are meant for (but not limited to):

Requirements files are used to hold the result from pip freeze for the purpose of achieving repeatable installations
```
pip freeze > requirements.txt
pip install -r requirements.txt
```
Requirements files are used to force pip to properly resolve dependencies. As it is now, pip doesn’t have true dependency resolution, but instead simply uses the first specification it finds for a project. E.g. if pkg1 requires pkg3>=1.0 and pkg2 requires pkg3>=1.0,<=2.0, and if pkg1 is resolved first, pip will only use pkg3>=1.0, and could easily end up installing a version of pkg3 that conflicts with the needs of pkg2. To solve this problem, you can place pkg3>=1.0,<=2.0 (i.e. the correct specification) into your requirements file directly along with the other top level requirements
```
pkg1
pkg2
pkg3>=1.0,<=2.0
```
Requirements files are used to force pip to install an alternate version of a sub-dependency. For example, suppose ProjectA in your requirements file requires ProjectB, but the latest version (v1.3) has a bug, you can force pip to accept earlier versions like so:
```
ProjectA
ProjectB<1.3
```

There are other ways of achieving the same result but will leave those for later.

Web Scraping with bs4 and requests

We will be using requests to GET the html and bs4 to parse it

requests:

will be use to make http requests (GET by default) and retrieve a html web page content

bs4:

is a Python library for pulling data out of HTML and XML files

Your project

For every item here you must display the results in a very understandable way:

<YOUR NAME GOES HERE>
=============================
1. Portal
# item_title: <result>
GET the title and print it: <result>
---------------------------------------
GET the Complete Address of UFM: <result>
------------------------------------------
.
.
.
find all properties that have href (link to somewhere):
- <result 1>
- <result 2>
- <result 3>
=============================
2. Estudios

# ----- : separator between items

# ===== : separator between parts

# 1. Title: Title of the section

# use '-' if its a list

It will be possible to pass an argument to your app to specify which section to run, if no argument provided it will default to "run all parts"

# default to run all parts
python3 soup.py

# run part 1
python3 soup.py 1

# run part 2
python3 soup.py 2

# run part 3
python3 soup.py 3

NOTE If for some reason the result exceeds 30 lines you will display "Output exceeds 30 lines, sending output to: <logfile>" and send the output to a text file inside logs/ , example format:

$ python3 soup.py 1
=============================
1. Portal
GET the title and print it: Output exceeds 30 lines, sending output to: logs/1portal_GET_the_title_and_print_it.txt


$ ls logs/1portal_GET_the_title_and_print_it.txt

$ cat logs/1portal_GET_the_title_and_print_it.txt

Date of generation: Mon Sep  9 22:58:30 CST 2019
================================================

Universidad Francisco Marroquín

this log files will not be git tracked.

1. Portal

using "http://ufm.edu/Portal"

GET the title and print it
GET the Complete Address of UFM
GET the phone number and info email
GET all item that are part of the upper nav menu (id: menu-table)
find all properties that have href (link to somewhere)
GET href of "UFMail" button
GET href "MiU" button.
get hrefs of all <img>
count all <a>

1.1 Extra points

From all (<a>) Create a csv file (logs/extra_as.csv) with the following columns: Text, href

example:

<ul><li><a target="_blank" rel="nofollow noreferrer noopener" class="external text" href="https://www.ufm.edu/english/">UFM Key Projects</a></li>

Text	href
UFM Key Projects	https://www.ufm.edu/english/

2. Estudios

using "http://ufm.edu/Estudios"

now navigate to /Estudios (better if you obtain href from the DOM)
display all items from "topmenu" (8 in total)
display ALL "Estudios" (Doctorados/Maestrias/Posgrados/Licenciaturas/Baccalaureus)
display from "leftbar" all <li> items (4 in total)
get and display all available social media with its links (href) "class=social pull-right"
count all <a> (just display the count)

3. CS

using "https://fce.ufm.edu/carrera/cs/"

GET title
GET and display the href
Download the "FACULTAD de CIENCIAS ECONOMICAS" logo. (you need to obtain the link dynamically)
GET following <meta>: "title", "description" ("og")
count all <a> (just display the count)
count all <div> (just display the count)

4. Directorio

using "https://www.ufm.edu/Directorio"

Sort all emails alphabetically (href="mailto:[email protected]") in a list, dump it to logs/4directorio_emails.txt
Count all emails that start with a vowel. (just display the count)
Group in a JSON all rows that have Same Address (dont use Room number) as address, dump it to logs/4directorio_address.json

{
    "Edificio Academico":[
        "Arquitectura",
        "Ciencias Economicas",
        .
        .
        .
        "Crédito Educativo"
    ],
    "Centro Estudiantil":[
        "Admisiones",
        .
        .
        .
         "Desarrollo"
    ],
    .
    .
    .
}

Try to correlate in a JSON Faculty Dean and Directors, and dump it to logs/4directorio_deans.json

{
    "Facultad de Arquitectura": {
            "Dean/Director": "Roberto Quevedo",
            "email": "[email protected]",
            "Phone Number": "2338-7709"
        },
    "Facultad de Ciencias Económicas": {
        "Dean/Director": "Mónica Rio Nevado de Zelaya",
        "email": "[email protected]",
        "Phone Number": "2338-7723 2338-7724"
    }
    .
    .
    .
}

GET the directory of all 3 column table and generate a CSV with these columns (Entity,FullName, Email), and dump it to logs/4directorio_3column_tables.csv

Entity	FullName	Email
Rector	Gabriel Calzada Álvarez	[email protected]
Campus Madrid	Gonzalo Melián	[email protected]
Alumni	Marcela Porta	[email protected]

5. Extra

Complete Dockerfile
Create README section for Dockerfile under Usage Dockerfile
Add CI to your own repo.

Start your project

In order to start your project:

you MUST fork this repository into your own personal repo in github
you will need to use git and commit every once in a while, every commit must have a meaningful message.

to start using it:

# clone
git clone <your own personal repo URL>
# install dependencies
pip install -r requirements.txt
# run it
python soup.py
# or
./soup.py

everytime you complete an "item" make sure to mark it as done [x]

Usage Dockerfile

In order to use Docker, run first:

docker build --rm -f "Dockerfile" -t soup:prod .

And then:

run -it soup:prod

Delivery

FORK IT!!
This will be developed individually
You will send the response via miU
You will respond only with the URL of your git repo. (preferable git tags)
your name (username) MUST have commits in the git log.
it must compile & work!
READ all README.me first

danisnowman / soup Goto Github PK

soup's Introduction