Code Monkey home page Code Monkey logo

soup's Introduction

web scrapping 101

In this project we will understand the DOM and interact with it, we will learn and assure HTML and some python best practices like requirements.txt and pip.


TOC


Pre Requisites

Go ahead and read these:

  • HTML
  • We will try to understand the DOM

Python Extras

pip

pip is the python package manage, from the web:

is the standard package manager for Python. It allows you to install and manage additional packages that are not part of the Python standard library

requirements.txt

“Requirements files” are files containing a list of items to be installed using pip install like so:

pip install -r requirements.txt

Requirements are meant for (but not limited to):

  1. Requirements files are used to hold the result from pip freeze for the purpose of achieving repeatable installations

    pip freeze > requirements.txt
    pip install -r requirements.txt
  2. Requirements files are used to force pip to properly resolve dependencies. As it is now, pip doesn’t have true dependency resolution, but instead simply uses the first specification it finds for a project. E.g. if pkg1 requires pkg3>=1.0 and pkg2 requires pkg3>=1.0,<=2.0, and if pkg1 is resolved first, pip will only use pkg3>=1.0, and could easily end up installing a version of pkg3 that conflicts with the needs of pkg2. To solve this problem, you can place pkg3>=1.0,<=2.0 (i.e. the correct specification) into your requirements file directly along with the other top level requirements

    pkg1
    pkg2
    pkg3>=1.0,<=2.0
  3. Requirements files are used to force pip to install an alternate version of a sub-dependency. For example, suppose ProjectA in your requirements file requires ProjectB, but the latest version (v1.3) has a bug, you can force pip to accept earlier versions like so:

    ProjectA
    ProjectB<1.3

There are other ways of achieving the same result but will leave those for later.


Web Scraping with bs4 and requests

We will be using requests to GET the html and bs4 to parse it

will be use to make http requests (GET by default) and retrieve a html web page content

is a Python library for pulling data out of HTML and XML files


Your project

For every item here you must display the results in a very understandable way:

<YOUR NAME GOES HERE>
=============================
1. Portal
# item_title: <result>
GET the title and print it: <result>
---------------------------------------
GET the Complete Address of UFM: <result>
------------------------------------------
.
.
.
find all properties that have href (link to somewhere):
- <result 1>
- <result 2>
- <result 3>
=============================
2. Estudios
# ----- : separator between items

# ===== : separator between parts

# 1. Title: Title of the section

# use '-' if its a list

It will be possible to pass an argument to your app to specify which section to run, if no argument provided it will default to "run all parts"

# default to run all parts
python3 soup.py

# run part 1
python3 soup.py 1

# run part 2
python3 soup.py 2

# run part 3
python3 soup.py 3

  • NOTE If for some reason the result exceeds 30 lines you will display "Output exceeds 30 lines, sending output to: <logfile>" and send the output to a text file inside logs/ , example format:
$ python3 soup.py 1
=============================
1. Portal
GET the title and print it: Output exceeds 30 lines, sending output to: logs/1portal_GET_the_title_and_print_it.txt


$ ls logs/1portal_GET_the_title_and_print_it.txt

$ cat logs/1portal_GET_the_title_and_print_it.txt

Date of generation: Mon Sep  9 22:58:30 CST 2019
================================================

Universidad Francisco Marroquín

this log files will not be git tracked.


1. Portal

using "http://ufm.edu/Portal"

  • GET the title and print it
  • GET the Complete Address of UFM
  • GET the phone number and info email
  • GET all item that are part of the upper nav menu (id: menu-table)
  • find all properties that have href (link to somewhere)
  • GET href of "UFMail" button
  • GET href "MiU" button.
  • get hrefs of all <img>
  • count all <a>

1.1 Extra points

  • From all (<a>) Create a csv file (logs/extra_as.csv) with the following columns: Text, href

example:

<ul><li><a target="_blank" rel="nofollow noreferrer noopener" class="external text" href="https://www.ufm.edu/english/">UFM Key Projects</a></li>
Text href
UFM Key Projects https://www.ufm.edu/english/

2. Estudios

using "http://ufm.edu/Estudios"

  • now navigate to /Estudios (better if you obtain href from the DOM)
  • display all items from "topmenu" (8 in total)
  • display ALL "Estudios" (Doctorados/Maestrias/Posgrados/Licenciaturas/Baccalaureus)
  • display from "leftbar" all <li> items (4 in total)
  • get and display all available social media with its links (href) "class=social pull-right"
  • count all <a> (just display the count)

3. CS

using "https://fce.ufm.edu/carrera/cs/"

  • GET title
  • GET and display the href
  • Download the "FACULTAD de CIENCIAS ECONOMICAS" logo. (you need to obtain the link dynamically)
  • GET following <meta>: "title", "description" ("og")
  • count all <a> (just display the count)
  • count all <div> (just display the count)

4. Directorio

using "https://www.ufm.edu/Directorio"

  • Sort all emails alphabetically (href="mailto:[email protected]") in a list, dump it to logs/4directorio_emails.txt
  • Count all emails that start with a vowel. (just display the count)
  • Group in a JSON all rows that have Same Address (dont use Room number) as address, dump it to logs/4directorio_address.json
{
    "Edificio Academico":[
        "Arquitectura",
        "Ciencias Economicas",
        .
        .
        .
        "Crédito Educativo"
    ],
    "Centro Estudiantil":[
        "Admisiones",
        .
        .
        .
         "Desarrollo"
    ],
    .
    .
    .
}
  • Try to correlate in a JSON Faculty Dean and Directors, and dump it to logs/4directorio_deans.json
{
    "Facultad de Arquitectura": {
            "Dean/Director": "Roberto Quevedo",
            "email": "[email protected]",
            "Phone Number": "2338-7709"
        },
    "Facultad de Ciencias Económicas": {
        "Dean/Director": "Mónica Rio Nevado de Zelaya",
        "email": "[email protected]",
        "Phone Number": "2338-7723 2338-7724"
    }
    .
    .
    .
}
  • GET the directory of all 3 column table and generate a CSV with these columns (Entity,FullName, Email), and dump it to logs/4directorio_3column_tables.csv
Entity FullName Email
Rector Gabriel Calzada Álvarez [email protected]
Campus Madrid Gonzalo Melián [email protected]
Alumni Marcela Porta [email protected]

5. Extra

  • Complete Dockerfile
  • Create README section for Dockerfile under Usage Dockerfile
  • Add CI to your own repo.

Start your project

In order to start your project:

  • you MUST fork this repository into your own personal repo in github

  • you will need to use git and commit every once in a while, every commit must have a meaningful message.

  • to start using it:

    # clone
    git clone <your own personal repo URL>
    # install dependencies
    pip install -r requirements.txt
    # run it
    python soup.py
    # or
    ./soup.py
  • everytime you complete an "item" make sure to mark it as done [x]

Usage Dockerfile

In order to use Docker, run first:

docker build --rm -f "Dockerfile" -t soup:prod .

And then:

run -it soup:prod    

Delivery

  • FORK IT!!
  • This will be developed individually
  • You will send the response via miU
  • You will respond only with the URL of your git repo. (preferable git tags)
  • your name (username) MUST have commits in the git log.
  • it must compile & work!
  • READ all README.me first

soup's People

Contributors

danisnowman avatar jmarcos-cano avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.