Code Monkey home page Code Monkey logo

nick-sort-lab2's Introduction

lab: sorting runtimes

In this lab you will measure just how much faster the built-in sorted function is than the functions you implemented on the last homework assignment. You will also learn about how to use git submodules to connect multiple git repos together.

You will need a partner for portions of this lab. (And recall that you must do partner work either in class or the QCL.)

Tasks

Task 1: initialize the repo

Do not clone this repository. Instead:

  1. Create a new empty repo. Ensure that the name of your repo is different than the name of your partner's repo (so that you can fork it later).

  2. Copy the contents of this repo to your new repo with the following commands

    $ git clone https://github.com/mikeizbicki/lab-sorting-runtimes
    $ cd lab-sorting-runtimes
    $ git remote rm origin
    $ git remote add origin <URL>
    $ git push origin master
    

    where <URL> is the url to your newly created repo.

    NOTE: In the past I used $URL syntax to denote a "variable" that you need to substitute into a command. Both the $URL and <URL> style syntaxes are commonly used.

    NOTE: At this point you should ensure you understand what the git remote commands are doing fully. I will stop giving these commands to you explicitly in the future.

Task 2: setup the submodule

Run the runtimes.py file. You should get a ModuleNotFoundError that looks something like

$ python3 runtimes.py
Traceback (most recent call last):
  File "runtimes.py", line 9, in <module>
    from sorting.sorting import merge_sorted, quick_sorted
ModuleNotFoundError: No module named 'sorting'

NOTE: In the code block above, I put both the command and the entire output of the command. Before the code block, I put a short summary of my steps using English intermixed with inline-code. When reporting error messages, it is standard to have a short English language summary and a full output of the error message in a code block. (Even if the output is hundreds of lines long, people will still include the full output, since all of those lines provide important information.) If you report your errors in this same way (e.g. on github issues/stackoverflow), you will be more likely to get good help from people.

The problem is that the runtimes.py file is trying to load the functions that you wrote in the sorting homework, but it can't find those files. The most obvious way to get access to those files is to clone your sorting homework into the repository. BUT DON'T DO THIS! It is never a good idea to clone one git repository inside another repository directly. Instead, we will use something called a git submodule which will allow the two git repositories to "talk" to each other when needed.

You can initialize a git submodule by running the command:

$ git submodule add <URL>

where <URL> is the url to your sorting homework.

If this command works successfully, then the command

$ ls -a

should have a new sorting directory in its output and a file .gitmodules. If you run

$ ls sorting

then you should see all of the contents of your sorting repo. Running the command

$ python3 runtimes.py

should now generate no errors. If you run

$ git status

you should see that the sorting repo and the .gitmodules file have has been added into the staging area. You should commit theses changes and push them to github now.

If you visit the github website for your lab repo, you should see that the sorting folder is now visible in the repo, and it will have a symbol next to it that looks like @ f43eacf where the hash number after the @ is the hash of the commit of the sorting repo and will be different. Clicking on the folder should take you to the url for your sorting repo and away from this repo. There will be no indication on your sorting repo that it is a git submodule in another repo.

Task 3: runtimes of sorting algorithms on random lists

Run the command

$ python3 runtimes.py --max_x=5

This will use the timeit module on 5 different lists of increasing size to measure the runtimes of the built-in sort function (which uses timsort) versus the sorting functions that you implemented in your last homework. The output of this command, however, is currently not nicely formatted. At the end of the file is a line labeled FIXME 1. Follow the instructions so that the runtimes.py file generates output in markdown table format.

NOTE: It is very common to write python code that generates code in another programming language, especially markdown or html.

After you've completed the FIXME, run the following command

$ python3 runtimes.py --max_x=22

and copy/paste the resulting table into this README file below this line.

timsort merge_sorted quick_sorted
n**0.00e+00 4.46e-06 3.20e-06 2.42e-06
n**1.00e+00 2.54e-06 1.18e-05 1.13e-05
n**2.00e+00 2.46e-06 1.88e-05 1.65e-05
n**3.00e+00 3.21e-06 4.04e-05 3.51e-05
n**4.00e+00 4.39e-06 8.23e-05 7.72e-05
n**5.00e+00 7.29e-06 1.83e-04 1.86e-04
n**6.00e+00 1.59e-05 4.16e-04 4.15e-04
n**7.00e+00 2.92e-05 9.15e-04 9.90e-04
n**8.00e+00 6.08e-05 1.88e-03 2.40e-03
n**9.00e+00 1.39e-04 4.26e-03 5.12e-03
n**1.00e+01 2.92e-04 9.23e-03 1.19e-02
n**1.10e+01 6.34e-04 2.02e-02 2.46e-02
n**1.20e+01 1.40e-03 4.37e-02 5.23e-02
n**1.30e+01 3.19e-03 9.82e-02 1.22e-01
n**1.40e+01 6.75e-03 2.00e-01 2.49e-01
n**1.50e+01 1.55e-02 4.41e-01 5.21e-01
n**1.60e+01 2.50e-02 8.56e-01 1.14e+00
n**1.70e+01 7.94e-02 2.02e+00 2.43e+00
n**1.80e+01 1.87e-01 4.14e+00 5.37e+00
n**1.90e+01 4.26e-01 8.96e+00 1.15e+01
n**2.00e+01 9.91e-01 1.90e+01 2.46e+01
n**2.10e+01 2.00e+00 4.01e+01 5.45e+01
n**2.20e+01 4.96e+00 8.43e+01 1.22e+02

You should observe that python's built-in sort function is 10-100x faster than yours. All functions have the same wort-case asymptotic complexity (i.e. $\Theta(n \log n)$), but python's built-in sorting function uses lots of optimization tricks to achieve this extra speedup. Native python code is not very good at performing these types of optimizations, and so the built-in function is written in the C programming language. (You can find the source code of the built-in function on github). Functions that must be very fast are generally written in C instead of Python. One of the differences between a computer sciencee major and a data science major is that the CS major focuses on how to write these fast functions, and the DS major focuses on how to use these fast functions to accomplish interesting tasks.

Push your changes to github and verify that the table is being displayed correctly.

Task 4: cloning

Fork your partner's repo and clone your partner's repo to the lambda server. In this forked repo, re-run the command

$ python3 runtimes.py --max_x=22

You should get a ModuleNotFoundError again.

The problem is that when you clone a repo with submodules, the submodules are by default not cloned as well.

ASIDE: Facebook announced in 2014 that the size of the git repo for facebook.com was 54GB, and it is undoubtably much larger today. This 54GB only includes the code directly responsible for facebook.com, and not any of the libraries that facebook.com depends on. Facebook's releases many public libraries on github at https://github.com/facebook, and most of these are going to be included in the main facebook.com repo as submodules. Some popular examples include the react web framework and pytorch machine learning library. Since these are their own separate git repos (and therefore submodules of facebook.com), developers can work on these libraries without needing access to the main facebook.com repo. Or if they are working at Facebook and do have access, they don't need to waste their time downloading all of the 54GB of code for facebook.com just to work on react or pytorch.

The main repo for facebook.com is not public, and not even hosted with github. Recall that one of the advantages of git is that it is a distributed version control system not tied to any hosting provider. Facebook only uses github for its public repos, and uses an internal git service for storing its proprietary repos like the one for facebook.com.

Now run the command

$ ls

and notice that the sorting folder does exist in your forked repo. The problem is that it is empty, and you can verify that with the command

$ ls sorting

We need to populate or hydrate this folder, which is currently unpopulated or dehydrated. You can do that with the following sequence of commands:

$ git submodule init
$ git submodule update

Now, you should be able to run the original command successfully

$ python3 runtimes.py --max_x=22

Task 5: runtime of sorting an already sorted list

Remain in the forked repo for this task.

We will now measure the runtime of the sort functions on data that has already been sorted. Do this by adding the --input=sorted argument to get the following

$ python3 runtimes.py --max_x=23 --input=sorted
Traceback (most recent call last):
  File "runtimes.py", line 38, in <module>
    xs = FIXME
NameError: name 'FIXME' is not defined

You'll notice that you get a NameError because the word FIXME is undefined.

NOTE: It is common practice in python to put the word FIXME as a variable in a code path that you haven't implemented yet. If that code path accidentally gets run, then python will raise the NameError to let you know that you are running code that you shouldn't be running. From python's perspective, there's nothing special about the word FIXME: it's just a variable name that has not been defined.

Follow the instructions in the comments to provide a proper definition of xs, then rerun the command above to generate a markdown table of runtimes. Copy/paste the table into the README file below this line.

timsort merge_sorted quick_sorted
n**0.00e+00 4.59e-06 2.93e-06 3.39e-06
n**1.00e+00 2.61e-06 1.14e-05 1.10e-05
n**2.00e+00 2.53e-06 1.68e-05 1.34e-05
n**3.00e+00 2.60e-06 3.31e-05 3.00e-05
n**4.00e+00 2.68e-06 5.80e-05 5.26e-05
n**5.00e+00 3.53e-06 1.26e-04 1.40e-04
n**6.00e+00 4.19e-06 2.77e-04 3.11e-04
n**7.00e+00 5.71e-06 6.25e-04 8.10e-04
n**8.00e+00 1.04e-05 1.33e-03 1.66e-03
n**9.00e+00 1.98e-05 2.95e-03 3.53e-03
n**1.00e+01 3.56e-05 6.26e-03 8.01e-03
n**1.10e+01 8.50e-05 1.34e-02 1.75e-02
n**1.20e+01 1.72e-04 2.78e-02 3.78e-02
n**1.30e+01 3.37e-04 5.93e-02 8.54e-02
n**1.40e+01 5.90e-04 1.24e-01 1.76e-01
n**1.50e+01 1.21e-03 2.61e-01 3.76e-01
n**1.60e+01 2.73e-03 5.39e-01 8.30e-01
n**1.70e+01 5.63e-03 1.17e+00 1.63e+00
n**1.80e+01 1.09e-02 2.21e+00 3.50e+00
n**1.90e+01 2.19e-02 4.66e+00 7.10e+00
n**2.00e+01 4.68e-02 1.04e+01 1.50e+01
n**2.10e+01 9.05e-02 2.11e+01 3.22e+01
n**2.20e+01 1.85e-01 4.33e+01 6.80e+01
n**2.30e+01 3.59e-01 9.04e+01 1.43e+02

You should notice that the built-in sorted function ran much faster on this input, but your merge_sorted and quick_sorted functions have essentially the same runtimes. This is because TimSort is designed to not have to resort already sorted data, and its best-case runtime is therefore $\Theta(n)$ instead of $\Theta(n\log n)$. It turns out that in practice, data is often partially sorted, an so TimSort is much faster than even highly optimized mergesort or quicksort implementations.

Add/commit your changes and push them to the forked github repo. Issue a pull request to your partner with your changes, and accept your partner's pull request when you receive it.

Submission

Submit the link to both your and your partner's github repos to sakai.

nick-sort-lab2's People

Watchers

 avatar

Forkers

cameronshir11

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.