With or without inheritance?

By following the example of R. Sedgewick and K. Wayne's implementation philosophy (from their work on implementing and explaining data structures and algorithms, whose code you can find here: https://github.com/kevin-wayne/algs4), I could remove the usage of inheritance to model relationshipts between data structures that in theory are connected by a parent-child relationship.

With inheritance

Advantages

Parent-child relationships between data structures would be more obvious in certain cases
1. Example 1: Min- and max- heaps have certain operations in common
2. Example 2: Many operations of a BST also work for a RBT
Less redundant code and duplication

Disadvantages

Dependency on another implementation
Highly coupled modules
Theoretical parent-child relationships are not always best represented and implemented as such.
Example 1: a tree is a special type of a graph, but it's usually implemented quite differently from a graph, hence it may not be a good idea to inherit a tree from a graph.

This is one example where in theory data structures are connected by a parent-child relationship, but in practice it doesn't make much sense to implement it.
Solution 1: implement inheritance whenever it really makes sense
1. Example 1: BST <- RBT
2. Example 2: Heap <- MinHeap
  1. Problem 1: Duplication when testing min- and max-heap implementations because they inherit methods, which are already implemented in the abstract class Heap, which need both to be test for min- and max-heaps.
    1. Note 1: in general, test duplication would happen even more often without inheritance
3. Problem 1: Inconsistencies between using and not using inheritance over ands.
Complex hierarchies

Without inheritance

Advantages

Every data structure (and file) would be self-contained.
Data structures are loosely coupled
1. Implementations can change more freely

Disadvantages

More duplication and redundancy
1. Note 1: this would happen not only in the code but in the tests too!
Operation implemented in one data structure may be missing in another similar one
1. Note 1: this is the problem of forgetting to implement!

Should the hash function of a hash table possibly change?

It should be possible to customise/choose the (current) hash function of the HashTable data structure without affecting the general behaviour of the same, but at most its performance.

I should do some research around regarding other implementations!

Queue data structure uses a list instead of deque

When I started off this project, I wasn't really concerned with the efficiency of the implementations in every case, so, in a few cases, the purpose of the implementation was just to expose the concepts.

Deques apparently are better suited to implement queues than list, but currently my implementation of Queue uses list instead of queue. I should change its implementation to use deques.

DisjointSet data structure seems not to be implemented properly

To be honest, I am not sure if my SetDisjoint is even well implemented (I think it is, but its design seems so bad). I need to find a good implementation of a SetDisjoint and maybe compare it with mine..

Missing the "deletion" method for the TST

Testing also private fields and methods of data structures?

Should I also test the private fields and methods of data structures or just its public interface?

Matrices

Matrices I could implement (which could also be useful to implement graphs). Note that these type of matrices may derive from a common and more generic matrix data structure.

Operations on Matrices

Determinant
Product, sum, subtraction and division of matrices
Matrix power
Inverse
Find eigenvectors
Find eigenvalues
Find spectral radius
Is identity?
Is matrix diagonally dominant?
Is matrix squared?
Is matrix positive-definite (or positive semi-definite)?
Is matrix (as)symmetric?
Is matrix sparse/dense?

Testing numerical algorithms

Which properties of a numerical algorithm should be tested? Does the answer to this question depend on the specific numerical algorithm?

Possibly useful resources to answer these and related questions

Fix time complexity of binary_search_iteratively from n*log(n) to log(n)

This was probably a typo.

Not all methods of MinMaxHeap which accept indices assert that indices are in the bounds

For consistency, I should assert in all methods that accept "indices" that they are valid, i.e. by using something of the form assert self.is_good_index(my_index). Examples of functions that do not do that: _index_of_max, _index_of_min, etc.

Mare sure that this assertion is required in certain methods: it may happen that it's "ok" that in a call or recursive sub-call to a function the indices are no more valid!!!

Check also the cases for BinaryHeap, MinHeap and MaxHeap.

Maximum-Flow Problem

Careful: These explanations are incomplete!!!

What is a flow network?

A flow network is a directed graph G = (V, E), such that

Each edge (u, v) ∈ E has a non-negative capacity c(u, v) ≥ 0
If (u, v) is an edge ∈ E, then (v, u) ∉ E
If (u, v) ∉ E, then c(u, v) = 0
There's a source node s and a sink node t.

Point 2 means that there aren't any anti-parallel edges. See below how to convert a graph with antiparallel edges to a flow network.

Flow

A flow in a flow network G = (V, E) is a real-valued function f : V x V -> ℜ that satisfies two properties:

Capacity constraint: ∀ u, v ∈ V : 0 ≤ f(u, v) ≤ c(u, v)

That is, for every two vertices u and v in V, we have that the flow between them is greater or equal to zero and is smaller or equal to the cost of the edge that connects u to v.
Flow conservation: ∀ u ∈ V - {s, t} : ∑_{v ∈ V} f(u, v) = ∑_{v ∈ V} f(v, u)

The value of a flow f is denoted by |f| and is defined by: |f| = ∑_{v ∈ V} f(s, v) - ∑_{v ∈ V} f(v, s)

That is the total flow out of the source s minus the total flow into the source s.

Each edge in the following flow network is labeled using a label of the form f/c, where f denotes the flow and c denotes the cost.

Maximum Flow Problem

Given a flow network G = (V, E), a source node s and a sink node t, find a flow of maximum value.

Antiparallel Edges

In the case of the presence of antiparallel edges in the flow network, we remove them in the following way:

Multiple Sources or Sinks

In the case we have multiple sources or sinks, we add respectively a supersource or supersink as follows:

Algorithms for maximum flow

Implement a few important algorithms to find the maximum flow, such as:

Notions and Terminology

Augmenting path

An augmenting path is a path constructed by repeatedly finding a path of positive capacity from a source s to a sink node t.

Examples???

Ford–Fulkerson method

function Ford–Fulkerson(G, s, t):
	initialize flow f to 0
	while there's an augmenting path p in the residual network Gf
		augment flow f along path p
	return p

where

Edmonds–Karp algorithm, which is apparently a specialization of Ford-Fulkerson algorithm.

Here's the Wikipedia article about the topic:

Maximum Flow Problem

References

Resources

Introduction to Algorithms (3rd ed.), by C.L.R.S., 26th chapter.

MinPriorityQueue could be moved to its own file?

Is TTD the most appropriate process to develop software for this project?

This project wasn't started as a TDD project because, at that time, I didn't know much about these techniques.

Questions

Why would or not TTD be the most appropriate process of developing software for this project?
- Advantages?
- Disadvantages?
Are there better alternatives for this project? Why? Examples?

Resources

The following resources apparently contain useful information to answer the previous questions.

https://en.wikipedia.org/wiki/Test-driven_development#Benefits

Order-Stastics Tree and Interval Trees (by augmenting RBTs)

References

Chapter 14 of "Introduction to algorithms" by CLRS to augment the RBT ds.
Slides by prof. Papadopoulou

The BST has a rank method, which works in linear time instead of the possible optimal logarithmic time...

Add support for:

Order statistics tree (OS-RANK, OS-SELECT) in log(n) time by augmenting the RBT with a field called size to each node x, which represents the number of nodes under x. This field must be maintained though.
Add support for interval trees

Order Statistics Tree

The OS-SELECT(T, i) (returns the ith element in T) method could look like this:

OS-SELECT(T, i): // we're looking for the ith smallest element in T
    OS-SELECT-AUX(T.root, i) // we start from the root

OS-SELECT-AUX(x, i):
    r = x.left.size + 1 // x.left.size because the elements on the left precede x
    if r == i: // x is at the ith position 
        return x
    else if i < r: // search in the left subtree of x
        return OS-SELECT-AUX(x.left, i)
    else:  // i >= r
        // since the subtree rooted at x contains r elements
        // that come before x's right subtree in a inorder tree walk
        // the ith smallest element in the subtree rooted at x
        // is the (i - r) element rooted at x's right subtree.
        return OS-SELECT-AUX(x.right, i - r)

The OS-RANK(T, x) (the number of elements that precede x in an inorder tree walk of T + 1, to account for x itself) that instead could look like this:

OS-RANK(T, x):
    r = x.left.size + 1
    y = x
    while y != T.root:
        // if y is a right child
        // then both its parent and all the nodes in its parent's left subtree
        // come before y in a inorder tree walk of T 
        if y == y.parent.right: 
            r = r + y.parent.left.size + 1
        y = y.parent
    return r

We need to maintain the field size when the RBT is modified during insertions and deletions. During insertions, when we go down through the path until the position of the new node, we increase the size of all nodes in these path by 1. Going up, to the fix the height of the tree to maintain it balanced, we do at most 2 rotations. During these rotations we also need to change the sizes of the nodes involved only in the rotation, since the rotation is a local operation. For example, in a left-rotate, we could do something like this (omitting handling of nil s and parent points):

VERIFY THAT THIS PSEUDOCODE IS CORRECT!!!

OS-LEFT-ROTATE(T, x):
    y = x.right
    x.right = y.left
    y.left.parent = x
    ...

    // the size of y becomes the size of its parent x
    y.size = x.size 

    // the new size of x is the sum of the sizes of its two 
    x.size = x.left.size + x.right.size + 1 children + 1
    ...

During deletion, we do at most 3 rotations. When going up doing these rotations, we need also to decrease the size of the nodes all the way up to the root.

We can also implement easily other operations such as OS-ITH-SUCCESSOR, OS-ITH-PREDECESSOR by simply combining OS-RANK and OS-SELECT. For example OS-ITH-SUCCESSOR(T, x, i) can be implemented as follows:

OS-ITH-SUCCESSOR(T, x, i):
    r = OS-RANK(x)
    return OS-SELECT-AUX(T.root, r + i) // select the (r + i)th element from the tree

OS-ITH-PRECESSOR can be implemented similarly by passing as second parameter to OS-SELECT-AUX r - i.

All the operations we have mentioned so far run all in O(log n).

Interval Trees

Idea: store intervals in the nodes of a RBT organized according to the low values of the intervals. In addition, augment the RBT's nodes with a field max which is the maximum value of the subtree rooted at that corresponding node, i.e.

x.max = max(x.interval.high, x.left.max, x.right.max)

We need to maintain this new field max during insertions and deletions. But this can be easily achieved, since this is a local information.

The INTERVAL-SEARCH procedure looks like this:

INTERVAL-SEARCH(T, i):
x = T.root
while x != T.nil and i does not overlap x:
    if x.left != T.nil and (x.left.max >= i.low):
        x = x.left // move left if the maximum key in the left subtree is greater than i.low
    else:
        x = x.right
return x

Other operations that could be useful are:

INTERVAL-SEARCH-EXACTLY (Check solutions to 2nd homework of ans2).
INTERVAL-SEARCH-ALL(T, i), which searches all intervals in T which overlap the interval i. This operation can be implemented by imitating the deletion of a found overlapping interval. (Check solutions to 2nd homework of ands2).

Should I create a generic parent "tree" class from which, e.g., a binary tree would derive?

Advantages

...

Disadvantages

...

Should additional non-standard methods of a particular data-structure be provided?

For example, in the case of heaps, I'm currently providing an additional non-standard method, i.e. delete, for all binary heaps (MinHeap, MaxHeap and MinMaxHeap), but delete is usually not an operation provided by the specification of a binary heap, because, in general, it requires linear time (for a heap).

Another example is the TST. I'm also here providing a deletion method, but, usually, ternary-search tree specifications do not state "deletion" as a standard operation.

Possible solution

One solution I'm currently seeing right now would be to not provide additional non-standard methods in the standard implementation of a data structure, but to eventually provide another subclass which implements the standard implementation whose purpose is to enhance it with the additional operations, or maybe to simply enhance the standard implementations (for example, in the case of BST, I already implemented a BSTImproved, whose purpose was to re-implement the "insertion" operation).
- Problem: is it worth to create an additional data structure or implementation, maybe just for a method?
  - Observation: at least, it seems to make more sense than the current implementations (Why ???)

Resources

https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it

Complexity analysis of algorithms

This project also aims at implementing algorithms which asymptotically are faster than their eventual alternatives. Nevertheless, asymptotically slower alternatives are not excluded: on the contrary, they may serve as a form of comparison (for the purposes of better understanding the complexity analysis, given that this project is mainly educational) with the theoretically asymptotically faster ones. Moreover, in practice, algorithms with lower asymptotic complexity may have a better runtime performance, at least for certain inputs and/or for certain architectures.

The goals of this issue (which is going to be permanent) are:

Ensure asymptotically fastest algorithms for a particular problem are implemented before the slower ones.
Implement interesting alternatives to the asymptotically fastest algorithms. Interesting alternatives are e.g. the ones that in practice (at least for certain inputs) perform better than the asymptotically fastest algorithms.
List and point to resources that may help in analyzing the implemented algorithms.
Take also into consideration the space complexity analysis of algorithms, which, so far, has not been performed much.
Describe (usual) techniques to compare asymptotic complexities of two algorithms. This may be useful e.g. if one wants to order (increasingly or decreasingly) algorithms according to their asymptotic growth rate [OPTIONAL].
List facts about the asymptotic behavior of functions. For example, polynomials grow slower than exponentials [OPTIONAL].

Asymptotic complexity of common algorithms

http://bigocheatsheet.com/

General explanations about asymptotic notation

https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/asymptotic-notation

Order functions according to their asymptotic growth rate

Are there some redundant parts in the implementation of the "deletion" operation of the RBT?

Inconsistencies between exceptions that are raised in the methods of TST

The TST class raises a ValueError in the insert method if not key, but raises TypeError for the same reason in the methods delete, search_recursively and search.

ValueError should be raised when the input's domain is invalid, whereas TypeError should be raised when the type of an input is not the expected one. So, I should also raise ValueError when for example the input has length 0, but should have length > 0. And I should raise TypeError if for example a check isinstance(my_input, my_required_type) is false.

http://stackoverflow.com/questions/8297526/proper-exception-to-raise-if-none-encountered-as-argument

Make "invariant" methods which contain assertions external methods to the related class

For example, in the class LinearProbingHashTable, I have defined a method called __invariants, whose purpose is to check that the preconditions, postconditions and invariants of the data structure or methods of the same are maintained. This method could be defined as an external function, like I'm already doing for heaps and binary search trees.

Another example where there's this problem is in the TST class. I may also want to add a similar function to DisjointSetsForest.

Searching algorithms in AI

I had already implemented BFS and DFS, even though currently they were removed temporarily from this project, since the graph data structure was also removed (because I'd to fix a few things with it), but I would like to add also other algorithms like:

Uniform-cost search (also called cheapest-first search)
- This is a variant of the famous Dijkstra's algorithm, where instead of starting with a queue full of nodes, it starts only with one item, and inserts new items as they are discovered. This can avoid some "decrease-key" operations.
- Dijkstra's algorithm finds the shortest path between a source node and all nodes in a graph, whereas UCS only finds the shortest path between a source and a destination node.
Depth-limited search
- http://www.cs.colostate.edu/~anderson/cs440/index.html/doku.php?id=notes:week2b
Bidirectional search
Iterative-deepening (depth-first) search
Best-first search

These algorithms are usually used in path finding in combinatorial search. There's also a particular version of best-first search, i.e. a greedy version, where paths that are predicted to be closer are chosen first.
Beam search
A* (not a greedy version of a best-first search)
- A (A* without admissible heuristics)
IDA* (or iterative-deepening A*)
Alpha-Beta pruning

Local search

Hill climbing
2-opt, 2.5-opt, 3-opt, Lin–Kernighan, etc.

Interesting articles

Iterative deepening vs depth-first search

Possibly useful resources

Chapter 3.5 of Russell and Norvig.
Slides of prof. Gambardella

Why a lot of dynamic programming algorithms use nested lists (matrices)?

In general, because of the nature of problems (i.e. overlapping subproblems and optimal substructure) where dynamic programming can be applied, matrices are needed to keep track of results to subproblems, etc.

More dynamic programming algorithms

Floyd-Warshall: simple O(V³) algorithm to find all source shortest path even with negative edges
DNA/RNA related algorithms

Add "print-set" and "find-set-iteratively" functions to the UnionFind DS.

TODO

print-set
find-set-iteratively

Possibly helpful resources

Inconsistencies between the acceptable values in the key-value pair of a HeapNode when adding a HeapNode to a Heap

I've just noticed that method search_by_value in the class Heap raises a ValueError if the input val is None.

The problem is that theoretically it's possible to add a HeapNode to the heap (either max, min or min-max) with a value None using the add method defined in the Heap class. In that same class, if the input is already a HeapNode, no further checks are performed.

Two possible solutions:

Allow HeapNode objects with None values to be added to the Heap.
Make sure that HeapNode objects with None values are not added to the heap by modifying the add method.

Note: the tests have not caught this bug!

A few asserts in the tests "need" to be converted to use the unittests' functions

Tests under ds:

Query the size of a data structure is being done both providing a method and a property

In some data structures, size is a property (e.g. DisjointSetsForest), and in others size is a method (e.g. BinaryHeap). Need to fix these inconsistencies.

Current heaps should be called "binary heaps"

So far what I've actually implemented was binary heaps, but I don't think I emphasized the fact that those are actually binary heaps, and not for example Fibonacci or binomial heaps.

Optional operations for DisjointSetsForest

size (number of elements in the data structure)
contains (checks if an element is in the data structure)
sets (number of sets in the data structure: may be useful for example to understand how far are we to have just one set, etc)
Erase or clear the data structure
Deletion of an element from a set (and from the data structure)
Deletion of a whole (disjoint) set
Pretty-print a set (similar to how we're pretty-printing a BST)
Counting elements in a set (this could also be a property)

Type hints for returns values of some TST's methods are not correct or missing

For example, the delete method has a type hint of TSTNode, but actually it returns the value of the node being deleted not the node itself. The node actually should never be returned to the user, since it's only useful for the implementation of the data structure.

There are a private method at least which is missing the return value's type hint...

Should the implementation details of DSForests be hidden from the clients?

Currently, it seems that the implementation of DSForests assumes that the user knows about DSNode and thus can access its public fields. Is this a good thing in this case? My first intuition is that the user doesn't care about these implementations details and therefore they should be hidden. Note: if that's the case, we may need also to change the interface of certain methods, e.g. find, which instead of returning DSNode would return the original value stored in the corresponding representative.

For example, Kruskal's MST algorithm uses DSForests, but doesn't really care about implementation details, it only cares about the main functionalities and advantages that DSForests uses:

Find if two elements belong to the same set in amortized constant time
The nature of the DS, i.e. that you have a forest of disjoint sets and you can union them.

Graph data structure

The current design of the graph data structure is very poor. I should do a little bit of research on the web regarding the best ways of implementing a graph data structure by not just considering a few particular cases, but it should be as much flexible and general as possible.

In some algorithms that run on the graph data structures, sometimes I need to do workarounds to provide a certain functionality (which functionalities?) using the current implementation (which was actually removed from this repository temporarily because of this issue).
I should provide the linked-list, matrix and maybe the edge list implementations, so that depending on the problem, we can choose whichever representation suits better for the situation.
1. In the case of a adjacency-matrix representation, should I create a "matrix" data structure first?
2. Would this create too many dependencies, or this could actually be a good idea, because, e.g., this graph data structure could be used in numerical computing problems?
Check this other issue related to the implementation of a matrix data structure.
How should I deal with directed and undirected graphs?
1. Should these be two separate and independent data structures, or this should be options of a single data structures?
How should I deal with sparse and dense graphs?
Even though efficiency is not one of the main goals of this project, this data structure should be implemented as efficient as possible, i.e. known algorithms with an optional or good performance should clearly be preferred over a "random" and "custom" implementation.

Possible supported operations

Complement of a graph

Possibly useful resources

Algorithms under algorithms/greedy/huffamn.py are broken

Since I've changed something in the interface of the HeapNode class, I realized that the code under huffamn.py won't work anymore.

Removed temporarily huffman.py.

I should have started this project using maybe a test-driven development, but I was not aware at the time of all of these development processes, and I didn't have much time to do it in that way, anyway.

Inconsistencies in the implementation of BST and RBT

Since RBT derives from BST, it has derived all its implementations problems, so I'm just going to talk about the problems in the context of the BST.

Problems

The BST allows duplicate keys. It also supports to associate a value with the key, but this probably does not make much sense: if the users associate a value to a key, when searching it may obtain a different value than he's expecting. In other words, the current BST implementation is not really a map but allows you to associate a value with a key, but that may not make much sense.
The BST allows you to both insert, search and delete elements either by key or by BSTNode, and this can cause a few problems.
1. When searching or deleting using BSTNode, the inputted BSTNode ...
  1. could be corrupted, and this could break some supposed invariants
  2. does not belong to the tree, and there's no way of solving this except for searching through all nodes, but this would be too expensive.
The return value of the search and delete operations is a BSTNode, which contains a value. Is it a good idea to expose the internal details to the client?
1. If I decide to not make this BST implementation a map, then probably the most sensible thing to do when searching and deleting is respectively to return a boolean value (to indicate if the key is in the tree or not) and to not return anything (since there's no value associated with the key, there's no need to return anything!)
2. If I decide to make this BST a map, then when searching and deleting I can return the corresponding value (if the key is in the map) or I can return, e.g., if it's not.
Currently BSTNode objects are equal if they are the same object. This could make sense if we allow the user to know about BSTNode objects, as it currently knows, but in case I decide to hide the implementation details of the BST, i.e. hide as much as possible BSTNode (and I probably should do it!), then it does not make sense to consider BSTNode objects the same only if they are the same objects, but if they have the same keys (in case of non-map BST).

Possible Solutions

Do not allow duplicates and thus creating a BSTMap!
Or Maybe I should have a BST base class which allows duplicates, but the nodes only contain keys (i.e. without values). In this case, I should probably hide all implementations details (like the existence of a BSTNode) and I should implement the BSTNode. In this context, two BSTNodes would be equal not if they are the same object but if they contain the same key.
- In a second time, I could implement a BSTMap which derives from this implementation (?), which actually does not allow duplicates (PREFERENCE)!

Which type of testing is being used for this project?

The tests currently provided are unit tests, i.e. they test only functions, classes or objects defined in one module: they do not test the interactions and dependencies between modules or packages (which would be integration testing).

Assertion error when checking if a MinMaxHeap is a MinMaxHeap inside the deletion operation

When I was running all tests locally, using the script ./run_tests.sh, at a certain point, an assertion (i.e. assert is_min_max_heap(self)) in the deletion operation was not satisfied. This only occurred locally and when I run all tests, apparently.

In general, there was some strange behavior going on.

Verify that the algorithm for deleting an element from a MinMaxHeap (and also MaxHeap and MinHeap) really work, but thinking about all possible cases!!!

Add operations "longestPrefixOf", "keysWithPrefix" and "keysThatMatch" to the TST

Possibly helpful resources

Why does the Graph class have the field dfs_global_time?

Essentially for convenience. This variable is used, as the name suggests, in the DFS algorithm to keep track of the global time, i.e. a counter variable used to mark when a node is first and last visited. This variable has nothing to do with the Graph class, except that the DFS algorithm runs on a graph. In my opinion it should be separate from the Graph class, but I didn't find yet a convincing solution to this problem. The only one that actually comes to my mind right now (and as it came to my mind when I was first creating the Graph class) is to make this variable a global variable only present in the file when the DFS algorithm runs, but I am still not sure if this is a good design too...

Not all methods of the LinearProbingHashTable class check if the input key is Hashable

The following methods do not check if the key is an instance of an Hashable

get
put

whereas the following do

delete

Note: update tests to reflect changes.

Adding a print method to print the output of an EXTENDED-BOTTOM-UP-ROD-CUT

Here's a possible algorithm:

PRINT-CUT-ROD-SOLUTION(prices, n):
	(r, s) = EXTENDED-BOTTOM-UP-CUT-ROD(prices, n)
	while n > 0:
	    PRINT(s[n])
	    n = n - s[n]

"delete" operation implemented in the Heap class does not work for min-max heaps

This means that I've to override that method in the MinMaxHeap class. So currently, the MinMaxHeap class is buggy.

Why can't I create a undirected (or directed) graph?

My idea was that a graph is by default directed, and to create an undirected graph you just need to add each node of an (virtual) edge to each other's node adjacent list.

For example, if A and B are part of an undirected edge, then you add A to the B's adjacent list and you add B to A's adjacent list too. Of course you need to do this for all edges. For this reason, I decided to create two functions, add_directed_edge and add_undirected_edge that do this automatically under the hood for you.

This seems to solve all the problems, but it might not be the case. For example, if you want to know if a Graph is directed or undirected, you need to check all (virtual) edges, i.e. all adjacent lists, etc.

There are also some other solutions to this problem, but this could introduce other problems, and so on, making the graph data structure always more complex, and maybe with a worse design.

Add explanation of why the XOR is used and works (as expected) to the one-time pad's documentation

Inconsistencies in the coding style for the heap data structures regarding private methods

For example, push_down should be a private method, but its name does not start with _.

Be careful: decide well which methods are part of the public API of these heap classes. To do that, you may need to think why would you use a heap.

Code coverage not considering not tested modules

python-coveralls (which uses coverage.py) does not consider modules which are not tested or, in general, discovered. Of course, this may be a little bit misleading for those looking just superficially at the status of the code coverage.

See here for possible solutions: TheKevJames/coveralls-python#142

TST's deletion is buggy

After visualizing the behaviour of a TST data structure using this tool, I realized that my implementation of deletion is buggy.

_delete_fix keeps deleting nodes up to the root and while u has no children, but it doesn't consider the fact that the current node u may contain a value and thus is not to be deleted.

This may not be the only problem... !! I need to think about all possible cases very well. Have a look at the implementation by galles.

Enumeration of the missing tests

Here's an enumeration of the data structures and algorithms that still need to be tested:

Data Structures

TST
Queue

Algorithms

All algorithms under:

In many cases, there's already a good starting point in the respective algorithms' own scripts to write their tests...

Add tail-recursive versions of the recursive functions

The concept of a tail-recursive function is especially important for functional programming languages. A tail-recursive function can be implemented in an optimized way by reusing the existing stack frame when calling recursively the function that is located in the tail position.

In the folder ands/algorithms/recursion, there are a few examples of recursive functions. The idea of this issue is to promote the:

provision of a tail-recursive alternative implementation of the existing recursive functions (which are not already tail-recursive), and
documentation of the functions as tail-recursive or not.

Resources

Possibly useful resources to implement this

PageRank using the power method

Page Rank

Suppose we have a graph g.
Let n be the number of nodes in this graph.
Let G be the transition matrix representing the connections of this graph g.
1. g_ij = 1, if there's a (outgoing) link from j to i.
2. g_ij = 0, otherwise.
Let c_j be the out-degree (the number of links going out) of node j
and let r_j be similarly the in-degree.
Let p be the probability that the random walk of a user follows a specific link.
1 A typical value for this is p=0.85.
2 Then 1 - p is the probability that some arbitrary page is chosen.
1. Hence δ = (1 - p) / n is the probability that a particular random page (from these arbitrary pages) is chosen.
Let A be an n × n. This is the transition probability (or stochastic) matrix (of the Markov chain) of the graph g.
1. a_ij = p * (g_ij / c_j) + δ, if c_j != 0
  1. The ratio g_ij / c_j is the proportion of the outgoing link from j to i over all out-links. Note that g_ij could be 0, if there's no out-link from j to i.
2. a_ij = 1 / n, if c_j == 0
  1. Note that assigning 1 / n when c_j == 0 makes sense because if we don't have any outgoing links from j, there's 0 probability of following a specific link from that page j, so we have 1 / n probability of choosing a random probability.
3. All elements of A are strictly between 0 and 1.
4. All column sums of A are all equal to 1

Power method

Good for matrices which are
1. large
2. sparse
3. well-conditioned???
The method consists in repeating x = Ax (for some initial x₀, e.g. such that ||x₀||=1), until the absolute difference between the next x and the current x is "small enough".

Input: A, x₀, termination condition
Output: v_k (iterate when termination condition is met)
for k=1, 2 ... until termination condition is met do:
v_temp = A * v_{k - 1}
v_k = v_temp / ||v_temp||
end
1. The reason for the repeated normalization is that there is nothing in the definition of an eigenvector to prevent our iterates v_k from growing in magnitude, which would indeed happen for large eigenvalues, accelerating roundoff error growth.
2. The termination condition depends on the purpose of the computation...!?
3. How do we know that matrix A has an eigenvalue equal to 1? (Answer: A is column stochastic, i.e. all column sums are 1, the Perron-Frobenius theorem tells us that this matrix has an eigenvalue 1)
  1. If it does, how do we know that it's the dominant eigenvalue and thus the solution that we find corresponds to the dominant eigenvector?
  2. Does this problem have a unique solution x?
  3. With a few adjustments to the basic model, A has indeed an eigenvalue 1 and it's the dominant eigenvalue, with algebraic multiplicity (?), and the problem has a unique positive real solution x. Proof?
  4. In the case of the PageRank, A has the dominant eigenvalue equal to 1, but in most cases we don't know its value. In those cases we can use the Rayleigh quotient, defined for any given vector v by: rq = (v^T * A * v) / (v^T * v).
    1. If v was an eigenvector of A, then rq would simply be the corresponding eigenvalue. We can now update the powermethod pseudocode above by adding this line λ₁^k = v_k^T * A * v, after the normalization step. Note that we don't have to divide by v^T * v, since we have already normalized v_k.
4. An initial approximation to x could be a vector whose entries are 1/n.
5. The Perron-Frobenius theorem applies to matrices like A:
  1. It says that x = Ax has a non-zero solution and is unique to within a scaling factor. If the scaling factor is chosen so that sum(x_i) = 1 (for all i), then the x we obtain at the end is the state vector and is also the Google's PageRank
only possible approach for large n, where n is the number of pages
Matrices G (the connectivity matrix) and A (the transition probability matrix of a Markov chain, formed from G) can actually never be formed.
To preserve the sparsity of G, we can do A = pGD + ez^T, where
1. D is the diagonal matrix formed as follows
  d_jj = 1/c_j, if c_j != 0;
  d_jj = 0, if c_j == 0.
2. e is a vector of all 1s
3. z is the vector with components:
  z_j = δ, if c_j != 0;
  z_j = 1/n, if c_j == 0.
  Thus the matrix ez^T accounts for random choices of web pages that do not follow links.
4. x=Ax can thus be written as
  
  x - Ax = 0
  (I - A)x = 0
  (I - (pGD + ez^T))x = 0
  (I - pGD - ez^T)x = 0
  Ix - pGDx - ez^Tx = 0
  (I - pGD)x - ez^Tx = 0
  (I - pGD)x = ez^Tx
  
  Now let z^Tx = γ, so we have
  
  (I - pGD)x = eγ
  
  We don't know γ because it depends on the unknown x.
  Let temporarily γ = 1.
  Let I - pGD = Q.
  So we end up with:
  
  Qx = e
  
  As long as p is strictly less than one, the coefficient matrix Q is nonsingular. Why???
  
  Now we can solve the linear system Qx = e for x.
The power method can also be implemented without actually forming the transition probability (or Markov) matrix, A = p * G * D + e * z^T, and thus preserving sparsity.
1. Let B = p * G * D
2. Let c be the vector containing all out-degrees of each node j, from j=1..n. That is it's the vector whose elements are the sums of the columns of G.
3. Define a vector z, such that |z|=|c|, fill each of its elements as follows
  1. Let z_j = (1 - p) / n, if c_j == 1
  2. Let z_j = 1/n, if c_j == 0
4. Start with a vector x = e / n.
5. Repeat x = B * x + e * (z * x) until x settles down to several decimal places.

References

https://www.mathworks.com/content/dam/mathworks/mathworks-dot-com/moler/exm/chapters/pagerank.pdf

Notes

Check assignment 1 of the numerical computing course.
PageRank can also be implemented using the inverse iteration (or power) method.
- To see how check the reference above and this Wikipedia article about the topic.
- The inverse power method, as oppose to the power method which can only find the dominant eigenpair, can find all eigenvalues provided that a good initial approximation is provided.
Check also:
- https://en.wikipedia.org/wiki/Power_iteration
- Chapter 8 of A FIRST COURSE IN NUMERICAL METHODS
- Very useful method talking about the power method https://www.youtube.com/watch?v=OzeDqsVoTFc
- http://mathfaculty.fullerton.edu/mathews/n2003/PowerMethodMod.html
- Shifted-inverse power method:
  - http://mathfaculty.fullerton.edu/mathews/n2003/PowerMethodMod.html
  - https://en.wikiversity.org/wiki/Shifted_inverse_iteration

Is the MinPriorityQueue in the Queue.py implemented "correctly"?

The MinPriorityQueue is a data structure implemented using a MinHeap and it could potentially be used in different algorithms.

So the idea of opening this issues is to try to understand if this implementation can be used in a flexible way: ideally it should be possible to use it whenever any algorithm needs an "min-priority queue", so it should be as simple as possible (i.e., without many secondary enhancements) and it should also be as much as possible decoupled from other data structures.

nbro / andz Goto Github PK

andz's People

Contributors

Stargazers

Watchers

Forkers

andz's Issues

With inheritance

Advantages

Disadvantages

Without inheritance

Advantages

Disadvantages

Operations on Matrices

Possibly useful resources to answer these and related questions

What is a flow network?

Flow

Maximum Flow Problem

Antiparallel Edges

Multiple Sources or Sinks

Algorithms for maximum flow

Notions and Terminology

Edmonds–Karp algorithm, which is apparently a specialization of Ford-Fulkerson algorithm.

References

Related problems and concepts

Resources

Questions

Resources

References

Order Statistics Tree

Interval Trees

Advantages

Disadvantages

Possible solution

Resources

Asymptotic complexity of common algorithms

General explanations about asymptotic notation

Order functions according to their asymptotic growth rate

Local search

Interesting articles

Possibly useful resources

TODO

Possibly helpful resources

Possible supported operations

Possibly useful resources

Problems

Possible Solutions

Possibly helpful resources

Data Structures

Algorithms

Resources

Page Rank

Power method

References

Notes

Recommend Projects

Recommend Topics

Recommend Org