Code Monkey home page Code Monkey logo

andz's People

Contributors

nbro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

andz's Issues

With or without inheritance?

By following the example of R. Sedgewick and K. Wayne's implementation philosophy (from their work on implementing and explaining data structures and algorithms, whose code you can find here: https://github.com/kevin-wayne/algs4), I could remove the usage of inheritance to model relationshipts between data structures that in theory are connected by a parent-child relationship.

With inheritance

Advantages

  1. Parent-child relationships between data structures would be more obvious in certain cases

    1. Example 1: Min- and max- heaps have certain operations in common
    2. Example 2: Many operations of a BST also work for a RBT
  2. Less redundant code and duplication

Disadvantages

  1. Dependency on another implementation

  2. Highly coupled modules

  3. Theoretical parent-child relationships are not always best represented and implemented as such.

  4. Example 1: a tree is a special type of a graph, but it's usually implemented quite differently from a graph, hence it may not be a good idea to inherit a tree from a graph.

    This is one example where in theory data structures are connected by a parent-child relationship, but in practice it doesn't make much sense to implement it.

  5. Solution 1: implement inheritance whenever it really makes sense

    1. Example 1: BST <- RBT
    2. Example 2: Heap <- MinHeap
      1. Problem 1: Duplication when testing min- and max-heap implementations because they inherit methods, which are already implemented in the abstract class Heap, which need both to be test for min- and max-heaps.
        1. Note 1: in general, test duplication would happen even more often without inheritance
    3. Problem 1: Inconsistencies between using and not using inheritance over ands.
  6. Complex hierarchies


Without inheritance

Advantages

  1. Every data structure (and file) would be self-contained.
  2. Data structures are loosely coupled
    1. Implementations can change more freely

Disadvantages

  1. More duplication and redundancy
    1. Note 1: this would happen not only in the code but in the tests too!
  2. Operation implemented in one data structure may be missing in another similar one
    1. Note 1: this is the problem of forgetting to implement!

Matrices

Matrices I could implement (which could also be useful to implement graphs). Note that these type of matrices may derive from a common and more generic matrix data structure.

Operations on Matrices

  • Determinant
  • Product, sum, subtraction and division of matrices
  • Matrix power
  • Inverse
  • Find eigenvectors
  • Find eigenvalues
  • Find spectral radius
  • Is identity?
  • Is matrix diagonally dominant?
  • Is matrix squared?
  • Is matrix positive-definite (or positive semi-definite)?
  • Is matrix (as)symmetric?
  • Is matrix sparse/dense?

Testing numerical algorithms

Not all methods of MinMaxHeap which accept indices assert that indices are in the bounds

For consistency, I should assert in all methods that accept "indices" that they are valid, i.e. by using something of the form assert self.is_good_index(my_index). Examples of functions that do not do that: _index_of_max, _index_of_min, etc.

Mare sure that this assertion is required in certain methods: it may happen that it's "ok" that in a call or recursive sub-call to a function the indices are no more valid!!!

Check also the cases for BinaryHeap, MinHeap and MaxHeap.

Maximum-Flow Problem

Careful: These explanations are incomplete!!!

What is a flow network?

A flow network is a directed graph G = (V, E), such that

  1. Each edge (u, v) ∈ E has a non-negative capacity c(u, v) ≥ 0
  2. If (u, v) is an edge ∈ E, then (v, u) ∉ E
  3. If (u, v) ∉ E, then c(u, v) = 0
  4. There's a source node s and a sink node t.

Point 2 means that there aren't any anti-parallel edges. See below how to convert a graph with antiparallel edges to a flow network.

Flow

A flow in a flow network G = (V, E) is a real-valued function f : V x V -> ℜ that satisfies two properties:

  1. Capacity constraint: ∀ u, v ∈ V : 0 ≤ f(u, v) ≤ c(u, v)

    That is, for every two vertices u and v in V, we have that the flow between them is greater or equal to zero and is smaller or equal to the cost of the edge that connects u to v.

  2. Flow conservation: ∀ u ∈ V - {s, t} : ∑v ∈ V f(u, v) = ∑v ∈ V f(v, u)

The value of a flow f is denoted by |f| and is defined by: |f| = ∑v ∈ V f(s, v) - ∑v ∈ V f(v, s)

That is the total flow out of the source s minus the total flow into the source s.

Each edge in the following flow network is labeled using a label of the form f/c, where f denotes the flow and c denotes the cost.

flow-network

Maximum Flow Problem

Given a flow network G = (V, E), a source node s and a sink node t, find a flow of maximum value.

Antiparallel Edges

In the case of the presence of antiparallel edges in the flow network, we remove them in the following way:

remove-antiparallel

Multiple Sources or Sinks

In the case we have multiple sources or sinks, we add respectively a supersource or supersink as follows:

add-supersource-and-supersink

Algorithms for maximum flow

Implement a few important algorithms to find the maximum flow, such as:

Notions and Terminology

Augmenting path

An augmenting path is a path constructed by repeatedly finding a path of positive capacity from a source s to a sink node t.

Examples???

Ford–Fulkerson method

function Ford–Fulkerson(G, s, t):
	initialize flow f to 0
	while there's an augmenting path p in the residual network Gf
		augment flow f along path p
	return p

where

Edmonds–Karp algorithm, which is apparently a specialization of Ford-Fulkerson algorithm.

Here's the Wikipedia article about the topic:

Maximum Flow Problem

References

Related problems and concepts

Resources

  • Introduction to Algorithms (3rd ed.), by C.L.R.S., 26th chapter.

Is TTD the most appropriate process to develop software for this project?

This project wasn't started as a TDD project because, at that time, I didn't know much about these techniques.

Questions

  • Why would or not TTD be the most appropriate process of developing software for this project?

    • Advantages?

    • Disadvantages?

  • Are there better alternatives for this project? Why? Examples?

Resources

The following resources apparently contain useful information to answer the previous questions.

Order-Stastics Tree and Interval Trees (by augmenting RBTs)

References

  • Chapter 14 of "Introduction to algorithms" by CLRS to augment the RBT ds.

  • Slides by prof. Papadopoulou


The BST has a rank method, which works in linear time instead of the possible optimal logarithmic time...

Add support for:

  • Order statistics tree (OS-RANK, OS-SELECT) in log(n) time by augmenting the RBT with a field called size to each node x, which represents the number of nodes under x. This field must be maintained though.

  • Add support for interval trees


Order Statistics Tree

The OS-SELECT(T, i) (returns the ith element in T) method could look like this:

OS-SELECT(T, i): // we're looking for the ith smallest element in T
    OS-SELECT-AUX(T.root, i) // we start from the root
OS-SELECT-AUX(x, i):
    r = x.left.size + 1 // x.left.size because the elements on the left precede x
    if r == i: // x is at the ith position 
        return x
    else if i < r: // search in the left subtree of x
        return OS-SELECT-AUX(x.left, i)
    else:  // i >= r
        // since the subtree rooted at x contains r elements
        // that come before x's right subtree in a inorder tree walk
        // the ith smallest element in the subtree rooted at x
        // is the (i - r) element rooted at x's right subtree.
        return OS-SELECT-AUX(x.right, i - r)

The OS-RANK(T, x) (the number of elements that precede x in an inorder tree walk of T + 1, to account for x itself) that instead could look like this:

OS-RANK(T, x):
    r = x.left.size + 1
    y = x
    while y != T.root:
        // if y is a right child
        // then both its parent and all the nodes in its parent's left subtree
        // come before y in a inorder tree walk of T 
        if y == y.parent.right: 
            r = r + y.parent.left.size + 1
        y = y.parent
    return r

We need to maintain the field size when the RBT is modified during insertions and deletions. During insertions, when we go down through the path until the position of the new node, we increase the size of all nodes in these path by 1. Going up, to the fix the height of the tree to maintain it balanced, we do at most 2 rotations. During these rotations we also need to change the sizes of the nodes involved only in the rotation, since the rotation is a local operation. For example, in a left-rotate, we could do something like this (omitting handling of nil s and parent points):

VERIFY THAT THIS PSEUDOCODE IS CORRECT!!!

OS-LEFT-ROTATE(T, x):
    y = x.right
    x.right = y.left
    y.left.parent = x
    ...

    // the size of y becomes the size of its parent x
    y.size = x.size 

    // the new size of x is the sum of the sizes of its two 
    x.size = x.left.size + x.right.size + 1 children + 1
    ...

During deletion, we do at most 3 rotations. When going up doing these rotations, we need also to decrease the size of the nodes all the way up to the root.

We can also implement easily other operations such as OS-ITH-SUCCESSOR, OS-ITH-PREDECESSOR by simply combining OS-RANK and OS-SELECT. For example OS-ITH-SUCCESSOR(T, x, i) can be implemented as follows:

OS-ITH-SUCCESSOR(T, x, i):
    r = OS-RANK(x)
    return OS-SELECT-AUX(T.root, r + i) // select the (r + i)th element from the tree

OS-ITH-PRECESSOR can be implemented similarly by passing as second parameter to OS-SELECT-AUX r - i.

All the operations we have mentioned so far run all in O(log n).

Interval Trees

Idea: store intervals in the nodes of a RBT organized according to the low values of the intervals. In addition, augment the RBT's nodes with a field max which is the maximum value of the subtree rooted at that corresponding node, i.e.

x.max = max(x.interval.high, x.left.max, x.right.max)

We need to maintain this new field max during insertions and deletions. But this can be easily achieved, since this is a local information.

The INTERVAL-SEARCH procedure looks like this:

INTERVAL-SEARCH(T, i):
x = T.root
while x != T.nil and i does not overlap x:
    if x.left != T.nil and (x.left.max >= i.low):
        x = x.left // move left if the maximum key in the left subtree is greater than i.low
    else:
        x = x.right
return x

Other operations that could be useful are:

  1. INTERVAL-SEARCH-EXACTLY (Check solutions to 2nd homework of ans2).

  2. INTERVAL-SEARCH-ALL(T, i), which searches all intervals in T which overlap the interval i. This operation can be implemented by imitating the deletion of a found overlapping interval. (Check solutions to 2nd homework of ands2).

Should additional non-standard methods of a particular data-structure be provided?

For example, in the case of heaps, I'm currently providing an additional non-standard method, i.e. delete, for all binary heaps (MinHeap, MaxHeap and MinMaxHeap), but delete is usually not an operation provided by the specification of a binary heap, because, in general, it requires linear time (for a heap).

Another example is the TST. I'm also here providing a deletion method, but, usually, ternary-search tree specifications do not state "deletion" as a standard operation.

Possible solution

  • One solution I'm currently seeing right now would be to not provide additional non-standard methods in the standard implementation of a data structure, but to eventually provide another subclass which implements the standard implementation whose purpose is to enhance it with the additional operations, or maybe to simply enhance the standard implementations (for example, in the case of BST, I already implemented a BSTImproved, whose purpose was to re-implement the "insertion" operation).

    • Problem: is it worth to create an additional data structure or implementation, maybe just for a method?

      • Observation: at least, it seems to make more sense than the current implementations (Why ???)

Resources

Complexity analysis of algorithms

This project also aims at implementing algorithms which asymptotically are faster than their eventual alternatives. Nevertheless, asymptotically slower alternatives are not excluded: on the contrary, they may serve as a form of comparison (for the purposes of better understanding the complexity analysis, given that this project is mainly educational) with the theoretically asymptotically faster ones. Moreover, in practice, algorithms with lower asymptotic complexity may have a better runtime performance, at least for certain inputs and/or for certain architectures.

The goals of this issue (which is going to be permanent) are:

  1. Ensure asymptotically fastest algorithms for a particular problem are implemented before the slower ones.

  2. Implement interesting alternatives to the asymptotically fastest algorithms. Interesting alternatives are e.g. the ones that in practice (at least for certain inputs) perform better than the asymptotically fastest algorithms.

  3. List and point to resources that may help in analyzing the implemented algorithms.

  4. Take also into consideration the space complexity analysis of algorithms, which, so far, has not been performed much.

  5. Describe (usual) techniques to compare asymptotic complexities of two algorithms. This may be useful e.g. if one wants to order (increasingly or decreasingly) algorithms according to their asymptotic growth rate [OPTIONAL].

  6. List facts about the asymptotic behavior of functions. For example, polynomials grow slower than exponentials [OPTIONAL].


Asymptotic complexity of common algorithms

General explanations about asymptotic notation

Order functions according to their asymptotic growth rate

Inconsistencies between exceptions that are raised in the methods of TST

The TST class raises a ValueError in the insert method if not key, but raises TypeError for the same reason in the methods delete, search_recursively and search.

ValueError should be raised when the input's domain is invalid, whereas TypeError should be raised when the type of an input is not the expected one. So, I should also raise ValueError when for example the input has length 0, but should have length > 0. And I should raise TypeError if for example a check isinstance(my_input, my_required_type) is false.

http://stackoverflow.com/questions/8297526/proper-exception-to-raise-if-none-encountered-as-argument

Make "invariant" methods which contain assertions external methods to the related class

For example, in the class LinearProbingHashTable, I have defined a method called __invariants, whose purpose is to check that the preconditions, postconditions and invariants of the data structure or methods of the same are maintained. This method could be defined as an external function, like I'm already doing for heaps and binary search trees.

Another example where there's this problem is in the TST class. I may also want to add a similar function to DisjointSetsForest.

Searching algorithms in AI

I had already implemented BFS and DFS, even though currently they were removed temporarily from this project, since the graph data structure was also removed (because I'd to fix a few things with it), but I would like to add also other algorithms like:

  • Uniform-cost search (also called cheapest-first search)

    • This is a variant of the famous Dijkstra's algorithm, where instead of starting with a queue full of nodes, it starts only with one item, and inserts new items as they are discovered. This can avoid some "decrease-key" operations.

    • Dijkstra's algorithm finds the shortest path between a source node and all nodes in a graph, whereas UCS only finds the shortest path between a source and a destination node.

  • Depth-limited search

  • Bidirectional search

  • Iterative-deepening (depth-first) search

  • Best-first search

    These algorithms are usually used in path finding in combinatorial search. There's also a particular version of best-first search, i.e. a greedy version, where paths that are predicted to be closer are chosen first.

  • Beam search

  • A* (not a greedy version of a best-first search)

    • A (A* without admissible heuristics)
  • IDA* (or iterative-deepening A*)

  • Alpha-Beta pruning

Local search

Interesting articles

Possibly useful resources

  • Chapter 3.5 of Russell and Norvig.
  • Slides of prof. Gambardella

Inconsistencies between the acceptable values in the key-value pair of a HeapNode when adding a HeapNode to a Heap

I've just noticed that method search_by_value in the class Heap raises a ValueError if the input val is None.

The problem is that theoretically it's possible to add a HeapNode to the heap (either max, min or min-max) with a value None using the add method defined in the Heap class. In that same class, if the input is already a HeapNode, no further checks are performed.

Two possible solutions:

  1. Allow HeapNode objects with None values to be added to the Heap.
  2. Make sure that HeapNode objects with None values are not added to the heap by modifying the add method.

Note: the tests have not caught this bug!

Current heaps should be called "binary heaps"

So far what I've actually implemented was binary heaps, but I don't think I emphasized the fact that those are actually binary heaps, and not for example Fibonacci or binomial heaps.

Optional operations for DisjointSetsForest

  • size (number of elements in the data structure)
  • contains (checks if an element is in the data structure)
  • sets (number of sets in the data structure: may be useful for example to understand how far are we to have just one set, etc)
  • Erase or clear the data structure
  • Deletion of an element from a set (and from the data structure)
  • Deletion of a whole (disjoint) set
  • Pretty-print a set (similar to how we're pretty-printing a BST)
  • Counting elements in a set (this could also be a property)

Type hints for returns values of some TST's methods are not correct or missing

For example, the delete method has a type hint of TSTNode, but actually it returns the value of the node being deleted not the node itself. The node actually should never be returned to the user, since it's only useful for the implementation of the data structure.

There are a private method at least which is missing the return value's type hint...

Should the implementation details of DSForests be hidden from the clients?

Currently, it seems that the implementation of DSForests assumes that the user knows about DSNode and thus can access its public fields. Is this a good thing in this case? My first intuition is that the user doesn't care about these implementations details and therefore they should be hidden. Note: if that's the case, we may need also to change the interface of certain methods, e.g. find, which instead of returning DSNode would return the original value stored in the corresponding representative.

For example, Kruskal's MST algorithm uses DSForests, but doesn't really care about implementation details, it only cares about the main functionalities and advantages that DSForests uses:

  1. Find if two elements belong to the same set in amortized constant time
  2. The nature of the DS, i.e. that you have a forest of disjoint sets and you can union them.

Graph data structure

The current design of the graph data structure is very poor. I should do a little bit of research on the web regarding the best ways of implementing a graph data structure by not just considering a few particular cases, but it should be as much flexible and general as possible.

  1. In some algorithms that run on the graph data structures, sometimes I need to do workarounds to provide a certain functionality (which functionalities?) using the current implementation (which was actually removed from this repository temporarily because of this issue).

  2. I should provide the linked-list, matrix and maybe the edge list implementations, so that depending on the problem, we can choose whichever representation suits better for the situation.

    1. In the case of a adjacency-matrix representation, should I create a "matrix" data structure first?
    2. Would this create too many dependencies, or this could actually be a good idea, because, e.g., this graph data structure could be used in numerical computing problems?

    Check this other issue related to the implementation of a matrix data structure.

  3. How should I deal with directed and undirected graphs?

    1. Should these be two separate and independent data structures, or this should be options of a single data structures?
  4. How should I deal with sparse and dense graphs?

  5. Even though efficiency is not one of the main goals of this project, this data structure should be implemented as efficient as possible, i.e. known algorithms with an optional or good performance should clearly be preferred over a "random" and "custom" implementation.

Possible supported operations

Possibly useful resources

Algorithms under algorithms/greedy/huffamn.py are broken

Since I've changed something in the interface of the HeapNode class, I realized that the code under huffamn.py won't work anymore.

Removed temporarily huffman.py.

I should have started this project using maybe a test-driven development, but I was not aware at the time of all of these development processes, and I didn't have much time to do it in that way, anyway.

Inconsistencies in the implementation of BST and RBT

Since RBT derives from BST, it has derived all its implementations problems, so I'm just going to talk about the problems in the context of the BST.

Problems

  1. The BST allows duplicate keys. It also supports to associate a value with the key, but this probably does not make much sense: if the users associate a value to a key, when searching it may obtain a different value than he's expecting. In other words, the current BST implementation is not really a map but allows you to associate a value with a key, but that may not make much sense.

  2. The BST allows you to both insert, search and delete elements either by key or by BSTNode, and this can cause a few problems.

    1. When searching or deleting using BSTNode, the inputted BSTNode ...

      1. could be corrupted, and this could break some supposed invariants
      2. does not belong to the tree, and there's no way of solving this except for searching through all nodes, but this would be too expensive.
  3. The return value of the search and delete operations is a BSTNode, which contains a value. Is it a good idea to expose the internal details to the client?

    1. If I decide to not make this BST implementation a map, then probably the most sensible thing to do when searching and deleting is respectively to return a boolean value (to indicate if the key is in the tree or not) and to not return anything (since there's no value associated with the key, there's no need to return anything!)

    2. If I decide to make this BST a map, then when searching and deleting I can return the corresponding value (if the key is in the map) or I can return, e.g., if it's not.

  4. Currently BSTNode objects are equal if they are the same object. This could make sense if we allow the user to know about BSTNode objects, as it currently knows, but in case I decide to hide the implementation details of the BST, i.e. hide as much as possible BSTNode (and I probably should do it!), then it does not make sense to consider BSTNode objects the same only if they are the same objects, but if they have the same keys (in case of non-map BST).

Possible Solutions

  1. Do not allow duplicates and thus creating a BSTMap!

  2. Or Maybe I should have a BST base class which allows duplicates, but the nodes only contain keys (i.e. without values). In this case, I should probably hide all implementations details (like the existence of a BSTNode) and I should implement the BSTNode. In this context, two BSTNodes would be equal not if they are the same object but if they contain the same key.

    • In a second time, I could implement a BSTMap which derives from this implementation (?), which actually does not allow duplicates (PREFERENCE)!

Which type of testing is being used for this project?

The tests currently provided are unit tests, i.e. they test only functions, classes or objects defined in one module: they do not test the interactions and dependencies between modules or packages (which would be integration testing).

Assertion error when checking if a MinMaxHeap is a MinMaxHeap inside the deletion operation

When I was running all tests locally, using the script ./run_tests.sh, at a certain point, an assertion (i.e. assert is_min_max_heap(self)) in the deletion operation was not satisfied. This only occurred locally and when I run all tests, apparently.

In general, there was some strange behavior going on.

Verify that the algorithm for deleting an element from a MinMaxHeap (and also MaxHeap and MinHeap) really work, but thinking about all possible cases!!!

Why does the Graph class have the field dfs_global_time?

Essentially for convenience. This variable is used, as the name suggests, in the DFS algorithm to keep track of the global time, i.e. a counter variable used to mark when a node is first and last visited. This variable has nothing to do with the Graph class, except that the DFS algorithm runs on a graph. In my opinion it should be separate from the Graph class, but I didn't find yet a convincing solution to this problem. The only one that actually comes to my mind right now (and as it came to my mind when I was first creating the Graph class) is to make this variable a global variable only present in the file when the DFS algorithm runs, but I am still not sure if this is a good design too...

Why can't I create a undirected (or directed) graph?

My idea was that a graph is by default directed, and to create an undirected graph you just need to add each node of an (virtual) edge to each other's node adjacent list.

For example, if A and B are part of an undirected edge, then you add A to the B's adjacent list and you add B to A's adjacent list too. Of course you need to do this for all edges. For this reason, I decided to create two functions, add_directed_edge and add_undirected_edge that do this automatically under the hood for you.

This seems to solve all the problems, but it might not be the case. For example, if you want to know if a Graph is directed or undirected, you need to check all (virtual) edges, i.e. all adjacent lists, etc.

There are also some other solutions to this problem, but this could introduce other problems, and so on, making the graph data structure always more complex, and maybe with a worse design.

TST's deletion is buggy

After visualizing the behaviour of a TST data structure using this tool, I realized that my implementation of deletion is buggy.

_delete_fix keeps deleting nodes up to the root and while u has no children, but it doesn't consider the fact that the current node u may contain a value and thus is not to be deleted.

This may not be the only problem... !! I need to think about all possible cases very well. Have a look at the implementation by galles.

Add tail-recursive versions of the recursive functions

The concept of a tail-recursive function is especially important for functional programming languages. A tail-recursive function can be implemented in an optimized way by reusing the existing stack frame when calling recursively the function that is located in the tail position.

In the folder ands/algorithms/recursion, there are a few examples of recursive functions. The idea of this issue is to promote the:

  1. provision of a tail-recursive alternative implementation of the existing recursive functions (which are not already tail-recursive), and

  2. documentation of the functions as tail-recursive or not.

Resources

Possibly useful resources to implement this

PageRank using the power method

Page Rank

  1. Suppose we have a graph g.
  2. Let n be the number of nodes in this graph.
  3. Let G be the transition matrix representing the connections of this graph g.
    1. gij = 1, if there's a (outgoing) link from j to i.
    2. gij = 0, otherwise.
  4. Let cj be the out-degree (the number of links going out) of node j
  5. and let rj be similarly the in-degree.
  6. Let p be the probability that the random walk of a user follows a specific link.
    1 A typical value for this is p=0.85.
    2 Then 1 - p is the probability that some arbitrary page is chosen.
    1. Hence δ = (1 - p) / n is the probability that a particular random page (from these arbitrary pages) is chosen.
  7. Let A be an n × n. This is the transition probability (or stochastic) matrix (of the Markov chain) of the graph g.
    1. aij = p * (gij / cj) + δ, if cj != 0

      1. The ratio gij / cj is the proportion of the outgoing link from j to i over all out-links. Note that gij could be 0, if there's no out-link from j to i.
    2. aij = 1 / n, if cj == 0

      1. Note that assigning 1 / n when cj == 0 makes sense because if we don't have any outgoing links from j, there's 0 probability of following a specific link from that page j, so we have 1 / n probability of choosing a random probability.
    3. All elements of A are strictly between 0 and 1.

    4. All column sums of A are all equal to 1

Power method

  1. Good for matrices which are

    1. large
    2. sparse
    3. well-conditioned???
  2. The method consists in repeating x = Ax (for some initial x0, e.g. such that ||x0||=1), until the absolute difference between the next x and the current x is "small enough".

    Input: A, x0, termination condition
    Output: vk (iterate when termination condition is met)
    for k=1, 2 ... until termination condition is met do:
    vtemp = A * vk - 1
    vk = vtemp / ||vtemp||
    end

    1. The reason for the repeated normalization is that there is nothing in the definition of an eigenvector to prevent our iterates vk from growing in magnitude, which would indeed happen for large eigenvalues, accelerating roundoff error growth.

    2. The termination condition depends on the purpose of the computation...!?

    3. How do we know that matrix A has an eigenvalue equal to 1? (Answer: A is column stochastic, i.e. all column sums are 1, the Perron-Frobenius theorem tells us that this matrix has an eigenvalue 1)

      1. If it does, how do we know that it's the dominant eigenvalue and thus the solution that we find corresponds to the dominant eigenvector?

      2. Does this problem have a unique solution x?

      3. With a few adjustments to the basic model, A has indeed an eigenvalue 1 and it's the dominant eigenvalue, with algebraic multiplicity (?), and the problem has a unique positive real solution x. Proof?

      4. In the case of the PageRank, A has the dominant eigenvalue equal to 1, but in most cases we don't know its value. In those cases we can use the Rayleigh quotient, defined for any given vector v by: rq = (vT * A * v) / (vT * v).

        1. If v was an eigenvector of A, then rq would simply be the corresponding eigenvalue. We can now update the powermethod pseudocode above by adding this line λ1k = vkT * A * v, after the normalization step. Note that we don't have to divide by vT * v, since we have already normalized vk.
    4. An initial approximation to x could be a vector whose entries are 1/n.

    5. The Perron-Frobenius theorem applies to matrices like A:

      1. It says that x = Ax has a non-zero solution and is unique to within a scaling factor. If the scaling factor is chosen so that sum(xi) = 1 (for all i), then the x we obtain at the end is the state vector and is also the Google's PageRank
  3. only possible approach for large n, where n is the number of pages

  4. Matrices G (the connectivity matrix) and A (the transition probability matrix of a Markov chain, formed from G) can actually never be formed.

  5. To preserve the sparsity of G, we can do A = pGD + ezT, where

    1. D is the diagonal matrix formed as follows
      djj = 1/cj, if cj != 0;
      djj = 0, if cj == 0.

    2. e is a vector of all 1s

    3. z is the vector with components:
      zj = δ, if cj != 0;
      zj = 1/n, if cj == 0.
      Thus the matrix ezT accounts for random choices of web pages that do not follow links.

    4. x=Ax can thus be written as

      x - Ax = 0
      (I - A)x = 0
      (I - (pGD + ezT))x = 0
      (I - pGD - ezT)x = 0
      Ix - pGDx - ezTx = 0
      (I - pGD)x - ezTx = 0
      (I - pGD)x = ezTx

      Now let zTx = γ, so we have

      (I - pGD)x = eγ

      We don't know γ because it depends on the unknown x.
      Let temporarily γ = 1.
      Let I - pGD = Q.
      So we end up with:

      Qx = e

      As long as p is strictly less than one, the coefficient matrix Q is nonsingular. Why???

      Now we can solve the linear system Qx = e for x.

  6. The power method can also be implemented without actually forming the transition probability (or Markov) matrix, A = p * G * D + e * zT, and thus preserving sparsity.

    1. Let B = p * G * D
    2. Let c be the vector containing all out-degrees of each node j, from j=1..n. That is it's the vector whose elements are the sums of the columns of G.
    3. Define a vector z, such that |z|=|c|, fill each of its elements as follows
      1. Let zj = (1 - p) / n, if cj == 1
      2. Let zj = 1/n, if cj == 0
    4. Start with a vector x = e / n.
    5. Repeat x = B * x + e * (z * x) until x settles down to several decimal places.

References

Notes

Is the MinPriorityQueue in the Queue.py implemented "correctly"?

The MinPriorityQueue is a data structure implemented using a MinHeap and it could potentially be used in different algorithms.

So the idea of opening this issues is to try to understand if this implementation can be used in a flexible way: ideally it should be possible to use it whenever any algorithm needs an "min-priority queue", so it should be as simple as possible (i.e., without many secondary enhancements) and it should also be as much as possible decoupled from other data structures.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.