gabrielmartell / comp3005-project-template Goto Github PK

0.0 2.0 1.0 136 KB

Python 100.00%

comp3005-project-template's Introduction

COMP3005 Queries Code (ver 1.3)

The following queries.py file is the template code for COMP3005 Database Project 1, which will be used for the autograder.

This README is an important guide to effectively using this repository and finishing this project.

It is important that you THOROUGHLY read this README to properly submit your Project without penalties.

Updates (04/25/2024)

- Try/Except clause for cursor execution of query.

Starter Code:

queries.py
- This will be the file you will write your queries in AND submit.
dbexport.sql
- This is an example of an exported database. This can be used for testing, for both you and myself.

Modules and Dependencies:

This project, and in turn the autograder and starter code, uses psycopg3 on a v22.04 Ubuntu Linux environment.

psycopg3 Installation and Documentation
Carleton University VM Downloads
PostgreSQL
pgAdmin4 (This is ideal to create and test your databases).

Task:

As per the project guidelines,

"Design a database that stores a soccer events dataset spanning multiple competitions and seasons. The provided dataset is in JSON format and can be downloaded from https://github.com/statsbomb/open-data/tree/0067cae166a56aa80b2ef18f61e16158d6a7359a1. The documentation of the dataset is also available in the above URL. After designing the database, you need to import the data from the JSON files into your database."

Once your database has been designed, you are then tasked to export this database into an .sql file named "dbexport" - this can be accomplished using pg_dump.

Given that the previous two steps are accomplished, your task is to now ONLY write your SQL queries within the prompted space within each Q_# method (where # is the question number).

This given task must be completed on an Ubuntu Linux environment, see above.

Testing:

The queries.py file and your exported dbexport.sql file should be within the same directory.

The queries.py file includes the following code snippet,

root_database_name = "project_database"
query_database_name = "query_database"
db_username = 'postgres'
db_password = '1234'
db_host = 'localhost'
db_port = '5432'

To briefly explain, these variables are used to connect to the root database named "project_database" and your query execution database named "query_database." The code's process for each query is to create a database named "query_database", import your dbexport.sql file into this database, execute the query, and then drop the database (to avoid any alterations so results are not affected down the line). The reasoning for two databases is because the connection cannot drop a database it is currently connected to, hence the two databases - one for a root connection and one for query execution.

You may change these values to test on your end, but under no circumstance in the final deliverable should these initial values be different. Mismtached values from the initial values will result in penalties applied to your final submission.

Expected Output when Testing:

While testing, your output are your query times.

INC simply means incomplete.

Any fatal errors in your query will be notified to you before the time prints.

Warning:

As the autograder is also connecting to your databases, to reiterate, any change to the initial values of the connection variables will result in your code submission becoming penalized. You may change these values for your own testing purposes, (e.g, you have a different password), but do so at your discretion.

What else will PENALIZE your submission?
- Any additional submissions in your repository.
- Other print statements.
- Outrageously large json_loader folders - that is, do not include the entire dataset in your folder.
- Other alterations to code other than the query executions (please view ACADEMIC INTEGRITY below).

Final Deliverable:

Your source code file(s) that maps and loads the existing JSON dataset from the JSON files into your database. This code must be stored in a directory named "json_loader". Therefore, in your submission repository, you are only submitting the script "queries.py", your dbexport.sql, ".gitattributes" when you import the dbexport.sql as an LFS, and the "json_loader" directory. Any additional submissions will penalize the entire code submission.

GitHub has a restriction to submission size - if any file exceeds 100.00 MB it simply cannot be added to your repository - so for this case, your dbexport.sql. Luckily, you may use Git LFS to your advantage. Git LFS (Large File Storage) allows the submission of these large files by using reference pointers to get this data - this will be beneficial for both you and the grader.

Steps:

Install git and git lfs in your Linux terminal, use the following commands to install both.
- sudo apt-get install git-all
- curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
- sudo apt-get install git-lfs
Next, you will set up your GitHub username and email,
- git config --global user.name "Your Name"
- git config --global user.email "Your Email"
Have a directory based on your submission, in other words, have a folder on your system containing all files you will have in your repository submission. This is both easier and organized for you, but also allows you to do the git commands without the worry of affected anything outside of it.

Open the terminal with this directory, and then write the following line, this will create a HIDDEN .git folder in your folder:
- git init

Then, write the following line to track your dbexport into a smaller reference file:
- git lfs track "*.sql"

Once this is done, you can then do the following lines of code in your terminal.
- git branch -M main
- git remote add origin "your submission .git link"
- git add "your files that you will add to your submission"
  - Example: git add queries.py
Once you add your files, you can check your additions to the commit by doing git status
- BEFORE
- AFTER ADDING
Then, you can commit using git commit -m "Commit Message"
And then finally, you can push to your repository, git push -u origin main
The terminal would then prompt you for your GitHub username and password, however, as of August 13th, 2021, GitHub has removed account password authentication for terminal pushes. Therefore, please use this following article to set up your own authentication key.
It will loop and continously ask for your username and password based on how many files you are pushing!

Your following submission should look like the following (if you used GIT LFS):

If your database is less than 100MB and you didn't track your database, your output would be like the above picture without the .gitattributes.

Before You Submit:

It is imperative that you test your queries.py in your Linux environment with your database before submitting it. pgAdmin4 is a good tool to use for testing your queries, yet, it is NOT case-sensitive, whereas the autograder (and psql) is.

The Linux environment local encoding is UTF-8 - thus ensure that your database is exported in UTF-8.

Ensure you only have the ESSENTIAL folders. These include your json_loader, queries.py, dbexport.sql and your .gitattributes if needed.

Bugs and Questions:

If you run into any fatal errors or bugs, please consult the closed issues first as it might have already been solved. If it hasn't been solved, and/or you also have questions, please feel free to create an open issue!. If need be, you can also shoot me an email at [email protected] - although I would respond quicker to the GitHub issues.

CARLETON ACADEMIC INTEGRITY

Any alterations to the code, such as modifying the time, will be flagged for suspicion of cheating - and thus will be reviewed by the staff and, if need be, the Dean. To review the Integrity Violation Attributes of Carleton University, please view https://carleton.ca/registrar/academic-integrity/

comp3005-project-template's People

Contributors

Watchers

Forkers

jaimorjaria

comp3005-project-template's Issues

Benchmarking queries internally to PSQL instead of through psycopg

Hi, would it be possible to switch the benchmark from just wrapping the single psycopg execution in time.time() to using an in-SQL benchmark? The one I've been using for tests is:

CREATE OR REPLACE FUNCTION bench(query TEXT, iterations INTEGER = 500)
RETURNS TABLE(median FLOAT) AS $$
DECLARE
  _start TIMESTAMPTZ;
  _end TIMESTAMPTZ;
  _delta DOUBLE PRECISION;
BEGIN
  CREATE TEMP TABLE IF NOT EXISTS _bench_results (
      elapsed DOUBLE PRECISION
  );

  -- Warm the cache
  FOR i IN 1..5 LOOP
    EXECUTE query;
  END LOOP;

  -- Run test and collect elapsed time into _bench_results table
  FOR i IN 1..iterations LOOP
    _start = clock_timestamp();
    EXECUTE query;
    _end = clock_timestamp();
    _delta = 1000 * ( extract(epoch from _end) - extract(epoch from _start) );
    INSERT INTO _bench_results VALUES (_delta);
  END LOOP;

  RETURN QUERY SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY elapsed) FROM _bench_results;
  DROP TABLE IF EXISTS _bench_results;

END
$$
LANGUAGE plpgsql;

then executing with:

cursor.execute(f"SELECT median FROM bench(%s)", (query,))
result = cursor.fetchone()
print(f"Query #{str(i).zfill(2)} took {round(result[0], 2)}ms")

The benefit is that one execution time spiking because of some Python performance annoyance isn't possible, because the execution time is derived from multiple executions from within the SQL engine.

Then for the actual query correctness results the regular execution can still be done and written to a CSV

Connection error

Should we change the dbname, password, host and port in queries.py?
There are three places where these parameters are used.
If we don't change these default settings, an error will occur:
connection failed: FATAL: password authentication failed for user "postgres"
connection to server at "127.0.0.1", port 5432 failed: FATAL: password authentication failed for user "postgres"

Error Running Queries with no explanation

Dear Gabriel.

Having an issue running query number 1. When I run the queries on the created database using PGAdmin it works absolutely no problem. No errors, no warnings.

The moment I insert the query into the queries.py file, it tells me that the "match" table does not exist. This is impossible as it exists in the DB, and I ran the queries in PgAdmin with no issue as described before.

The DB name is project_database, as was instructed, and I am running the code on a VM with Ubuntu, everything else is working.

Not sure what to do here, is there a step I am missing? I have inserted my query directly into the string for the query in queries.py.

Please any help would be appreciated, I am very confused, again the query functions fine in PgAdmin

Version error on compressed dbexport.sql

Hey Gabe, for our final project submission we compressed the dbexport as instructed, however when attempting to run it on the course VM, I run into this error:

None of our groupmates really understand how github works in this capacity, do you know what this error means or how we can fix it?

Role of dbexport.sql

Just to make sure I'm doing it correctly:

I've created an empty database called "project_database"
then, exported the main database (populated using the json files) to a file called query_database
then deleted dbexport.sql file and replaced "dbexport.sql" in load_database(cursor, conn) with "query_database"
then added queries to Q_n functions

The third step isn't mentioned in README or project document but I found it necessary. Is this how it should be?

queries.py usage

Hello,

I have been attempting to test my queries with the provided queries.py on a Linux environment. Although the work I have done across the term was done entirely on Windows, I believe I have set up psycog3 and PostgreSQL correctly on a Ubuntu VM, however I am consistently running into problems with running the file. The current issue I am having is an authentication error, despite modifying passwords in the queries.py file to those of my previous databases (they would be changed back to '1234' before submission). In case this is relevant, the database setup script, queries.py, and exported database were all made/worked on in Windows and later transfered to Linux. Any guidance towards how to effectively execute the file, or any essential knowledge on the setup and usage of PostgreSQL on Linux that might easily be missed, would be appreciated.

Thanks

ModuleNotFoundError: No module named 'psycopg'

Could you include instructions on what versions of psycopg is needed and how to install it

Error occurs when referencing path to dbexport.sql

I'm running into errors when dbexport.sql is ran by the script because my project path has parenthesis in it. Would you be able to add quotes around the path to avoid this?

# BEFORE
command = f'psql -h {host} -U {user} -d {query_database_name} -a -f {os.path.join(dir_path, "dbexport.sql")}'

# AFTER
command = f'psql -h {host} -U {user} -d {query_database_name} -a -f \'{os.path.join(dir_path, "dbexport.sql")}\''

Thanks!

queries.py not connecting to tables

Hello!

I've been trying to run the queries.py file but have been receiving:
[ERROR] Error getting time. current transaction is aborted, commands ignored until end of transaction block

After adding a debug statement to find the exception I received the output:
relation "players" does not exist
LINE 3: FROM players p
^
[ERROR] Error getting time.
current transaction is aborted, commands ignored until end of transaction block

I have a database named "project_database" that contains the players table. Additionally, I ran the query in postgresql and it ran with no errors. I was wondering if there was any reason as for why the python file isn't seeing the table.

Thank you!

CSV Encoding

Could you modify the write_csv function in the autograder with the following:

# with open(filename, 'w', newline='') as csvfile:
with open(filename, 'w', encoding='utf-8', newline='') as csvfile:

The default for me is not utf-8. Without the change I was getting errors for all non-ascii names.

Thanks

GitHub submission format & other questions

Hi there,
Just wanted to clarify a few things.

My .SQL file is larger than GitHub's maximum 100mb single file limit so I had to compress it. Can I assume that the TA will unzip it before running the autograder?
The README states that any other files in our submission other than queries.py, dbexport.SQL, and the json_loader folder with our database creation script inside will void the entire submission. But we also need to submit a project report PDF and diagram image files for the project. Are we just intended to put our code-related files in a sub-folder in our repo or something?
Does queries.py judge query correctness? It gives us runtimes but I'm not sure if the speeds are actually good or if the queries are returning as expected.

Thanks for your time,
Joey V.

queries.py: System environment varaibles not imported to the subprocess.run(), psql is not a recognized command

When running queries.py I was getting the error: "'psql' is not recognized as an internal or external command, operable program or batch file.". I believe this is because the environment variables were not being used while initializing the subprocess.

To solve this issue, I've change the code from:

env = {'PGPASSWORD': password}
subprocess.run(command, shell=True, check=True, env=env)

To:

env = os.environ.copy()
env['PGPASSWORD'] = password
subprocess.run(command, shell=True, check=True, env=env)

This resolves this issue on my machine.

pgAdmin total failure on course VM

Hey just posting this here, having some issues running pdadmin4 on the VM.

It was working fine yesterday but now I get this error when I run the pgadmin4

When a check my file manager using df -h it tells me that 0 bytes are free out of 9.8 GB of space (using 9.3). This makes no sense and was not the case just a few short hours ago.

I don't understand why pgadmin just randomly crapped out on me, but now I cant run anything I cant even run the autograder or anything at all on my system, all connections are being rejected.

do you know any way to fix this? I can't find anything online, I'm thinking it is something to do with temp files being permanently stored by mistake? I dont know anything about linux so I really am not sure how to fix this

thanks,

Ben Seguin

duplicating values with each query

Seems like when the queries are being ran some of the values in the database are being duplicated. When looking at the outputs in the csv files for each query and comparing it to the tables produced when I run the queries in pgadmin4 on my own test database (which I set up with the same dbexport.sql dump file) the values are multiplied by the query number. For example, the values in Q_2.csv are twice the value compared to the output in pgadmin4 on my test database. The values in Q_3.csv are three times the value compared to the output in my test databse. The values in Q_4.csv are four times the value compared to my test database. I assume that this might be an error somewhere in the queries.py file so I wanted to mention it.

Database fails to be deleted due to dead connection

Line 45 of queries.py forwards cursor and conn to drop_database. However, the cursor that's passed in is the same one that's initialized in the top level __name__ == "__main__". Its corresponding conn is closed on line 56 of queries.py. This leads to any follow-up usages of cursor (i.e., on the next attempt to drop & create the database) to fail with the error 'the connection is closed'.

I can PR a fixed copy if you'd like but it should be as simple as localizing the connection and cursor to the function where it's used.

[ERROR] Error Getting Time

Hi Gabriel,

I've been stuck on the following error for a little bit and had a question. After running queries.py, I get the following:

This error message is vague I was wondering if it were acceptable to add a line of code to print the exception in the except block in get_time(). This is only so I can debug why I can't get the autograder to work, and wanted to ask before making any changes because of the Academic Integrity policy. If not, it would be great if you had an idea what would be causing this, however I'm assuming it is something within my code.

Thanks,
Steven

Can't Push From Remote

Hi there,
I have been attempting to push to main via the Ubuntu remote using "git push -u origin main" as instructed but nothing happens. The "git status" command returns everything correctly and I know the key token I'm using is right because I can pull just fine. However, "git push -u origin main" just returns nothing, then immediately prompts me for my user/password again in an infinite loop. Any ideas?

Thanks,
Joey V.

PSQL error

I met a same trouble with an other student who sent a thread in Project 1 Q&A. Professor told him need submit an issue request to here. But he didn't post his issues, so I will pass on what he said:
"
Hello, I have a few concerns.

1.My queries.py, when ran contains the wrong password so there's an incorrect password error when I run the script, as it is set to 1234, however this works when I change the password to postgres.

2.When I use the following command as shown in the spec: C:\Program Files\PostgreSQL\13\bin>pg_dump.exe --file "C:\Users\user\
Desktop\db_dump.sql" --host "localhost" --port "5433" --username "postgres" --verbose
--format=p "postgres", It doesen't work, but if I change the port to 5432 it works but the db_dump.sql only contains header information and not the database schema information. However, this command worked for me:
pg_dump.exe --file "C:\Users\sheri\Documents\db_dump.sql" --host "localhost" --port "5432" --username "postgres" --verbose --format=p "project_database"

3.When I've made these changes and I run queries.py it says internal or external command psql not found, and I fixed this by adding psql to my environment variables on my machine, but it still doesen't work and gives that error even though I can run psql anywhere on my machine.

4.Even though I shouldn't modify queries.py I modified this line so psql command could work so I could test it:
psql_path = r'C:\Program Files\PostgreSQL\16\bin\psql.exe'
command = f'"{psql_path}" -h {host} -U {user} -d "query_database" -a -f "{os.path.join(dir_path, "dbexport.sql")}"'
print(command)
env = {'PGPASSWORD': password}
subprocess.run(command, shell=True, check=True, env=env)
And when I run psql works now however I run into this error:

Some clarification to these issues or regarding the queries.py and the autograder would be much appreciated.
"

im getting the error "connection failure: Connection refused Is the server running on that host and accepting TCP/IP connections?"

Hi,
I posted this question in the Q&A but read that I should ask queries.py relation questions here.
Basically, I was able to insert data into the project_database we created using psycopg2

However, I am now trying to run the queries.py file in a virtual Linux environment to test my queries and receiving the following error:

connection failure: Connection refused Is the server running on that host and accepting TCP/IP connections?

I have not changed any of the code except for entering the queries in their respective locations, and the name of our database is project_database. Also, I have not been able to change the password for the user 'postgres' to '1234' so i have tried changing the db_password value to my password for testing purposes.

queries.py Full of Errors

Hi there,
I have been consistently running into problem after problem attempting to run queries.py, and now I am completely stuck.

Here is what I have done so far:

Made a psycopg3 script that creates a db "project_database" of my own design, then populates it with the data from the JSON files.
Found out that "queries.py" doesn't work in Windows. Entered a Ubuntu environment from one of my class' VMs, installed postgreSQL & pgAdmin.
Ran my script and confirmed that the database is correct and populated using pgAdmin.
Exported the db to a .sql file using "pg_dump -U postgres -h localhost project_database >> dbexport.sql" (inside Ubuntu to get UTF-8 encoding, which was a whole issue in itself). The result is a whopping 180mb in size, causing more problems related to my VM's allocated disk space as well as GitHub's maximum single file size.
Found a way to run "queries.py" inside a shared folder.

Finally, queries.py actually runs, but now it is riddled with errors. I can see it going over all of my tables and all of the data using the correct names, but I can see a bunch of error prints along the way. I have attached an image of the end of the print.

I apologize for the harsh tone, but I have spent more time just trying to understand what is expected from me and getting the autograder to work than actually doing the project, which I think is a major problem. Please let me know what I can do so I can finally get started on queries and help my groupmates understand how to start.