Code Monkey home page Code Monkey logo

dsc-data-analysis-python-recap's Introduction

Data Analysis in Base Python - Recap

Introduction

In this section, you started by learning about how Python code interacts with files, and how to use the open built-in function to read data from files on disk into Python objects in memory. Then you learned about the CSV and JSON formats for serializing data, and how to use the built-in csv and json Python modules to parse and extract data from files using those formats.

Key Takeaways

Base Python

While there are many popular third-party libraries used for data analysis with Python, you can create simple, powerful analyses with just the built-in language data types and modules. When working with data in base Python:

  • Data is stored in familiar data structures such as lists and dictionaries, which are often nested
  • Analysis is performed using familiar control structures such as for loops and if statements
  • Files are opened using the open built-in function, and CSV or JSON files are parsed using the built-in csv or json modules

Loading Data from Files

As you were initially learning to work with Python, you often used Python variables that were declared in the code of a Jupyter Notebook, manipulated using that code, then never existed again after that notebook was shut down. These are variables "in memory". In many cases data scientists will want to work with larger data files that are saved after the code is finished running. This is data "on disk". Python has built-in functionality to facilitate communication between variables in memory and files on disk.

Key takeaways:

  • Variables in memory are the current active variables in your code, which go away if you shut down the kernel or shut down your computer
  • Files on disk are the things you see printed out when you type ls in the terminal, which persist when you shut down the kernel or shut down your computer
  • There is no concept like "editing" when working with files in Python. You are either reading from the first line to the last line of the file (sometimes all at once), or writing lines one after another
  • In Python, the typical way to interact with these files is to use the open built-in function
    • The first step is identifying the path to the file, which is a string and represents the relative path from the current code execution context to the location of the file on the computer. It is represented as path_to_file in the examples below
    • To open a file for reading, that looks like:
      with open(path_to_file) as f:
          # Read data from file object f
    • To open a file for writing, that looks like:
      with open(path_to_file, "w") as f:
          # Write data to the file object f
    • The with...: syntax means that the file is automatically closed after the with block ends. If you use the syntax f = open(path_to_file) instead, you also need to call f.close() when you are finished working with the file
    • You will find that some third-party libraries handle opening and closing the files for you. In this case, you only need to identify the file path and do not need to call open and .close yourself
    • The open function assumes that you are working with a file encoded as plain text, such as an unstructured .txt file, CSV file, or JSON file. If you are working with a file encoded as bytes, such as a JPEG image, you need to specify a mode of "rb" to read or "wb" to write when you call open

CSV

The CSV format stands for "comma-separated values". It is possibly the most common data serialization format used by data scientists, and we will use it very frequently in this course.

Key takeaways:

  • CSV is a file format. The file names typically end with the extension .csv
  • CSV is a plain text format. This means that it is encoded as text that is readable without specialized software, and you can explore the contents using VS Code, Vim, or any other general-purpose editor
  • CSV is a delimited format. This means that there are characters in the text of the file that are intended to separate the pieces of data from one another
    • As you might assume from the name, the most common delimiter is a comma. A line of comma-delimited CSV content representing four pieces of data might look like:
      10,2,1.5,"Firstname Lastname"
      
    • It is also possible to use some other delimiter. One of the most common is a tab \t, to the extent that there is a name "TSV" (tab-separated values) for that specific kind of file. You could also use a pipe |, semicolon ;, or really any other character as long as the markup is consistent and the code knows how to interpret it. Technically all of these files would still be referred to as CSVs, even though the delimiter is not a comma
  • In Python, we can use the csv module to parse data from CSV files (documentation here)
    • If the data file does not contain headings, the csv.reader function works well. It will return a reader iterable that produces a Python list for each row of the file. There is a matching csv.writer option for writing data to files rather than reading data from files
    • If the data file does contain headings, the csv.DictReader class works well. It will use a reader iterable that produces a Python dictionary for each row of the file. There is a matching csv.DictWriter option

JSON

The JSON format stands for "JavaScript object notation". It is a widely-adopted format for storing and transferring data between applications, particularly on the web.

Key takeaways:

  • JSON is a file format. The file names typically end with the extension .json
  • JSON is a plain text format, like CSV
  • The markup of JSON is more complex than CSV. While it typically does contain comma delimiters, it can also contain square brackets ([ and ]) and curly braces ({ and }).
    • For example, the same line represented as CSV above might look like this in JSON:
      [10, 2, 1.5, "Firstname Lastname"]
      
    • Unlike CSV (which mainly works for "flat" tabular data) it is also possible to store nested data with JSON. For example, this comes from the json docs:
      ["foo", {"bar": ["baz", null, 1.0, 2]}]
      
  • In Python, we can use the json module to parse data from JSON files (documentation here)
    • Typically you will want to load the entire file contents at once using json.load
    • You can also use json.dump to serialize an object in memory and write it to a file on disk

Conclusion

A major part of data science is loading the data, and you just learned the fundamental building blocks of how to do this with Python! In future sections we will cover additional libraries and data formats, but you already have the knowledge to dig up interesting data sets and use Python to answer questions about the data.

dsc-data-analysis-python-recap's People

Contributors

cheffrey2000 avatar hoffm386 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.