Code Monkey home page Code Monkey logo

dataengineering-task's Introduction

TAIYŌAI INC.

Data Engineering Trial Task

Objective:

Find, standardize, and continuously update data regarding construction and infrastructure projects and tenders in the state of California.

Part 1: Research and Data Sourcing

Task: Research and identify 5-10 reliable data sources about construction and infrastructure projects and tenders in California.

Methodology: Use a combination of online research and language models (e.g., OpenAI's GPT models) to identify these sources. Explicitly state how and why you used GPT or similar models in your research process.

Part 2: Data Extraction and Standardization

Task: From the provided Table 1 and your own list, suggest methods to scrape data using language model-based tools like OpenAI API, Mistral 7B, Llama2, or other open-source models.

Requirements:

Demonstrate how you can build data products (DPs) to scrape data from multiple sources. Standardize the scraped data according to the guidelines provided in Table 2.

Part 3: Automation and Continuous Updating

Task: Propose a system for automating the data scraping and standardization processes.

Details:

Explain how the data sources will be continuously updated.

Describe the use of cron jobs or similar scheduling tools for ongoing data updates. Ensure your methodology adheres to a production environment's standards.

Evaluation Criteria

  • Scalability: Ability to scrape multiple sources effectively.
  • Adherence to Standards: Conformity with the provided data standards; penalties for deviation.
  • Automation and Continuity: Quality of the proposal for continuous data updating, including details on cron monitoring and production environment suitability.

Deliverables

Candidates should share a Google Drive folder containing:

  1. Python Scripts: The actual code used for data scraping and standardization.
  2. Documentation: Detailed explanations of the scripts and methodologies.
  3. Sample Datasets: Examples of the data extracted and standardized.
  4. Production Environment Plan: A document detailing the implementation of cron monitoring and how the system will operate in a production environment.

Notes to Candidates

  • Pay close attention to the data standards and ensure your methods are scalable and suitable for a production environment.
  • Clearly articulate your use of AI or machine learning models, specifically in the context of data sourcing and any preprocessing tasks.
  • Demonstrate a thoughtful approach to continuous data updating and monitoring.

dataengineering-task's People

Contributors

warisamir avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.