The dataengineering-task from warisamir

TAIYŌAI INC.

Data Engineering Trial Task

Objective:

Find, standardize, and continuously update data regarding construction and infrastructure projects and tenders in the state of California.

Part 1: Research and Data Sourcing

Task: Research and identify 5-10 reliable data sources about construction and infrastructure projects and tenders in California.

Methodology: Use a combination of online research and language models (e.g., OpenAI's GPT models) to identify these sources. Explicitly state how and why you used GPT or similar models in your research process.

Part 2: Data Extraction and Standardization

Task: From the provided Table 1 and your own list, suggest methods to scrape data using language model-based tools like OpenAI API, Mistral 7B, Llama2, or other open-source models.

Requirements:

Demonstrate how you can build data products (DPs) to scrape data from multiple sources. Standardize the scraped data according to the guidelines provided in Table 2.

Part 3: Automation and Continuous Updating

Task: Propose a system for automating the data scraping and standardization processes.

Details:

Explain how the data sources will be continuously updated.

Describe the use of cron jobs or similar scheduling tools for ongoing data updates. Ensure your methodology adheres to a production environment's standards.

Evaluation Criteria

Scalability: Ability to scrape multiple sources effectively.
Adherence to Standards: Conformity with the provided data standards; penalties for deviation.
Automation and Continuity: Quality of the proposal for continuous data updating, including details on cron monitoring and production environment suitability.

Deliverables

Candidates should share a Google Drive folder containing:

Python Scripts: The actual code used for data scraping and standardization.
Documentation: Detailed explanations of the scripts and methodologies.
Sample Datasets: Examples of the data extracted and standardized.
Production Environment Plan: A document detailing the implementation of cron monitoring and how the system will operate in a production environment.

Notes to Candidates

Pay close attention to the data standards and ensure your methods are scalable and suitable for a production environment.
Clearly articulate your use of AI or machine learning models, specifically in the context of data sourcing and any preprocessing tasks.
Demonstrate a thoughtful approach to continuous data updating and monitoring.

warisamir / dataengineering-task Goto Github PK

dataengineering-task's Introduction

dataengineering-task's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent