TAIYŌAI INC.
Data Engineering Trial Task
Objective:
Find, standardize, and continuously update data regarding construction and infrastructure projects and tenders in the state of California.
Part 1: Research and Data Sourcing
Task: Research and identify 5-10 reliable data sources about construction and infrastructure projects and tenders in California.
Methodology: Use a combination of online research and language models (e.g., OpenAI's GPT models) to identify these sources. Explicitly state how and why you used GPT or similar models in your research process.
Part 2: Data Extraction and Standardization
Task: From the provided Table 1 and your own list, suggest methods to scrape data using language model-based tools like OpenAI API, Mistral 7B, Llama2, or other open-source models.
Requirements:
Demonstrate how you can build data products (DPs) to scrape data from multiple sources. Standardize the scraped data according to the guidelines provided in Table 2.
Part 3: Automation and Continuous Updating
Task: Propose a system for automating the data scraping and standardization processes.
Details:
Explain how the data sources will be continuously updated.
Describe the use of cron jobs or similar scheduling tools for ongoing data updates. Ensure your methodology adheres to a production environment's standards.
Evaluation Criteria
- Scalability: Ability to scrape multiple sources effectively.
- Adherence to Standards: Conformity with the provided data standards; penalties for deviation.
- Automation and Continuity: Quality of the proposal for continuous data updating, including details on cron monitoring and production environment suitability.
Deliverables
Candidates should share a Google Drive folder containing:
- Python Scripts: The actual code used for data scraping and standardization.
- Documentation: Detailed explanations of the scripts and methodologies.
- Sample Datasets: Examples of the data extracted and standardized.
- Production Environment Plan: A document detailing the implementation of cron monitoring and how the system will operate in a production environment.
Notes to Candidates
- Pay close attention to the data standards and ensure your methods are scalable and suitable for a production environment.
- Clearly articulate your use of AI or machine learning models, specifically in the context of data sourcing and any preprocessing tasks.
- Demonstrate a thoughtful approach to continuous data updating and monitoring.