Code Monkey home page Code Monkey logo

weave_buitrago's Introduction

Data Engineering Coding Challenge

tl;dr: The challenge is to create a data pipeline that will ingest a UniProt XML file (data/Q9Y261.xml) and store the data in a Neo4j graph database.

⚠️ To apply, email [email protected] with 1) the link to your solution repo and 2) your resume.

Task

Read the XML file Q9Y261.xml located in the data directory. The XML file contains information about a protein. The task is to create a data pipeline that will ingest the XML file and store as much information as possible in a Neo4j graph database.

Requirements & Tools

  • Use Apache Airflow or a similar workflow management tool to orchestrate the pipeline
  • The pipeline should run on a local machine
  • Use open-source tools as much as possible

Source Data

Please use the XML file provided in the data directory. The XML file is a subset of the UniProt Knowledgebase.

The XML contains information about proteins, associdated genes and other biological entities. The root element of the XML is uniprot. Each uniprot element contains a entry element. Each entry element contains various elements such as protein, gene, organism and reference. Use this for the graph data model.

The full XML schema is available here.

Neo4j Target Database

Please run a Neo4j database locally. You can download Neo4j from https://neo4j.com/download-center/ or run it in Docker:

docker run \
  --publish=7474:7474 --publish=7687:7687 \
  --volume=$HOME/neo4j/data:/data \
  neo4j:latest

Getting Started with Neo4j: https://neo4j.com/docs/getting-started/current/

Data Model

The data model should be a graph data model. The graph should contain nodes for proteins, genes, organisms, references, and more. The graph should contain edges for the relationships between these nodes. The relationships should be based on the XML schema. For example, the protein element contains a recommendedName element. The recommendedName element contains a fullName element. The fullName element contains the full name of the protein. The graph should contain an edge between the protein node and the fullName node.

Here is an example for the target data model:

Example Data Model

Assessed Criteria

⚠️ The solution will not be assessed based on correctness of the data model with respect to biological entities. This requires domain knowledge that we do not expect you to have.

We will assess the solution based on the following criteria:

  • The solution captures most of the data from the XML
  • The solution makes use of general purpose open-source tools
  • The solution can be scaled to handle larger datasets

Example Code

In the example_code directory, you will find some example Python code for loading data to Neo4j.

Submission

Please commit your solution to a new repository on GitHub.

Feel free to use this repository as a starting point or to start from scratch. Include a README.md file that describes how to run the solution. Please also include a description how to set up and reproduce the environment required to run the solution.

Finally, email [email protected] with 1) the link to your solution repo and 2) your resume

weave_buitrago's People

Contributors

mpreusse avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.