Code Monkey home page Code Monkey logo

header

Hello! Welcome to Jack's page

My name is Chih-Hsu Lin. You can call me Jack.
I am a senior data scientist at C3.ai (NYSE: AI) in the San Francisco Bay Area.
I love data and have won 🏆top 3%-6% in 3 Kaggle competitions (top 2.9% of active users).
I received a 🎓PhD degree with quantitative concentration.
During my 3-month internship in a top-tier biotech company, Illumina (with >7,000 employees), I not only reduced 94% time of manual curation by machine learning pipeline, I also led a 11-person team to win 🏆 the 1st place in business case competition.

Projects

  1. Designing Interpretable Neural Networks by Prior Knowledge to Predict Cancer Drug Targets. 2021. <Code>
    📖Published in Bioinformatics (ranking in math & computational biology: 🏆3/59, top 5.1%)
    ❓Problem (classification): How to predict personalized drug targets? How to design better neural network architecture?
    🤔Why it's important: Better algorithms can accelerate therapeutic development and explains the predictions to earn people's trusts.
    📝What I did: I invented and implemented a new and interpretable neural network algorithm that converges 35% faster, reduces 200 times of parameters, and marginally outperforms (AUROC>0.88) traditional neural network in PyTorch.
    💡Findings: Leveraging high-quality prior knowledge can build efficient, robust and interpretable neural networks.
    📂Data type: tabular data
    🛠️Skills: Deep learning, PyTorch, statistical tests

  2. Analysis of 5,500 Data Science Jobs. 2020. <Blog>. 1,100+ views in a week.
    ❓Problem: What skills do data science jobs need?
    🤔Why it's important: It's good to understand the trend for job seekers.
    📝What I did: I extracted and cleaned 5,500 job descriptions from the internet. I summarized results and generated interactive plots to investigate the skills, location and the difference between data analysts, scientists, and engineers.
    💡Findings:

    📂Data type: tabular and text
    🛠️Skills: Plotly, Seaborn, web scraping

  3. Kaggle Recursion Cellular Image Classification. 2019. 🏆 Top 3.0% (26/866) <Code & Solution>
    ❓Problem (multiclass classification): How to classify 1,108 treatments based on the images of 4 different cell types?
    🤔Why it's important: Accurate and precise image classification can expedite the drug discovery process and improve the understanding of drug effects on cells.
    📝What I did: I built a deep learning pipeline using state of the art convolutional neural networks to achieved accuracy of 0.97757.
    💡Findings: Different cell types have pretty distinct images so the cell type-specific models are necessary.
    📂Data type: image of 6 channels
    🛠️Skills: PyTorch, data augmentation, image processing, convolutional neural networks

  4. Accelerating Variant Triaging by Machine Learning. 2019. internship.
    ❓Problem (classification): How to predict the clinically relevant variants to automate the triaging process?
    🤔Why it's important: Successful predictions can reduce turn around time and provide timely information to facilitate clinical decisions.
    📝What I did: I parsed json files and converted them to tabular data. I cleaned and merged internal data with external data. I developed a machine learning pipeline to reduce the manual time by 94%. I presented the results as a poster at an international annual conference (8,500 attendees, ~250 exhibiting companies).
    💡Findings: communication is the key to customize the pipeline to colleagues' needs and existing frameworks.
    📂Data type: json and tabular data + external data collection and cleaning
    🛠️Skills: Scikit-learn

  5. Multimodal Network Diffusion Predicts Future Disease-Gene-Chemical Associations. 2018. <Code>
    🎓 PhD thesis published in Bioinformatics (ranking in math & computational biology: 🏆3/59, top 5.1%)
    ❓Problem (edge prediction): How to predict future disease-gene-drug interactions based on existing network data?
    🤔Why it's important: Predicting interactions between diseases, genes and drugs can accelerate the drug development process.
    📝What I did: I merged and cleaned data from 3 databases and generated a network of 215,000+ drug-gene-disease associations. I implemented and validated graph-based kernel machine learning methods in Python to predict associations with >90% precision.
    💡Findings: Adding more data would improve performance only if the method is good enough.
    📂Data type: graph/network
    🛠️Skills: graph kernel machine learning algorithms (self-implemented), graph building and analysis

  6. Kaggle Mercedes-Benz Greener Manufacturing. 2017. 🏆 Top 4.9% (188/3,835)
    ❓Problem (regression): How to predict the time for the car to pass the manufacturing test based on anonymized car features?
    🤔Why it's important: successful predictions can lead to speedier testing, lower carbon dioxide emissions without reducing Daimler’s standards.
    📝What I did: I applied dimensionality reduction methods to compressed 386 anonymized variables. I developed a machine learning pipeline using gradient boosting and ensemble methods to achieve 0.55227 R², which was only 0.00323 less than the first place.
    💡Findings: Fitting the public leaderboard may lead to bad ranking in final leaderboard.
    📂Data type: tabular data (anonymized features)
    🛠️Skills: Scikit-learn, dimension reduction, stacking, gradient boosting, XGBoost

  7. Kaggle Sberbank Russian Housing Market. 2017. 🏆 Top 6.1% (201/3,274)
    ❓Problem (regression): How to predict the Russia house prices based on house features and location under the country’s volatile economy?
    🤔Why it's important: Successful predictions can provide more certainty to the market in an uncertain economy.
    📝What I did: I build a machine learning pipeline to predict house price using gradient boosting, artificial neural network models and ensemble methods.
    💡Findings: Filtering outliers can improve predictions.
    📂Data type: tabular data + external data collection and cleaning
    🛠️Skills: Scikit-learn, gradient boosting, XGBoost, fully connected neural network, Keras, ensemble

📈 GitHub Stats

Jack's GitHub Stats

Jack Chih-Hsu Lin's Projects

cellular_image_classification icon cellular_image_classification

Kaggle competition 2019: Recursion Cellular Image Classification. The solution of the team ranked as 26/866 (top 3.0%; silver medal).

gmplot icon gmplot

Plotting data on google maps, the easy (stupid) way.

pretrained-models.pytorch icon pretrained-models.pytorch

Pretrained ConvNets for pytorch: NASNet, ResNeXt, ResNet, InceptionV4, InceptionResnetV2, Xception, DPN, etc.

scipy icon scipy

Scipy library main repository

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.