Code Monkey home page Code Monkey logo

computer-vision-course's People

Contributors

adhiiisetiawan avatar ai-ank avatar albertkao227 avatar alvanli avatar aman06012003 avatar anindyadeep avatar asusevski avatar atayloraerospace avatar bellabf avatar charchit7 avatar diwakarbasnet avatar fariddinar avatar farrosalferro avatar hwaseem04 avatar johko avatar jvthunder avatar kfahn22 avatar lulmer avatar mattmdjaga avatar merveenoyan avatar minemile avatar mkrolick avatar mmhamdy avatar nemesisalm avatar psetinek avatar seshupavan avatar sezan92 avatar snehilsanyal avatar vasugupta9 avatar zekrom-7780 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

computer-vision-course's Issues

Video & Video Processing

Hello everyone,

Together with my collaborators, we've drafted a curriculum for our section. We're keen on getting feedback from the HF team during our writing. Here's what we've got so far:

Proposed Curriculum

INTRODUCTION (introduction-to-video.mdx)

  • What is a Video? – Delving into the Basic Concept (Assigned to @DiwakarBasnet, Estimated: by Nov 25th)
  • Role of Video Processing in AI and CV (Assigned to @wonhyeongseo, Estimated: by Nov 25th)

VIDEO PROCESSING BASICS (video-processing-basics.mdx)

  • Video vs. Image – Exploring Key Processing Differences (Assigned to @DiwakarBasnet, Estimated: by Nov 25th)
  • Temporal Continuity & Motion Estimation (Assigned to @DiwakarBasnet, Estimated: by Nov 25th)
  • Key Techniques: Stabilization, Background Subtraction, and Object Tracking

DEEP LEARNING FOR VIDEO (deep-learning-for-video.mdx)

  • Overview of Previous SOTA Models – Including 3D CNNs like (2+1)D Resnets and Two Stream Nets (Assigned to @sazio , Time: To be specified)
  • Transformers in Video Processing – Highlighting differences in ViT for Image vs. Video (Assigned to @DiwakarBasnet and @sazio , Time: To be specified)
  • Key Applications: Action Recognition and Video Captioning

PRACTICAL APPLICATIONS & CHALLENGES (applications-and-challenges.mdx)

  • Practical Applications: Surveillance, Streaming, Autonomous Driving, AR/VR, Healthcare, and Sports Analytics (Note: Might be too much; consider narrowing down for depth, Assigned to @sanae-a11y, Estimated: To be specified)
  • Challenges: Real-time Constraints & Privacy Concerns (Assigned to @wonhyeongseo, Estimated: by Nov 25th)

HANDS-ON DEMO (Jupyter Notebooks)

  • Simple Video Stabilization
  • Object Tracking
  • Action Recognition
  • Video Summarization

cc. @DiwakarBasnet @sazio @sanae-a11y @jungnerd
bcc. @johko @lunarflu @merveenoyan

With our curriculum now finalized and our initial meeting behind us, we're eager to write the contents. Here are some things to keep in mind while writing:

add notebook examples

https://github.com/johko/computer-vision-course/blob/main/chapters/en/unit4/multimodal-models/transfer_learning.mdx

Task Description Model Notebook
Fine-tune CLIP Fine-tuning CLIP on a custom dataset openai/clip-vit-base-patch32 CLIP notebook
VQA Answering a question in natural
language based on an image
dandelin/vilt-b32-finetuned-vqa VQA notebook
Image-to-Text Describing an image in natural language Salesforce/blip-image-captioning-large Text 2 Image notebook
Open-set object detection Detect objects by natural language input Grounding DINO Grounding DINO notebook
Assistant (GTP-4V like) Instruction tuning in the multimodal field LLaVA LLaVa notebook

Unit 8 - 3D Vision, Scene Rendering and Reconstruction: Draft outline

Hi. This is a really deep subject, with much more potential material than can be covered easily by a small team.
As a result, I've adapted work by @CyWiz57 to give a minimal outline covering some of the current hot topics - view synthesis, NeRFs, and Gaussian splatting.

Introduction:

  • Overview of 3D Vision
  • Brief history of 3D Vision
  • 3D Vision applications

Terminologies and Basics:

  • Linear algebra and transformations.
  • Camera models: Pinhole and lens distortion models.
  • Representations: Point clouds, meshes, implicit surfaces, volumetric data.

3D Vision:

  • Depth estimation from single images (MiDaS)
  • Structure from Motion - Estimating camera positions from multiple images (COLMAP, SuperGlue)
  • Multiview stereo (COLMAP)
  • Volume Rendering
  • Neural Radiance Fields (NeRF)
  • Rendering Signed Distance Fields (NeuS)
  • Sparse View 3D and View Synthesis (PixelNeRF, Zero123)
  • Neural Light Fields (Light Field Networks)
  • Point Cloud Processing (PointNet)
  • Generative 3D Models (Point-E, DreamFusion, Magic3D)
  • Gaussian Splatting
  • Datasets (ShapeNet, Objaverse)

Team members @jfozard @julien-blanchon @CyWiz57 - possibly also Luke, sanae, hwaseem04.

Please let me know if there are things missing, or that can be cut easily.

Common Pre-Trained Models (ResNet, etc.) Discussion

Hello. This issue is for discussion about the chapter Common Pre-Trained Models

My thoughts

In the following paragraphs, I am adding my thoughts. This will be finalized after discussion.

What do you think should be added?

The chapter assumes the reader nows fairly about the CNN algorithms. Now are job is to address the major architectures using CNN.

Architectures to be added

  • Vgg16/19 (First deep CNN)
  • Resnet (Residual net helped )
  • Googlenet (inception model)
  • Resnext
  • Convnext

How would you like to explain?

  • I think the chapter should be diagram-heavy and if possible some implementations of the architectures.

Please let me know your thoughts

z-axis rotation matrix not right in Basics of Linear Algebra for 3D Data

First of all thanks for this course :)

Please note that in Basics of Linear Algebra for 3D Data when introduced the z-axis rotation matrix it is a copy of the y-axis one.

Now it is:

- Rotation around the Y-axis

$$ R_y(\beta) = \begin{pmatrix} \cos \beta & 0 & \sin \beta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin \beta & 0 & \cos \beta & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$

We are sure you can use the example snippet above and figure out how to implement a rotation around the Y-axis.😎😎

- Rotation around the Z-axis

$$ R_y(\beta) = \begin{pmatrix} \cos \beta & 0 & \sin \beta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin \beta & 0 & \cos \beta & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$

Again, can you use the last code snippet and implement a rotation around the Z-axis❓

Should be:

- Rotation around the Y-axis

$$ R_y(\beta) = \begin{pmatrix} \cos \beta & 0 & \sin \beta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin \beta & 0 & \cos \beta & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$

We are sure you can use the example snippet above and figure out how to implement a rotation around the Y-axis.😎😎

- Rotation around the Z-axis

$$ R_z(\beta) = \begin{pmatrix} \cos \beta & -\sin \beta & 0 & 0 \\ \sin \beta & \cos \beta & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$

Again, can you use the last code snippet and implement a rotation around the Z-axis❓

Thanks again

Image and Imaging Outline

Hello everyone,
We finally have an outline for the Image and Imaging chapter.

1. What is an image? [image.mdx]

  • Definition of Image
  • Difference between Image and Video
  • Image-specific challenges (i.e.comparing images to other types of data)

2. Fundamentals of Imaging and its technical aspects of images [imaging.mdx]

  • Image formation: Bayer Grid / Filter, Digital Image Sensing, Color space representations
  • Image representation: Type of image representations, RGB's, image as the matrix,
  • Output of images: Image resolution, Pixel/Voxel, Dimensions Channels, Dynamic range, Bit depth, File Compression, Image formats, Metadata

3. How we imaged everything/When in doubt, image it. [extension-image.mdx]

  • Challenges in imaging real-life
  • Imaging everything (how we extend sensors to make images from things that are not usually thought of as images)
  • Perspective on what we obtained from this and its impact on us (i.e. we have imaged outer space and electrons; we have an MRI machine capable of imaging inside of us without ever touching 🤯)
  • Example(s) of how image-acquired characteristics of images come from what types of sensors they record, challenges/difficulties in computer vision change according to the image/data, not all data is RGB

Our biggest concern was the possible overlap of our content with the “Feature Detection/Feature Extraction chapter” content, especially regarding image transformations, filters, and convolution. We opt to left it out, considering that the proposed outline for Feature Detection.

Let us know what you think!

Team members - @bellabf, @seshu-pavan, @mfidabel @froestiago, @froestiago
bcc - @johko @lunarflu @merveenoyan @MKhalusova

Best,
Image and Imaging team

General Architechture : Draft Outline

Hey Everyone

This issue is for discussion about the chapter General Architecture. So our team planned to keep it as simple as possible and also make it informative.

General Architechture of CNN’s:

  • What is a convolutional filter?
  • Implementation of convolution in Pytorch (also Jax)
  • Filters and their Math
  • Various types of Convolutions
  • Layers to build ConvNet
    - Convolution Layer
    - Pooling
    - Normalization
    - Fully Connected layers
  • Implementation of CNN in Pytorch (JAX also probably)

Open to all suggestions for improving the draft .
Team members: @jucamohedano @youssefadr @alperenunlu
bcc : @johko @merveenoyan @lunarflu

Image Classification task is missing in Unit 6

Image classification falls under Basic CV tasks but as of now it is missing from Unit 6. I am planning to add a chapter in Unit 6. A typical structure I am planning to add is here -

Image Classification

  • Overview of Image Classification
    • Introduction to the concept and importance of image classification in computer vision
    • Brief presentation of popular models.
  • Example Application of Image Classification
  • Hands-on Notebook
    • Using pretrained CNN model (such as VGG, ResNet) for image classification problem
    • Using pretrained ViT model for image classification problem
  • Real life example of image classification ( such medical image classification) with notebooks.

Model Optimization Chapter 15

Hi everyone, i with the team just discuss about model optimization chapter, and we make an outline for this chapter. Overall we plan like this

Module 1: Introduction to model optimization for deployment

What is model optimization?
Why is it important for deployment in computer vision?
Different types of model optimization techniques
Trade-offs between accuracy, performance, and resource usage

Module 2: Model compression techniques

Quantization
Pruning
Knowledge distillation
Low-rank approximation
Model compression with hardware accelerators

Module 3: Model deployment considerations

Different deployment platforms (cloud, edge, mobile)
Model serialization and packaging
Model serving and inference
Best practices for deployment in production

Module 4: Model optimization tools and frameworks

TensorFlow Model Optimization Toolkit (TMO)
PyTorch Quantization
ONNX Runtime
OpenVINO
TensorRT
Edge TPU

Module 5: Model optimization case studies (Hands on, choose one is okay)

Deploying a real-time object detection model on a mobile device
Deploying a semantic segmentation model on an edge device
Deploying a face recognition model for authentication in the cloud
Case studies of using model optimization techniques for specific deployment challenges

Module 6: Future trends in model optimization for deployment

Federated learning
Continual learning
Model compression for new hardware architectures

Module means like section in each chapter, so for example in NLP course chapter 1, there are 10 module. And we plan to create like that, and for each module contains topics like above.

And than for module 5, we still discussing what the hands on project use case should implement. But for starter we plan to object detection on mobile device.

Let us know if the outline need to be revise or all of you guys have any suggestions, it will be very helpful @johko @lunarflu @merveenoyan

cc: @mfidabel

Unit 2 - CNN | Transfer Learning/Fine-Tuning Draft Outline

Hey everyone.

This is the proposed sections for the Transfer Learning/Fine-Tuning Chapter.

Transfer Learning/Fine-Tuning

  • What is Transfer Learning?
  • What is Fine-Tuning?
  • Fine-Tuning with Torchvision [VGG16, ResNet50]
  • Torchvision vs Timm
  • Fine-Tuning with Timm [VGG16, ResNet50]
  • Detailed Fine-Tuning tutorial on an image classification task. [Adapted on Different Models]

Open to suggestions for further improvement of the draft.

Trends and New Architectures - Draft Outline

Hello👋 ,

This is a draft outline for the Trends and New Architectures chapter. I think it'd be better to call it alternative rather than new. Below I'll give a brief overview of the chapter content.

🔹 On Innovation: An Introduction

🔹 Case Study: ViT vs. Image Transformer

Image Transformer was an early attempt by the authors of Attention Is All You Need to introduce transformers into computer vision. It would be interesting to compare it to the now-established ViT (both models come from Google Brain).

🔹 Why Alternative Architectures?

  • The limitations of CNNs
  • The limitations of ViTs

🔹 Hiera

Paper: Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
A hierarchical vision transformer that was introduced by Meta's FAIR. It improves modern hierarchical vision transformers not by adding new components, but by questioning the several vision-specific components in these architectures. Meta wrote a twitter thread about it.

  • Overview
  • Why it matters

🔹 Hyena

Paper: Multi-Dimensional Hyena for Spatial Inductive Bias
Initially introduced in NLP, the hyena layer offers a replacement for the transformer's self-attention with subquadratic complexity. A variant of it is the Multi-Dimensional Hyena, Hyena N-D layer, boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT.

  • Overview
  • Why it matters

🔹 I-JEPA

Paper: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
The first AI model based on Yann LeCun’s vision for more human-like AI. Image-based Joint-Embedding Predictive Architecture (I-JEPA), is a non-generative approach for self-supervised learning from images. It delivers strong performance on multiple computer vision tasks, and it’s much more computationally efficient than other widely used computer vision models.

  • Overview
  • Why it matters

🔹 RMT

Paper: RMT: Retentive Networks Meet Vision Transformers
Retentive Network (RetNet) is a new architecture that ambitiously aims to replace Transformers and to be the new foundation architecture for large language models. Inspired by RetNet, RMT is an attempt to further improve the retention mechanism into a 2D form and introduce it to visual tasks.

  • Overview
  • Why it matters

🔹 Trends And Research Directions

  • Open Vocabulary Learning: aims to enable models to recognize and classify objects in images and videos without being explicitly trained on those categories. This is in contrast to traditional computer vision approaches, which require a large dataset of labeled images for each category.

  • General foundation models for computer vision: In NLP, language models predicting the next token have proven to be a good foundational model that can be fine-tuned for various tasks. In computer vision, what model and loss objective has the potential to serve as the foundational model for different computer vision tasks tasks.

  • Domain-specific models: CV models that have been trained on a domain-specific dataset or task achieve higher accuracy and performance on that task than general computer vision models. This has many applications, for example, in medicine and healthcare.

🔹 Summary

This part offers a summary of the chapter and refers to other research work that was not mentioned in the chapter.


Notes:

  • This is just a glimpse of the various new and amazing work done in computer vision. I'm sure there is a lot to add to this chapter and improve it, so your feedback is highly appreciated.
  • We try to make the chapter as short and interesting as possible, but the trends section still needs more work in my opinion.

Other Resources

Multimodal Models - CLIP and relatives

Hello!

Inspired by #19 #28, me and my fellow collaborators have also outlined a course curriculum for our section but we would like to have some inputs and feedback from the HF team before we finalize it and start working on it. This is our chosen structure so far.

Introduction

  • Motivation for multimodality
  • History of multimodal models
  • Self supervised learning enabling multi-modality

CLIP

  • Intro to (ELI5)
  • Theory behind clip (contrastive loss, embeddings, etc)
  • Variations of CLIP backbones
  • How tokenisation and embeddings work in clip
  • Applications of clip:
    • Search and retrieve
    • Zero shot classification
    • Clip guidance (Using clip in other models to guide generation, DALLE, SD etc)
  • Fine-tuning clip (Open-clip, and other variants?)

Losses/ self supervised learning

  • Contrastive
  • Non contrastive
  • Triplet
  • One or two other ones

Relatives

  • Image-bind
  • BLIP
  • OWL-VIT
  • Flamingo (IDEFICS)
  • LLaVa

Practical applications & challenges

  • Applications
    • Search image engine based on textual prompts
    • Downstream tasks on embeddings eg classification, clustering etc
    • Visual question answering systems
  • Challenges
    • Data bias/ out of distribution data
    • Hard to get enough data -> leads to using noisy internet data

References:

@mattmdjaga @froestiago

GANs, Diffusion Models, Generative Tasks (txt2img, img2img, inpainting)

Greetings everyone,
Inspired by #19 , me and my fellow collaborators have also outlined a course curriculum for our section but we would like to have some inputs and feedback from the HF team before we finalise it and start working on it. This is our chosen structure so far.

INTRODUCTION

  • What are generative vision models and how do they differ from other models?
  • Different types of generative models/tasks?
  • Prerequisites and resources to help

GANS & VAEs

  • VAEs theory (Theory)
  • Idea behind GANs, generator and discriminator (Theory & code)
    • DCGAN as the main implementation
  • Simple explanation, showcase and external resources:
    • StyleGAN
    • CycleGAN
    • VQGAN

Diffusion models

  • Theory of diffusion models and how they differ from GANs (limitations of GANs)
  • Evolution/ what made it work diffusion models DDPM, latent diffusion
  • Using stable diffusion
    • Basic structure of SD
    • How to use txt2img, img2img, inpainting
  • Simple explanation, showcase and external resources
    • Dreambooth
    • LoRA, show how to use, link to fine-tuning yourself
    • ControlNet, show how to use

PRACTICAL APPLICATIONS & CHALLENGES

  • Real-time Constraints and Privacy Concerns
  • Bias concerns

CC: @hwaseem04 , @mattmdjaga, @charchit7
BCC: @johko , @lunarflu , @merveenoyan

We would like to know if this course structure is suitable or do there need to be changes for this. In particular we are interested to know:

  • Do we need to create separate files for each topic or squish it together into a big jupyter notebook? (essentially structural help)
  • How much code is too much? Like, for GANs, we've thought to stick for DCGAN for code from scratch and for the rest just refer to the pre-trained models and showcase how to use them directly.
  • Is the chosen curriculum sufficient or do we need to add more to it? Also, if it's too much, then what can be truncated or removed from it considering that we also have an existing diffusers course out there.

Thanks,
Sarthak (and @hwaseem04 , @mattmdjaga, @charchit7)

Common Vision Transformers (SWIN) Chapter

After discussions with @alanahmet, @SuryaKrishna02, @Mkrolick, and @sulphatet, here is our proposed subdivisions for learning common vision transformers, specifically the Shifted Windows (SWIN) architecture.

Introduction

This section will provide the limitations of ViT and the rationale for using SWIN over ViT.

  • Background on Transformers. (A brief background of how transformers work in CV, since we assume this will primarily be tackled in the preceding Transformers Architecture + ViT chapter)
  • Limitations of ViT and the emergence of SWIN Transformer. (Quadratic complexity, absolute positional encoding, etc.)
  • Motivation for SWIN Transformer. (The problems it addresses from ViT)

SWIN Transformer Architecture and Its Advantages

This section will provide the theoretical aspect of SWIN's advantages to have better performance over ViT.

  • Overview
  • Window-based self-attention ((Linear complexity and its efficiency when scaling)
  • Hierarchical Representation / MSA and how it enhances feature representation
  • Relative position bias

Deconstructing SWIN

This section will now go into the technical (and coding) aspect of the architecture; primarily on breaking down the key layers to understand how it is implemented and why it is that way.

  • Deconstruct key layers (Patch embedding, W-MSA / SW-MSA, Relative Position Bias, and Patch Merging)
  • Implement the key layers mentioned above
  • Plot attention maps of W-MSA / SW-MSA to get a better understanding of the model's perception

Application of SWIN

This section is where SWIN will be finetuned to classify a custom dataset via hugging face

  • Finetune SWIN, ViT, and CNN-based architectures for classification of a custom dataset.
  • Record each model's performance and compare
  • Discussion of SWIN for other downstream tasks like object detection and segmentation

Conclusion

  • Wrap up
  • Advancements on SWIN
  • Advancements on Computer Vision models in general (e.g. self-supervised methods)

Please let us know your suggestions on this proposal.

Should we use safetensors?

I wondered if we should add an official recommendation to use the safetensors saving format wherever possible.

But I have to admit, that I'm not that familiar with it, so I don't know how much overhead it would be in cases where we cannot use a HF library like transformers.

Issue with rendering the course

If we try to render the course to preview how our added content looks like, it throws the following error

sarthak@kde:~/Desktop/computer-vision-course$ doc-builder preview computer-vision-course chapters/ --not_python_module
Initial build docs for computer-vision-course chapters/ /tmp/tmp0uqdjoxf/computer-vision-course/main/en
Building the MDX files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:00<00:00, 1288.27it/s]
Traceback (most recent call last):
  File "/home/sarthak/anaconda3/bin/doc-builder", line 8, in <module>
    sys.exit(main())
  File "/home/sarthak/anaconda3/lib/python3.9/site-packages/doc_builder/commands/doc_builder_cli.py", line 47, in main
    args.func(args)
  File "/home/sarthak/anaconda3/lib/python3.9/site-packages/doc_builder/commands/preview.py", line 171, in preview_command
    source_files_mapping = build_doc(
  File "/home/sarthak/anaconda3/lib/python3.9/site-packages/doc_builder/build_doc.py", line 405, in build_doc
    sphinx_refs = check_toc_integrity(doc_folder, output_dir)
  File "/home/sarthak/anaconda3/lib/python3.9/site-packages/doc_builder/build_doc.py", line 460, in check_toc_integrity
    raise RuntimeError(
RuntimeError: The following files are not present in the table of contents:
- en/Unit 5 - Generative Models/variational_autoencoders
- en/Unit 5 - Generative Models/README
- en/Unit 11  - Zero Shot Computer Vision/README
- en/Unit 2 - Convolutional Neural Networks/README
- en/Unit 1 - Fundamentals/README
- en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/README
- en/Unit 4 - Mulitmodal Models/README
- en/Unit 9 - Model Optimization/README
- en/Unit 6 - Basic CV Tasks/README
- en/Unit 7 - Video and Video Processing/README
- en/Unit 13 - Outlook/README
- en/Unit 3 - Vision Transformers/README
- en/Unit 12 - Ethics and Biases/README
- en/Unit 10 - Synthetic Data Creation/README
Add them to chapters/_toctree.yml.

Explanation: This is because there have been README files added to each chapter. However, these README files are not present in the _toctree.yml.

Why it's important: Being able to render the course locally is important as it can give us a rough overview of how the content looks like.

Possible solutions could be:

  • Remove the README files for the time being
  • Add them to the toctree and also making sure that if anyone adds any chapter contents they also update the toctree making it easier for others to render the course

Open for discussion from other members ✌️

Multi-label Image Classification Colab Notebook Error

This notebook doesn't work
https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb

When running this line of code:
notebook_launcher(train, (model_name,8,5,5e-5), num_processes = 2)

The following error:
File "", line 10, in train_transforms
labels = torch.tensor(batch['classes'])
ValueError: expected sequence of length 1 at dim 1 (got 2)

Zero-shot Computer Vision - Draft Outline

This is an early draft outlining the Zero-shot Computer Vision chapter. Below I'll give a brief overview of the chapter content.
There is also a presentation slides for the main concepts with some (little😃) details available.

🔹 Introduction

This section will basically lay the ground for the rest of the chapter. Each subsection was initially a section of its own, but then we thought that it would be better to merge them together under one heading.

  • On Generalization
  • Zero-shot Learning (ZSL), History and Definitions
  • Comparison With Other Techniques: This part aims to differentiate between zero-shot learning and some other methods such as Open Set Recognition (OSR), Domain Adaptation, and Out of Distribution (OOD) Detection
  • Relationship with Transfer Learning: This part discusses how is zero-shot learning related to transfer learning, and differentiates between homogeneous and heterogeneous transfer learning. this will only discuss parts related to ZSL as there is already a transfer learning chapter.

🔹 Side Box: How Humans Recognize New Objects

This is a collapsable box for the interested reader about how humans are good at identifying new unseen objects and why this is not the same for machines.

🔹 Zero-shot Learning methods

🔹 Zero-shot Learning with CLIP and friends

  • How is CLIP Different From Previous Approaches: CLIP has been introduced in previous chapters. Here, we will discuss briefly (I hope) the parts related to zero-shot learning.

🔹 Zero-shot Learning in Computer Vision

This part illustrates how zero-shot learning can be used in the context of many different computer vision tasks. CV tasks were introduced in previous chapters.

Zero-shot Object Recognition/ Image Classification

  • Methods
  • Code Example

Zero-shot Object Detection

  • Methods
  • Code Example

Zero-shot Instance Segmentation

  • Methods
  • Code Example

Other CV Tasks

Besides the three most common CV tasks mentioned above, in this section, we may discuss other interesting CV tasks in the ZSL context.

🔹 Advantages of Zero-shot Learning

This section discusses why zero-shot learning is important. This will span a paragraph or two at most.

🔹 Applications of Zero-shot Learning

This section aims to provide some real-world applications of zero-shot learning in a computer vision context. There are no specific ones yet.

🔹 Challenges and Limitations of Zero-shot Learning

  • Bias
  • Domain Shift
  • Hubness
  • Semantic Loss

🔹 Frontiers

This is a paragraph or a little bit more mentioning the current state-of-the-art and or recent experimental approaches in zero-shot learning.

🔹 Chapter Summary

This is another paragraph (or two 😅) that aims to condense the main ideas discussed in the chapter and Key Takeaways.

👩‍💻 Hands-on Notebook

This is a hands-on notebook that shows two things:

  1. The implementation of a classic ZSL algorithm, for example, ESZSL from scratch.
  2. The implementation of a ZSL pipeline using CLIP or another friend from scratch.

Notes

  1. The chapter contents may seem overwhelming but we hope that we will get to a much shorter and dense version when we start working on the details.
  2. The Algorithms part is the most volatile (high probability of change) one. There are a lot of ZSL algorithms out there and we are trying to choose a representative sample showing different approaches.
  3. We will make sure that the ratio of plain ZSL : CV ZSL remains in the reasonable range.

Other Resources:

Notebook in wrong subdirectory

File computer-vision-course/chapters/en/'Unit 3 - Vision Transformers'/KnowledgeDistillation.ipynb should be moved to computer-vision-course/notebooks/en/'Unit 3 - Vision Transformers'/KnowledgeDistillation.ipynb.

Will open PR to fix

Chapter Discussion: Fundamentals of Computer Vision

Hello everyone,

The topic Fundamentals of Computer Vision is going to be the first section of the course, so our team planned to keep it as engaging and informative as possible, so we came up with a simple theme of explaining using "What, Why and How's of computer vision" and came up with the following curriculum.

  1. Why Computer Vision? [motivation.mdx]
  • The perspective of vision
  • Significance of Vision to Humans
  • The motivation behind creating artificial systems capable of simulating human vision and cognition
  1. What is computer vision? [defination.mdx]
  • The core of what computer vision is
  • Historical definition
  • Deep learning and the computer vision renaissance
  • Distinction between human and computer vision systems
  • Relation to the field of machine learning and image processing
  • Introduction to image understanding at different levels
  1. Computer Vision in the Wild ( How) [applications.mdx]
  • Problems, challenges, and objectives of CV in the real world (examples of problems that computer vision tries to address)
  • A high-level overview of computer vision systems with examples (one or two)
  • Ethical considerations

Team members - @seshu-pavan @bellabf, @froestiago @aman06012003 @chiho-5
bcc - @johko @lunarflu @merveenoyan @MKhalusova

We have one question for now: are you expecting us to write coded tutorials? We created this to include storytelling, definitions and images mostly.

We are well aware that this is an iterative project, so any feedback, tips and suggestions the HF team provides will give us a chance to
improve our progress.

Best,
Fundamentals Team

Multimodal Transfer Learning: Draft outline

Hi CV course contributors,
We would love to hear your feedback on the multimodal transfer learning section of the course.
Here's the current general outline with some of the thoughts we've done in the team.
What do you think of the following outline:

1. Introduction

  • Introduce fine-tuning methods such as:
    • Zero-shot learning
    • Few-shot learning
    • Full fine-tuning
    • Parameter-Efficient Fine Tuning (PEFT)
  • Discuss the current rapid innovation within a new domain.
  • Introduce a multimodal task like VQA for the notebook (depending on compute requirements/specifications this could also be img2text).
  • Present procedure for loading and inspecting a select dataset.

2. Zero-shot multi-modality

  • Introduction to large language-vision models like BLIP-2, SAM, LLaVA, etc., that offer zero-shot capabilities.
  • Demonstration of loading a model and running zero-shot inference on a task.

3. Full finetuning

  • Explaining the concept of fine-tuning all layers within a model.
  • Computation of parameters to be modified followed by the fine-tuning process.
  • Inference on the fine-tuned model with a comparison to zero-shot results.

Discussion Point: Full fine-tuning may be impractical for multimodal models due to their size. Should we still include it for educational value or focus solely on PEFT?

4. Parameter efficient fine tuning (PEFT)

  • Clarification on 1 or 2 PEFT methods, potentially focusing on LoRA and Llama-Adapters.
  • Fine-tune a language-vision model from earlier using the PEFT methods
  • Calculation parameters of modified parameters in the model during PEFT methods.
  • Run inference on same task as earlier with fine-tuned model for comparison.

5. Final remarks

  • Compare and discuss results of previous sections.
  • Discussion on the benefits and challenges of each method.
  • Insights into future developments within the field.

How to include "What you'll learn" section for this course?

Hello everyone,
Our PR for Fundamentals of Computer Vision was merged a few days back. After that, one thing we still need to acknowledge based on your feedback on our chapter outline is building a demo using Gradio to give learners a taste of what they'll learn. One of our teammates, @aman06012003 , created a simple Cat vs Dog classifier deployed it on Hugging face spaces, which we want you to take a look at and give feedback.

Once the demo is finalized, there are two ways to include it, referring to the Hugging Face Audio Course. One is to create a new .mdx file in our fundamentals folder. The other is to create a new chapter - Welcome to the course, where we add what you'll learn, community notes, etc. We are still determining the optimal path, so please guide us.

Team members - @seshu-pavan , @bellabf , @aman06012003
bcc - @MKhalusova @johko @merveenoyan @lunarflu

Best,
Fundamentals team

Visualization (the difference between convolution and transformer)

Went through your proposed curriculum, and it is really amazing.

Just my suggestion, you can also look into this recent interpretation for ViT.
For example: CVPR-2023

When it comes to ViT, the intermediate interpretations are not well explored as the field is emerging, it would be really helpful to the community if you can add the above part in your curriculum

Transformers Architecture + ViT

Here's our (w/ @Anindyadeep, @kaustubh-s1) outline for the chapter on Transformers and Vision Transformers 🤗

Chapter Layout

  • Introduction

    • Convolution & CNNs : limits, inductive bias, difference with attention
      welcome transformers
    • Pre-requisites E.g. attention mechanism (see a nice blogpost here for more refs)
  • Transformers and Vision Transformers

    • The origin of transformers + commonalities and differences between the original transformer (NLP applications) and ViT (may be something visual)
    • Patches, i.e. Make attention work with visual data
    • Technical section w/ how the encoding process is done (add vanilla ViT code)
  • Pre-Trained Models, Finetuning etc

    • Welcome timm (not clear: do we need to stick to timm or transformers for vision?)
  • Interpretability + visualizing effects of inductive biases

    • #35
      • CNNs → feature maps or feature viz (maximally activating stimuli) à là circuits thread
      • Transformers → attention maps (on patches)
  • ViTs at scale and real-world scenarios

    • How ViT got dominant in this space and how it is used in other areas (possible applications w/ mini-projects)
      • segmentation
      • object detection
      • etc ..
  • Research

    • Foundation models for ViTs
    • Few Shot Applications
    • more (any ideas?)
  • Other resources

  • Conclusions

    • Wrapping up
    • more ideas (?)

Selecting Frameworks and Implementing a MultiFramework Approach for the Course

Framework Choice

To create a comprehensive course, we must select and commit to specific frameworks. I propose focusing on PyTorch and Jax due to their popularity and versatile applications.

MultiFramework Vs MonoFramework Approach

Consider using two or more frameworks for each lesson, providing both Jax and PyTorch code options. This enables learners to choose their preferred framework and facilitates Jax adoption for those already familiar with PyTorch.


Let's collect suggestions and insights on this matter.

Synthetic Data Creation - Proposed Outline

INTRODUCTION (introduction-to-synthetic-data.mdx)

  • What is synthetic data?
  • Why would you use synthetic data?

GENERATING SYNTHETIC DATA (generating-synthetic-data.mdx)

  • CAD & Blender / GLSL
  • Deep Generative Models

PRACTICAL APPLICATIONS & CHALLENGES (practical-applications-and-challenges.mdx)

  • Synthetic Faces
  • Synthetic Animals
  • Synthetic Objects
  • Medical Imaging (identification of tumors, macular degeneration, Alzheimers, liver disease)
  • Plant Disease (tomato, cotton, soybeans)
  • Traffic signs / emergency vehicles (autonomous driving)
  • Industrial waste sorting
  • Identification of fakes(images, speech)

HANDS-ON DEMOS

RESOURCES (resources.mdx)

  • Datasets (FaceSynthetics, ShapeNet / ABO, Synthetic Animals Dataset, leaf disease (tomato & cotton plants, crop disease Ghana), CIFAKE, traffic signs, PET scans)
  • Relevant research

Definition of image

Hi,

I am starting the course and I have just joined the community on Discord. There might be a typo in the definition of an image in Unit 1.

image

Shouldn't it be $n=2$ and not $n=1$ ?

Thanks

Unit 12 Ethics and Bias : Draft Outline

Hey fellow CV Course Contributors and Reviewers 🤗

This issue discusses an initial draft for the unit Ethics and Bias. We read a few posts, blogs searched through datasets, and created this simple and brief outline. Please feel free to share your views on this, for us to improve this unit.

We prepare this unit by combining theoretical concepts, case studies, and practical examples, and finally close with HuggingFace's mission and efforts to emphasize Ethical AI for Society. The structure is slightly inspired by HF Audio Course.

1. Introduction

The ImageNet Roullete Case Study and Implications

2. Ethics and Bias in AI

What is Ethics and Bias? Why does it matter? Include short examples and reflect on previous ImageNet Roullete example.

3. How bias creeps in AI Models (text, vision, speech)

Give one example each for each modality with implications and mention the HF space: https://huggingface.co/spaces/society-ethics/

4. Types of bias in AI

5. Bias evaluation in CV models and Metrics

Some example case studies:

We can also include other studies that discriminate based on gender, ethnicity and other factors.

Example citing/mention, use learnings from the blog: Bias in Text to Image models https://huggingface.co/blog/ethics-soc-4

6. Bias mitigation in CV models

7. HuggingFace's efforts: Ethics and Society

  • This will be closing chapter of the unit, so we include efforts of HuggingFace in emphasizing Ethics for Society :D

  • Talk about HuggingFace's mission for transparency and reproducibility (model-cards, datasets, evaluate), other aspects

  • 6 categories of submission of HF Spaces: Rigorous, Consentful, Socially Conscious, Sustainable, Inclusive and Inquisitive
    https://huggingface.co/spaces/society-ethics/about

References:

  1. Ethics and Society Letter HuggingFace 1 to 5
  2. HuggingFace Audio Course for Structure (great flow of theory, practical examples, and notebooks)
  3. Ethics course Fast.ai
  4. Kaggle microcourse on Introduction to AI Ethics
  5. Montreal Ethics AI for Computer Vision
  6. Ethical Dimensions of Computer Vision Datasets

Please let us know about the content, suggestions are highly welcome 🤗 🚀 🔥

CV Tasks: Segmentation, Detection, etc...

Hey everyone!

As discussed in the discord previously, here's a rough template and summary of what we think could be a good start for this section. Upon further discussion with the team we wanted to also get some input from the HF team before we seriously start implementing everything.

Various use-cases of CV and their details

  • This section will briefly touch on the areas which have massive use of computer vision like- segmenting images both semantically and instance-wise, and object detection in different scenarios.
  • A little bit about various architectures that are used in these cases like YOLO, ResNet, RCNN, Faster- RCNN, etc.
  • Intro to few-shot classification and its application.
  • Deeper concepts intro like Class Incremental Learning and Few shot class incremental learning (FSCIL)? not sure if we want to add this, would need some input from the maintainers @merveenoyan @lunarflu @johko

Notebooks

  • Since this section seems more like hands-on code, people can quickly refer to the main application and reproduce it at their end if required, so providing notebooks seems sensible here.

  • For starters, these notebooks can be a direct implementation of these CV tasks using torchvision, transformers.
    And then for more details and inner workings, I would want to show the actual implementation of let's say U-Net for Segmentation on any standard dataset. This would be more like replicating that architecture in vision scenarios. Do we want to do this?

Let us know if this sounds like a plan and we can iterate and improve upon this 🤗!
@johko @lunarflu @merveenoyan

Team members- @adhiiisetiawan @sarthak247 @vijiv11

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline

Hey fellow CV Course Contributors and Reviewers 🤗

This issue discusses an initial draft for the chapter Fusion of Text and Vision which is part of Unit 4: Multimodal Models. We feel that since this is an introductory section, we will have less content related to code and more stress will be on content and setting up the stage for later sections in the unit. We would like this unit to be short and crisp, at most 3 sections, nothing more than that unless some other additions are required like spaces/demos.

Thought Process: Previous unit is Unit 3 on Vision Transformers, Next unit is Unit 5 on Generative Models. So, content in this unit will use Unit 3's transformer models (and not traditional approaches to the tasks, so we will refrain from adding too much historical aspects) and also will form a precursor for later sections as well as Unit 5 Generative Models.

1. Introduction

  • Why Multimodality?
  • Real-world data is multimodal (it is always a combination of different modalities)
  • Short example of the human sensory feedback system (humans make decisions based on different sensory inputs and feedback)
  • Multimodal in what sense? Data? Models? Fusion Technique? Are spectrograms an example of multimodal data representation? (input is multimodal, output is multimodal, both input and output are of different modalities, this part is foundation for multimodal tasks and models)
  • Why data is multimodal in many real-life scenarios, how real-life content is multimodal data and is essential for search (example from Google and Bing)
  • Some cool applications and examples of multimodality (Robotics: Vision Language Action models like RT2, RTX, Palm-E)

2. Multimodal Tasks and Models

A brief overview of different tasks and models (more emphasis on those tasks, which will be taken up in the course in different sections like #29 and #28

Mention briefly about the tasks and models (task, input and output, models with links or spaces). Can include other examples like text to speech, and speech to text in tasks and add a one-liner on it referring to the HF-Audio Course "For more information on this refer to HF-Audio Course". After this focus on Vision + Text/Audio.

Tasks and Models (each task, it's input output, and around 3-4 model names to go with it):

  • Document Visual Question Answering (text + vision), Models: LayoutLM, Nougat, Donut.
  • Image to Text, Visual Question Answering Models: Deplot, Pix2Struct, VILT, TrOCR, BLIP
  • Text to Image (synthesis and generation) SD, IF etc
  • Image and Video Captioning
  • Text to Video Models: CLIP-VTT etc

We can also create an infographic that divides the models into different categories like text + vision, text + vision + audio, more than 3 modalities etc, like a chart or hierarchy.

Mention everything related to tasks on vision + X (audio, text) and focus on Vision Language Models (text + vision) in the next section.

3. Vision Language Models

  • Introduction to Vision Language Models (brief, mechanism)
  • Cool Applications and examples (Multimodal Chatbots like GILL, LLava, Video ChatGPT, some cool application being developed in #29 )
  • Emphasize on tasks that involve CLIP and relatives #29
  • A brief ending of the introduction section which sets the stage for next sections like CLIP and relatives and fine-tuning.

References:

  1. Awesome Self-Supervised Multimodal Learning
  2. HF Tasks
  3. Multi Modal Machine Learning Course, CMU
  4. Meta's ImageBind
  5. Multimodal Machine Learning: A Survey and Taxonomy
  6. Recent blog by Chip Huyen

Please feel free to share your views on the outline 🤗 🚀 🔥

Image Reconstruction/Enhancement with Real-time Image Processing

a) Image Reconstruction/Enhancement

  1. Image Transformation through Domains:
  • Fourier Transform
  • Discrete Fourier Transform
  • Convolution Theorem and Filtering in Frequency Domain
  • Discrete Cosine Transform
  • Haar Transform
  1. Image Enhancement:
  • Operations in Frequency Domain vs Spatial Domain (Point Processing Techniques/Spatial Filtering)
  • Histogram Equalization
  • Histogram Matching
  • Denoising/Deblurring etc.
  • Upsampling/Super-Resolution
  1. Image Reconstruction/Enhancement using Deep Learning

b) Development of a Real-time Image Processing Framework

  1. Data Acquisition
  2. Pre-processing (including Image Reconstruction/Enhancement)
  3. Object Detection/Segmentation (including model optimization using pruning/quantization etc)
  4. Post-processing
  5. Containerization/Storage

c) Optional: Use case in real world applications: Medical Imaging

Latex and <Tip> blocks seems to not render correctly

Description:
While accessing the preliminary rendered content of the "Welcome" unit in the computer vision course hosted at this link, @bellabf and I encountered two main issues that significantly impact the readability and educational value of the material.

  1. LaTeX Equations Not Displaying Correctly: The LaTeX equations embedded within the course content are not rendered properly. It appears that equations enclosed in single dollar signs $...$ or $$...$$ do not show up as expected. I'm not sure if the problem lies with the Markdown renderer or another aspect of the documentation infrastructure.

  2. <Tip> Block Quotes Not Visible: Additionally, the custom <Tip>...</Tip> block quotes intended to highlight important information or tips are not visible at all. This makes it difficult to distinguish essential pointers or notes that could enhance understanding of the course material.

Questions:

  • Has anyone else encountered these issues, or is it an isolated case specific to the viewing website ?
  • Are there any known fixes or workarounds for the LaTeX equation rendering problem? Specifically, does switching to double dollar signs $$...$$ consistently resolve similar issues?
  • Regarding the <Tip> block quotes not showing up, is this a known limitation of the current "beta" documentation platform, or could it be an oversight in the HTML/CSS styling?

Possible Solutions:

  • For the <Tip> block quotes, it might be worthwhile to review the CSS associated with these elements or ensure that the custom HTML tags are correctly interpreted by the Markdown renderer.

Request:
We are reaching out to the documentation team and the course maintainers to seek guidance on these issues.

Example:
Here is a screenshot I took for one of the affected chapters (Unit13: Hyena). Unit 1: Images, Unit13: Hiera, and I suppose all the chapters that have equations/tips are also affected :
image

Feature Extraction Outline

I decided to join the Feature Extraction team and helped prepare an outline for the section. So here it is:

Feature Detection

  • What are features?
  • What makes a good feature?
  • Simple Feature Detection (e.g. Harris Corner Detection, Hough Transform, Shi-Tomasi)

Feature Description

  • How can we represent features in data structures?
  • What makes a good descriptor (think of different image conditions, rotation, translation, etc.)
  • Example algorithm: SIFT & SURF

Feature Matching

  • How can we match detected features from image-to-image?
  • Examples (Brute Force, FLANN, maybe others)

CNN Features + Visualization

  • How do ANN (especially CNNs) detect features?
  • What do the features “look like”?
  • Some nice visualizations, maybe some DeepDream for fun

Resources:
https://docs.opencv.org/4.8.0/db/d27/tutorial_py_table_of_contents_feature2d.html
https://towardsdatascience.com/image-feature-extraction-traditional-and-deep-learning-techniques-ccc059195d04
https://fpcv.cs.columbia.edu/

The CNN Features part is probably the most controversial, as it actually talks about CNNs before the CNN chapter. But then again it also is a really interesting topic related to feature extraction. I also didn't find anything about it in the CNN outline, so maybe we can write about it and if needed move it to the CNN section later on.

Notebooks and .mdx files
For now we will start writing separate .mdx files for each of the topics, mainly for ease of parallel working, but depending on how extensive they are, we might also merge some of them later on.

We also plan to have two notebooks, one about Feature Extraction, which gives a walkthrough with a complete example and one for CNN visualization part (if we do this here)

Let us know what you think @merveenoyan , @lunarflu :)

A view synthesis chapter?

Hi 👋
I think the chapter about 3D vision is quite overloaded. I suggest dedicating a separate chapter for view synthesis to do both topics justice.

For example (just a suggestion), the content of the two chapters could look like this:

3D Computer Vision:

  • 3D Object Detection
  • 3D Human Pose Estimation
  • Point Cloud Classification
  • 3D Object Tracking

View Synthesis:

  • Scene Reconstruction
  • NeRF
  • Gaussian Splatting

Computer Vision Projects with Gradio Section

Hello everyone, 👋🏻

Let's outline the section on hands on interactive projects with gradio.

Topics Draft

  • Intro to using Gradio.
  • Using Gradio Interface class and 🤗 Transformers API integration.
  • Using Gradio Blocks API.
  • Gradio CV specific Components.
  • Gradio CV projects.
    • Gradio CV project(Detection, Segmentation, Generating etc.) [1 ... N].

There should be more projects based on the content of other sections.

Before finalizing let's gather thoughts and discuss.

Add all relevant links to github

We've got some links (spreadsheet etc) in the discord channel - Could be nice to move them here somewhere and have one place with everything

Unit 3 Chapter 3 contents: Transfer Learning and Fine-Tuning Vision Transformers

Hey, sorry for the delayed issue. With our fellow collaborators (@alanahmet, @sezan92), we managed to get out curriculum out for the Transfer Learning and Fine-tuning with Vision Transformers 🤗

Chapter Layout

  • A small introduction to Knowledge Transfer. Why training vision models from scratch are not always the solution.
  • Transfer Learning vs Fine-tuning.
  • Transfer Learning in depth with 🤗 transformers / torch code.
    • Removing the last layer and adding a MLP
    • Removing the last layers and adding a MLP + additionally re-train a small percentage of the layers
  • Fine Tuning in depth with 🤗 transformers / torch code
    • General Fine-tuning (i.e. take a small transformer and full fine-tune it)
    • Fine-tuning using PEFT (example: LoRA). This should include a small intro on Parameter Efficient Techniques.
  • When to use Fine-tuning and Transfer Learning.
  • Conclusion.
  • References.

Let's discuss more on this and re-iterate on the chapter content if needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.