CSCI-GA.3033-102: Learning with Large Language and Vision Models

Course Information

Instructor: Saining Xie

Time: 04:55PM-06:55PM

Syllabus (TBD; The following can be potential topics.)

Neural networks and deep learning. Large-scale Optimization. Language modeling: N-gram models, word embeddings, LSTM, transformers, BERT, GPT. Vision modeling: convnets and vision transformers. Classification and detection (Mask R-CNN, DETR), segmentation (FCN, SAM), motion, depth. Vision self-supervised learning. Generative models (GANs, VAEs, Diffusions). Alignment (RLHF). Multi-modal learning and language and vision models (CLIP, BLIP, Flamingo).

Prerequisites

Note: Please be aware that this course is not designed as an introductory AI/machine learning class. Rather, it is intended to serve as an advanced graduate seminar with a strong emphasis on research. Participants in this course should already possess a solid foundation in deep learning and computer vision, as this background is necessary for active engagement in class discussions and successful completion of the final project.

**Students are expected to have completed at least one of these courses: 1) Deep Learning, 2) Machine Learning, or 3) Computer Vision.**

Python programming; Deep learning programming with PyTorch or JAX
Foundations of machine learning
Foundations of deep learning
Linear algebra
Probability and statistics

Structure

The course will encompass a combination of lectures, paper reading seminars, and semester-long hands-on projects. Students will be organized into groups of 4-5, working collaboratively on projects and engaging in presentations and discussions during paper seminars.