Skip to main content Link Search Menu Expand Document (external link)

CSCI-GA.3033-102 Special Topic - Learning with Large Language and Vision Models


Welcome to Special Topic - Learning with Large Language and Vision Models. This course at the graduate level probes into the game-changing influence of large foundational models across a range of fields. Rather than zeroing in on a single domain, the course aims to explore the intersection of multiple areas, such as language and vision. As multimodal learning becomes increasingly critical in today’s AI landscape, the course is structured to provide a solid understanding of state-of-the-art models, including multimodal representation learning, generative AI for text and image, and joint embedding models. It also aims to arm students with the essential practical skills required to do further research in this rapidly evolving field.

Teaching Team

Saining Xie

Office Hours: Monday 3:00 - 4:00 pm

Calendly for Booking

Sai Charitha Akula

Office Hours: Friday 12:00 - 1:00 pm. 60 Fifth Ave, 402.

Raviteja Chukkapalli

Office Hours: Friday 12:00 - 1:00 pm. 60 Fifth Ave, 402.


Students are expected to have a solid mathematics background and strong programming skills. Students are expected to have completed at least one of these courses: 1) Deep Learning, 2) Machine Learning, or 3) Computer Vision. Ohter requirements include: Python programming; Algorithms and data structure (CSCI-UA.102); Deep learning programming with PyTorch or JAX; Foundations of machine learning; Foundations of deep learning; Linear algebra; Probability and statistics (DS-GA.1002, MATH-UA.140, MATH-UA.235);


When: Monday, 4:55-6:55PM ET

Where: 251 Mercer St (Warren Weaver) Room 312 109

Format: The course will adopt a hybrid format. Initially, the instructor will provide lectures to offer a broad overview and context. Some lectures will also address non-technical elements, like strategies for reading academic papers. Following this, the class will seamlessly shift to student-led presentations and panel discussions, utilizing Alec Jacobson and Colin Raffel’s role-play seminar approach.

Discord Group: We will use Discord to faliciate discussion. You can find the Discord link on Brightspace.

Students auditing the course should email the instructor or any of the TA’s to get access to the Discord server.

Class Schedule

An updating schedule of individual classes and topics can be found on the Calendar page.


Grading will be based on three activities:

  1. Early assignment (10%)
  2. Semester-long project (60%)
  3. Paper review and panel discussion (30%)

1. Early assignment

This small warm-up excersise aims to give you hands on experience and prepare you for the class. More information is available here. Please follow the instructions to submit your assignment.

2. Semester-long project

The main deliverable of the course is a semester-long project, designed to give you the open-ended opportunity to either:

  1. Build an LLVM-powered application or demo. LLVM models are powerful tools to solve exciting real world problems. Utilizing various prompting techniques, compositional methods, and API interactions, LLVM inference engines can function as ready-to-use tools. They can be employed to automate processes, perform data analytics, generate captivating art, or simply build something cool.

  2. Conduct a research project. Should you wish to explore the research aspects of LLVMs more thoroughly, we invite you to undertake a research project tailored to your interests. Your focus could be on identifying a specific research topic within the realms of computer vision, NLProc, and machine learning. You can conduct comparative studies to uncover the limitations of current LLVMs, or enhance the overall design of LLVMs—be it through optimizing data pipelines, training objectives, or architectures.

Be aware that the line separating a demo from a research project can be somewhat indistinct; the instructor will assist you in appropriately categorizing your project idea. Additionally, there is no grading preference for either application/demo or research projects, so feel free to select the option that most excites you!

Project logistics

Both project formats may be done in teams of 1-3 students.

We will organize your project progress into two key milestones: (1) a preliminary proposal, and (2) a final submission/presentation. The dates for these milestones will be disclosed soon.

The preliminary proposal should sketch out the research question or application you’re keen to explore, along with the methodology you intend to employ. This should feature a concise overview of the LLVMs you aim to utilize, as well as a list of potential metrics for evaluating success.

For the final submission, both write up and code repo will be required, regardless of the project format. We will schedule a presentation/poster session for each team to present their work during the final week of the semester. Additional specifics will be provided soon, but anticipate the following:

  • Application / demo submissions to include a functional demo of your application, possibly through platforms like Gradio or Streamlit. This should be accompanied by a brief written explanation that outlines the problem you’re addressing, the LLVM(s) you’ve employed, and your implementation and evaluation process.

  • Research project submissions to a final report resembling a research paper (ranging from 4 to 9 pages, excluding references) and a code repository to replicate your findings. Clear and succinct writing is crucial; any lack of clarity or unnecessary complexity may lead to point deductions. For projects involving multiple contributors, a delineation of each participant’s role is mandatory. All submissions must be LaTeX-formatted and provided in PDF format (exceptions must be approved by the instructor). Utilizing user-friendly web platforms like Overleaf is strongly encouraged.

Further details will soon be accessible here. We will also refresh the page with a compilation of possible project ideas and resources.

3. Paper Review and Panel Discussion

  1. Paper reviews and discussions Paper reviews. Before each class, you will be assigned 1-2 papers. You should read these papers carefully and write a review of the paper(s). Your review should follow the style of a conference review, say for CVPR or NeurIPS.

Paper reviews are due at 11:00 AM EDT on Gradescope on the day of the lecture.

  1. Paper discussions. During each class discussion, there is a panel of 4-5 students (you are expected to sign up for at least two panels). The student panelists lead a discussion moderated by the instructors. Everyone else is expected to participate by asking the panel questions. More details to follow.

4. Class attendance and participation

Daily class attendance will be recorded.

Late Submission Policy

  • Each student will be provided 3 grace days to submit their assignment without any penalty. They will be free to use these grace days at their convenience. Some examples of how a student could use these grace days are:
    • Student_1 submits 3 assignments each of which is one day late. They will not be penalized for any assignment.
    • Student_2 submits 1 assignment which is 1 day late and another assignment which is 2 days late. They will not be penalized for these two assignments.
    • Student_3 submits 1 assignment 3 days late. They will not be penalized for this single assignment.
  • Once the student exhausts its 3 graces days, they will receive:
    • 75% grade if their assignment is late by one day
    • 50% grade if their assignment is late by two days
    • 0% grade if their assignment is late by more than two days