TY - GEN
T1 - Learning Program Representations for Food Images and Cooking Recipes
AU - Papadopoulos, Dim P.
AU - Mora, Enrique
AU - Chepurko, Nadiia
AU - Huang, Kuan Wei
AU - Ofli, Ferda
AU - Torralba, Antonio
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In this paper, we are interested in modeling a how-to instructional procedure, such as a cooking recipe, with a meaningful and rich high-level representation. Specifically, we propose to represent cooking recipes and food images as cooking programs. Programs provide a structured repre-sentation of the task, capturing cooking semantics and se-quential relationships of actions in the form of a graph. This allows them to be easily manipulated by users and executed by agents. To this end, we build a model that is trained to learn a joint embedding between recipes and food images via self-supervision and jointly generate a program from this embedding as a sequence. To validate our idea, we crowdsource programs for cooking recipes and show that: (a) projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results; (b) generating programs from images leads to better recognition re-sults compared to predicting raw cooking instructions; and (c) we can generate food images by manipulating programs via optimizing the latent code of a GAN. Code, data, and models are available online11http://cookingprograms.csail.mit.edu.
AB - In this paper, we are interested in modeling a how-to instructional procedure, such as a cooking recipe, with a meaningful and rich high-level representation. Specifically, we propose to represent cooking recipes and food images as cooking programs. Programs provide a structured repre-sentation of the task, capturing cooking semantics and se-quential relationships of actions in the form of a graph. This allows them to be easily manipulated by users and executed by agents. To this end, we build a model that is trained to learn a joint embedding between recipes and food images via self-supervision and jointly generate a program from this embedding as a sequence. To validate our idea, we crowdsource programs for cooking recipes and show that: (a) projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results; (b) generating programs from images leads to better recognition re-sults compared to predicting raw cooking instructions; and (c) we can generate food images by manipulating programs via optimizing the latent code of a GAN. Code, data, and models are available online11http://cookingprograms.csail.mit.edu.
KW - Datasets and evaluation
KW - Recognition: detection
KW - Vision + language
KW - categorization
KW - retrieval
UR - https://www.scopus.com/pages/publications/85138902242
U2 - 10.1109/CVPR52688.2022.01606
DO - 10.1109/CVPR52688.2022.01606
M3 - Conference contribution
AN - SCOPUS:85138902242
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 16538
EP - 16548
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PB - IEEE Computer Society
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Y2 - 19 June 2022 through 24 June 2022
ER -