• We tackle the problem of one-shot segmentation: finding and segmenting a previously unseen object in a cluttered scene based on a single instruction example. We propose a baseline architecture combining a Siamese embedding for detection with a U-net for segmentation and evaluate it on a novel dataset, which we call $\textit{cluttered Omniglot}$. Using oracle models with access to various amounts of ground-truth information, we show that in this kind of visual search task, detection and segmentation are two intertwined problems, the solution to each of which helps solving the other. We therefore introduce $\textit{MaskNet}$, an improved model that sequentially attends to different locations, generates segmentation proposals to mask out background clutter and selects among the segmented objects. Our findings suggest that such image recognition models based on an iterative refinement of object detection and foreground segmentation may help improving both detection and segmentation in highly cluttered scenes.