We investigate machine learning (ML) techniques for predicting the number of
galaxies (N_gal) that occupy a halo, given the halo's properties. These types
of mappings are crucial for constructing the mock galaxy catalogs necessary for
analyses of large-scale structure. The ML techniques proposed here distinguish
themselves from traditional halo occupation distribution (HOD) modeling as they
do not assume a prescribed relationship between halo properties and N_gal. In
addition, our ML approaches are only dependent on parent halo properties (like
HOD methods), which are advantageous over subhalo-based approaches as
identifying subhalos correctly is difficult. We test 2 algorithms: support
vector machines (SVM) and k-nearest-neighbour (kNN) regression. We take
galaxies and halos from the Millennium simulation and predict N_gal by training
our algorithms on the following 6 halo properties: number of particles, M_200,
\sigma_v, v_max, half-mass radius and spin. For Millennium, our predicted N_gal
values have a mean-squared-error (MSE) of ~0.16 for both SVM and kNN. Our
predictions match the overall distribution of halos reasonably well and the
galaxy correlation function at large scales to ~5-10%. In addition, we
demonstrate a feature selection algorithm to isolate the halo parameters that
are most predictive, a useful technique for understanding the mapping between
halo properties and N_gal. Lastly, we investigate these ML-based approaches in
making mock catalogs for different galaxy subpopulations (e.g. blue, red, high
M_star, low M_star). Given its non-parametric nature as well as its powerful
predictive and feature selection capabilities, machine learning offers an
interesting alternative for creating mock catalogs.