1. P. Schäuble, Multimedia information retrieval: content-based information retrieval from large text and audio databases. Springer Science & Business Media, 2012, vol. 397.
  2. P. Natarajan, P. Natarajan, S. Wu, X. Zhuang, A. Vazquez Reina, S. N. Vitaladevuni, K. Tsourides, C. Andersen, R. Prasad, G. Ye, D. Liu, S.-F. Chang, I. Saleemi, M. Shah, Y. Ng, B. White, L. Davis, A. Gupta, and I. Haritaoglu, “BBN VISER TRECVID 2012 multimedia event detection and multimedia event recounting systems,” in Proceedings of TRECVID 2012. NIST, USA, 2012.
  3. Z. Lan, L. Jiang, S.-I. Yu, C. Gao, S. Rawat, Y. Cai, S. Xu, H. Shen, X. Li, Y. Wang, W. Sze, Y. Yan, Z. Ma, N. Ballas, D. Meng, W. Tong, Y. Yang, S. Burger, F. Metze, R. Singh, B. Raj, R. Stern, T. Mitamura, E. Nyberg, and A. Hauptmann, “Informedia @ TRECVID 2013,” in Proceedings of TRECVID 2013. NIST, USA, 2013.
  4. H. Cheng, J. Liu, S. Ali, O. Javed, Q. Yu, A. Tamrakar, A. Divakaran, H. S. Sawhney, R. Manmatha, J. Allan et al., “Sri-sarnoff aurora system at trecvid 2012: Multimedia event detection and recounting,” in Proceedings of TRECVID, 2012.
  5. J. Maxime, X. Alameda-Pineda, L. Girin, and R. Horaud, “Sound representation and classification benchmark for domestic robots,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 6285–6292.
  6. M. Janvier, X. Alameda-Pineda, L. Girinz, and R. Horaud, “Sound-event recognition with a companion humanoid,” in 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012). IEEE, 2012, pp. 104–111.
  7. W. K. Edwards and E. D. Mynatt, “An architecture for transforming graphical interfaces,” in Proceedings of the 7th Annual ACM Symposium on User Interface Software and Technology, ser. UIST ’94. New York, NY, USA: ACM, 1994, pp. 39–47. [Online]. Available:
  8. A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in 24th European Signal Processing Conference 2016, Budapest, Hungary, 2016.
  9. J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in 22st ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA, Nov. 2014.
  10. M. Yang and J. Kang, “Psychoacoustical evaluation of natural and urban sounds in soundscapes,” The Journal of the Acoustical Society of America, vol. 134, no. 1, pp. 840–851, 2013.
  11. K. Hiramatsu and K. Minoura, “Response to urban sounds in relation to the residents connection with the sound sources,” Proc. of Internoise 2000, 2000.
  12. D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events: an IEEE AASP challenge,” in 2013 IEEE WASPAA.
  13. K. J. Piczak, “ESC: dataset for environmental sound classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 – 30, 2015, 2015, pp. 1015–1018.
  14. J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in 22st ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA, Nov. 2014.
  15. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
  16. A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), November 2017, submitted.
  17. S. Ntalampiras, “A transfer learning framework for predicting the emotional content of generalized sound events,” The Journal of the Acoustical Society of America, vol. 141, no. 3, pp. 1694–1701, 2017.
  18. R. A. Stevenson and T. W. James, “Affective auditory stimuli: Characterization of the international affective digitized sounds (iads) by discrete emotional categories,” Behavior Research Methods, 2008.
  19. S. Frühholz, W. Trost, and S. A. Kotz, “The sound of emotionstowards a unifying neural network perspective of affective sound processing,” Neuroscience & Biobehavioral Reviews, vol. 68, pp. 96–110, 2016.
  20. A. Darvishi, E. Munteanu, V. Guggiana, and H. Schauer, “Designing environmental sounds based on the results of interaction between objects in the real world,” in Human-Computer Interaction–INTERACT, vol. 95, pp. 38–42.
  21. R. Schafer, The Soundscape: Our Sonic Environment and the Tuning of the World. Inner Traditions/Bear, 1993.
  22. B. Gygi, G. R. Kidd, and C. S. Watson, “Spectral-temporal factors in the identification of environmental sounds,” The Journal of the Acoustical Society of America, vol. 115, no. 3, pp. 1252–1265, 2004.
  23. A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman, “Visually indicated sounds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2405–2413.
  24. D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proceedings of the 21st ACM International Conference on Multimedia, ser. MM ’13. New York, NY, USA: ACM, 2013, pp. 223–232.
  25. T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks,” arXiv preprint arXiv:1410.8586, 2014.
  26. J. M. Chaquet, E. J. Carmona, and A. Fernández-Caballero, “A survey of video datasets for human action and activity recognition,” Computer Vision and Image Understanding, vol. 117, no. 6, pp. 633–659, 2013.
  27. R. Poppe, “A survey on vision-based human action recognition,” Image and vision computing, vol. 28, no. 6, pp. 976–990, 2010.
  28. S. Baccianella, A. Esuli, and F. Sebastiani, “Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining.”
  29. J. Zhong, Y. Cheng, S. Yang, and L. Wen, “Music sentiment classification integrating audio with lyrics,” JOURNAL OF INFORMATION&COMPUTATIONAL SCIENCE, vol. 9, no. 1, pp. 35–44.
  30. R. W. Picard, “Computer learning of subjectivity,” ACM Computing Surveys (CSUR), vol. 27, no. 4, pp. 621–623, 1995.
  31. M. Soleymani, Y.-H. Yang, Y.-G. Jiang, and S.-F. Chang, “Asm’15: The 1st international workshop on affect and sentiment in multimedia,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015, pp. 1349–1349.
  32. W. Davies, M. Adams, N. Bruce, R. Cain, A. Carlyle, P. Cusack, D. Hall, K. Hume, A. Irwin, P. Jennings, M. Marselle, C. Plack, and J. Poxon, “Perception of soundscapes: An interdisciplinary approach,” Applied Acoustics, vol. 74, no. 2, pp. 224–231, February 2013.
  33. Ö. Axelsson, M. E. Nilsson, and B. Berglund, “A principal components model of soundscape perception,” The Journal of the Acoustical Society of America, vol. 128, no. 5, pp. 2836–2846, 2010.
  34. D. Stowell and M. D. Plumbley, “An open dataset for research on audio field recording archives: freefield1010,” CoRR, vol. abs/1309.5275, 2013. [Online]. Available:
  35. H. Lei, J. Choi, A. Janin, and G. Friedland, “User verification: Matching the uploaders of videos across accounts,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 2404–2407.
  36. S. Chu, S. Narayanan, and C. C. J. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, Aug 2009.
  37. B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard, “Yaafe, an easy to use and efficient audio feature extraction software,” in Proceedings of the 11th International Society for Music Information Retrieval Conference, Utrecht, The Netherlands, August 9-13 2010, pp. 441–446.
  38. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, and others., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  39. K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2015, pp. 1–6.
  40. A. Neviarouskaya, H. Prendinger, and M. Ishizuka, “Sentiful: Generating a reliable lexicon for sentiment analysis,” in 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, Sept 2009, pp. 1–6.
  41. A. Datta, M. Shah, and N. D. V. Lobo, “Person-on-person violence detection in video data,” in Pattern Recognition, 2002. Proceedings. 16th International Conference on, vol. 1. IEEE, 2002, pp. 433–438.
  42. C. Guastavino, “The ideal urban soundscape: Investigating the sound quality of french cities,” Acta Acustica united with Acustica, vol. 92, no. 6, pp. 945–951, 2006.