(Original entry at Jordi Pons’ blog)
Here my first personal AMA interview! But wait, what’s an AMA interview? AMA stands for “Ask Me Anything” in Reddit jargon. After reading this interview you will know a bit more about my life and way of thinking This interview is a dissemination effort done by the María de Maeztu program (who funds my PhD research), and the AI Grant (who supports our Freesound Datasets project). Let’s start!
Q: Hi! Who are you?
A: Hi there! My name is Jordi Pons and, although I have been living in Barcelona for many years now, my (family) origins are in Olot: a small town on the catalan side of the Pyrenees. During these days I am pursuing a PhD on (music) audio tech at the Music Techonlogy Group of the Universitat Pompeu Fabra in Barcelona. In addition to my day-time academic job, in my spare evenings I regularly join a local debate group where we aim to understand/discuss which are the social implications of (our?) recent technology developments. Besides that, I love playing basketball and assisting to open mic sessions in cheap bars with my friends – not to mention that I’m a fervent Barça fan, a bad guitarist, and an even worse chess player.
Q: Your PhD is supported by the María de Maeztu program, which aims to promote data-driven knowledge extraction research within your university’s department. What’s the frame of your research?
One fundamental goal in our research field is to automatically structure sound and music collections – in order to facilitate the access and retrieval of their audio content. Back when I started my PhD, the state-of-the-art had not yet succeeded in automatically labeling large collections of audio recordings with the most common semantic labels used in practical applications – e.g., for music collections we are interested in labeling semantic concepts such as genre, lead instrument, musical key, or mood of each musical piece. This type of tasks have only been partly solved for small audio collections with few number of classes for each concept. Current known problems for the advancement of this type of research include: (i) the lack of sufficiently large audio collections with appropriate labels that can be used for machine learning; and (ii) not having adequate and robust audio features of relevance for the classification tasks to be performed. During my PhD, we are tackling both problems: (i) via building Fresound Datasets – see further details below; and (ii) via exploring feature learning based methods – which has been the main focus of my research, and has lead us to experiment with musically motivated deep learning architectures.
Q: Which is your experience in the context of the broader María de Maeztu program and your work at the department?
Having the María de Maeztu program running in our department facilitated several collaborations that would be otherwise more challenging. For example, during the in-house Doctoral Students Workshop I was able to connect with many department researchers that were working on related areas – and these connections have been very useful when seeking for advice in later stages of my research. And yet another example: since the María de Maeztu program encourages collaborations among department researchers, we took the initiative to collaborate with another student funded by Maria de Maeztu to write a conference paper together. Definitely, having a project like the María de Maeztu one involving the whole department helps establishing collaborations – because it already exists a connection through the project.
Q: What was your AI Grant Application for?
A: To build Freesound Datasets: a platform for the collaborative creation of open audio datasets labeled by humans. Inspired by the undoubtable contribution of the ImageNet dataset in the computer vision field, our goal is to build a huge dataset for audio AI. Provided that Freesound is one of the biggest open libraries of creative commons sounds on earth (with +350k sounds), we rely on this source of open data for the creation of datasets – what will facilitate data sharing and reproducible research.
Q: How is your project going?
A: A few months ago we launched the Freesound Datasets platform so that people all-over the world can contribute to it. However, the project has been alive for more than one year now. As a result of creating and testing the aforementioned platform, we where able to collect a significant amount of data. Reliable meta-data were collected for more than 18k sounds, that are now available for a Kaggle competition. Although we are far from our ambitious goals, during this first year we have established the basis for being able to scale up the corpus creation process during the following years.
Q: What was an unexpected difficult part of your project?
A: An unexpected challenge was to annotate sounds considering a finite and large set of audio categories. Given that the amount of audio categories we consider is huge (+600), most annotators are not aware of the many possibilities our comprehensive dictionary of audio categories offer. Let’s say we want to annotate a violin being played with the pizzicato technique. It is reasonable to assume that an annotator will tag the sound as violin, but it will not use the pizzicato tag unless the annotator is aware of the existence of such category. In order to overcome this issue, we defined an interface that helps the user to be aware of every related category in our possible set of audio categories.
Q: What’s tools to you use?
A: During these days, I spend most of my time writing python and latex files – although I started programming in Matlab and PHP. During many years I used Theano with Lasagne for my deep learning experiments, but now I use Tensorflow. However, I have to say that most of my colleagues use Keras and some others are super excited about using Pytorch. Which one is the best? Here my takes: MILA is no longer developing Theano; Tensorflow is being used by lots of people, and therefore there is a huge community of users posting doubts on internet; Keras’ extractions help having very compact python scripts; and Pytorch seems to be very handy and easy to read since its logic is similar to the numpy’s one.
Q: What’s the biggest mistake you often see researchers make?
A: I think there are two kinds of researchers in the computer science field: (i) researchers who spend most of their time reading, writing, and thinking – and, as a result, they have little time for actually “coding this new crazy idea”; and (ii) researchers who “code any crazy idea” – and, as a result, their ideas are less meditated and it is likely that they will end up re-inventing the wheel or doing minor contributions. I think good researchers are the ones able to effectively read the state-of-the-art in order to properly connect the dots, but they also spend time coding and having hands on experience.
Q: How did you get into coding?
A: I started coding websites with one of my best friends at high school! Our biggest project was a PHP-MySQL site (that we programmed from scratch following youtube videos!) that was meant for local retailers to easily construct a website. We were so excited, that our friends got bored of us talking about that all the time! I searched on internet to see if there was any digital trace of our project – and I found this (incredibly nice) video-ad my friend did!
Q: How do you think AI will affect our lives in the next 3-5 years?
A: It exists a consensus within the research community that “task-oriented” AI systems are likely to work when it is feasible to collect large amounts of training data. In line with that: I expect AI-based technologies to help minimizing the amount of repetitive work is required for some tasks, or to help people solving specific problems of they daily lives. I think one of the main limitations of current AI research prototypes (which are shaping the technologies we will be consuming in 2021) is that these require assuming that a finite set of events/actions are enough to interact with the world. For this reason, and because “task-oriented” AI systems have not yet fully impacted our lives, I don’t think technologies beyond “task-oriented AI” will be ready for massive consumption in such a short time-span. In order to have AI-based technologies that can affect our daily lives in an unprecedented manner, we need more general systems – closer to what AGI (Artificial General Intelligence) stands for. In this direction, I expect unsupervised learning to play a decisive role towards advancing AGI – due to its natural way of learning representations without assuming a finite set of events/actions under the hood.
Q: What are your goals for the coming year?
A: To keep having fun! One of the bests parts of being a PhD student is that one has the freedom to choose what to do, and with whom.
Q: What’s a good book you’ve read recently?
A: On my bedroom shelve leaves a book in catalan entitled “Capitalisme i Sobirania” (in English: “Capitalism and sovereignty”) which constructs a critical dissection of capitalism from the perspective of constructing spaces of sovereignty. As I previously mentioned, I enjoy thinking which can be the implications of current technology developments in our society. Through this journey, I have seen many people concluding that the main “tech fears” are associated to the way we consume tech. Given the way our economy is organized (capitalism), our technology consumption is basically driven by the market. As a result, edgy technology is mainly delivered to our society through companies. Interesting fact? Given the high specialization of the tech sector (not everyone knows how to build an AI system), many people’s freedom (sovereignty) only relies on choosing whether to use this new technology or not (a choice that would eventually shape the market) – not on defining which technologies can be beneficial to our society. Therefore, note the power (and the responsibility!) that tech people have as architects who can build systems that can deeply transform our lives.
Q: Bonus question: what’s something that is considered normal today and will be seen as crazy in 10 years?
A: Interesting! Note that you are asking me to identify a “blind spot” in our society, as defined by Amin Maalouf in his book “Les Désorientés”. A “blind spot” is something that is now hard to understand, but in a few decades such idea will be prevalent. For example: some decades ago it was accepted that womans do not vote; or during the first industrial revolution there were no concerns for contaminating the environment. Following this idea, I think today’s “blind spot” is the digital footprint – provided that most of our society is not aware of they digital “contamination”. I forecast that the coming generation will lead an ecological movement to fight data “contamination” – as a way to educate our society in which are the implications of their digital trace. Why is this not happening now? Because a great part of our society has lived their lives without using current technologies. As a result, our generation is not ready to lead such movement since most of our society do not comprehend the nature of the problem – given that most of them were not educated to live in a digital world. It is important to note that most of nowadays tech jobs where not even existing 30 years ago! Of course, the current level of tech comprehension/awareness will ameliorate during the next generations – who will be ready to lead such movement, since they will be educated to live with and to understand such technologies.