The fantastic futures of Slow AI and vibes-based search

Can AI weirdness help us to understand ourselves?

18 November 2024

In October I attended the Fantastic Futures conference in Canberra. Fantastic Futures is organised by AI4LAM, and brings together people from archives, libraries and museums to talk about how they are, or could be, using machine learning and other "AI" tools in their work.

When I arrived – indeed even when I booked to attend – I was feeling pretty jaded with AI. It has, of course, been the unavoidable topic of the last two years. But when I the headline speakers were announced, I felt reassured. These are people I respect and in some cases know. They are thoughtful people deeply concerned about ongoing injustices and colonialism. There were certainly people speaking at and attending the conference who are enthusiasts and see the only ethical problem with AI being simply to ensure everyone has access to the tools. But the vibe of the conference was summed up in a phrase that started to bounce around the various talks: "Slow AI". This was particularly ironic given that several presentations were about projects with very tight externally-imposed deadlines for delivering something. But nearly everyone – including the people working on those projects – seemed to committed to doing things right rather than merely fast.

Peter-Lucas Jones and Kathy Reid set the tone at the very beginning, with a nuanced and thoughful conversation about the promise, dangers and weaknesses of automated speech-to-text (and to a lesser extent, text-to-speech) systems – with a particular focus on OpenAI's Whisper model. Peter-Lucas Jones leads Te Hiku Media, a Māori owned company that started as a community radio station but now seems to be an intriguing cross between a media company, software startup, language preservation organisation, and activist centre. Kathy has researched the diversity (or lack thereof) in language model training data. She showed some interesting data showing the same strong bias we see in many cultural products - the median voice in the training data is a 21-year old Ohio man, and the further from that one is, the more errors will be made when transcribing and interpreting. Peter-Lucas gave us an astounding example of this when he said that models coming out of Silicon Valley – by their own admission – were only 50% accurate for translating Te Reo Māori into English. This was the trigger for Te Hiki to start building their own models.

I was planning to do a brief write up of some of the sessions, but what I came away with from the conference wasn't really a list of interesting projects to find out more about. What I got was a new sense of the questions GLAM workers should ask about AI.

Grant Heinrich from the National Film and Sound Archive (who hosted the conference) told us about their "Bowerbird" project, attempting to build a speech recognition model based on Whisper but specific to the Australian accent. It sounds like they didn't really manage to do that, but they did train it to recognise Australian place names and other particular Australianisms. NFSA holds thirty linear years of audio, of which twenty linear years is accessible for automated transcription. They are working to do this in a way that allows them to do named entity recognition. This would allow transcriptions to, for example, record when Paul Keating is speaking about Wollongong. Someone interested in either of those named entities can then search across the collection and find the exact time Paul Keating mentioned Wollongong in the clip. This is the kind of thing that could be done entirely manually, but ...twenty linear years is a lot of audio, and NFSA's funding has been, let us say, "inconsistent" over the years.

I had a conversation with some people in the breaks about navigating the ethics of all this. Are there ethical uses of AI? Is there a point at which all the problematic issues are balanced out by the benefits – and who gets to decide that? Does it make a difference if the thing we want to do would in practical terms be impossible without the use of AI, not merely a bit less efficient? As well as the NFSA project, another example of this that I think is more problematic (though still an interesting experiment) is a trial the National Library of Australia have run. The NLA holds a large quantity of images with essentially no descriptive metadata. That is, they may have metadata about who the artist or photographer is, and perhaps the name of the image if it has one, but not a description of the image itself. This is extremely limiting from a discovery point of view, since it's difficult to do an effective search along the lines of "show me every image that includes a steam train". NLA is experimenting with a possible solution to this by running image-to-text tools over a small "representative" set of images. This generated text description is then used with CLIP to create embeddings that can then be used to map a much larger sample (201,000) of images to enable "similar to" discovery. This is obviously not nearly as accurate is it would be to manually create descriptions of very image and then run a more traditional full text search over those descriptions. But it's also not very realistic to think that the NLA will ever have the resources to create hundreds of thousands of detailed descriptions of these images. Does the benefit of being able to cheaply find images that are possibly similar to the search term outweigh the fact that OpenAI is a kleptomaniac imperialist company run by unscrupulous charlatans? Should we accept that fully cataloguing this material in a more complete and accurate way is not considered worthy of public funding? If nothing else, this project prompts us to ask such questions.

What is machine vision good for?

OpenAI's CLIP came up in a few other contexts, and I think might point to some answers. Peter Leonard from Stanford University asserted that CLIP is not all that good at straightforward categorisation, but can be very effective in identifying images that have some abstract concept in common. So classifying images in terms of the exact subject is something machine vision can struggle with. But returning a bunch of images that are "wistful", or express "anticipation" turns out to be something they're quite good at. Leonard demonstrated with a clever example, showing us the results of a search over a collection of Scandinavian oil paintings for the term "비빔밥". Unsurprisingly there are no images of Korean rice dishes with mixed toppings, nor indeed any descriptions written in Korean script within the collection. But the search returned a group of images that noticeably had something in common. This is the uncanny AI Weirdness that Janelle Shane has leaned into in her work. Machines see the world differently to humans, often struggling to do things we find so basic as to be almost automatic, yet able in other contexts to perform feats that seem impossible.

Computational linguist Emily Bender (of Stochastic Parrots fame) wrote a couple of weeks ago:

Setting things up so that you get "the answer" to your question cuts off the user's ability to do the sense-making that is critical to information literacy. That sense-making includes refining the question, understanding how different sources speak to the question, and locating each source within the information landscape.

Information literacy and chatbots as search - Emily Bender

Her brief post neatly sums up many of my concerns about how the utility of AI has been framed in public discourse, and particularly when it comes to information-seeking. Using machine vision and tools like CLIP to do the same things "more efficiently" isn't merely a dubious proposal given the limitations of the models. It also feels to me like a misunderstanding of what these tools offer us. After two days of hearing what various GLAM institutions are doing with AI, and talking to people in the breaks, it became clearer to me that the reason I've been so depressed about "AI" is that Silicon Valley boosters and university Vice Chancellors have profoundly limited imaginations. Nobody speaking at Fantastic Futures thought the machines are going to take our jobs. They all had a clear sense of the very real limitations of these technologies, and the dangers of embracing them uncritically. Yet there was also a feeling that machine learning and language models open up a range of new possibilities for GLAM.

Collection discovery: ¿Por qué no los dos?

Thinking about the NLA and Stanford experiments, it seems to me that these are demonstrating completely new capabilities for search and discovery. Perhaps it's too much of a stretch, but we might now bring in something Kirsten Thorpe often speaks about: The language of interacting with GLAM collections is the language of colonisation. GLAM catalogue interfaces often invite us to "explore" and "discover" the collections. How do we start to think about better language than "discovery"? Bender might have an answer, with her language of "sense making". What would happen if instead of trying to use AI to do the same things "more efficiently", GLAM institutions used them to enhance existing paradigms with new ways of making sense of our collections? Commercial search tools have interests that push them towards presenting "answers". The frustration that many people feel towards Google and other search tools is not just that the web is now full of AI slop and The Algorithm seems to prefer it. It's also that what we in libraries call "known term" or, in the context of publications, "known title" searching is completely broken in present-day web search. Sometimes you don't want to search "related terms" – you want exactly the thing you searched for, with exactly the spelling you used.

In GLAMs we can and should refuse to choose between these types of sense-making within our collections. People should be able to ask of our collections "Do you have this exact thing?", and "Show me everything about this category of X without any other kind of X" and "what do you have that has this vibe?". Traditional cataloguing and classification is extremely powerful for identifying everything in a collection that fits a certain category, and centuries of experience and tooling have given us very sophisticated ways to draw fine distinctions and disambiguate terms. In an age of "people who liked X also liked Y", and web search engines that reinforce power laws of popularity, it's sometimes easy to forget this. But there are limits and trade-offs. Traditional cataloguing can't nail down a vibe in any meaningful way. Consider a concept like "emu". A typical GLAM cataloguing system, based on controlled vocabularies and conceptual hierarchies, might return a bunch of search results specifically about the flightless bird, uses of its feathers and eggs, and perhaps something about The Emu War, when the Royal Australian Artillery was forced to retreat in the face of a stronger force. But in many Dreaming stories, as recounted by Tyson Yunkaporta, Emu represents a narcissistic streak in our natures, making trouble and refusing to work with others. Perhaps if our cataloguing standards were based on Indigenous Australian worldviews it's possible these attributes would be reflected in a search for "emu", but given the right training data, might a CLIP-type AI model come up with similar kinds of associations?

The ability for AI tools to surface associations and correlations from their training data is one of the most powerful things about them. But it's also the most troublesome. The very same phenomenon that allows us to search an art collection by mood also tells US parole officers that Black prisoners are more likely to re-offend, recruiters that women are less likely to succeed in the role, and made it impossible for Safiya Umoja Noble to find anything that wasn't porn when she googled for fun things for "black girls" to do.

When you train your model on racist inputs, it's hardly surprising that they produce racist outputs. One way to try to deal with this is to paper over the cracks and hope nobody notices, like Google did in their notorious "Woke Nazis" incident. This is a losing proposition: you can't balance out the patterns the machine sees by trying to anticipate them and alter the output ahead of time. Indeed, I would argue that this is only even an issue because Google and others are operating on the model Emily Bender warned about. If you think there is only ever one "answer", one Truth, and the purpose of your tool is to provide it, then such absurdities are guaranteed.

I was surprised to hear both Kristen Thorpe (Jumbunna Institute, UTS) and Honiana Love (Ngā Taonga) express hope that using AI in creating GLAM metadata may help reduce bias and support Indigenous cultural re-invigoration. But thinking about it more afterwards, perhaps this is what they were getting at. If instead of "discovering" answers, we think about what we do as helping users take a "sense-making" journey, then we open up new ways to use machines to help us to think and see. And if – for good reason – you don't trust what an institution like a state archive is telling you explicitly, it might make sense to ask what less obvious patterns appear in the archive.

Machine vision doesn't simply show us the same thing faster. More than anything else, it shows us the seams, flaws and inadvertent correlations in the data we feed it. It shows us what was always there, but we didn't see. Sometimes that's because the connection was too subtle for us. Sometimes it's because we didn't want to see it. Either way it seems to me that using this as an additional way to understand GLAM collections can be useful. Not because it will make anything more efficient, or even reveal hidden truths, but rather because it shows us another way to understand what our collections say, and perhaps what they do not.