back to blog

Vision RAG tool uses LLaVA to increase cultural accuracy in images

Read Time 8 mins | Written by: Cole

Vision RAG tool uses LLaVA to increase cultural accuracy in images

Over 100 journalists, software engineers, and product designers gathered for a hackathon at Columbia University to work on AI in journalism. Hack/Hackers and Brown Institute for Media Innovation at Columbia brought these journalists and technologists together to see what they could build with open source AI. 

One team built a RAG for vision tool that uses Google Street View images to solve a big problem in AI vision models – cultural accuracy.

If you ask a vision model to “make me an image of a market in Morocco,” you’ll likely get an image with caricatured architectural styles, clothing, and people that aren’t representative of the actual location – like the image below.

Vision RAG morocco accuracy image example

Notice that the prompted image above is a fantastical version of the actual locality. Here’s Judd Smith, Sr. Product Designer and Columbia GSAPP in M.S. Computational Design Practices '25, on why that’s a big problem for journalists and society as a whole.

“The reality is these open source datasets are not magic and have not indexed the whole of human knowledge or our physical world, only that which is disproportionately submitted by users in the global north. 

As spatial practitioners first and foremost, our utmost concern first and foremost is for the representation of our natural world and how people interact with these representations, whether they be organically generated (through photographic means) or synthetically generated (through text-to-image models). 

If the rise of TTI Models has experts projecting that by 2026 over 90% of images we see online can be synthetically-generated, what does that mean for the preservation of diversity in localities and cultural nuances? Will we let paradigms, tropes, and a collective imagination favoring a Western—predominantly-European—represent our natural world even when we're representing places far outside of that frame of reference? 

Can this increase the speed of cultural erasure by what's not found in these representations?”

Virginia Zangs and Judd Smith wanted to solve this problem and see if they could get more accurate images of real places from generative AI. They teamed up with Zachary B. from Codingscape to build a solution.

Together, they built a vision RAG tool using open source LLaVA and the SDXL-Turbo Model from Hugging Face. Connected with the Google Maps API, they were able to use Street View images to generate more accurate images of real life locations. And they did all of this without fine-tuning a vision model.

Here’s what they built over the weekend hackathon.

RAG vision pipeline w/prompt generation

Vision RAG prompt generation tool flow

The pipeline uses a structured approach to prompt generation – including prompt guardrails that specify realistic images and avoid things like dystopian themes. 

Google Street View images of the chosen location are sent to LLaVA to describe them visually. Those attributes are used to create an optimized prompt that SDXL-Turbo turns into a more accurate final image of the location.

Here are the basic steps:

  • Enter location and visual prompt (e.g. weather, time of day, architecture style, clothing style, etc.)
  • Street View images of that location are pulled from the Google Maps API and sent to LLaVA to examine the images.
  • LLaVA creates text for an optimized prompt that describes the visual elements of the Street View images
  • Optimized prompt from LLaVA sent to SDXL-Turbo to generate the final image with more accurate context. 

Here’s a snapshot of how the pipeline functions using NYC as the location.

Optimized prompt to more accurate image of NYC

Vision RAG prompt for NYC real place images

In the flow above you can see the whole process – from location to the final augmented image. 

NYC is chosen as the location and Google Street View images of NYC are pulled from the Google Maps API. LLaVA is prompted to describe the location visually based on the Street View images.

Vision RAG LLaVA prompting

The visual description that LLaVA creates is used to generate an optimized text prompt to give SDXL-Turbo more detail for the final image output. 

Vision RAG image-to-text prompt for SDXL turbo

SDXL Turbo turns the optimized prompt into a more accurate image of NYC that includes detail from the Street View images. 

Vision RAG final image NYC

Create a more accurate image of Paris

Here’s a more compact look at generating an image of Paris.

Vision RAG tool more accurate image of paris

The image on the left is the first output generated with Paris as the location plus other important visual attributes like:

  • Time of day: Daytime 
  • Weather: Sunny 
  • Apparent region: Urban 
  • Architecture style: Neoclassical 
  • Clothing style: Modern 
  • Identified landmarks: None 
  • Notable objects: Buildings, street, fence, trees, sidewalk, streetlights, vehicles, pedestrians, balcony, windows, doorway, street sign, metal structure, staircase, sky.

You can see the Street View images as a composite in the middle and the final image on the right is the result of the optimized prompt.

It looks more like Paris than the initial image – notice the streets and architecture have more details from the actual location.

How the Vision RAG user interface could look

Vision RAG tool user interface

The team also started to create a prototype for a user interface. Here, you can choose locations, describe the scene visually, and receive an instant view of the final output.

Vision RAG for location accuracy tech stack

Several publicly available APIs and open source tools made this all possible:

  • LLaVA model: open source vision model for image-to-text
  • SDXL-Turbo model: text-to-image synthetic image generation
  • Google Maps API: For precise location data and access to streetview images.
  • Google Street View and Satellite: To provide visual context to update the images.

 

Why is AI image accuracy especially important in journalism? 

Let’s take a moment to reinforce why cultural accuracy is important for AI in journalism. AI image generators constantly misrepresent cultures and locations – which is especially harmful if these tools are used in journalism.

A dataset is no different than a map, a photograph, or a piece of art: What is not included within the information encoded is just as important as what has been included within the frame of whatever type of information (be it visual, textual, or symbolic you are working with).” says Judd Smith, Sr. Product Designer and Columbia GSAPP student.

“Photographic representation like all of these things is a careful act of framing something slightly out of context. If AI-generated imagery became an accepted use for representing data, places, concepts and people at scale in a "trusted source" context of journalism, I think our perception of trust will continue to move in the wrong direction that could become a breeding ground for undesired consequences.”

Trust is one of the biggest issues in generative AI. And there’s a long list of reasons why AI datasets can’t be trusted for cultural accuracy in journalism:

  1. Ethnic diversity – AI models consistently overrepresent white people, even in countries where they're not the majority.
  2. Architectural styles – Generic or Western-style buildings often replace distinct local architecture.
  3. Cultural clothing and symbols – These models often default to Western attire and symbols, erasing local cultural elements.
  4. Environmental context – AI frequently mismatches environments, like putting tropical features in arctic settings.
  5. Social interactions and customs – Local customs are often replaced with Western norms, misrepresenting how people actually interact.
  6. Linguistic and textual representation – AI tends to use English or nonsensical text instead of local languages.
  7. Socioeconomic diversity – These tools often skew towards either overly wealthy or overly impoverished depictions.

If we're using AI in journalism or other influential media, we need to be extremely cautious and fix these problems. These misrepresentations reinforce prejudices, erase cultural identities, and spread misinformation. 

Judd continues, “I think at the core of all this, generative-AI tools—whether prompt-based or text-to-image generation—are tipping the scales on where we situate trust digitally, and at what scales. Trust is increasingly being deferred to an algorithm through the shear velocity of the scale of these datasets and their user bases, and the trust that originates from synthetic generation than from a person we're asked to trust speaks volumes to where we have moved as a society, but I think there is a shift in the scales of trust we prioritize. 

This question of trust and data undergirds all of our research and efforts we do, and we're thankful for people like Laura Kurgan, Director of the M.S. Computational Design Practices program, who have fought to create spaces for thought like this to develop and disperse. 

There are many things computation systems can do and few they cannot — however understanding the scale of their potential usage and the consequences, we must continue to ask and question "what tasks are right to offload to a system?" in both subjective and objective use-cases. 

Our human experiences and a computer's efficiency do not always align no matter what tech leaders might tell you!”

Want to learn more about AI in Journalism? 

You can read more about the HuggingFace and Codingscape sponsored Hack/Hackers Hackathon for AI in Journalism at Columbia University here

Don't Miss
Another Update

Subscribe to be notified when
new content is published
Cole

Cole is Codingscape's Content Marketing Strategist & Copywriter.