Yesterday, I wanted to play around with some AI image generation stuff, and I kept hearing about SAM and LLaVA. So, I decided to give them a shot. I’m no expert, but I figured, why not?
First, I needed to get my hands on the models. That was my starting point – downloading stuff. I had to do some digging to find where to actually download SAM (Segment Anything Model) and LLaVA(Large Language and Vision Assistant). It wasn’t super straightforward, there are a lot of similar-sounding projects out there.
Getting SAM Ready
After some struggles, I got my stuff and I started with SAM. I already had some basic setup. It looks basic coding, not much. I realized I needed a specific version of PyTorch. So, I went ahead and did this:
pip install torch==2.2.2 torchvision==0.17.2
Then, I had to grab the SAM model itself.I got to download it to my local:
wget
I put the checkpoint file in a directory I named ‘models’. Stay organized, right?
Running the basic example code for SAM, it looked pretty simple. I just pointed it to an image and it tried to segment everything it could find. It was cool to see it highlight different objects, even if it wasn’t always perfect.
Bringing in LLaVA
Next up was LLaVA. This one was a bit trickier, I think. It felt like there were more moving parts. I followed some more instructions, cloning the LLaVA repository and installing more stuff:
git clone
pip install -e .
It’s about installing and settings. Then came the weights for * were HUGE! Several gigabytes, so it took a while to download. I had to get a specific “delta” weights file and then “apply” it to a base Vicuna model. Honestly, I’m still not 100% sure what all that means, but I followed the steps and it seemed to work.
Putting them Together (or Trying To)
This is where I started to get a little lost. I wanted to see if I could use SAM to identify objects and then have LLaVA describe them. I found some examples online, but they were pretty * needs a lot of coding.
I managed to get SAM to generate masks for an * use it with LLaVA. But it’s not easy, I can only run on my local.

My Takeaways (So Far)
- It’s messy! There’s a lot of setup involved, and it’s easy to get confused by different versions and dependencies.
- It’s powerful! Even with my limited understanding, I could see the potential of these models.
- It’s a work in progress. I definitely need to spend more time learning how to properly combine these tools.
So, that’s my “sam and luca” adventure for now. It’s more like “SAM and LLaVA,” but hey, I’m learning. I’m going to keep playing around with this stuff and see what I can create. Maybe next time, I’ll have a proper demo to show off!