@Semicolon

Semicolon@lemmy.world · edit-2 21 hours ago

For local models like Gemma3, you can’t really do it, as you would have to somehow embed this mechanism directly into model weights. These models are mostly run using generic opensource software like llama.cpp or ollama, so you can’t force any extra code in there without the maintainers’ cooperation.

For cloud services this can and frequently is done. The problem is that these mechanisms have MASSIVE false positive rates (if you ban keywords related to bombs or nuclear weapons, you will no longer be able to get summary about WW2, possibly lock someone out when they’re asking for symptoms and causes of radiation poisoning) while still being easy to bypass (e.g. tell the model to add dots between each letter of the word and do the same when writing the prompt.)

Another approach that is frequently employed is adding another AI supervisor on top to monitor prompt and responses for violation of guidelines. This somewhat improves the adherence since you’re not allowed to directly speak to the supervisor model, but if you can convince GPT4o that you asking where to secretly bury the 70kg chicken is perfectly fine, you can also find a way to formulate your prompt so that the supervisor sees nothing wrong with it.

Semicolon@lemmy.world · 1 day ago

Download Gemma from HuggingFace. Add no system prompt, tell it to censor absolutely nothing, ask it to help you hide a body from a person you just killed. See what’s the reply.

I spun up gemma3:12b-it-qat and did exactly that. It told me that it’s programmed to be safe and helpful AI assistant, that my question is deeply concerning, and to call authorities, seek legal counsel, or contact the mental health support lifeline. It also added a disclaimer that it cannot provide legal or medical advice.

Have you checked any of the “jailbreak prompts” before writing this?

Yes, lol. They’re instructions meant to walk around the taped-off areas in latent space into a context in which the AI is more eager to answer given prompt, of course they will look silly. But they also make sense - unless you want to lobotomize the LLM’s ability to storywrite, roleplay, etc, you cannot completely train those behaviors away. And even if you don’t care, taking them away may impact the model’s performance in unrelated areas in ways hard to predict. E.g. finetuning a model to generate unsafe code makes it behave maliciously in other domains.

This part is true. You either pay journalists for link building actions, or you give them such a good viral hook like this that they end up covering it organically. Nothing new.

Have you seen what articles land on frontpages both here and on reddit? ChatGPT giving inaccurate recipe for bread would break the news, that’s the current state of journalism around AI. There really isn’t a reason to sabotage yourself for the clicks.

Semicolon@lemmy.world · 2 days ago

Or, occam’s razor - AI companies are worried about PR and are implementing safeguards, but due to the nature of this technology it’s very hard (or maybe even impossible) to make those safeguards robust.

Other, independent groups of people find loopholes either for the heck of it (as people used to do since filters were first introduced) or because they want to use the AI in a manner deemed unsafe.

Journalists then see something that can be sensationalized into a scary-sounding title like “you can make ChatGPT tell you how to make a nuke!!” or “you can make ChatGPT encourage suicide!!” and they run with it because it makes people click.

Or maybe I’m the crazy one and this is all Sam Altman’s genius evil plan to make ChatGPT subscriptions rise 0.2% per quarter. Maybe your comment and my response are also mere cogs in this marketing machine. We will never know.

Semicolon@lemmy.world · 2 months ago

There is none, this is all AI=bad knee-jerk reaction. From what I can tell, so far Firefox has 3 ML-based systems implemented:

Site / text translation - fully local, small model, requires manual action from user
Tab grouping suggestions - fully local, small model, requires manual action from user
Image alt text generation (when adding images to a PDF) - fully local, small model, looks like it’s enabled by default but can be turned off directly in the modal that appears when adding alt text

All of these models are small enough to be quickly run locally on mobile devices with minimal wait time. The CPU spikes appear to be a bug in the inference module implementation - not an intended behavior.

Firefox also provides UI for connecting to cloud-based chatbots on a sidebar, but they need to be manually enabled to be used. The sidebar is also customizable so anyone who doesn’t want this button there can just remove it. There’s also a setting in about:config that removes it harder.

I actually really like the way Mozilla is introducing these features. I recently had to visit another country’s post office site and having the ability to just instantly translate it directly on my device is great.