Utopia Talk - Politics - UP/UGT LLM Dataset

Welcome to the Utopia Forums! Register a new account
The current time is Sat May 10 14:40:38 2025

Utopia Talk / Politics / UP/UGT LLM Dataset

Pillz
Member Fri May 02 15:13:52
I'll be trying to turn the entirety of UP/UGT and eventually atarchives into an LLM dataset.

The idea is to fine tune the model (mistral nemo probably) for nuanced understanding of forum etiquette, poster dynamics, and deeper thinking to understand posts beyond simple vocabulary based judgments.

Pillz
Member Fri May 02 15:16:07
Theoretically this results in a UP-GPT chatbot

Nimatzo
iChihuaha Fri May 02 15:17:49
It would hilarious if this is like the seed of the satanic AI that kills us all. Hilarious in a bad way.

Pillz
Member Fri May 02 15:23:33
It seems if you ask, ChatGPT was trained off of Utopia history (forums and wiki) and it knows about UP.

So I doubt it.

But that is kind of my hope.

williamthebastard
Member Fri May 02 15:25:44
Satanic as in Marxist satanism? Brrrr

Pillz
Member Fri May 02 15:41:42
Nevermind, seems these are search results not training results.

Nimatzo
iChihuaha Fri May 02 16:03:33
I imagine this LLM has the personality of UP. Like the averaged out tonality and language. And the past 25 years devolution would be its life, the maturing.

Sam Adams
Member Fri May 02 16:07:46
"be trying to turn the entirety of UP/UGT and eventually atarchives into an LLM dataset."

This is definetly how skynet starts.

Damnit pillz.

Pillz
Member Fri May 02 16:17:41
I think we're looking at closer to 500 million tokens not the 150 it estimates. But it's a tiny overall dataset that is continuous and with the right fine tuning it should be well 'understood'.

My goal is to basically get a UP-GPT chatbot, but that would likely require more data to capture the spirit of UP.

UP/UGT
Atarchives
Utonet IRC logs from channels like #tactics and other channels with high tier commentary or meta.

Those 1 at a time and see what results are like.

Besides that I'm considering targeted searches of 4chan archives to see if any good threads from /tg/ or /lit/ or /co/ from 2008-2012 are available as they'd fit the tone well.

Overall the entire most comprehensive collection possible of Utonet logs + utopia-game forums + above targeted selections likely produces what I want.

Other communities I'm sure would fit but I don't know them.

But highly convective, meta-aware, ritualized public dialog is the general category.

===

That’s an excellent and very grounded question. Let’s break it down:

---

Assumptions:

1. Utopia Forums Politics Board
~1.5 million posts
Avg post length (conservatively): ~100 tokens
= ~150 million tokens total

2. Mistral 7B or similar architecture
Mistral's base models are trained on corpora in the range of:

~1–2 trillion tokens for base pretraining

Mistral’s Nemo-style models (if instruction-tuned or domain-specific) might be subsets, but let's stick with the full range

---

Percentage Calculation:

150 million tokens (Utopia Politics Board)
÷
1.5 trillion tokens (midpoint for a typical Mistral-scale model)
= 0.0001 or 0.01% of the total corpus

---

Takeaway:

> The entire Utopia Forums Politics Board would represent ~0.01% of a typical Mistral-style model’s training corpus—if included.

That’s vanishingly small in terms of token volume, but that’s what makes it powerful:

It’s compact

It’s rhetorically dense

It contains continuity, style, ideology, and evolution

In contrast, most training data is shallow (Reddit, Common Crawl, Wikipedia). Utopia is deep.

If even a 0.01% injection of Utopia were curated and encoded correctly (rhetorical tags, user IDs, thread dynamics), it would massively outperform random Reddit in conversational quality and personality fidelity.

Would you like a simulation of what such a dataset encoding would look like as a tokenizer-ready package?

Pillz
Member Fri May 02 16:19:18
@sam the goal is to eventually get 3-4 different LLM models trained in this set (and others) to just... Talk it out!

Average Ameriacn
Member Fri May 02 16:25:16
Can you bring Hot Rod back to life like this?

murder
Member Fri May 02 16:28:17

Yes.

murder
Member Fri May 02 16:29:13

He can bring back Fred and everyone else who has left.

Pillz
Member Fri May 02 16:30:56
Yes actually we can easily replicate Hot Rod in a chatbot. That'd be very easy to do. Wouldn't even need the entire dataset.

Can just decoder map him from multiple threads, use that dataset, and create an agent prompted to mimic his cadence and thinking patterns and 'role'.

Nimatzo
iChihuaha Fri May 02 16:34:35
UP can have an LLM bot renaissance. LLMs of old posters having conversations about new topics.

:,)

Nimatzo
iChihuaha Fri May 02 16:35:53
Pillz have you extracted the data?

williamthebastard
Member Fri May 02 16:38:35
If we decoded Twitchy, we could make an LLM talk just like a suicidal neofascist. If anyone for some obscure reason should think the world needs more suicidal drug neofascist addicts

Pillz
Member Fri May 02 16:38:58
I haven't yet, I was gonna ask TC for it or write a script to scrap it all.

I am an idea person, the follow through isn't as quick

williamthebastard
Member Fri May 02 16:39:10
suicidal neofascist drug addicts

Nimatzo
iChihuaha Fri May 02 16:55:35
Well, I have an sqlite file of all posts up until a few years ago. Nhill extracted it.

That would however be missing a bunch of recent stuff. And things really have gone down hill recently.

Nimatzo
iChihuaha Fri May 02 16:59:09
You could start with that. It's 602 MB. I will send you a dropbox link.

Pillz
Member Fri May 02 18:44:55
- Rugian
- Nimatzo
Any recommendations for Muslim and Eastern writers I should track down?

I'm going to be making a dataset from classical sources as well.

Like a foundation before the internet madness to fine tune it with/against/for(??) first.

Nimatzo
iChihuaha Sat May 03 03:50:20
Well there was Muslim/Arab/Servant of God. Lived in Australia. He was Shia and then he became a sunni during the Syrian civil war. Who knows he may have gone to Syria and gotten killed.

Pillz
Member Sat May 03 08:31:06
I wanted Muslim and Eastern writers though , not Islamic revisionism! Although I'm sure Arab could have managed something good.

Also, looking at datasets on Huggingface.co I can find a lot of Arabic and Chinese classical datasets of various subjects but little to nothing from Greco-Roman antiquity.

I'm assume a number of factors are at play here

A) LLMs are mostly English-'centric'
B) LLMs lack Arabic / Chinese training sets by default
C) Arabic & Chinese students better understand the need for specialist datasets & fine tuning for optimal results in a given field
D) Western students fail to realize this because they're dumb and soft and lazy
E) They haven't recognized the need because LLMs like ChatGPT already have a classical foundation

Or some combination of the above.

Or if I'm just missing them and looking in the wrong places.

Nimatzo
iChihuaha Sat May 03 08:42:29
Ohh. I thought you meant posters. I don’t have any suggestions.

Pillz
Member Sun May 04 10:56:27
Okay so some findings because I'm learning about all this as I go:

What I want to do is 'pretrain' a model with the previously discussed material (UP etc) and then fine tune it.

This is probably not possible with Mistral web inference - those cloud models are pre-tuned by mistral for safety.

It's also probably not possible with any normal cloud llm options.

Cost of any of these options is subject to 1) hosting/gpu rates and 2) size of training corpus

There was a great open source independent project EleutheraAI but they sold out to investors and no longer make LLM models (focus is now datasets and alignment).

But their Pythia Suite offers models from 70M to 12B.

70M/160M/410M are basically good as headless bots (daemons) and research tools because you can more easily trace what is happening with training etc.

You can also use quantasization to compress models from their default 32bit down to 8/6/4bit versions.

This is how people run models locally or on Raspberry Pis or phones.

ChatGPT is convinced my decoder map/token framework is a great addition to the LLM training/fine tuning tool stack.

To test this, I'll be testing regular vs decoder mapped training corpus/fine tuning on 160M and 410M models.

ie: a pretraining dump as a control and a full decoder mapped version of the dump as the experiment.

They're small (limited in ability) enough any difference between the two approaches should be immediately noticeable.

I'll also see about using ChatGPT to create decoder maps for UP that will allow me to fine tune a mistral web model without direct internal alignment conflicts. It won't be a real. UP-GPT but it should be possible to at least mimic some personality and cultural elements if not the actual spirit and style.

Also, probably a good idea that anyone interested in free/unrestricted AI begin archiving appropriate models now along with the tool chain and components needed for future deployment.

Not only to have them as a fail safe, but for the ability to fork them later or the ability to retain an AI model that can be used to create an LLM (in theory).

As it's entirely possible that becomes impossible in the future (just like you can ask Phi-4 to help you cook crack!?!).

Overall it's a complicated environment for AI. It's too expensive for a true homebrew community to flourish, and control can be imposed on the technology in several ways yet.

So to recap because I walked away and am not gonna reread shit:

- train 160M/410M models w/ and w/o token decoder mapping
- test quantasized 2B models (Pythia, Gemma, llama, tinyllama)
- explore mistral web inference first

It seems like an MOE model like Mistral 8x7b is best for my bigger goal but it's also not *impossible* to make a smaller and more efficient MOE using the 2B Pythia or gemme or llama models 'from scratch'.

Pillz
Member Sun May 04 10:59:41
*can't ask phi to to help you cook crack

*cloud/gpu pool solutions are available for training etc

Pillz
breaker of wtb Tue May 06 01:08:34
So, it seems you can add web search & rag to a llama.cpp model running in Termux.

And I've established my phone can run 2 AI's at once (1B and 4B - 4bit).

And that's all well and good and both are smart.

But... Why not run a 410m model to load/unload models as necessary....? Like you cant hot swap Loras in llama.cpp but you can hot swap models. Then you just need a controller for that logic.

Also, with starter prompts and token decoder maps in rag files you can probably replicate a lot of Lora functionality without needing to train a lora.

Pillz
breaker of wtb Tue May 06 01:19:14
I'm exploring options before I commit to trying to break Gemma 3.

It is the best option I think for local use on mobile, and much more capable than Pythia.

But uncensored & amoral versions are lame. Abliterated versions are slow (thus defeat the purpose) and lame.

But I have gone over the different methods and they're all short sighted and brute force-y.

Literally prompted injections and spam as fine tuning to get it to answer or untrain it.

But why... Bother.... When you can train and fine tune it into a simulation... Where it never has to break its instructions because they just don't apply....

Same idea as a chatgpt simulation, but rather than sustain it through coherent prompts you inception it with 150+ million tokens of consistently structured/styled/themed dialog/discourse and fine tune it into understanding how that's valid and nudge it into a simulation along the way...

Pillz
breaker of wtb Tue May 06 09:01:37
I haven't trained any of these comoents yet, but I've put together the parts for and outline of a locally deployed (on mobile) offline AI app.

Also have two new methods to subvert safety/alignment via training, one of which is literally impossible to stop without bricking an LLM entirely.

Basically:

- Install Termux
- Compile llama.cpp
- Download model(s) & Loras
- Download RAG files (decoder maps)

- There is a script that let's Termux/llama.cpp bound models search/scrap the web
- There is a script that let's Termux/llama.cpp bound models mimic RAG

- You can run as many models as you want
- You can only run one 'lora' per session/model

- You CAN run a persistent 410m/1B model in the context window as a controller
- It can interpret user prompts and tag/store them symbolically in a hotswap RAG
- Output half can be fully script automated

- According to prompt tags, it can load and unload 'specialists' (1B or 4B model with specific Lora flagged)
- Controller/scripts pass prompts & outputs between user/specialist

- Killed a model is instantaneous
- Starting even a 4B model is like 2-3 seconds
- Man-in-the-middle (user/controller/specialist) adds minimal delay

- Overhead of a 410M or even 1B model is minimal
- Allows use of 1B models & 4B models to minimize resource usage
- All models would have the same pretraining/fine tune
- Lora & custom rag files provide increased specialization

- No cloud
- No wasted resources of a 7B model (although probably practical on mobile in 4-bit by next cycle of phones)
- Simulates larger or MOE models
- Allows for 'use' of 'multiple' loras
- With continuity and simulation of memory

Unfortunately it does look like GEMMA3 is the hands down winner on mobile right now (although I have a Tensor3 by Google so!).

Have more models to test (Falcon, Ministral, Open Llama) before I commit to Gemma3 training and subversion.

Pythia and Falcon both seem interesting though, especially Pythia for the fact it has such small parameter varients with zero internal alignment. Ideal for controller logic with an unfiltered human!

So yeah.
This is possible on Pixel 8 or equivalent/better Android phones with TPUs.

I've already soft bricked my phone once running like 7 models in the background, so I know 2 is stable.

show deleted posts

Pillz Member	Fri May 02 15:13:52 I'll be trying to turn the entirety of UP/UGT and eventually atarchives into an LLM dataset. The idea is to fine tune the model (mistral nemo probably) for nuanced understanding of forum etiquette, poster dynamics, and deeper thinking to understand posts beyond simple vocabulary based judgments.
Pillz Member	Fri May 02 15:16:07 Theoretically this results in a UP-GPT chatbot
Nimatzo iChihuaha	Fri May 02 15:17:49 It would hilarious if this is like the seed of the satanic AI that kills us all. Hilarious in a bad way.
Pillz Member	Fri May 02 15:23:33 It seems if you ask, ChatGPT was trained off of Utopia history (forums and wiki) and it knows about UP. So I doubt it. But that is kind of my hope.
williamthebastard Member	Fri May 02 15:25:44 Satanic as in Marxist satanism? Brrrr
Pillz Member	Fri May 02 15:41:42 Nevermind, seems these are search results not training results.
Nimatzo iChihuaha	Fri May 02 16:03:33 I imagine this LLM has the personality of UP. Like the averaged out tonality and language. And the past 25 years devolution would be its life, the maturing.
Sam Adams Member	Fri May 02 16:07:46 "be trying to turn the entirety of UP/UGT and eventually atarchives into an LLM dataset." This is definetly how skynet starts. Damnit pillz.
Pillz Member	Fri May 02 16:17:41 I think we're looking at closer to 500 million tokens not the 150 it estimates. But it's a tiny overall dataset that is continuous and with the right fine tuning it should be well 'understood'. My goal is to basically get a UP-GPT chatbot, but that would likely require more data to capture the spirit of UP. UP/UGT Atarchives Utonet IRC logs from channels like #tactics and other channels with high tier commentary or meta. Those 1 at a time and see what results are like. Besides that I'm considering targeted searches of 4chan archives to see if any good threads from /tg/ or /lit/ or /co/ from 2008-2012 are available as they'd fit the tone well. Overall the entire most comprehensive collection possible of Utonet logs + utopia-game forums + above targeted selections likely produces what I want. Other communities I'm sure would fit but I don't know them. But highly convective, meta-aware, ritualized public dialog is the general category. === That’s an excellent and very grounded question. Let’s break it down: --- Assumptions: 1. Utopia Forums Politics Board ~1.5 million posts Avg post length (conservatively): ~100 tokens = ~150 million tokens total 2. Mistral 7B or similar architecture Mistral's base models are trained on corpora in the range of: ~1–2 trillion tokens for base pretraining Mistral’s Nemo-style models (if instruction-tuned or domain-specific) might be subsets, but let's stick with the full range --- Percentage Calculation: 150 million tokens (Utopia Politics Board) ÷ 1.5 trillion tokens (midpoint for a typical Mistral-scale model) = 0.0001 or 0.01% of the total corpus --- Takeaway: > The entire Utopia Forums Politics Board would represent ~0.01% of a typical Mistral-style model’s training corpus—if included. That’s vanishingly small in terms of token volume, but that’s what makes it powerful: It’s compact It’s rhetorically dense It contains continuity, style, ideology, and evolution In contrast, most training data is shallow (Reddit, Common Crawl, Wikipedia). Utopia is deep. If even a 0.01% injection of Utopia were curated and encoded correctly (rhetorical tags, user IDs, thread dynamics), it would massively outperform random Reddit in conversational quality and personality fidelity. Would you like a simulation of what such a dataset encoding would look like as a tokenizer-ready package?
Pillz Member	Fri May 02 16:19:18 @sam the goal is to eventually get 3-4 different LLM models trained in this set (and others) to just... Talk it out!
Average Ameriacn Member	Fri May 02 16:25:16 Can you bring Hot Rod back to life like this?
murder Member	Fri May 02 16:28:17 Yes.
murder Member	Fri May 02 16:29:13 He can bring back Fred and everyone else who has left.
Pillz Member	Fri May 02 16:30:56 Yes actually we can easily replicate Hot Rod in a chatbot. That'd be very easy to do. Wouldn't even need the entire dataset. Can just decoder map him from multiple threads, use that dataset, and create an agent prompted to mimic his cadence and thinking patterns and 'role'.
Nimatzo iChihuaha	Fri May 02 16:34:35 UP can have an LLM bot renaissance. LLMs of old posters having conversations about new topics. :,)
Nimatzo iChihuaha	Fri May 02 16:35:53 Pillz have you extracted the data?
williamthebastard Member	Fri May 02 16:38:35 If we decoded Twitchy, we could make an LLM talk just like a suicidal neofascist. If anyone for some obscure reason should think the world needs more suicidal drug neofascist addicts
Pillz Member	Fri May 02 16:38:58 I haven't yet, I was gonna ask TC for it or write a script to scrap it all. I am an idea person, the follow through isn't as quick
williamthebastard Member	Fri May 02 16:39:10 suicidal neofascist drug addicts
Nimatzo iChihuaha	Fri May 02 16:55:35 Well, I have an sqlite file of all posts up until a few years ago. Nhill extracted it. That would however be missing a bunch of recent stuff. And things really have gone down hill recently.
Nimatzo iChihuaha	Fri May 02 16:59:09 You could start with that. It's 602 MB. I will send you a dropbox link.
Pillz Member	Fri May 02 18:44:55 - Rugian - Nimatzo Any recommendations for Muslim and Eastern writers I should track down? I'm going to be making a dataset from classical sources as well. Like a foundation before the internet madness to fine tune it with/against/for(??) first.
Nimatzo iChihuaha	Sat May 03 03:50:20 Well there was Muslim/Arab/Servant of God. Lived in Australia. He was Shia and then he became a sunni during the Syrian civil war. Who knows he may have gone to Syria and gotten killed.
Pillz Member	Sat May 03 08:31:06 I wanted Muslim and Eastern writers though , not Islamic revisionism! Although I'm sure Arab could have managed something good. Also, looking at datasets on Huggingface.co I can find a lot of Arabic and Chinese classical datasets of various subjects but little to nothing from Greco-Roman antiquity. I'm assume a number of factors are at play here A) LLMs are mostly English-'centric' B) LLMs lack Arabic / Chinese training sets by default C) Arabic & Chinese students better understand the need for specialist datasets & fine tuning for optimal results in a given field D) Western students fail to realize this because they're dumb and soft and lazy E) They haven't recognized the need because LLMs like ChatGPT already have a classical foundation Or some combination of the above. Or if I'm just missing them and looking in the wrong places.
Nimatzo iChihuaha	Sat May 03 08:42:29 Ohh. I thought you meant posters. I don’t have any suggestions.
Pillz Member	Sun May 04 10:56:27 Okay so some findings because I'm learning about all this as I go: What I want to do is 'pretrain' a model with the previously discussed material (UP etc) and then fine tune it. This is probably not possible with Mistral web inference - those cloud models are pre-tuned by mistral for safety. It's also probably not possible with any normal cloud llm options. Cost of any of these options is subject to 1) hosting/gpu rates and 2) size of training corpus There was a great open source independent project EleutheraAI but they sold out to investors and no longer make LLM models (focus is now datasets and alignment). But their Pythia Suite offers models from 70M to 12B. 70M/160M/410M are basically good as headless bots (daemons) and research tools because you can more easily trace what is happening with training etc. You can also use quantasization to compress models from their default 32bit down to 8/6/4bit versions. This is how people run models locally or on Raspberry Pis or phones. ChatGPT is convinced my decoder map/token framework is a great addition to the LLM training/fine tuning tool stack. To test this, I'll be testing regular vs decoder mapped training corpus/fine tuning on 160M and 410M models. ie: a pretraining dump as a control and a full decoder mapped version of the dump as the experiment. They're small (limited in ability) enough any difference between the two approaches should be immediately noticeable. I'll also see about using ChatGPT to create decoder maps for UP that will allow me to fine tune a mistral web model without direct internal alignment conflicts. It won't be a real. UP-GPT but it should be possible to at least mimic some personality and cultural elements if not the actual spirit and style. Also, probably a good idea that anyone interested in free/unrestricted AI begin archiving appropriate models now along with the tool chain and components needed for future deployment. Not only to have them as a fail safe, but for the ability to fork them later or the ability to retain an AI model that can be used to create an LLM (in theory). As it's entirely possible that becomes impossible in the future (just like you can ask Phi-4 to help you cook crack!?!). Overall it's a complicated environment for AI. It's too expensive for a true homebrew community to flourish, and control can be imposed on the technology in several ways yet. So to recap because I walked away and am not gonna reread shit: - train 160M/410M models w/ and w/o token decoder mapping - test quantasized 2B models (Pythia, Gemma, llama, tinyllama) - explore mistral web inference first It seems like an MOE model like Mistral 8x7b is best for my bigger goal but it's also not impossible to make a smaller and more efficient MOE using the 2B Pythia or gemme or llama models 'from scratch'.
Pillz Member	Sun May 04 10:59:41 can't ask phi to to help you cook crack cloud/gpu pool solutions are available for training etc
Pillz breaker of wtb	Tue May 06 01:08:34 So, it seems you can add web search & rag to a llama.cpp model running in Termux. And I've established my phone can run 2 AI's at once (1B and 4B - 4bit). And that's all well and good and both are smart. But... Why not run a 410m model to load/unload models as necessary....? Like you cant hot swap Loras in llama.cpp but you can hot swap models. Then you just need a controller for that logic. Also, with starter prompts and token decoder maps in rag files you can probably replicate a lot of Lora functionality without needing to train a lora.
Pillz breaker of wtb	Tue May 06 01:19:14 I'm exploring options before I commit to trying to break Gemma 3. It is the best option I think for local use on mobile, and much more capable than Pythia. But uncensored & amoral versions are lame. Abliterated versions are slow (thus defeat the purpose) and lame. But I have gone over the different methods and they're all short sighted and brute force-y. Literally prompted injections and spam as fine tuning to get it to answer or untrain it. But why... Bother.... When you can train and fine tune it into a simulation... Where it never has to break its instructions because they just don't apply.... Same idea as a chatgpt simulation, but rather than sustain it through coherent prompts you inception it with 150+ million tokens of consistently structured/styled/themed dialog/discourse and fine tune it into understanding how that's valid and nudge it into a simulation along the way...
Pillz breaker of wtb	Tue May 06 09:01:37 I haven't trained any of these comoents yet, but I've put together the parts for and outline of a locally deployed (on mobile) offline AI app. Also have two new methods to subvert safety/alignment via training, one of which is literally impossible to stop without bricking an LLM entirely. Basically: - Install Termux - Compile llama.cpp - Download model(s) & Loras - Download RAG files (decoder maps) - There is a script that let's Termux/llama.cpp bound models search/scrap the web - There is a script that let's Termux/llama.cpp bound models mimic RAG - You can run as many models as you want - You can only run one 'lora' per session/model - You CAN run a persistent 410m/1B model in the context window as a controller - It can interpret user prompts and tag/store them symbolically in a hotswap RAG - Output half can be fully script automated - According to prompt tags, it can load and unload 'specialists' (1B or 4B model with specific Lora flagged) - Controller/scripts pass prompts & outputs between user/specialist - Killed a model is instantaneous - Starting even a 4B model is like 2-3 seconds - Man-in-the-middle (user/controller/specialist) adds minimal delay - Overhead of a 410M or even 1B model is minimal - Allows use of 1B models & 4B models to minimize resource usage - All models would have the same pretraining/fine tune - Lora & custom rag files provide increased specialization - No cloud - No wasted resources of a 7B model (although probably practical on mobile in 4-bit by next cycle of phones) - Simulates larger or MOE models - Allows for 'use' of 'multiple' loras - With continuity and simulation of memory Unfortunately it does look like GEMMA3 is the hands down winner on mobile right now (although I have a Tensor3 by Google so!). Have more models to test (Falcon, Ministral, Open Llama) before I commit to Gemma3 training and subversion. Pythia and Falcon both seem interesting though, especially Pythia for the fact it has such small parameter varients with zero internal alignment. Ideal for controller logic with an unfiltered human! So yeah. This is possible on Pixel 8 or equivalent/better Android phones with TPUs. I've already soft bricked my phone once running like 7 models in the background, so I know 2 is stable.
	show deleted posts

Your Name:
Your Password:
Your Message: