This is a searchable text archive of various cannabis related sources that were previously audio or video only. The files are hosted on github and are easily searchable* using the built-in search functionality.
* some folders go beyond file limitations on github, looking for solutions atm
Why do this?
Because a lot of cannabis history is oral, it is very difficult to research and reference. This project started out of a lack of ability to easily cite sources of information such as the history of a particular clone because that info was only available on a podcast.
FAQs
Can this podcast series or youtube channel be added to the transcriptions?
Likely, yes! If the source is informative and focused on anything cannabis like growing, breeding, science, history, legality, etc. The idea is to extract the entirety of the content available so the source must revolve around cannabis as a topic. Write about it here in the thread and it can be discussed!
I found a problem with the text, what can I do?
Ideally, you can submit a github issue or even fix it yourself and submit a pull request. If you’re not familiar with that, a DM would be preferred over posting in the thread.
Can you add transcriptions from instagram or tiktok?
Probably not. Youtube, soundcloud, and similar sites are preferred because it is easy to query those pages and monitor for changes. I would also like to lean towards longer-form content like podcasts and presentations rather than clips.
It looks like some content may be missing?
This is likely, Youtube likes to temporarily block direct downloads of videos which is part of the process for the transcriptions. Archive completion should improve over time.
Pending eps 6-9, 51-83. I am unable to find eps 14, 29, 45 - anyone know if these even exist?
What other content should I transcribe? I am eyeing Shango Los’ Shaping Fire and Future Cannabis Project has some really good breeding/growing talks as well. The most convenient format of grabbing the content to transcribe is via YouTube playlists so if anyone has some good ones on hand, please share!
The Korean audio translation actually went well but I did notice some repeated phrases that don’t quite match up the YT subtitles. There are tools out there to rip the subtitles directly off videos but I don’t want to make exceptions for the overall workflow if it can be avoided. Let me know how they look to you. Basically its a difference of this transcription being the direct translation of the audio vs the manually written subtitles they included in those korean audio videos.
In the english voiceover videos, the korean phrases included are not properly translated since this tool doesn’t seem to be able to handle multiple languages. I guess since it can only deal with one language model at a time.
Just put up about a third of Breeder’s Syndicate. I did catch one that got bugged and just printed “I” for about half the text but it was fine when I re-ran it. If you come across anything weird like that, let me know!
One thing that’s dawning on me now is that this is kind of creating a new problem Now we need another AI to help summarize all this text. Maybe there will be enough source material to train one of the language models like GPT so we can have it answer breeding questions
Also, updated about 10% of what I grabbed from Future Cannabis Project that showed up under “breeding” or “seeds”. There is still so much pending, these guys put out a ton of content. I tried to prune some of it down to avoid the scheduled talk shows unless the title explicitly mentioned a breeder or something breeding related.
Still working through Future Cannabis Project. Not super sure where to go next after those. There are great talks from educational or more science-focused sources. However, they usually go along with nice presentation slides which would be lost or are themselves summaries of actual scientific papers so kind of redundant to transcribe.
Another batch of FCP in, total at 360, with another 180 to go from what I’ve grabbed. Starting to look into feeding this text into a language model or some sort of analyzer. Mainly to extract summaries or maybe make some sort of Q&A chatbot. No promises though as I am definitely out of my depth, but having fun learning.
Cornell SIPS is definitely on the list, same as Apogee/Dr. Bugbee stuff - going next on the list. Also, I appreciate the cross-promotion, I see you
“Last” of FCP going up. There’s still plenty left but ~550 is enough for now. I am checking downloads against an archive so in theory I shouldn’t have duplicates when I go get another batch later.