Transcribing Breeding Info from Video/Audio sources

mufafa · May 23, 2023, 10:37pm

Transcriptions are in Progress, requests open!

Index

Pending

Requests Open!

Last updated 2023-06-12

Original Post:
So this is an idea that’s been floating around my head for a while. There’s tons of breeding info floating around in video or audio content, and for the vast majority of them the authors do not have transcriptions available. This is frustrating to me for many reasons: I don’t always want to watch or listen to learn, there’s lots of non-breeding-related chatter, I can’t easily reference or share the info within without having to scrub through it again, etc etc.

Until the recent explosion of accessible deep learning AI, the options for transcribing this content were not great and usually paywalled. I’ll leave the nerdy stuff for the end of the post but I came across a pretty decent method of transcription and I’d love to share this with my fellow OGers.

So here’s the first transcript I’ve done of The Pot Cast’s Ep 50 with Ryan Lee/Chimera (pt. 1)

Transcript Link

I think the results are good enough. Obviously it’s not perfect, but it’s far better than nothing. I’d love to know what y’all think about this. Do you see yourself using this? Which podcasts or videos would you want transcribed?

Nerdy stuff
Quick preface that I am by no means an expert on anything related to this. I just find it fascinating and have been loving tinkering with anything that leverages GPU compute. I am a total casual when it comes to python (or any programming), DL AI, and everything in between. So, like the pastebin says this was made using Whisper (specfically, whisper-standalone-win) and I was able to transcribe the 2 hour podcast in 10 minutes with an RTX 3080 using the large model - not bad at all. Plus, the workflow with the standalone app is dead simple: grab the audio/video, run the command, wait, get the output.

I think there would be some value in creating a public repository of the transcriptions, similar to how open source projects handle translations. That way the info is easily indexed, and everyone can contribute corrections or additional transcriptions. It could even just all be straight up hosted on github with a Github Page frontend for browsing it. If there’s enough demand, I would be more than glad to get that set up to host all the requests you OGers want to see.

Additionally, I also came across this project which combines Whisper and pyannote-audio to split the transcription based on speakers. Then it generates a html site with the text, that follows along matched up with the video timestamps - very cool! I am new to jupyter notebooks so I struggled a bit but managed to get it working as a regular python script in windows. The workflow is super hacky and manual, plus the speaker identification is so-so. Something like this would be the endgame for the github repo idea I mentioned above.

the_bot · May 23, 2023, 11:58pm

this.

i do like to listen to them but this also allows me to search faster. thanks, its a nice tool.

TopShelfTrees1 · May 25, 2023, 4:31pm

Very interesting, thank you

mufafa · May 25, 2023, 5:31pm

Ok, here we go! Nearly 50 episodes of The Pot Cast transcriptions just went up at: canna-transcripts/The Pot Cast at main · muf4/canna-transcripts · GitHub

Pending eps 6-9, 51-83. I am unable to find eps 14, 29, 45 - anyone know if these even exist?

What other content should I transcribe? I am eyeing Shango Los’ Shaping Fire and Future Cannabis Project has some really good breeding/growing talks as well. The most convenient format of grabbing the content to transcribe is via YouTube playlists so if anyone has some good ones on hand, please share!

Update: eps 1-72 up, pending 73-83

Dirt_Wizard · May 25, 2023, 10:55pm

Youngsang Cho 2023 JADAM lecture tour in the US:

https://youtube.com/playlist?list=PLba0q5T16c0VFr272n7yCtiR8QcnSgviS

mufafa · May 25, 2023, 11:49pm

Ooh these are new to me, really awesome stuff in here. Added to pending requests! I’ll get this started after I finish up the last batch of Pot Casts.

mufafa · May 26, 2023, 11:40am

These are up!
Some notes though:

The Korean audio translation actually went well but I did notice some repeated phrases that don’t quite match up the YT subtitles. There are tools out there to rip the subtitles directly off videos but I don’t want to make exceptions for the overall workflow if it can be avoided. Let me know how they look to you. Basically its a difference of this transcription being the direct translation of the audio vs the manually written subtitles they included in those korean audio videos.
In the english voiceover videos, the korean phrases included are not properly translated since this tool doesn’t seem to be able to handle multiple languages. I guess since it can only deal with one language model at a time.

Dirt_Wizard · May 26, 2023, 2:55pm

Thanks dude! I’ll look around for some other interesting and more obscure playlists for this project, I love it!

Dirt_Wizard · May 26, 2023, 6:04pm

Some more requests please

DocCalyx podcast:

https://m.youtube.com/@CalyxCrewPodcast/playlists

Matt Riot’s Breeders Syndicate:

https://m.youtube.com/@BreedersSyndicate/playlists

An incredible archive of historical interviews including Jack Herer, Denis Peron and more:

https://m.youtube.com/@RestoreHemp/videos

mufafa · May 26, 2023, 6:14pm

Sweet suggestions! Breeder’s Syndicate was on my wishlist as well. Will try to get some of these over the weekend. Cheers!

Dirt_Wizard · May 26, 2023, 6:43pm

Thank you! I’m so glad someone finally did this!

mufafa · May 27, 2023, 8:36pm

Just put up about a third of Breeder’s Syndicate. I did catch one that got bugged and just printed “I” for about half the text but it was fine when I re-ran it. If you come across anything weird like that, let me know!

One thing that’s dawning on me now is that this is kind of creating a new problem Now we need another AI to help summarize all this text. Maybe there will be enough source material to train one of the language models like GPT so we can have it answer breeding questions

Also, updated about 10% of what I grabbed from Future Cannabis Project that showed up under “breeding” or “seeds”. There is still so much pending, these guys put out a ton of content. I tried to prune some of it down to avoid the scheduled talk shows unless the title explicitly mentioned a breeder or something breeding related.

mufafa · May 29, 2023, 6:48am

Shaping Fire and Breeder’s Syndicate fully up, Doc Calyx and RestoreHemp archive next on the list

mufafa · May 31, 2023, 5:28pm

The remaining ones requested here are up!

Still working through Future Cannabis Project. Not super sure where to go next after those. There are great talks from educational or more science-focused sources. However, they usually go along with nice presentation slides which would be lost or are themselves summaries of actual scientific papers so kind of redundant to transcribe.

Definitely open to more ideas or requests

mufafa · June 2, 2023, 11:28am

Another batch of FCP in, total at 360, with another 180 to go from what I’ve grabbed. Starting to look into feeding this text into a language model or some sort of analyzer. Mainly to extract summaries or maybe make some sort of Q&A chatbot. No promises though as I am definitely out of my depth, but having fun learning.

Dirt_Wizard · June 3, 2023, 4:36pm

Here’s some neat stuff @Northern_Loki found from Cornell:

mufafa · June 5, 2023, 6:33am

Cornell SIPS is definitely on the list, same as Apogee/Dr. Bugbee stuff - going next on the list. Also, I appreciate the cross-promotion, I see you

“Last” of FCP going up. There’s still plenty left but ~550 is enough for now. I am checking downloads against an archive so in theory I shouldn’t have duplicates when I go get another batch later.

mufafa · June 8, 2023, 9:35pm

These really go by quicker than I expect. Cornell SIPS and Apogee transcripts up. Kept only the ones most related to growing/breeding for each.

mufafa · August 10, 2023, 5:00pm

Thinking of doing another round of transcriptions to catch up on episodes from already transcribed series. Any suggestions for additions?

Dirt_Wizard · August 11, 2023, 7:26am

Maybe some Mechoulam interviews or presentations?

This looks like good stuff too:

https://youtube.com/playlist?list=PLY7VMskICYhwGdVzs19ngXDBynfW3VlPu

Topic		Replies	Views
Comprehensive Breeding Information Sources Breeder's Lab compendium	41	1571	October 23, 2021
Slym3rs mutants Indoor Growing	89	1565	July 19, 2023
Pollen Chuck Question Basic Growing Info	21	540	May 31, 2024
South East Asian Landraces & Heirloom Breeder's Lab landrace	42	1056	October 25, 2024
Understanding Basic Breeding Breeder's Lab	10	806	July 11, 2020
Weird thought on breeding Smoker's Lounge	36	741	December 17, 2021
Seed creation information is being requested (yes again!) Indoor Growing	11	691	May 29, 2019

Transcribing Breeding Info from Video/Audio sources

Transcriptions are in Progress, requests open!

Last updated 2023-06-12

Related Topics