Transcribing Breeding Info from Video/Audio sources

Transcriptions are in Progress, requests open!


Index

Pending

  • Requests Open!

Last updated 2023-06-12


Original Post:
So this is an idea that’s been floating around my head for a while. There’s tons of breeding info floating around in video or audio content, and for the vast majority of them the authors do not have transcriptions available. This is frustrating to me for many reasons: I don’t always want to watch or listen to learn, there’s lots of non-breeding-related chatter, I can’t easily reference or share the info within without having to scrub through it again, etc etc.

Until the recent explosion of accessible deep learning AI, the options for transcribing this content were not great and usually paywalled. I’ll leave the nerdy stuff for the end of the post but I came across a pretty decent method of transcription and I’d love to share this with my fellow OGers.

So here’s the first transcript I’ve done of The Pot Cast’s Ep 50 with Ryan Lee/Chimera (pt. 1)

Transcript Link

I think the results are good enough. Obviously it’s not perfect, but it’s far better than nothing. I’d love to know what y’all think about this. Do you see yourself using this? Which podcasts or videos would you want transcribed?


Nerdy stuff
Quick preface that I am by no means an expert on anything related to this. I just find it fascinating and have been loving tinkering with anything that leverages GPU compute. I am a total casual when it comes to python (or any programming), DL AI, and everything in between. So, like the pastebin says this was made using Whisper (specfically, whisper-standalone-win) and I was able to transcribe the 2 hour podcast in 10 minutes with an RTX 3080 using the large model - not bad at all. Plus, the workflow with the standalone app is dead simple: grab the audio/video, run the command, wait, get the output.

I think there would be some value in creating a public repository of the transcriptions, similar to how open source projects handle translations. That way the info is easily indexed, and everyone can contribute corrections or additional transcriptions. It could even just all be straight up hosted on github with a Github Page frontend for browsing it. If there’s enough demand, I would be more than glad to get that set up to host all the requests you OGers want to see.

Additionally, I also came across this project which combines Whisper and pyannote-audio to split the transcription based on speakers. Then it generates a html site with the text, that follows along matched up with the video timestamps - very cool! I am new to jupyter notebooks so I struggled a bit but managed to get it working as a regular python script in windows. The workflow is super hacky and manual, plus the speaker identification is so-so. Something like this would be the endgame for the github repo idea I mentioned above.

19 Likes

this.

i do like to listen to them but this also allows me to search faster. thanks, its a nice tool.

5 Likes

Very interesting, thank you

2 Likes

Ok, here we go! Nearly 50 episodes of The Pot Cast transcriptions just went up at: canna-transcripts/The Pot Cast at main · muf4/canna-transcripts · GitHub

Pending eps 6-9, 51-83. I am unable to find eps 14, 29, 45 - anyone know if these even exist?

What other content should I transcribe? I am eyeing Shango Los’ Shaping Fire and Future Cannabis Project has some really good breeding/growing talks as well. The most convenient format of grabbing the content to transcribe is via YouTube playlists so if anyone has some good ones on hand, please share!

Update: eps 1-72 up, pending 73-83

6 Likes

Youngsang Cho 2023 JADAM lecture tour in the US:

https://youtube.com/playlist?list=PLba0q5T16c0VFr272n7yCtiR8QcnSgviS

2 Likes

Ooh these are new to me, really awesome stuff in here. Added to pending requests! I’ll get this started after I finish up the last batch of Pot Casts.

1 Like

These are up!
Some notes though:

  • The Korean audio translation actually went well but I did notice some repeated phrases that don’t quite match up the YT subtitles. There are tools out there to rip the subtitles directly off videos but I don’t want to make exceptions for the overall workflow if it can be avoided. Let me know how they look to you. Basically its a difference of this transcription being the direct translation of the audio vs the manually written subtitles they included in those korean audio videos.
  • In the english voiceover videos, the korean phrases included are not properly translated since this tool doesn’t seem to be able to handle multiple languages. I guess since it can only deal with one language model at a time.
1 Like

Thanks dude! I’ll look around for some other interesting and more obscure playlists for this project, I love it!

1 Like

Some more requests please

DocCalyx podcast:

https://m.youtube.com/@CalyxCrewPodcast/playlists

Matt Riot’s Breeders Syndicate:

https://m.youtube.com/@BreedersSyndicate/playlists

An incredible archive of historical interviews including Jack Herer, Denis Peron and more:

https://m.youtube.com/@RestoreHemp/videos

2 Likes

Sweet suggestions! Breeder’s Syndicate was on my wishlist as well. Will try to get some of these over the weekend. Cheers!

2 Likes

Thank you! I’m so glad someone finally did this!

1 Like

Just put up about a third of Breeder’s Syndicate. I did catch one that got bugged and just printed “I” for about half the text but it was fine when I re-ran it. If you come across anything weird like that, let me know!

One thing that’s dawning on me now is that this is kind of creating a new problem :laughing: Now we need another AI to help summarize all this text. Maybe there will be enough source material to train one of the language models like GPT so we can have it answer breeding questions :thinking:

Also, updated about 10% of what I grabbed from Future Cannabis Project that showed up under “breeding” or “seeds”. There is still so much pending, these guys put out a ton of content. I tried to prune some of it down to avoid the scheduled talk shows unless the title explicitly mentioned a breeder or something breeding related.

2 Likes

Shaping Fire and Breeder’s Syndicate fully up, Doc Calyx and RestoreHemp archive next on the list :muscle:

1 Like

The remaining ones requested here are up!

Still working through Future Cannabis Project. Not super sure where to go next after those. There are great talks from educational or more science-focused sources. However, they usually go along with nice presentation slides which would be lost or are themselves summaries of actual scientific papers so kind of redundant to transcribe.

Definitely open to more ideas or requests :thinking:

1 Like

Another batch of FCP in, total at 360, with another 180 to go from what I’ve grabbed. Starting to look into feeding this text into a language model or some sort of analyzer. Mainly to extract summaries or maybe make some sort of Q&A chatbot. No promises though as I am definitely out of my depth, but having fun learning.

2 Likes

Here’s some neat stuff @Northern_Loki found from Cornell:

2 Likes

Cornell SIPS is definitely on the list, same as Apogee/Dr. Bugbee stuff - going next on the list. Also, I appreciate the cross-promotion, I see you :face_with_monocle:

“Last” of FCP going up. There’s still plenty left but ~550 is enough for now. I am checking downloads against an archive so in theory I shouldn’t have duplicates when I go get another batch later.

1 Like

These really go by quicker than I expect. Cornell SIPS and Apogee transcripts up. Kept only the ones most related to growing/breeding for each.

1 Like

Thinking of doing another round of transcriptions to catch up on episodes from already transcribed series. Any suggestions for additions?

1 Like

Maybe some Mechoulam interviews or presentations?

This looks like good stuff too:

https://youtube.com/playlist?list=PLY7VMskICYhwGdVzs19ngXDBynfW3VlPu

1 Like