▲ ▼ On premise video indexing with visual data
Current video indexing are mostly done by converting 'speech-to-text' and then indexing it. This is mostly done to save on compute resources and that algorithms to index video with visual data are not as stable as that of text.
e.g. Searching for a video for an old man with red shoes wouldn't yield results unless there was an audio in the video describing it when using speech-to-text indexing.
It is said that even Youtube hasn't indexed all its video by visual data due to sheer number of videos. There APIs available like that from Microsoft which claim to enable video indexing with visual meta data, but they are not a on-premise solution and so using those might rake up bills if the number of videos are large.
There is a need gap for on-premise video indexing technology.
I was just noodling over something similar the other day with regard to the "speech-to-text" bit. I know that when you speed videos up on Youtube, the audio speeds up as well. I always wondered if the speech-to-text algorithms were able to transcribe that. My first thought was "the NSA can probably do this".
Youtube does use speech-to-text for indexing its videos, the transcriptions are associated with the timestamp and so if the time is sped up the text still syncs fine with the audio.