AI Speech to Text FAQ


What is Oracle Cloud Infrastructure Speech?

OCI Speech is a service that uses automatic speech recognition (ASR) to convert speech to text. The service enables developers, business units, content providers, tinkerers, and other users to transcribe audio files. With OCI Speech, users can transcribe call center calls or meetings, generate closed captions, and index and search audio and video content.

Why should I use OCI Speech?

You should use OCI Speech if you need a fast, accurate, time-stamped transcription service. If you are using OCI to store your audio files, you will also enjoy lower latencies and no network costs associated with transcription.

How do I get started with OCI Speech?

Start here to create your first transcription, or read more about the service here.


What transcription services do you support?

We currently support file-based asynchronous transcription. We do not offer real-time transcription at this time.

What languages are currently supported?

Transcription comes with pretrained models for the following languages: English, Spanish, and Portuguese.

Are the files I transcribed used by OCI to improve the service (or for anything else)?

No. We only transcribe your content and keep no information from the file.

What else should I know about the service?

Like any other transcription service, the quality of the output depends on the quality of input audio file. Speakers' accents, background noises, switching between languages, using fusion languages (such as Spanglish), and multiple people speaking simultaneously can all impact the quality of transcription. We are constantly working to improve the performance of the service to provide more accurate transcriptions for all inputs and speakers.

Can OCI Speech automatically detect the language in the file?

Not currently (but soon).

What input file formats do you support?

We support single-channel, 16-bit PCM WAV audio files with a 16kHz sample rate. We recommend Audacity (GUI) or ffmpeg (command line) for audio transcoding. Additional audio formats are coming soon.

What output formats do you support?

We support JSON (as default) and SRT (as the option with no further costs).

Billing and pricing

How will I be charged?

We use precision billing which means we charge you $0.50 for every hour of transcription, but we use seconds to measure the aggregated usage. For example, if you uploaded three files with the following durations: 3,600 seconds, 4,575 seconds, and 1,421 seconds, your monthly bill will by calculated by the sum of your seconds (9,596) divided by 3,600 (the number of seconds in an hour), multiplied by $0.50. In other words, you will be charged $1.332 or 9,596/3,600 x $0.50 = $1.332.

What is the billable metric for OCI Speech?

We named our billable metric “transcription hour.” Transcription hour measures the number of audio hours transcribed during a given month of the service.

Are there any setup charges or minimum serice commitments with Speech?

No. OCI Speech does not have any setup charges or minimum service commitments. And there’s no hardware required.

Do you offer any free hours to try out the service?

Yes. We offer five hours of free transcription every month per tenancy.

Do you charge more for punctuation or SRT?

Punctuation is a free service just like SRT. Storing SRT files may increase your storage fee.

Other technical questions

What devices will be supported by OCI Speech?

Speech works with any recording device, and is not device-specific.

My file is not a WAV file. How should I convert my file to WAV?

We recommend using the ffmpeg utility with the following command: ffmpeg -i <input.ext> -fflags +bitexact -acodec pcm_s16le -ac 1 -ar 16000 <output.wav>.

I am getting the following error message: Either the bucket named “undefined” does not exist in the namespace <namespace> or you are not authorized to access it. How do I fix that?

See the Speech Policy Setup.