Skip to content

Commit 7ebf6a1

Browse files
committed
update README
1 parent 04ad459 commit 7ebf6a1

File tree

1 file changed

+23
-13
lines changed

1 file changed

+23
-13
lines changed

README.md

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Extract hardcoded subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python.
44

5-
Input video with hardcoded subtitles:
5+
Input a video with hardcoded subtitles:
66

77
<p float="left">
88
<img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873658-3b76dd00-6a34-11e9-95c6-cd6edc721f58.png">
@@ -12,43 +12,48 @@ Input video with hardcoded subtitles:
1212
```python
1313
import videocr
1414

15-
print(videocr.get_subtitles('video.avi', lang='HanS'))
15+
print(videocr.get_subtitles('video.avi', lang='chi_sim+eng', sim_threshold=70))
1616
```
1717

1818
Output:
1919

2020
```
2121
0
22-
00:00:00,000 --> 00:00:02,711
23-
-谢谢 … 你 好 -谢谢
24-
Thank you...Hi. Thanks.
22+
00:00:01,042 --> 00:00:02,877
23+
喝 点 什么 ?
24+
What can I get you?
2525
2626
1
27-
00:00:02,794 --> 00:00:04,879
28-
喝 点 什么 ?
29-
What can I get you?
27+
00:00:03,044 --> 00:00:05,463
28+
我 不 知道
29+
Um, I'm not sure.
3030
3131
2
32-
00:00:05,046 --> 00:00:12,554
32+
00:00:08,091 --> 00:00:10,635
3333
休闲 时 光 …
3434
For relaxing times, make it...
3535
3636
3
37-
00:00:12,804 --> 00:00:14,723
37+
00:00:10,677 --> 00:00:12,595
3838
三 得 利 时 光
3939
Bartender, Bob Suntory time.
4040
4141
4
42-
00:00:16,474 --> 00:00:19,144
42+
00:00:14,472 --> 00:00:17,142
43+
我 要 一 杯 伏特 加
4344
Un, I'll have a vodka tonic.
4445
4546
5
46-
00:00:19,394 --> 00:00:20,687
47+
00:00:18,059 --> 00:00:19,019
4748
谢谢
4849
Laughs Thanks.
4950
5051
```
5152

53+
## Performance
54+
55+
The OCR process runs in parallel and is CPU intensive. It takes 3 minutes on my dual-core laptop to extract a 20 seconds video. You may want more cores for longer videos.
56+
5257
## API
5358

5459
```python
@@ -71,7 +76,11 @@ Write subtitles to `file_path`. If the file does not exist, it will be created a
7176

7277
- `lang`
7378

74-
Language of the subtitles in the video. Besides `eng` for English, all language codes on [this page](https://github.com/tesseract-ocr/tessdata_best/tree/master/script) are supported.
79+
The language of the subtitles in the video. All language codes on [this page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016) (e.g. `'eng'` for English) and all script names in [this repository](https://github.com/tesseract-ocr/tessdata_fast/tree/master/script) (e.g. `'HanS'` for simplified Chinese) are supported.
80+
81+
Note that you can use more than one language. For example, `'hin+eng'` means using Hindi and English together for recognition. More details are available in the [Tesseract documentation](https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#using-multiple-languages).
82+
83+
Language data files will be automatically downloaded to your `$HOME/tessdata` directory when necessary. You can read more about Tesseract language data files on their [wiki page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files).
7584

7685
- `time_start` and `time_end`
7786

@@ -92,3 +101,4 @@ Write subtitles to `file_path`. If the file does not exist, it will be created a
92101
- `use_fullframe`
93102

94103
By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame.
104+

0 commit comments

Comments
 (0)