koreader

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

master

v2024.04

v2024.03.1

v2024.03

v2024.01

v2023.10

v2023.08

v2023.06.1

v2023.06

v2023.05.1

v2023.05

v2023.04

v2023.03

v2023.01

v2022.11

v2022.10

v2022.08

v2022.07

v2022.06

v2022.05.1

v2022.05

v2022.03.1

v2022.03

v2022.02

v2022.01

v2021.12.1

v2021.12

v2021.11

v2021.10.1

v2021.10

v2021.09

v2021.08

v2021.07

v2021.06

v2021.05

v2021.04

v2021.03

v2021.02

v2021.01

v2021.01.1

v2020.12

v2020.11

v2020.10.1

v2020.10

v2020.09

v2020.08.1

v2020.08

v2020.07.1

v2020.07

v2020.06

v2020.05

v2020.04.1

v2020.04

v2020.03.2

v2020.03.1

v2020.03

v2020.02

v2020.01

v2019.12

v2019.11

v2019.10

v2019.09.3

v2019.09.2

v2019.09.1

v2019.09

v2019.08

v2019.07

v2019.06

v2019.05

v2019.04

v2019.03.1

v2019.03

v2019.02

v2019.01.1

v2019.01

v2018.12

v2018.11.1

v2018.11

v2015.11

v2015.11-stable

v2014.11

v2014.04

v2013.03

v2013.1

v2012.11

v2012.10

v2012.09

v2012.04.2

v2012.04.1

v2012.04

v2012.03

koreader-nightly-20150516

v2012.05

v2014.03.11-nightly

v2014.03.13-nightly

v2014.03.18-nightly

v2014.03.24-nightly

v2014.03.27-nightly

v2014.03.31-nightly

v2014.04-stable

v2014.04.03-nightly

v2014.04.05-nightly

v2014.04.08-nightly

v2014.04.10-nightly

v2014.04.18-nightly

v2014.04.24-nightly

v2014.04.25-nightly

v2014.05.02-nightly

v2014.05.06-nightly

v2014.05.09-nightly

v2014.05.16-nightly

v2014.05.17-nightly

v2014.05.18-nightly

v2014.05.19-nightly

v2014.05.20-nightly

v2014.05.24-nightly

v2014.06.01-nightly

v2014.06.02-nightly

v2014.06.03-nightly

v2014.06.04-nightly

v2014.06.05-nightly

v2014.06.06-nightly

v2014.06.08-nightly

v2014.06.09-nightly

v2014.06.11-nightly

v2014.06.12-nightly

v2014.06.18-nightly

v2014.06.23-nightly

v2014.06.27-nightly

v2014.06.29-nightly

v2014.07.02-nightly

v2014.07.03-nightly

v2014.07.04-nightly

v2014.07.14-nightly

v2014.07.15-nightly

v2014.07.16-nightly

v2014.07.17-nightly

v2014.07.19-nightly

v2014.07.20-nightly

v2014.07.24-nightly

v2014.07.27-nightly

v2014.07.28-nightly

v2014.08.02-nightly

v2014.08.04-nightly

v2014.08.06-nightly

v2014.08.07-nightly

v2014.08.11-nightly

v2014.08.12-nightly

v2014.08.13-nightly

v2014.08.20-nightly

v2014.08.21-nightly

v2014.08.23-nightly

v2014.08.24-nightly

v2014.08.27-nightly

v2014.08.28-nightly

v2014.08.29-nightly

v2014.09.11-nightly

v2014.09.14-nightly

v2014.09.29-nightly

v2014.10.03-nightly

v2014.10.05-nightly

v2014.10.08-nightly

v2014.10.14-nightly

v2014.10.15-nightly

v2014.10.20-nightly

v2014.10.22-nightly

v2014.10.24-nightly

v2014.10.28-nightly

v2014.10.29-nightly

v2014.10.30-nightly

v2014.10.31-nightly

v2014.11-stable

v2014.11.07-nightly

v2014.11.10-nightly

v2014.11.11-nightly

v2014.11.12-nightly

v2014.11.13-nightly

v2014.11.14-nightly

v2014.11.17-nightly

v2014.11.18-nightly

v2014.11.21-nightly

v2014.11.24-nightly

v2014.11.25-nightly

v2014.11.26-nightly

v2014.11.27-nightly

v2014.11.28-nightly

v2014.11.29-nightly

v2014.12.01-nightly

v2014.12.03-nightly

v2014.12.06-nightly

v2014.12.07-nightly

v2014.12.22-nightly

v2014.12.29-nightly

v2015.01.15-preview

v2015.01.19-nightly

v2015.01.28-pb-test

v2015.01.31-nightly

v2015.02.02-nightly

v2015.02.07-nightly

v2015.02.09-nightly

v2015.02.12-nightly

v2015.03.06-nightly

v2015.03.16-nightly

v2015.03.17-nightly

v2015.03.18-nightly

v2015.03.20-nightly

v2015.03.22-nightly

v2015.03.24-nightly

v2015.04.01-nightly

v2015.04.07-nightly

v2015.04.14-nightly

v2015.04.15-nightly

v2015.04.16-nightly

v2015.04.24-nightly

v2015.04.27-nightly

v2015.04.30-nightly

v2015.05.17-nightly

v2015.06.04-nightly

v2015.06.08-nightly

v2015.06.16-nightly

v2015.06.25-nightly

v2015.07.17-nightly

v2015.08.08-nightly

v2015.09.11-nightly

v2015.09.17-nightly

v2015.10.06-nightly

v2015.10.08-nightly

v2015.10.24-nightly

v2015.10.27-nightly

v2015.12.13-nightly

v2016.02.14-nightly

v2016.02.16-nightly

v2016.02.27-nightly

v2016.03.15-nightly

v2016.03.23-nightly

v2016.04.13-nightly

v2016.04.27-nightly

v2016.05.13-nightly

v2016.05.29-nightly

v2016.06.02-nightly

v2016.06.20-nightly

v2016.06.24-nightly

v2016.07.02-nightly

v2016.07.14-nightly

v2016.08.14-nightly

v2016.08.23-nightly

v2016.11.20-nightly

v2017-10-23-nightly

v2017.02.09-nightly

v2017.03.08-nightly

v2017.04.30-nightly

v2017.06.12-nightly

v2017.08.21-nightly

v2017.10.04-nightly

v2018.01.10-nightly

v2018.02.12-nightly

v2018.03.14-beta

v2018.04.10-beta

v2018.04.12-beta

v2018.06.02-beta

v2018.07.29-beta

v2018.10.07-beta

History

Aleksa Sarai 6f1b70e5eb util.utf8: improve CJK character detection Previously the CJK character detection defined only characters in the range U+4000..U+AFFF as "CJK characters". This excludes an incredibly large number of CJK characters within the BMP, let alone the whole two planes dedicated to rarer CJK characters (the SIP and TIP). As a result, a very large number of Chinese, Japanese, and Korean characters were not detected as being CJK characters. While slightly less elegant-looking, it is far more accurate to compute the codepoint from the utf8 character and then see if it falls within one of the defined CJK blocks. This is not future-proof against future CJK ideograph extensions in future Unicode versions, but there is no real way to accurately predict such changes so this is the best we can do without accidentally treating characters explicitily defined as being non-CJK in Unicode as CJK. While we're at it, copy Lua 5.3's utf8.charpattern constant definition so that we can more easily write utf8 iterators with string.gmatch (at least in the interim until there is a rework of utf8 handling in KOReader and everything is rebuilt on top of utf8proc). Some unit tests are added for Korean and Japanese text, and the existing unit tests needed a minor adjustment to handle the fact that isSplittable now correctly detects CJK punctuation as a character to compare against the forbidden split rules. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	3 years ago
..
unit	util.utf8: improve CJK character detection	3 years ago

Aleksa Sarai 6f1b70e5eb util.utf8: improve CJK character detection

Previously the CJK character detection defined only characters in the
range U+4000..U+AFFF as "CJK characters". This excludes an incredibly
large number of CJK characters within the BMP, let alone the whole two
planes dedicated to rarer CJK characters (the SIP and TIP). As a result,
a very large number of Chinese, Japanese, and Korean characters were not
detected as being CJK characters.

While slightly less elegant-looking, it is far more accurate to compute
the codepoint from the utf8 character and then see if it falls within
one of the defined CJK blocks. This is not future-proof against future
CJK ideograph extensions in future Unicode versions, but there is no
real way to accurately predict such changes so this is the best we can
do without accidentally treating characters explicitily defined as being
non-CJK in Unicode as CJK.

While we're at it, copy Lua 5.3's utf8.charpattern constant definition
so that we can more easily write utf8 iterators with string.gmatch (at
least in the interim until there is a rework of utf8 handling in
KOReader and everything is rebuilt on top of utf8proc).

Some unit tests are added for Korean and Japanese text, and the existing
unit tests needed a minor adjustment to handle the fact that
isSplittable now correctly detects CJK punctuation as a character to
compare against the forbidden split rules.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

unit

util.utf8: improve CJK character detection