Skip to content

Commit

Permalink
docs: initial user-guide pass for string functions (#2635)
Browse files Browse the repository at this point in the history
* docs: initial pass

* docs: add sample data
  • Loading branch information
agoose77 authored Aug 11, 2023
1 parent 3f33c57 commit e2e5df6
Show file tree
Hide file tree
Showing 6 changed files with 498 additions and 0 deletions.
11 changes: 11 additions & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,17 @@ subtrees:
- file: user-guide/how-to-math-gpu
title: "On GPUs [todo]"

- file: user-guide/how-to-strings
title: "Working with strings"
subtrees:
- entries:
- file: user-guide/how-to-strings-read-binary
title: "Reading UTF-8 binary streams"
- file: user-guide/how-to-strings-extract-substrings
title: "Extracting substrings with regex"
- file: user-guide/how-to-strings-split-and-join
title: "Splitting and joining"

- file: user-guide/how-to-filter
title: "Filtering data"
subtrees:
Expand Down
Binary file added docs/samples/Android.head.log.gz
Binary file not shown.
141 changes: 141 additions & 0 deletions docs/user-guide/how-to-strings-extract-substrings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.15.0
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

# How to extract substrings using regular expressions

+++

Let's consider the following log data

```{code-cell} ipython3
import awkward as ak
lines = ak.from_iter(
[
"12-17 19:31:36.263 1795 1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0\r",
"12-17 19:31:36.263 5224 5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0\r",
"12-17 19:31:36.264 1795 1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r",
"12-17 19:31:36.264 1795 1825 I PowerManager_screenOn: DisplayPowerController updatePowerState mPendingRequestLocked=policy=BRIGHT, useProximitySensor=true, useProximitySensorbyPhone=true, screenBrightness=33, screenAutoBrightnessAdjustment=0.0, brightnessSetByUser=true, useAutoBrightness=true, blockScreenOn=false, lowPowerMode=false, boostScreenBrightness=false, dozeScreenBrightness=-1, dozeScreenState=UNKNOWN, useTwilight=false, useSmartBacklight=true, brightnessWaitMode=false, brightnessWaitRet=true, screenAutoBrightness=-1, userId=0\r",
"12-17 19:31:36.264 1795 2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823\r",
"12-17 19:31:36.264 1795 2750 I HwLightsService: back light level before map = 823\r",
"12-17 19:31:36.264 1795 1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r",
"12-17 19:31:36.264 1795 1825 V KeyguardServiceDelegate: onScreenTurnedOn()\r",
"12-17 19:31:36.264 1795 1825 I WindowManger_keyguard: onScreenTurnedOn()\r",
"12-17 19:31:36.264 1795 1825 D DisplayPowerController: Display ready!\r",
"12-17 19:31:36.264 1795 1825 D DisplayPowerController: Finished business...\r",
"12-17 19:31:36.264 2852 3328 D KeyguardService: Caller checkPermission fail\r",
"12-17 19:31:36.264 2852 3328 D KeyguardService: KGSvcCall onScreenTurnedOn.\r",
"12-17 19:31:36.264 2852 3328 D KeyguardViewMediator: notifyScreenTurnedOn\r",
"12-17 19:31:36.265 2852 2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn\r",
"12-17 19:31:36.265 2852 2852 I PhoneStatusBar: onScreenTurnedOn\r",
"12-17 19:31:36.265 2852 2852 D KGWallpaper_Magazine: getNextIndex: 0; from 5 to 5; size: 44\r",
"12-17 19:31:36.265 2852 2852 I HwLockScreenReporter: report msg is :{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg}\r",
"12-17 19:31:36.265 2852 2852 W HwLockScreenReporter: report result = falsereport type:162 msg:{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg, channelId: 05}\r",
"12-17 19:31:36.265 2852 2852 I OucScreenOnCounter: Screen already turned on at: 1481974212\r",
"12-17 19:31:36.267 5224 5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0\r",
"12-17 19:31:36.270 1795 16500 I HwActivityManagerService: Split enqueueing broadcast [callerApp]:ProcessRecord{580cfb2 5224:com.huawei.health:DaemonService/u0a99}\r",
"12-17 19:31:36.271 2852 2852 I EventCenter: EventCenter Get :android.com.huawei.bone.NOTIFY_SPORT_DATA\r",
"12-17 19:31:36.275 7741 7741 D Mms_TX_NOTIFY: Get no-perm notification callback android.intent.action.SCREEN_ON\r",
"12-17 19:31:36.275 7741 7741 D Mms_TX_NOTIFY: ScreenState present\r",
"12-17 19:31:36.275 5224 5283 I Step_HSNH: 20002302|upDateHealthNotification()|89|2.98|4180\r",
"12-17 19:31:36.276 2883 2996 I HwSystemManager: ITrafficInfo:ITrafficInfo create 301updateBytes = 1769320345\r",
"12-17 19:31:36.278 5224 5283 I Step_HSNH: 20002302|rebuild notification\r",
"12-17 19:31:36.279 2852 2925 I EventCenter: ContentChange for slot: 1\r",
"12-17 19:31:36.279 2852 2852 I HwBrightnessController: onChange selfChange:false uri.toString():content://settings/system/screen_auto_brightness mIsObserveAutoBrightnessChange:true\r",
"12-17 19:31:36.279 1795 1825 D FpDataCollector: case xxx, not a fingerprint unlock \r",
"12-17 19:31:36.280 1795 1825 D PowerManagerService: ready=true,policy=3,wakefulness=1,wksummary=0x11,uasummary=0x1,bootcompleted=true,boostinprogress=false,waitmodeenable=false,mode=true,manual=33,auto=-1,adj=0.0userId=0\r",
"12-17 19:31:36.280 1795 1825 I PowerManager_screenOn: PowerManagerNotifier onWakefulnessChangeFinished mInteractiveChanging=true, mInteractive=true\r",
"12-17 19:31:36.280 2852 2852 I HwBrightnessUtils: APS brightness=20.0,ConvertToPercentage=0.21667233\r",
"12-17 19:31:36.280 2852 2852 I HwBrightnessUtils: getSeekBarProgress isAutoMode:true current brightness:20 percentage:0.21667233\r",
"12-17 19:31:36.280 2852 2852 I HwBrightnessController: updateSlider1 seekBarProgress:2167\r",
"12-17 19:31:36.280 2852 2852 I HwBrightnessController: updateSlider2 seekBarProgress:2167\r",
"12-17 19:31:36.280 2852 2852 I ToggleSlider: mSeekListener onProgressChanged progress:2167 fromUser:false\r",
"12-17 19:31:36.281 2852 2852 I ToggleSlider: mSeekListener onProgressChanged progress:2167 fromUser:false\r",
"12-17 19:31:36.282 3626 3753 I LogCollectService: msg = 103 received\r",
"12-17 19:31:36.283 1795 11747 I NotificationManager: enqueueNotificationInternal: pkg=com.huawei.health id=10010 notification=Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x2 color=0x00000000 vis=PRIVATE)\r",
"12-17 19:31:36.284 1795 1795 I NotificationManager: enqueueNotificationInternal: n.getKey = 0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.285 1795 2750 D HW_DISPLAY_EFFECT: presently, hw_update_color_temp_for_rg_led interface not achieved.\r",
"12-17 19:31:36.285 3466 3466 I Contacts: DialpadFragment mBroadcastReceiver action:android.intent.action.SCREEN_ON\r",
"12-17 19:31:36.289 3608 3608 D InCall : InCallActivity - mScreenOnReceiver mCallEndOptionsDialog = null\r",
"12-17 19:31:36.295 1795 1795 V NotificationService: disableEffects=null canInterrupt=false once update: false\r",
"12-17 19:31:36.297 2852 2852 I StatusBar: onNotificationPosted: StatusBarNotification(pkg=com.huawei.health user=UserHandle{0} id=10010 tag=null key=0|com.huawei.health|10010|null|10099: Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x62 color=0x00000000 vis=PRIVATE)) important=2, post=1481974296283, when=1481531589202, vis=0, userid=0\r",
"12-17 19:31:36.297 2852 2852 D StatusBar: updateNotification(StatusBarNotification(pkg=com.huawei.health user=UserHandle{0} id=10010 tag=null key=0|com.huawei.health|10010|null|10099: Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x62 color=0x00000000 vis=PRIVATE)))\r",
"12-17 19:31:36.298 2852 2852 D HwCust : Create obj success use class android.app.HwCustNotificationImpl\r",
"12-17 19:31:36.299 2852 2852 I StatusBarIconView: updateTint: tint=0\r",
"12-17 19:31:36.300 2852 2852 D StatusBar: No peeking: unimportant notification: 0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.301 2852 2852 D StatusBar: applyInPlace=true shouldPeek=false alertAgain=true\r",
"12-17 19:31:36.301 2852 2852 I NotificationGroupManager: onEntryUpdated:0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.301 2852 2852 I NotificationGroupManager: onEntryAdded:0|com.huawei.health|10010|null|10099, group=0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.301 2852 2852 D StatusBar: reusing notification for key: 0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.301 2852 2852 D HwCust : Create obj success use class android.app.HwCustNotificationImpl\r",
"12-17 19:31:36.301 2852 2852 D HwCust : Create obj success use class android.app.HwCustNotificationImpl\r",
"12-17 19:31:36.302 2852 2852 I StatusBarIconView: updateTint: tint=0\r",
"12-17 19:31:36.304 16628 16628 I TotemWeather: RetryTaskController:mTaskList is null\r",
"12-17 19:31:36.311 2852 2852 I HwPhoneStatusBar: updateNotificationShade\r",
"12-17 19:31:36.311 2852 2852 I PhoneStatusBar: updateNotificationShade\r",
"12-17 19:31:36.311 2852 2852 I PhoneStatusBar: removeNotificationChildren\r",
"12-17 19:31:36.311 2852 2852 I HwNotificationIconAreaController: showNotificationAll\r",
"12-17 19:31:36.313 31949 31967 I PushService: main{1} PushService.onStartCommand(PushService.java:87) Push Service Start by userEvent\r",
]
)
```

In the {mod}`ak.str` module there is the {func}`ak.str.extract_regex` function. This function decomposes an array of strings into an array of records, where each field of the newly created records corresponds to a named group in the regular expression. Let's define a regular expression to match our log

```{code-cell} ipython3
pattern = (
# Timestamp
r"(?P<datetime>\d\d-\d\d\s\d\d:\d\d:\d\d)\."
# Fractional seconds
r"(?P<datetime_frac>\d\d\d)\s\s"
# Unknown integers
r"(?P<i0>\d\d\d\d)\s\s"
r"(?P<i1>\d\d\d\d)\s"
# String category
r"(?P<category>\w)\s"
# String message
r"(?P<message>.*)"
)
```

Does this match the first line?

```{code-cell} ipython3
lines[0]
```

Let's use the {mod}`re` module to use the above pattern to parse this line

```{code-cell} ipython3
import re
match = re.match(pattern, lines[0])
match.groupdict()
```

Let's now apply {func}`ak.str.extract_regex` to our array of lines using this pattern

```{code-cell} ipython3
structured = ak.str.extract_regex(lines, pattern)
structured
```

The type of the `structured` record is an "optional record of optional fields". This is because both the match itself can fail (producing the outer option), or the inner groups may be missing (producing the inner options). If we know that all groups should succeed or all groups should fail, then we can lift the inner options outside the record. To do this, we need to decompose the record, and rebuild it with `ak.zip` which provides a special `optiontype_outside_record` argument.

```{code-cell} ipython3
fields = ak.fields(structured)
contents = ak.unzip(structured)
result = ak.zip(dict(zip(fields, contents)), optiontype_outside_record=True)
result
```
144 changes: 144 additions & 0 deletions docs/user-guide/how-to-strings-read-binary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.15.0
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

# Read strings from a binary stream

Awkward Array implements support for ragged strings as ragged lists of [code-units](https://en.wikipedia.org/wiki/UTF-8). As such, successive strings are closely packed in memory, leading to high-performance operations.

+++

Let's imagine that we want to read some logging output that is stored in a text file. For example, [a subset of logs from the Android Application framework](https://zenodo.org/record/8196385).

```{code-cell} ipython3
import gzip
import itertools
import pathlib
# Preview logs
log_path = pathlib.Path("..", "samples", "Android.head.log.gz")
with gzip.open(log_path, "rt") as f:
for line in itertools.islice(f, 8):
print(line, end="")
```

To begin with, we can read the decompressed log-files as an array of {data}`np.uint8` dtype using NumPy, and convert the resulting array to an Awkward Array

```{code-cell} ipython3
import awkward as ak
import numpy as np
with gzip.open(log_path, "rb") as f:
# `gzip.open` doesn't return a true file descriptor that NumPy can ingest
# So, instead we read into memory.
arr = np.frombuffer(f.read(), dtype=np.uint8)
raw_bytes = ak.from_numpy(arr)
raw_bytes.type.show()
```

Awkward Array doesn't support scalar values, so we can't treat these characters as a single-string. Instead we need at least one dimension. Let's unflatten our array of characters, to form a length-1 array of characters.

```{code-cell} ipython3
array_of_chars = ak.unflatten(raw_bytes, len(raw_bytes))
array_of_chars
```

We can then ask Awkward Array to treat this array of lists of characters as an array of strings, using {func}`ak.enforce_type`

```{code-cell} ipython3
string = ak.enforce_type(array_of_chars, "string")
string.type.show()
```

The underlying mechanism for implementing strings as lists of code-units can be seen if we inspect the low-level layout that builds the array

```{code-cell} ipython3
string.layout
```

The `__array__` parameter is special. It is reserved by Awkward Array, and signals that the layout is a special pre-undertood built-in type. In this case, that type of the outer {class}`ak.contents.ListOffsetArray` is "string". It can also be seen that the inner {class}`ak.contents.NumpyArray` also has an `__array__` parameter, this time with a value of `char`. In Awkward Array, an array of strings *must* look like this layout; a list with the `__array__="string"` parameter wrapping a {class}`ak.contents.NumpyArray` with the `__array__="char"` parameter.

+++

A single (very long) string isn't much use. Let's split this string at the line boundaries

```{code-cell} ipython3
split_at_newlines = ak.str.split_pattern(string, "\n")
split_at_newlines
```

Now we can remove the temporary length-1 outer dimension that was required to treat the data as a string

```{code-cell} ipython3
lines = split_at_newlines[0]
lines
```

In the low-level layout, we can see that these lines are still just variable-length lists

```{code-cell} ipython3
lines.layout
```

## Bytestrings vs strings

+++

In general, whilst strings can fundamentally be described as lists of bytes (code-units), many string operations do not operate at the byte-level. The {mod}`ak.str` submodule provides a suite of vectorised operations that operate at the code-point (*not* code-unit) level, such as computing the string length. Consider the following simple string

```{code-cell} ipython3
large_code_point = ak.Array(["Å"])
```

In Awkward Array, strings are UTF-8 encoded, meaning that a single code-point may comprise up to four code-units (bytes). Although it looks like this is a single character, if we look at the layout it's clear that the number of code-units is in-fact two

```{code-cell} ipython3
large_code_point.layout
```

This is reflected in the {func}`ak.num` function

```{code-cell} ipython3
ak.num(large_code_point)
```

The {mod}`ak.str` module provides a function for computing the length of a string

```{code-cell} ipython3
ak.str.length(large_code_point)
```

Clearly _this_ function is code-point aware.

+++

If one wants to drop the UTF-8 string abstraction, and instead deal with strings as raw byte arrays, there is the `bytes` type

```{code-cell} ipython3
large_code_point_bytes = ak.enforce_type(large_code_point, "bytes")
large_code_point_bytes
```

The layout of this array has different `"bytestring"` and `"byte"` parameters

```{code-cell} ipython3
large_code_point_bytes.layout
```

Many of the functions in the {mod}`ak.str` module treat bytestrings and strings differently; in the latter case, strings are often manipulated in terms of code-points instead of code-units. Consider {func}`ak.str.length` for this array

```{code-cell} ipython3
ak.str.length(large_code_point_bytes)
```

This is clearly counting the bytes (code-units), not code-points.
Loading

0 comments on commit e2e5df6

Please sign in to comment.