Update Notes
parent
d6d4fa5f82
commit
5fab8b951e
@ -0,0 +1,105 @@
|
||||
# Notes for PR(daneroo) feature/listing branch
|
||||
|
||||
I don't expect this to be merged as is. I mostly wanted to share my experiments.
|
||||
|
||||
If any of it is useful, I can craft proper commits with smaller features.
|
||||
|
||||
*I am a first time contributor, and not very experienced with Go, so would appreciate any feedback both on content and process.*
|
||||
|
||||
## Overview
|
||||
|
||||
When I first started to use the project, I experienced instability, and would need to restart the download process a large number of times to get to the end. The process would timeout after 1000-4000 images. To test, I use two accounts one with ~700 images, and one with 31k images.
|
||||
|
||||
The timeouts were mostly coming from the `navLeft()` iterator, so I disabled actual downloading and fucused on traversal. (`-list` option). I also added an option (`-all`) to bypass the incremental behavior of `$dldir/.lastDone`. I then added `-vt (VerboseTiming`, to isolate timing metrics reporting specifically.
|
||||
|
||||
The last main issue was a degradation of performance over time. Probably due to resource leaks in the webapp, especially when iterating at high speed. I found that the simplest solution to this was simply to force a reload every 1000 iterations. This typically takes about `1s` and avoids the `10X` slowdown I was obeserving.
|
||||
|
||||
Following are details of each of the major changes.
|
||||
|
||||
## `navLeft()`
|
||||
|
||||
- By adding the `-list` option, its functionality is isolated.
|
||||
- Remove `WaitReady("body",..)`, as it has no effect (The document stays loaaded when we navigate from image to image), and replace it with a loop (until `location != prevLocation`).
|
||||
- When only listing, this is the critical part of the loop, so I spent some time experimenting with the timing of the chromedp interactions to optimize total throughput. What seemed to work best was an exponential backoff, so that we can benefit from the *most likely* fast navigation, but don't overtax the browser when we experiene the occasional longer naviagation delay (which can go up to multiple seconds).
|
||||
|
||||
- We also pass in `prevLoaction` from `navN()`, to support the new `navLeft()` loop, to optimize the throughput.
|
||||
|
||||
## `navN()`
|
||||
|
||||
- Addition of timing code
|
||||
- Support listing only (`-list`)
|
||||
- Reload page every (`batchSize=1000`) items.
|
||||
|
||||
## Termination Criteria
|
||||
|
||||
We explicitly fetch the lastPhoto, on the main album page (immediately after authentication), and rely on that, instead of `navLeft()` failing to navigate (`location == prevLocation`). The lastPhoto is determined with a CSS selector evaluation (`a[href^="./photo/"]`), which is fast and should be stable over time, as it uses the `<a href=.. >` detail navigation mechanism of the main album page. This opens another possibility for iteration in the album page itself. (see `listFromAlbum` below)
|
||||
|
||||
Another issue, although the `.lasDone` being captured on a successful run is useful, as photos are listed in exif date order, if older photos are added to the album they would not appear in subsequent runs. So it would be useful in general to rerun over the entire album (`-all`).
|
||||
|
||||
## Headless
|
||||
|
||||
If we don't need the authentication flow, (and we persist the auth creds in the profile `-dev`), Chrome can be brought up in `-headless` mode, which considerably reduces memory footprint (`>2Gb` to `<500Mb`), and margninally (23%) increases throughput for the listing experiment.
|
||||
|
||||
## `listFromAlbum()`
|
||||
|
||||
This is another experiment where listing was entirely performed in the main album page (by scrolling incrementally). This is even faster. It would also allow iterating either forwards or backwards through the album.
|
||||
|
||||
In a further experiment I would like to use this process as a coordinating mechanism and perform the actual downloads in separate (potentially multiple) `tabs/contexts`.
|
||||
|
||||
### Performance: An Argument for periodic page reload and headless mode
|
||||
|
||||
Notice how the latency grows quickly without page reload (ms per iteration): [66, 74, 89, 219, 350, 1008,...], and the cumulative rate drops below `4 items/s` after `6000` items.
|
||||
|
||||
```bash
|
||||
$ time ./gphotos-cdp -dev -n 6001 -list
|
||||
21:36:26 . Avg Latency (last 1000 @ 1000): Marginal: 66.43ms Cumulative: 66.43ms
|
||||
21:37:40 . Avg Latency (last 1000 @ 2000): Marginal: 73.99ms Cumulative: 70.21ms
|
||||
21:39:09 . Avg Latency (last 1000 @ 3000): Marginal: 88.97ms Cumulative: 76.46ms
|
||||
21:42:48 . Avg Latency (last 1000 @ 4000): Marginal: 218.75ms Cumulative: 112.03ms
|
||||
21:48:38 . Avg Latency (last 1000 @ 5000): Marginal: 350.23ms Cumulative: 159.67ms
|
||||
22:05:27 . Avg Latency (last 1000 @ 6000): Marginal: 1008.58ms Cumulative: 301.16ms
|
||||
22:05:27 Rate (6001): 3.32/s Avg Latency: 301.11ms
|
||||
```
|
||||
|
||||
Whereas with page reloading every 1000 images, it is stable at `<60ms` with a sustained rate of `17 items/s`:
|
||||
|
||||
```bash
|
||||
$ time ./gphotos-cdp -dev -vt -list -n 6001
|
||||
14:33:22 . Avg Latency (last 1000 @ 1000): Marginal: 58.88ms Cumulative: 58.88ms
|
||||
14:34:21 . Avg Latency (last 1000 @ 2000): Marginal: 57.60ms Cumulative: 58.24ms
|
||||
14:35:21 . Avg Latency (last 1000 @ 3000): Marginal: 57.93ms Cumulative: 58.14ms
|
||||
14:36:17 . Avg Latency (last 1000 @ 4000): Marginal: 54.93ms Cumulative: 57.33ms
|
||||
14:37:16 . Avg Latency (last 1000 @ 5000): Marginal: 56.24ms Cumulative: 57.12ms
|
||||
14:38:16 . Avg Latency (last 1000 @ 6000): Marginal: 58.73ms Cumulative: 57.38ms
|
||||
14:38:16 Rate (6001): 17.43/s Avg Latency: 57.38ms
|
||||
```
|
||||
|
||||
When invoked in `-headless` mode we can reduce latency to `~47ms` or `21 items/s`, also this reduces memory footprint from `>2Gb` to about `500Mb`
|
||||
|
||||
```bash
|
||||
15:37:44 Rate (6001): 21.27/s Avg Latency: 47.02ms
|
||||
```
|
||||
|
||||
When using `listFromAlbum()`, we can roughly double the iterations speed again to `38 items/s` or `25ms/item`
|
||||
|
||||
```bash
|
||||
15:24:55 Rate (6001): 38.60/s Avg Latency: 25.91ms
|
||||
```
|
||||
|
||||
*All of these times were obtained on a MacBook Air (2015) / 8G RAM.*
|
||||
|
||||
## Starting Chrome with another profile
|
||||
|
||||
```bash
|
||||
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' --user-data-dir=/var/folders/bw/7rvbq3q92g5bn4lv4hrh5qv40000gn/T/gphotos-cdp
|
||||
```
|
||||
|
||||
## TODO
|
||||
|
||||
*Move to Document as done.*
|
||||
|
||||
- Review Nomenclature, start,end,newest,last,left,right...
|
||||
- Refactor common chromedp uses in navLeft,navRight,navToEnd,navToLast
|
||||
- `navLeft()` - throw error if loc==prevLoc
|
||||
- flags: `-verify`, `-incr/-optimimistic`
|
||||
- Listing with DOM selectors, and scrolling, while downloading in separate table
|
@ -1,144 +0,0 @@
|
||||
# Notes for feature/listing PR
|
||||
|
||||
## TODO
|
||||
|
||||
*Move to Document as done.*
|
||||
|
||||
- Nomenclature, start,end,newest
|
||||
- `navLeft()`: stabilize and accelerate
|
||||
- flag `-list`
|
||||
- remove `WaitReady("body",..)`
|
||||
- pass in prevousLocation (&string ?)
|
||||
- throw error if loc==prevLoc
|
||||
- exponential backoff (capped), timeout (default with param?)
|
||||
- `navN()` batch timing and reload, `-batch` batchSize flag, default 1000
|
||||
- flags: `-all`, `-list`, `-batch`, `-vt`, `-verify`
|
||||
- flag `-vt`: (verboseTiming) Marginal Latency/Batch, navLeft: log.warning >3 iterations or >1s,1m
|
||||
- skip if already downloaded (-optimistic)
|
||||
- Termination criteria: lastPhoto(ctx)
|
||||
- lastPhoto: explain Selector
|
||||
- setup After Auth
|
||||
- add to `navN()`
|
||||
- Future listing with DOM selectors, and scrolling
|
||||
|
||||
## Starting another profile
|
||||
|
||||
```bash
|
||||
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' --user-data-dir=/var/folders/bw/7rvbq3q92g5bn4lv4hrh5qv40000gn/T/gphotos-cdp
|
||||
```
|
||||
|
||||
## Watching
|
||||
|
||||
- Show Histograms of navLeft <https://github.com/jamiealquiza/tachymeter>
|
||||
|
||||
```bash
|
||||
cd ~/Downloads
|
||||
for d in gphotos-cdp*; do echo $d $(find $d -type f|wc -l); done
|
||||
while true; do cat gphotos-cdp/.lastdone ; echo; sleep 1; done
|
||||
```
|
||||
|
||||
## Optimizing total time (peru)
|
||||
|
||||
- document.querySelector('a[href^="./photo/"]').href; // first photo
|
||||
- document.querySelector('[aria-label="View previous photo"]')
|
||||
- document.querySelector('[aria-label="View next photo"]');
|
||||
- Pass in prevLocation to NavLeft
|
||||
- ? Remove chromedp.WaitReady("body", chromedp.ByQuery)
|
||||
|
||||
### An Argument for Batching
|
||||
|
||||
```bash
|
||||
$ time ./gphotos-cdp -dev -n 6001 -list
|
||||
21:35:11 Session Dir: /var/folders/bw/7rvbq3q92g5bn4lv4hrh5qv40000gn/T/gphotos-cdp
|
||||
21:35:11 Download Dir: /Users/daniel/Downloads/gphotos-cdp
|
||||
21:35:14 Last Photo: ./photo/AF1QipMWCRwtfPHED45inXlRNjeg28oyrfhW5UA60Qjg (first on Landing Page)
|
||||
21:35:19 Nav to the end sequence is started because location is https://photos.google.com/photo/AF1QipNS8lcfYzAxV1ji4yY5noyzUUlznuNE4h-qktrJ
|
||||
21:35:19 NavToLast iteration: location is https://photos.google.com/photo/AF1QipNS8lcfYzAxV1ji4yY5noyzUUlznuNE4h-qktrJ
|
||||
21:36:26 . Marginal Avg Latency (last 1000 @ 1000): 66.43ms Cumulative Avg Latency: 66.43ms
|
||||
21:37:40 . Marginal Avg Latency (last 1000 @ 2000): 73.99ms Cumulative Avg Latency: 70.21ms
|
||||
21:39:09 . Marginal Avg Latency (last 1000 @ 3000): 88.97ms Cumulative Avg Latency: 76.46ms
|
||||
21:42:48 . Marginal Avg Latency (last 1000 @ 4000): 218.75ms Cumulative Avg Latency: 112.03ms
|
||||
21:48:38 . Marginal Avg Latency (last 1000 @ 5000): 350.23ms Cumulative Avg Latency: 159.67ms
|
||||
22:05:27 . Marginal Avg Latency (last 1000 @ 6000): 1008.58ms Cumulative Avg Latency: 301.16ms
|
||||
22:05:27 Rate (6001): 3.32/s Avg Latency: 301.11ms
|
||||
OK
|
||||
1815.896s
|
||||
```
|
||||
|
||||
With page reloading every 1000 images:
|
||||
|
||||
```bash
|
||||
$ time ./gphotos-cdp -dev -n 6001 -list
|
||||
21:24:45 Session Dir: /var/folders/bw/7rvbq3q92g5bn4lv4hrh5qv40000gn/T/gphotos-cdp
|
||||
21:24:45 Download Dir: /Users/daniel/Downloads/gphotos-cdp
|
||||
21:24:48 Last Photo: ./photo/AF1QipMWCRwtfPHED45inXlRNjeg28oyrfhW5UA60Qjg (first on Landing Page)
|
||||
21:24:52 Nav to the end sequence is started because location is https://photos.google.com/photo/AF1QipNS8lcfYzAxV1ji4yY5noyzUUlznuNE4h-qktrJ
|
||||
21:24:52 NavToLast iteration: location is https://photos.google.com/photo/AF1QipNS8lcfYzAxV1ji4yY5noyzUUlznuNE4h-qktrJ
|
||||
21:25:55 . Marginal Avg Latency (last 1000 @ 1000): 62.11ms Cumulative Avg Latency: 62.11ms
|
||||
21:27:02 . Marginal Avg Latency (last 1000 @ 2000): 65.13ms Cumulative Avg Latency: 63.62ms
|
||||
21:28:06 . Marginal Avg Latency (last 1000 @ 3000): 62.51ms Cumulative Avg Latency: 63.25ms
|
||||
21:29:10 . Marginal Avg Latency (last 1000 @ 4000): 63.23ms Cumulative Avg Latency: 63.25ms
|
||||
21:30:16 . Marginal Avg Latency (last 1000 @ 5000): 63.60ms Cumulative Avg Latency: 63.32ms
|
||||
21:31:19 . Marginal Avg Latency (last 1000 @ 6000): 62.00ms Cumulative Avg Latency: 63.10ms
|
||||
21:31:21 Rate (6001): 15.85/s Avg Latency: 63.09ms
|
||||
OK
|
||||
396.036s
|
||||
```
|
||||
|
||||
## Just Listing
|
||||
|
||||
```bash
|
||||
# in batches
|
||||
|
||||
time for i in `seq 1 31`; do echo =-=- iteration $i start; time ./gphotos-cdp -v -dev -n 1000; echo =-=- iteration $i done; done
|
||||
|
||||
# From Scratch
|
||||
2019/12/01 00:23:09 .. navLeft so far: 5m0.158025009s
|
||||
2019/12/01 00:23:09 navLeft took 5m0.158049209s
|
||||
2019/12/01 00:23:09 NavN iteration (7312): location is https://photos.google.com/photo/AF1QipORJSC-4iLwGPXtXiVMZsf2ZG8u6-XOegoijUDW
|
||||
OK
|
||||
4098.325s
|
||||
|
||||
# Appended:
|
||||
2019/12/01 01:49:40 .. navLeft so far: 5m0.354725675s
|
||||
2019/12/01 01:49:40 navLeft took 5m0.354868401s
|
||||
2019/12/01 01:49:40 NavN iteration (7493): location is https://photos.google.com/photo/AF1QipN98MoLk2V3gkGxzJuSe8nWqy6U7nudpBxLkuFs
|
||||
OK
|
||||
4562.572s
|
||||
# Append
|
||||
2019/12/01 03:12:15 navLeft took 438.758367ms
|
||||
2019/12/01 03:12:15 NavN iteration (5193): location is https://photos.google.com/photo/AF1QipP0NnL5F31yWiEMgQdx75VaDXMI-099eN-ePTEx
|
||||
2019/12/01 03:12:15 Marking https://photos.google.com/photo/AF1QipP0NnL5F31yWiEMgQdx75VaDXMI-099eN-ePTEx as done
|
||||
^C
|
||||
1087.231s
|
||||
|
||||
# Append
|
||||
2019/12/01 03:16:27 NavN iteration (2190): location is https://photos.google.com/photo/AF1QipPuPZJ-R6UTVgipHD8R-URdoB66hakxX0C4s1U7
|
||||
2019/12/01 03:16:27 Marking https://photos.google.com/photo/AF1QipPuPZJ-R6UTVgipHD8R-URdoB66hakxX0C4s1U7 as done
|
||||
2019/12/01 03:16:27 .. navLeft so far: 7.674008ms
|
||||
2019/12/01 03:16:27 .. navLeft so far: 109.60942ms
|
||||
2019/12/01 03:16:27 .. navLeft so far: 262.405953ms
|
||||
2019/12/01 03:16:27 .. navLeft so far: 369.708306ms
|
||||
2019/12/01 03:16:27 .. navLeft so far: 477.549416ms
|
||||
2019/12/01 03:16:27 .. navLeft so far: 579.120186ms
|
||||
2019/12/01 03:16:28 .. navLeft so far: 686.254498ms
|
||||
^C
|
||||
264.910s
|
||||
# Append
|
||||
2019/12/01 03:20:28 NavN iteration (2574): location is https://photos.google.com/photo/AF1QipMl1tMQVHpxfkgpPMEd_Ko8eiTE8IsyvFX0f0Rf
|
||||
2019/12/01 03:20:28 Marking https://photos.google.com/photo/AF1QipMl1tMQVHpxfkgpPMEd_Ko8eiTE8IsyvFX0f0Rf as done
|
||||
2019/12/01 03:20:28 navLeft took 196.956415ms
|
||||
^C
|
||||
177.288s
|
||||
|
||||
7312 + 7493 + 5193 + 2190 + 2574 + 4000 + 614
|
||||
+1000 67.465s, 76.261s, 71.233s, 74.092s
|
||||
+614
|
||||
|
||||
```
|
||||
|
||||
https://photos.google.com/lr/photo/AICs3TPpjUDPZuOJ7u_AWLKaVnXW-zrVAOWbUEhdZpQWvKKczTTYYEsR4zfqSwUDULOH1s3W18TneUiB_QFDidcpKr63abnUgA
|
||||
|
||||
https://photos.google.com/u/2/photo/AF1QipMWCRwtfPHED45inXlRNjeg28oyrfhW5UA60Qjg
|
||||
AF1QipMWCRwtfPHED45inXlRNjeg28oyrfhW5UA60Qjg
|
||||
|
Loading…
Reference in New Issue