pull/2/merge
Daniel Lauzon 5 years ago committed by GitHub
commit b4174faabd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,104 @@
# Notes for PR(daneroo) feature/listing branch
I don't expect this to be merged as-is. I mostly wanted to share my experiments and get feedback.
If any of it is useful, I can craft proper commits with smaller features.
*I am a first-time contributor, and not very experienced with Go, so would appreciate any feedback both on content and process.*
## Overview
When I first started to use the project, I experienced instability and would need to restart the download process a large number of times to get to the end. The process would timeout after 1000-4000 images. To test, I use two accounts one with ~700 images, and one with 31k images.
The timeouts were mostly coming from the `navLeft()` iterator, so I disabled actual downloading and focused on traversal. (`-list` option). I also added an option (`-all`) to bypass the incremental behavior of `$dldir/.lastDone`. I then added `-vt (VerboseTiming`, to isolate timing metrics reporting specifically.
The last main issue was the degradation of performance over time. Probably due to resource leaks in the webapp, especially when iterating at high speed. I found that the simplest solution to this was simply to force a reload every 1000 iterations. This typically takes about `1s` and avoids the `10X` slowdown I was observing.
The following are details of each of the major changes.
## `navLeft()`
- By adding the `-list` option, its functionality is isolated.
- Remove `WaitReady("body",..)`, as it has no effect (The document stays loaded when we navigate from image to image), and replace it with a loop (until `location != prevLocation`).
- When only listing, this is the critical part of the loop, so I spent some time experimenting with the timing of the chromedp interactions to optimize total throughput. What seemed to work best was an exponential backoff, so that we can benefit from the *most likely* fast navigation, but don't overtax the browser when we experience the occasional longer navigation delay (which can go up to multiple seconds).
- We also pass in `prevLoaction` from `navN()`, to support the new `navLeft()` loop, to optimize the throughput.
## `navN()`
- Addition of timing code
- Support listing only (`-list`)
- Reload page every (`batchSize=1000`) items.
## Termination Criteria
We explicitly fetch the lastPhoto, on the main album page (immediately after authentication), and rely on that, instead of `navLeft()` failing to navigate (`location == prevLocation`). The lastPhoto is determined with a CSS selector evaluation (`a[href^="./photo/"]`), which is fast and should be stable over time, as it uses the `<a href=.. >` detail navigation mechanism of the main album page. This opens another possibility for iteration on the album page itself. (see `listFromAlbum` below)
Another issue, although the `.lasDone` being captured on a successful run is useful, as photos are listed in EXIF date order. If older photos are added to the album they would not appear in subsequent runs. So it would be useful in general to rerun over the entire album (`-all`).
## Headless
If we don't need the authentication flow, (and we persist the auth creds in the profile `-dev`), Chrome can be brought up in `-headless` mode, which considerably reduces memory footprint (`>2Gb` to `<500Mb`), and marginally (23%) increases throughput for the listing experiment.
## `listFromAlbum()`
This is another experiment where the listing was entirely performed in the main album page (by scrolling incrementally). This is even faster. It would also allow iterating either forward or backward through the album.
In a further experiment, I would like to use this process as a coordinating mechanism and perform the actual downloads in separate (potentially multiple) `tabs/contexts`.
### Performance: An Argument for a periodic page reload and headless mode
Notice how the latency grows quickly without page reload (ms per iteration): [66, 74, 89, 219, 350, 1008,...], and the cumulative rate drops below `4 items/s` after `6000` items.
```bash
$ time ./gphotos-cdp -dev -n 6001 -list
21:36:26 . Avg Latency (last 1000 @ 1000): Marginal: 66.43ms Cumulative: 66.43ms
21:37:40 . Avg Latency (last 1000 @ 2000): Marginal: 73.99ms Cumulative: 70.21ms
21:39:09 . Avg Latency (last 1000 @ 3000): Marginal: 88.97ms Cumulative: 76.46ms
21:42:48 . Avg Latency (last 1000 @ 4000): Marginal: 218.75ms Cumulative: 112.03ms
21:48:38 . Avg Latency (last 1000 @ 5000): Marginal: 350.23ms Cumulative: 159.67ms
22:05:27 . Avg Latency (last 1000 @ 6000): Marginal: 1008.58ms Cumulative: 301.16ms
22:05:27 Rate (6001): 3.32/s Avg Latency: 301.11ms
```
Whereas with page reloading every 1000 images, it is stable at `<60ms` with a sustained rate of `17 items/s`:
```bash
$ time ./gphotos-cdp -dev -vt -list -n 6001
14:33:22 . Avg Latency (last 1000 @ 1000): Marginal: 58.88ms Cumulative: 58.88ms
14:34:21 . Avg Latency (last 1000 @ 2000): Marginal: 57.60ms Cumulative: 58.24ms
14:35:21 . Avg Latency (last 1000 @ 3000): Marginal: 57.93ms Cumulative: 58.14ms
14:36:17 . Avg Latency (last 1000 @ 4000): Marginal: 54.93ms Cumulative: 57.33ms
14:37:16 . Avg Latency (last 1000 @ 5000): Marginal: 56.24ms Cumulative: 57.12ms
14:38:16 . Avg Latency (last 1000 @ 6000): Marginal: 58.73ms Cumulative: 57.38ms
14:38:16 Rate (6001): 17.43/s Avg Latency: 57.38ms
```
When invoked in `-headless` mode we can reduce latency to `~47ms` or `21 items/s`, also this reduces memory footprint from `>2Gb` to about `500Mb`
```bash
15:37:44 Rate (6001): 21.27/s Avg Latency: 47.02ms
```
When using `listFromAlbum()`, we can roughly double the iterations speed again to `38 items/s` or `25ms/item`
```bash
15:24:55 Rate (6001): 38.60/s Avg Latency: 25.91ms
```
*All of these times were obtained on a MacBook Air (2015) / 8G RAM.*
## Starting Chrome with another profile
```bash
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' --user-data-dir=/var/folders/bw/7rvbq3q92g5bn4lv4hrh5qv40000gn/T/gphotos-cdp
```
## TODO
*Move to Document as done.*
- Review Nomenclature, start,end,newest,last,left,right...
- Refactor common chromedp uses in navLeft,navRight,navToEnd,navToLast
- `navLeft()` - throw error if loc==prevLoc
- flags: `-verify`, `-incr/-optimimistic`
- Listing with DOM selectors, and scrolling, while downloading in separate table

@ -26,6 +26,7 @@ import (
"fmt"
"io/ioutil"
"log"
"net/url"
"os"
"os/exec"
"path/filepath"
@ -41,12 +42,17 @@ import (
)
var (
nItemsFlag = flag.Int("n", -1, "number of items to download. If negative, get them all.")
devFlag = flag.Bool("dev", false, "dev mode. we reuse the same session dir (/tmp/gphotos-cdp), so we don't have to auth at every run.")
dlDirFlag = flag.String("dldir", "", "where to write the downloads. defaults to $HOME/Downloads/gphotos-cdp.")
startFlag = flag.String("start", "", "skip all photos until this location is reached. for debugging.")
runFlag = flag.String("run", "", "the program to run on each downloaded item, right after it is dowloaded. It is also the responsibility of that program to remove the downloaded item, if desired.")
verboseFlag = flag.Bool("v", false, "be verbose")
// TODO(daneroo): New flags after -v: order if merge?
nItemsFlag = flag.Int("n", -1, "number of items to download. If negative, get them all.")
devFlag = flag.Bool("dev", false, "dev mode. we reuse the same session dir (/tmp/gphotos-cdp), so we don't have to auth at every run.")
dlDirFlag = flag.String("dldir", "", "where to write the downloads. defaults to $HOME/Downloads/gphotos-cdp.")
startFlag = flag.String("start", "", "skip all photos until this location is reached. for debugging.")
runFlag = flag.String("run", "", "the program to run on each downloaded item, right after it is dowloaded. It is also the responsibility of that program to remove the downloaded item, if desired.")
verboseFlag = flag.Bool("v", false, "be verbose")
verboseTimingFlag = flag.Bool("vt", false, "be verbose about timing")
listFlag = flag.Bool("list", false, "Only list, do not download any images")
allFlag = flag.Bool("all", false, "Ignore -start and <dlDir>/.lastDone, start from oldest photo, implied by -list")
headlessFlag = flag.Bool("headless", false, "Start chrome browser in headless mode (cannot do Auth this way).")
)
var tick = 500 * time.Millisecond
@ -60,6 +66,13 @@ func main() {
log.Print("-start only allowed in dev mode")
return
}
if *listFlag {
*allFlag = true
}
if *startFlag != "" && *allFlag {
log.Print("-start is ignored if -all (implied by -list)")
return
}
s, err := NewSession()
if err != nil {
log.Print(err)
@ -68,6 +81,7 @@ func main() {
defer s.Shutdown()
log.Printf("Session Dir: %v", s.profileDir)
log.Printf("Download Dir: %v", s.dlDir)
if err := s.cleanDlDir(); err != nil {
log.Print(err)
@ -101,6 +115,10 @@ type Session struct {
// really) that was downloaded. If set, it is used as a sentinel, to indicate that
// we should skip dowloading all items older than this one.
lastDone string
// lastPhoto is our termination criteria. It is the first photo in tha album, and hence
// it is the last photo to be fetched in the current traversal order.
// this field is not initialized in NewSession, because it can only be determined after authentication (in firstNav)
lastPhoto string
}
// getLastDone returns the URL of the most recent item that was downloaded in
@ -141,6 +159,12 @@ func NewSession() (*Session, error) {
if err != nil {
return nil, err
}
// skip navigating to s.lastDone if (-all (implied by -list))
if lastDone != "" && *allFlag {
log.Printf("Skipping lastDone (-all): %v", lastDone)
lastDone = ""
}
s := &Session{
profileDir: dir,
dlDir: dlDir,
@ -150,7 +174,7 @@ func NewSession() (*Session, error) {
}
func (s *Session) NewContext() (context.Context, context.CancelFunc) {
ctx, cancel := chromedp.NewExecAllocator(context.Background(),
allocatorOpts := []chromedp.ExecAllocatorOption{
chromedp.NoFirstRun,
chromedp.NoDefaultBrowserCheck,
chromedp.UserDataDir(s.profileDir),
@ -177,7 +201,13 @@ func (s *Session) NewContext() (context.Context, context.CancelFunc) {
chromedp.Flag("enable-automation", true),
chromedp.Flag("password-store", "basic"),
chromedp.Flag("use-mock-keychain", true),
)
}
if *headlessFlag {
allocatorOpts = append(allocatorOpts, chromedp.Flag("headless", true))
allocatorOpts = append(allocatorOpts, chromedp.Flag("disable-gpu", true))
}
ctx, cancel := chromedp.NewExecAllocator(context.Background(), allocatorOpts...)
s.parentContext = ctx
s.parentCancel = cancel
ctx, cancel = chromedp.NewContext(s.parentContext)
@ -258,6 +288,14 @@ func (s *Session) login(ctx context.Context) error {
// 2) if the last session marked what was the most recent downloaded photo, it navigates to it
// 3) otherwise it jumps to the end of the timeline (i.e. the oldest photo)
func (s *Session) firstNav(ctx context.Context) error {
// fetch lastPhoto before navigating to specific photoURL
lastPhoto, err := lastPhoto(ctx)
if err != nil {
return err
}
s.lastPhoto = lastPhoto
if *startFlag != "" {
chromedp.Navigate(*startFlag).Do(ctx)
chromedp.WaitReady("body", chromedp.ByQuery).Do(ctx)
@ -280,6 +318,47 @@ func (s *Session) firstNav(ctx context.Context) error {
return nil
}
// lastPhoto return the URL for the first image in the album. It is meant to be our termination criteria, as the photos are traversed in reverse order
func lastPhoto(ctx context.Context) (string, error) {
// This should be our TerminationCriteria
// extract most recent photo URL
// expr := `document.querySelector('a[href^="./photo/"]').href;` // first photo
// This selector matches <a href="./photo/*"/> elements
// - where the ^= operator matches prefix of the href attribute
sel := `a[href^="./photo/"]` // first photo on the landing page
var href string
ok := false
if err := chromedp.AttributeValue(sel, "href", &href, &ok).Do(ctx); err != nil {
return "", err
}
if !ok {
return "", errors.New("lastPhoto: Unable to find most recent photo")
}
if last, err := absURL(href); err != nil {
return "", err
} else {
log.Printf("Last Photo: %s (first on Landing Page)", last)
return last, nil
}
}
func absURL(href string) (string, error) {
// if href is a relative url, e.g.: "./photo/AF1QipMOl0XXrO9WPSv5muLRBFpbyzGsdnrqUqtF8f73"
// we need an absolute url: https://photos.google.com/photo/AF1QipMOl0XXrO9WPSv5muLRBFpbyzGsdnrqUqtF8f73
u, err := url.Parse(href)
if err != nil {
return "", err
}
// the base url could be fetched from the current location, but the login flow already assume that we are at this url
base, err := url.Parse("https://photos.google.com/")
if err != nil {
return "", err
}
absURL := base.ResolveReference(u).String()
return absURL, nil
}
// navToEnd waits for the page to be ready to receive scroll key events, by
// trying to select an item with the right arrow key, and then scrolls down to the
// end of the page, i.e. to the oldest items.
@ -303,6 +382,9 @@ func navToEnd(ctx context.Context) error {
time.Sleep(tick)
}
// Not used for now. Just for experimenting alternative navigation
// listFromAlbum(ctx)
// try jumping to the end of the page. detect we are there and have stopped
// moving when two consecutive screenshots are identical.
var previousScr, scr []byte
@ -328,6 +410,65 @@ func navToEnd(ctx context.Context) error {
return nil
}
var allPhotoIds = map[string]bool{}
// Not used for now. Just for experimenting alternative navigation
func listFromAlbum(ctx context.Context) error {
// try looping through all pages
// allPhotoIds := map[string]bool{} // temprarily global
start := time.Now()
deadline := time.Now().Add(5 * time.Second)
sel := `a[href^="./photo/"]` // first photo on the landing page
for {
var attrs []map[string]string
if err := chromedp.AttributesAll(sel, &attrs).Do(ctx); err != nil {
log.Printf("navToEnd: error %s", err)
} else {
atLeastOne := false
for _, a := range attrs {
href, err := absURL(a["href"]) // could check that href attr exists
if err != nil {
log.Printf("navToEnd absURL error: %s", err)
return err
}
if _, ok := allPhotoIds[href]; !ok {
allPhotoIds[href] = true
if *verboseFlag {
log.Printf("listFromAlbum: + %s (total:%d)", href, len(allPhotoIds))
}
atLeastOne = true
deadline = time.Now().Add(10 * time.Second)
}
}
if time.Now().After(deadline) {
break
}
if !atLeastOne {
// log.Printf("navToEnd: none added (total:%d) - %s left", len(allPhotoIds), time.Until(deadline))
}
}
// ArrowRight is faster overall and doesn't miss any entries
chromedp.KeyEvent(kb.ArrowRight).Do(ctx)
// PageDown is slower overall and misses some entries
// chromedp.KeyEvent(kb.PageDown).Do(ctx)
// sleeping doesn't change anything with ArrowRight
// time.Sleep(10 * time.Millisecond)
// TODO(daneroo): revisit this...
if *nItemsFlag > 0 && len(allPhotoIds) > *nItemsFlag {
break
}
}
log.Printf("listFromAlbum done adding items, found: %d in: %s", len(allPhotoIds), time.Since(start))
if *verboseTimingFlag {
log.Printf("Rate (%d): %.2f/s Avg Latency: %.2fms", len(allPhotoIds), float64(len(allPhotoIds))/time.Since(start).Seconds(), time.Since(start).Seconds()*1000.0/float64(len(allPhotoIds)))
}
return nil
}
// navToLast sends the "\n" event until we detect that an item is loaded as a
// new page. It then sends the right arrow key event until we've reached the very
// last item.
@ -355,6 +496,7 @@ func navToLast(ctx context.Context) error {
if location == prevLocation {
break
}
log.Printf("NavToLast iteration: location is %v", location)
prevLocation = location
}
return nil
@ -375,9 +517,49 @@ func doRun(filePath string) error {
}
// navLeft navigates to the next item to the left
func navLeft(ctx context.Context) error {
func navLeft(ctx context.Context, prevLocation *string) error {
start := time.Now()
deadline := start.Add(15 * time.Second) // sleep max of 15 Seconds
maxBackoff := 500 * time.Millisecond
backoff := 10 * time.Millisecond // exp backoff: 10,20,40,..,maxBackoff,maxBackoff,..
// var prevLocation string
var location string
chromedp.KeyEvent(kb.ArrowLeft).Do(ctx)
chromedp.WaitReady("body", chromedp.ByQuery)
// I don't think this actually waits for anything, body is already ready
// ...usualy takes <10us
// chromedp.WaitReady("body", chromedp.ByQuery)
n := 0
for {
if err := chromedp.Location(&location).Do(ctx); err != nil {
return err
}
n++
if location != *prevLocation {
break
}
// Should return an error, but this is conflated with Termination Condition
if time.Now().After(deadline) {
log.Printf("NavLeft timed out after: %s (%d)", time.Since(start), n)
break
}
time.Sleep(backoff)
if backoff < maxBackoff {
// calculate next exponential backOff (capped to maxBackoff)
backoff = backoff * 2
if backoff > maxBackoff {
backoff = maxBackoff
}
}
}
elapsed := time.Since(start)
if *verboseTimingFlag && (n > 10 || elapsed.Seconds() > 10) {
log.Printf(". navLeft n:%d elapsed: %s", n, elapsed)
}
return nil
}
@ -419,6 +601,7 @@ func startDownload(ctx context.Context) error {
for _, ev := range []*input.DispatchKeyEventParams{&down, &up} {
if *verboseFlag {
log.Printf("Event: %+v", *ev)
// log.Printf("Event: %+v", (*ev).Type)
}
if err := ev.Do(ctx); err != nil {
return err
@ -532,29 +715,85 @@ func (s *Session) navN(N int) func(context.Context) error {
return nil
}
var location, prevLocation string
totalDuration := 0.0 // for reporting
batchDuration := 0.0 // for reporting
batchSize := 1000 // We force a browser reload every batchSize iterations
for {
start := time.Now()
if err := chromedp.Location(&location).Do(ctx); err != nil {
return err
}
// This is still active as a termination criteria, but is pre-empted by location == s.lastPhoto below
if location == prevLocation {
break
}
prevLocation = location
filePath, err := s.dlAndMove(ctx, location)
if err != nil {
return err
}
if err := doRun(filePath); err != nil {
return err
if !*listFlag {
filePath, err := s.dlAndMove(ctx, location)
if err != nil {
return err
}
if err := doRun(filePath); err != nil {
return err
}
} else {
// This is to test the completeness of the listFromAlbum
// _, ok := allPhotoIds[location]
// if !ok {
// log.Printf("Listing (%d): %v found:%v", n, location, ok)
// }
if *verboseFlag {
log.Printf("Listing (%d): %v", n, location)
}
}
n++
if N > 0 && n >= N {
break
}
if err := navLeft(ctx); err != nil {
// Termination criteria: s.lastPhoto reached
if location == s.lastPhoto {
log.Printf("Last photo reached (%d): %v", n, location)
break
}
if err := navLeft(ctx, &prevLocation); err != nil {
return err
}
totalDuration += time.Since(start).Seconds()
batchDuration += time.Since(start).Seconds()
// Reload page on batch boundary
if n%batchSize == 0 {
// This is where we reload the page - this reduces accumulated latency significantly
// Page reload taked about 1.3 seconds
reloadStart := time.Now()
if err := chromedp.Reload().Do(ctx); err != nil {
if *verboseTimingFlag {
log.Printf(". Avg Latency (last %d @ %d): Marginal: %.2fms Cumulative: %.2fms",
batchSize, n, batchDuration*1000.0/float64(batchSize), totalDuration*1000.0/float64(n))
}
log.Printf("Failed to reload Page at n:%d", n)
return err
}
if *verboseTimingFlag {
log.Printf(". Avg Latency (last %d @ %d): Marginal: %.2fms Cumulative: %.2fms Page Reloaded:%s",
batchSize, n, batchDuration*1000.0/float64(batchSize), totalDuration*1000.0/float64(n), time.Since(reloadStart))
}
if *verboseFlag {
log.Printf("Reloaded page at: %s in %s", location, time.Since(reloadStart))
}
batchDuration = 0.0 // reset Statistics
}
}
if *verboseTimingFlag {
log.Printf("Rate (%d): %.2f/s Avg Latency: %.2fms", n, float64(n)/totalDuration, totalDuration*1000.0/float64(n))
}
return nil
}

Loading…
Cancel
Save