Long conjunctive query phrases make TCAT not start #334

xmacex · 2018-09-30T20:07:02Z

Hi. The following, long conjunctive query phrase #Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem will make TCAT not restart after the routine reload which is performed shortly after a query bin is added, as observed in logs/controller.php

2018-09-30 19:38:02 enforcing reload of config for track
2018-09-30 19:38:02 sending a TERM signal to track for 10060
2018-09-30 19:38:02 waiting for graceful exit of script track with pid 10060

I am waiting for the process dmitcat_track.php to respawn via the normal cronjob by monitoring its appearance in ps afx |grep 'var/www/dmi-tcat' |grep "track" listing. It won't be there. Removing the bin from the usual web UI brings the dmitcat_track.php back, after the routine reload is done (in a dozen or so seconds).

The query phrase is long, 142 characters. If I arbitrarily split it into two by placing a disjuctive comma in the middle #Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit, #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem the same issue still persists.

However, if I further split these into two disjunctions each, #Chinabigbrother #Citizenrating, #Chinasurveillance #Socialcredit, #Chinablackmirror #Digitaldictatorship, #Digitalcensorship #Socialcreditsystem, things work as usual. Of course any actual query design is out the window by this point (if it ever was there, but that's an another matter).

I observed the unabridged query phrase go into the database. The bin gets it's entries in tcat_query_bins_* tables, as usual.

I am not particularly expert in PHP / SQL webapp debugging, so if someone can please guide me how to get more information about the program flow, where to request debug messages, throw PHP exceptions and stack traces, or maybe even run the dmitcat_track.php under XDebug or another debugging tool, I would appreciate it.

I hope I will be able the send a pull request once I have a better idea where to get more information what goes wrong.

The text was updated successfully, but these errors were encountered:

xmacex · 2018-09-30T20:31:18Z

Oops sorry I was neglecting the logs/track.error.log, silly me. With this TCAT query bin in effect, we get a HTTP/1.1 406 Not Acceptable response, and the dmitcat_track.php terminates silently terminates. I won't be able to paste in the whole thing from the log file for security reasons, sorry. I'm afraid there is more TCAT source code reading for me to do...

xmacex · 2018-09-30T22:39:01Z

Twitter Standard streaming API request parameters says about track

Each phrase must be between 1 and 60 bytes, inclusive.

Does it make any sense to document this issue like so, without writing this test against TCAT's codebase

<?php
/**
 * See https://github.com/digitalmethodsinitiative/dmi-tcat/issues/334 for the issue
 */
use PHPUnit\Framework\TestCase;

final class QueryPhraseTest extends TestCase
{
    /** 
     * Test that any phrase is within the Twitter API spec
     * https://developer.twitter.com/en/docs/tweets/filter-realtime/guides/basic-stream-parameters.html
     *
     * @dataProvider phraseProvider
     */
    public function testPhraseLengthIsWithinTwitterApiTrackSpecs($phrase)
    {
        $minLength = 0;
        $maxLength = 60;
        $this->assertGreaterThanOrEqual($minLength, strlen($phrase));
        $this->assertLessThanOrEqual($maxLength, strlen($phrase));
    }

    /**
     * Data providers
     */
    public function phraseProvider()
    {
        return [
            ["#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem"],
            ["#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit",
             "#Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem"],
            ["#Chinabigbrother #Citizenrating",
             "#Chinasurveillance #Socialcredit",
             "#Chinablackmirror #Digitaldictatorship",
             "#Digitalcensorship #Socialcreditsystem"],
        ];
    }
}

Which currently gives us

PHPUnit 7.3.5 by Sebastian Bergmann and contributors.

FF. 3 / 3 (100%)

Time: 122 ms, Memory: 10.00MB

There were 2 failures:

QueryPhraseTest::testPhraseLengthIsWithinTwitterApiTrackSpecs with data set #0 ('#Chinabigbrother #Citizenrati...system')
Failed asserting that 142 is equal to 60 or is less than 60.

/Users/maco/tip/QueryPhraseTest.php:20

QueryPhraseTest::testPhraseLengthIsWithinTwitterApiTrackSpecs with data set URL expander exits with error #1 ('#Chinabigbrother #Citizenrati...credit', '#Chinablackmirror #Digitaldic...system')
Failed asserting that 64 is equal to 60 or is less than 60.

/Users/maco/tip/QueryPhraseTest.php:20

FAILURES!
Tests: 3, Assertions: 12, Failures: 2.

Anyway, towards a solution I would be inclined to fortify the web UI from

dmi-tcat/capture/index.php

Lines 488 to 493 in 85504de

    
           if(params['type']=='track') { 
        
               var _nrOfPhrases = validateNumberOfPhrases(params['oldphrases'].split(",").length,_newphrases.split(",").length); 
        
               if(!_nrOfPhrases) { 
        
                   alert("With this query you will exceed the number of allowed queries (400) to the Twitter API. Please reduce the number of phrases."); 
        
                   return false; 
        
               }

to something like

                if(params['type']=='track') {
                    var _nrOfPhrases = validateNumberOfPhrases(params['oldphrases'].split(",").length,_newphrases.split(",").length);
                    if(!_nrOfPhrases) {
                        alert("With this query you will exceed the number of allowed queries (400) to the Twitter API. Please reduce the number of phrases.");
                        return false;
                    }
                    var _lenOfPhrases = validateLengthOfPhrases(_newphrases);
                    if(!_lenOfPhrases) {
                        alert("One or more of the phrases exceeds the allowed length (60 characters) to the Twitter API. Please shorten your phrases.");
                        return false;

with validateLengthOfPhrases(phrases) being something like phrases.split(',').every(phrase => phrase.length <= 60). Some fortification further down the stack would be good too, I imagine, before phrases are accepted into the database.

xmacex · 2018-09-30T22:45:46Z

Or alternatively fortify the web UI in here

dmi-tcat/capture/index.php

Line 750 in 85504de

function validateQuery(query,type) {

ErikBorra · 2018-10-01T06:40:31Z

Hi @xmacex,

thanks for submitting this issue.

As you stated, the phrase does not make any sense as it would only track tweets which contain exactly "#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem".

TCAT should indeed validate the length of phrases, although it should not insert arbitrary comma's thus modifying the phrases input by the user. Instead, the user could get a warning indicating that the phrase to track is too long.

Best,

Erik

xmacex · 2018-10-01T08:32:50Z

@ErikBorra I suggest let's keep the query design (Rogers 2017; and of course a whole body of literature in information retrieval) issues separate from input string validation. Of course as STS scholars this is a moment of reflection about how this divide exactly gets constantly made and re-made in the everyday practices 😆

xmacex · 2018-10-01T08:43:34Z

Anyway I'd be happy to work on a pull request. I imagine input validation at JS as suggested in #334 (comment), but if someone can nudge me towards what would be a good place to introduce some fortification further down the stack, in I'll do that too while I'm on it. The function create_new_bin

dmi-tcat/capture/query_manager.php

Line 42 in 4613a15

function create_new_bin($params) {

looks like a good candidate

xmacex · 2018-10-02T12:38:14Z

Working on it here https://github.com/xmacex/dmi-tcat/tree/phrases_must_not_exceed_60chrs_each. I really don't know how to provide tests for Javascript functions buried in PHP 😟

xmacex · 2018-10-03T07:35:53Z

A terminological observation about myself: I seem to have converged towars speaking of kitten,lizard,'large aardvark',anteater as one query containing four phrases.

dentoir · 2018-11-12T10:54:48Z

Hi @xmacex

Sorry for the late reply on this and thank you for your pull request. It looks sound in principle. Some quick remarks before I can merge this. Maybe we should add a maximum length to the @username in the follow role as well (which appears to be 15 characters). There is a third location where (individual) phrases are validated, which is validate_capture_phrases() in common/functions.php and is called by CLI scripts such as search.php. Finally, could you detach your unit test from the pull request? It would leave a script in the tree which has non-present dependencies.

xmacex · 2018-11-13T16:09:49Z

Thanks @dentoir. The Twitter API documentation for follow does not seem to state that the lengths of usernames are limited as API query time (of course on new user registration, yes). Exceeding the 60 characters limit for a track, the Twitter API responds with a HTTP/1.1 406 Not Acceptable. Does it give an error in the HTTP response, and freak out TCAT if asked to follow a user name too long? I am not sure if I can currently verify that myself.

I implemented validate_capture_phrases() hopefully as suggested in commit df582ee, sorry I missed that. Thanks for pointing it out. If someone can take a quick review of the PR #335, I'll detach the test suite and nudge the pull request.

xmacex mentioned this issue Oct 2, 2018

Phrases must not exceed 60chrs each #335

Merged

niczem added the bug label Jan 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long conjunctive query phrases make TCAT not start #334

Long conjunctive query phrases make TCAT not start #334

xmacex commented Sep 30, 2018

xmacex commented Sep 30, 2018 •

edited

Loading

xmacex commented Sep 30, 2018

xmacex commented Sep 30, 2018

ErikBorra commented Oct 1, 2018 •

edited

Loading

xmacex commented Oct 1, 2018

xmacex commented Oct 1, 2018

xmacex commented Oct 2, 2018 •

edited

Loading

xmacex commented Oct 3, 2018

dentoir commented Nov 12, 2018

xmacex commented Nov 13, 2018

Long conjunctive query phrases make TCAT not start #334

Long conjunctive query phrases make TCAT not start #334

Comments

xmacex commented Sep 30, 2018

xmacex commented Sep 30, 2018 • edited Loading

xmacex commented Sep 30, 2018

xmacex commented Sep 30, 2018

ErikBorra commented Oct 1, 2018 • edited Loading

xmacex commented Oct 1, 2018

xmacex commented Oct 1, 2018

xmacex commented Oct 2, 2018 • edited Loading

xmacex commented Oct 3, 2018

dentoir commented Nov 12, 2018

xmacex commented Nov 13, 2018

xmacex commented Sep 30, 2018 •

edited

Loading

ErikBorra commented Oct 1, 2018 •

edited

Loading

xmacex commented Oct 2, 2018 •

edited

Loading