Index.php still reachable/indexable despite pretty URLs being set

So I noticed that Google Search Console is indexing Domain.com and Domain.com

The thing is, I’ve had pretty URLs set up following the documentation this whole time. And when I go to domain.com it does not show Domain.com any more. And when I go to sub-pages the index.php isn’t part of the hyperlink.

HOWEVER when I directly go to Domain.com or Domain.com it does NOT redirect to a URL without index.php, and index.php is there.

So google is indexing bad pages, and so far as I can tell the .htaccess rewrite presented is not sufficient. I’ve tried a few permutations (which I have since forgotten the details of, sorry), and I had one configuration that seemed to work, but when I went to a dashboard to modify the attributes of a page, I had load errors. And those page load errors went away when I removed those modifications.

Anyways, this is the contents of my .htaccess :

<IfModule mod_rewrite.c>
	RewriteEngine On
	RewriteBase /
	RewriteCond %{REQUEST_FILENAME} !-f
	RewriteCond %{REQUEST_FILENAME}/index.html !-f
	RewriteCond %{REQUEST_FILENAME}/index.php !-f
	RewriteRule . index.php [L]
</IfModule>

I couldn’t quite land on a single reliable configuration, so I was hoping someone could chime in. I’m also on v8.5.12, and upgrading to v9 is not currently possible due to addon limitations.

Anyways, is anyone able to help with the rewrite contents I need in .htaccess please?

One thing to check is if your sitemap.xml needs regenerating? Just in case it’s got index.php entries in there confusing Google.

My sitemap.xml is generating properly, I know that before even checking this :wink: I’m really talking about more an Apache configuration probably. (one centric to Concrete CMS maybe?)

Are you using canonical urls? Google shouldn’t index sitename.com/index.php if canonical is set to sitename.com.

Yeah I am canonical. I have followed the documentation and steps needed for all of that. A human clicking around the website seems to never encounter index.php . However if I manually go to Domain.com the page still loads with index.php in the URL, which is likely how Google is indexing/finding those pages. The sitemap doesn’t even list any entries for index.php in any way!

And yes the .htaccess file does include the code that Concrete CMS spits out. Also keep in mind I’m using v8.5.12, so I don’t know if what is generated in v9.x is different or not (there’s addons blocking me from upgrading to v9 right now).

Yeah I’m still not seeing why any htaccess or other changes I made are actually allowing the /index.php content, and I’m actually now seeing this with 3x Concrete CMS websites (all v8.5.x).

I’ve found some threads alluding to this being a Concrete problem (from many years ago) and I’m really not sure what I can actually do at this point to improve this. This is actually hurting the SEO of these websites, so really do need a solution here! Anyone? :frowning:

You mention the .htaccess includes what Concrete CMS spits out - but does it contain other stuff, or have you modified it? Just wondering if trying just exactly what Concrete provides would be worth testing. Because yeah, otherwise I’m stumped.

Here is the contents of .htaccess (there is content below this that is redacted doing “Redirect 301” and “Redirect 410” functions, redacted due to sensitivity):

(Sorry that I took so long to reply, lost track of this thread)

<IfModule mod_deflate.c>
# Compress HTML, CSS, JavaScript, Text, XML and fonts
AddOutputFilterByType DEFLATE application/javascript
AddOutputFilterByType DEFLATE application/rss+xml
AddOutputFilterByType DEFLATE application/vnd.ms-fontobject
AddOutputFilterByType DEFLATE application/x-font
AddOutputFilterByType DEFLATE application/x-font-opentype
AddOutputFilterByType DEFLATE application/x-font-otf
AddOutputFilterByType DEFLATE application/x-font-truetype
AddOutputFilterByType DEFLATE application/x-font-ttf
AddOutputFilterByType DEFLATE application/x-javascript
AddOutputFilterByType DEFLATE application/xhtml+xml
AddOutputFilterByType DEFLATE application/xml
AddOutputFilterByType DEFLATE font/opentype
AddOutputFilterByType DEFLATE font/otf
AddOutputFilterByType DEFLATE font/ttf
AddOutputFilterByType DEFLATE image/svg+xml
AddOutputFilterByType DEFLATE image/x-icon
AddOutputFilterByType DEFLATE text/css
AddOutputFilterByType DEFLATE text/html
AddOutputFilterByType DEFLATE text/javascript
AddOutputFilterByType DEFLATE text/plain
AddOutputFilterByType DEFLATE text/xml

# Remove browser bugs (only needed for really old browsers)
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
Header append Vary User-Agent
</IfModule>

# set browser caching
ExpiresActive On
ExpiresDefault M1209600
#ExpiresDefault "access 1 year"
ExpiresByType image/gif M1209600
ExpiresByType image/png M1209600
#ExpiresByType image/jpeg M1209600
ExpiresByType image/jpg M1209600
ExpiresByType image/jpeg M1209600
ExpiresByType image/x-icon M1209600
ExpiresByType application/pdf M1209600
ExpiresByType application/x-javascript M1209600
#ExpiresByType text/x-javascript M1209600
ExpiresByType text/x-javascript M1209600
ExpiresByType text/plain M1209600
ExpiresByType text/css M1209600
# end browser caching
# TIME CHEAT SHEET
#      300   5 MIN
#      600  10 MIN
#      900  15 MIN


# -- concrete5 urls start --
<IfModule mod_rewrite.c>
	RewriteEngine On
	RewriteBase /
	RewriteCond %{REQUEST_FILENAME} !-f
	RewriteCond %{REQUEST_FILENAME}/index.html !-f
	RewriteCond %{REQUEST_FILENAME}/index.php !-f
	RewriteRule . index.php [L]
</IfModule>
# -- concrete5 urls end --

Any thoughts on the above @EvanCooper ?

Nothing really other than it definitely looks custom compared to the default stuff generated by the core. Might also be somewhere in the server configuration. I don’t believe this happens for most installations so that would lead me to believe somewhere along the lines it’s a configuration issue.

Okay, but is there any particular direction you can point me to put efforts into next? I’ve worked through a substantial permutation of configuration changes for the reverse-proxy in-front of the host (nginx), and the HTTPD on the host itself (apache2), as well as adjustments to how the rewrite is done within .htaccess. Both from the lines generated by Concrete CMS itself “normally” as well as going through oodles of examples online on the index.php rewrite topic. So far I have not found a reliable solution, and reverted to the rewrite aspects Concrete CMS outputs. Oh and I do believe I have also tried with and without caching settings on/off within Concrete CMS.

As for the customisations within .htaccess itself, that’s in response to website analysis tools and their recommendations.

Full disclosure I’m not an Apache / Nginx / webserver expert, but my hunch is that your problem might not actually be at the webserver level.

What I’m thinking is you might have a hardcoded internal link or something in your site that contains index.php, and Google sees that and is like “oh, index.php you say? What else is under here? Why, it’s the whole site! Let us index!”

Or something equally simple. So my recommendation would be to re-start your investigatino from the outside-in - Start with Google, instead of your webserver.

You can hit up their Google Search Console Tools:
https://search.google.com/search-console/about

As a triage measure, I think you can set things in that tool to not index, so you might be able to say “don’t index anything under index.php, Google” and it will honor that and you at least stop the problem at the surface level, even if the underlying cause is still present.

But again, even though I don’t know for certain, I would say odds are this isn’t a Concrete-specific issue, because otherwise we’d be hearing about it from left, right and center, you know?

1 Like
  1. No worries about not knowing “everything about everything” :stuck_out_tongue_winking_eye:
  2. One of the example hosts isn’t publicly accessible (dev system), the other is (and is Google Indexed). And in my testing both present/render websites with the /index.php/ aspect being included. So I hear you on the Google seeing it and being all “GIMMEEE GOBLBBLEE GOBBLE GOBBLE”, I am still perplexed by the dev system results too.
  3. I have tried a bunch of noindex methods in robots.txt, with limited success. It has improved the indexing aspect, but still feels like not the “best” solution. I even tried 301’s with index.php and from what I remember that started causing undesirable problems (but I forget which this moment).
  4. I hear you on the “why aren’t more people talking about this?” angle, and I’ve asked that myself lots. I’ve dug so deep all over the inter-butts (even many Concrete-specific threads) assuming I’m doing it wrong to come to the conclusion that I really don’t think I am.
  5. Hardlinks? Yeah, that’s def possible on the publicly-accessible site, but on the Dev one, I really do not believe I have that going on. I doubt I have it going on for the pub-access website, but I don’t have the same level of confidence of that vs the Dev site.

Either way, thanks for your thoughts. Someday I might figure it out, dunno.

1 Like

You could modify your /index.php file. I did this years ago before canonical URLs existed on 5.x legacy sites. Interesting things can be done here, but no one seems to talk about it. :sunglasses:

I’d start with the dev site (the comment block can be removed) …

<?php /* index.php - with mods */
  // 301 redirect *just* /index.php to / (for crawlers and clean URLs)
if ($_SERVER['SCRIPT_NAME'] === $_SERVER['REQUEST_URI']) {
    header('Location: https://'. $_SERVER['HTTP_HOST']. '/', true, 301);
    exit;
}
/*
echo '<pre>'. print_r($_SERVER, true). '</pre>';
-----
  /    (no index.php - the only one where SCRIPT_NAME = REQUEST_URI)
[SCRIPT_NAME] => /index.php
[REQUEST_URI] => /
[SCRIPT_URL] => /
[PHP_SELF] => /index.php
-----
  /index.php
[SCRIPT_NAME] => /index.php
[REQUEST_URI] => /index.php
[QUERY_STRING] =>
[SCRIPT_URL] => /index.php
[PHP_SELF] => /index.php
-----
  /index.php/some-page?q=bogotest
[SCRIPT_NAME] => /index.php
[REQUEST_URI] => /index.php/some-page?q=bogotest
[QUERY_STRING] => q=bogotest
[SCRIPT_URL] => /index.php/some-page
[PHP_SELF] => /index.php/some-page
*/

require 'concrete/dispatcher.php';

Works on several of my 9.2.x hosts with Pretty URLs. Should also work without.

Let us know here how it goes…

While that may be an option, I really would rather not modify core Concrete CMS files (and yes I’m sure index.php doesn’t really change). If the issue originates within Concrete CMS, well then we should have that changed upstream by the CMS Devs. If not, well then change something else (.htaccess? HTTPD config? dunno!)

I do appreciate you sharing, as it may help others (yay!) but I myself am likely to opt out of that method. :slight_smile:

I’ve never seen it changed in over 10 years, and I wouldn’t consider it part of the core either. In this case I’d look at it more like adding a custom route.

This is the perfect place to do it - why involve the overhead of loading the CMS when you are just doing a redirect? Besides, it’s only 4 lines of PHP code.

You could try a feature or pull request, but I would be surprised if it made it into the core.

Again, I’ve done some very interesting/custom things in index.php in the past. Logging/stopping DDOS attacks from crashing the web/database server among them.

It’s also not specific to the type of webserver (Apache/Nginx/IIS/etc).

I should also point out that https://SOMESITE/index.php?cID=1 could also be indexed by a crawler (and is not dependent on Pretty URLs). :stuck_out_tongue_winking_eye:

I’d love to hear @andrew’s opinion on this subject.